Re: [HACKERS] all_visible replay aborting due to uninitialized pages
On 2013-05-29 23:01:31 -0400, Robert Haas wrote: On Wed, May 29, 2013 at 9:57 AM, Andres Freund and...@2ndquadrant.com wrote: Thought about that, but given that 9.3's visibilitymap_set already will already FPI heap pages I concluded it wouldn't really be an improvement since it's only one ||log_heap_page or so there. Not sure what's better. Will write the patch and see how it goes. Ended up using log_newpage_buffer since reusing visibilitymap_set's record would break the wal format as we currently do not accept an FPI on the heap pages during replay when 9.3. Forcing to upgrade the client first would be rather unfriendly... That has the disadvantage of logging a full heap page since it doesn't use the hole optimization but this happens really infrequently, so ... Yeah, I think it's fine. The patch also looks fine, although I think the comments could use a bit of tidying. I guess we need to back-patch this all the way back to 8.4? It will require some adjustments for the older branches. I think 9.2 is actually far enough and it should apply there. Before that we only logged the unsetting of all_visible via heap_(inset|update|delete)'s wal records not the setting as far as I can tell. So I don't immediately see a danger 9.2. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] units in postgresql.conf comments
On 30.05.2013 06:43, Bruce Momjian wrote: On Wed, May 29, 2013 at 09:59:10PM -0400, Peter Eisentraut wrote: I think these sort of entries don't make much sense: #wal_sender_timeout = 60s # in milliseconds; 0 disables I think we should remove units from the comments when it's clear from the name or the default value that time units are accepted. We are documenting what happens when there are no units. Are people are going to change '60s' to '50' and assume that is '50s'? Hopefully not. I do like the clutter avoidance of removing the units from the comments. We could make it mandatory to specify the unit in the value. Ie. throw an error on wal_sender_timeout = 50: ERROR: unit required for option wal_sender_timeout HINT: Valid units for this parameter are ms, s, min, h, and d. Then you wouldn't need a comment to explain what the unit of a naked value is. The only problem I see with that is backwards-compatibility. Old postgresql.conf files containing naked values would no longer work. But all you'd need to do is to add in the units, which would work on older versions too, and would be good for readability anyway. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)
On 05/28/2013 07:52 PM, Fabien COELHO wrote: However I'm not sure that pg_stat_replication currently has the necessary information on either side to measure the lag (in time transactions, but how do I know when a transaction was committed? or number of transactions?). The BDR codebase now has a handy function to report when a transaction was committed, pg_get_transaction_committime(xid) . It also adds pg_xlog_wait_remote_apply and pg_xlog_wait_remote_receive that can be used with pg_current_xlog_location() to wait until one or all replicas have caught up, or with LSNs from pg_stat_replication to (say) wait until all replicas have caught up with the most up-to-date one. I don't think these depend on anything BDR-specific, though Andres or Álvaro would be able to say for sure. Take a look in: git://git.postgresql.org/git/users/andresfreund/postgres.git on the 'bdr' branch. Be aware that it is rebased regularly, though the '0.4' tag applied earlier today will remain constant and contains the functions of interest. I hope this helps. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] units in postgresql.conf comments
On 05/30/2013 12:01 AM, Heikki Linnakangas wrote: We could make it mandatory to specify the unit in the value. Ie. throw an error on wal_sender_timeout = 50: ERROR: unit required for option wal_sender_timeout HINT: Valid units for this parameter are ms, s, min, h, and d. Then you wouldn't need a comment to explain what the unit of a naked value is. The only problem I see with that is backwards-compatibility. Old postgresql.conf files containing naked values would no longer work. But all you'd need to do is to add in the units, which would work on older versions too, and would be good for readability anyway. I like this idea with one addition. We should have a default unit for each. For wal_sender_timeout seconds makes sense, but for checkpoint_timeout minutes makes sense (for example). JD - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)
On 05/30/2013 03:10 PM, Craig Ringer wrote: On 05/28/2013 07:52 PM, Fabien COELHO wrote: However I'm not sure that pg_stat_replication currently has the necessary information on either side to measure the lag (in time transactions, but how do I know when a transaction was committed? or number of transactions?). The BDR codebase now has a handy function to report when a transaction was committed, pg_get_transaction_committime(xid) . It also adds pg_xlog_wait_remote_apply and pg_xlog_wait_remote_receive that can be used with pg_current_xlog_location() to wait until one or all replicas have caught up, or with LSNs from pg_stat_replication to (say) wait until all replicas have caught up with the most up-to-date one. I don't think these depend on anything BDR-specific They do, however, require changes to Pg core. These aren't functions you can just borrow and add to an extension, they require additional changes to core to collect the data they use. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] units in postgresql.conf comments
On Thu, May 30, 2013 at 3:52 AM, Joshua D. Drake j...@commandprompt.com wrote: On 05/30/2013 12:01 AM, Heikki Linnakangas wrote: We could make it mandatory to specify the unit in the value. Ie. throw an error on wal_sender_timeout = 50: ERROR: unit required for option wal_sender_timeout HINT: Valid units for this parameter are ms, s, min, h, and d. Then you wouldn't need a comment to explain what the unit of a naked value is. The only problem I see with that is backwards-compatibility. Old postgresql.conf files containing naked values would no longer work. But all you'd need to do is to add in the units, which would work on older versions too, and would be good for readability anyway. In general, I like this. Requiring full specification is never wrong. Except possibly for thje backwards compatible thing. I like this idea with one addition. We should have a default unit for each. For wal_sender_timeout seconds makes sense, but for checkpoint_timeout minutes makes sense (for example). This sounds like a good way to make things even more confusing. Right now the confusion is only in the comments - this would make it confusing in the actual values. Requiring a unit seems like a much better idea. That way, there is no way for confusion. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)
On 2013-05-30 15:54:01 +0800, Craig Ringer wrote: On 05/30/2013 03:10 PM, Craig Ringer wrote: On 05/28/2013 07:52 PM, Fabien COELHO wrote: However I'm not sure that pg_stat_replication currently has the necessary information on either side to measure the lag (in time transactions, but how do I know when a transaction was committed? or number of transactions?). The BDR codebase now has a handy function to report when a transaction was committed, pg_get_transaction_committime(xid) . It also adds pg_xlog_wait_remote_apply and pg_xlog_wait_remote_receive that can be used with pg_current_xlog_location() to wait until one or all replicas have caught up, or with LSNs from pg_stat_replication to (say) wait until all replicas have caught up with the most up-to-date one. I don't think these depend on anything BDR-specific They do, however, require changes to Pg core. These aren't functions you can just borrow and add to an extension, they require additional changes to core to collect the data they use. pg_xlog_wait_remote_receive() doesn't require changes afaics and should be easily packable as an extension. We might want to make it use the sync commit infrastructure at some point instead of essentially busy waiting, but... 'committs' - the mapping of xids to timestamp certainly does though. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] units in postgresql.conf comments
On 05/30/2013 12:55 AM, Magnus Hagander wrote: I like this idea with one addition. We should have a default unit for each. For wal_sender_timeout seconds makes sense, but for checkpoint_timeout minutes makes sense (for example). This sounds like a good way to make things even more confusing. Right now the confusion is only in the comments - this would make it confusing in the actual values. Requiring a unit seems like a much better idea. That way, there is no way for confusion. I can buy into that. JD -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] units in postgresql.conf comments
On 30.05.2013 10:52, Joshua D. Drake wrote: On 05/30/2013 12:01 AM, Heikki Linnakangas wrote: We could make it mandatory to specify the unit in the value. Ie. throw an error on wal_sender_timeout = 50: ERROR: unit required for option wal_sender_timeout HINT: Valid units for this parameter are ms, s, min, h, and d. Then you wouldn't need a comment to explain what the unit of a naked value is. The only problem I see with that is backwards-compatibility. Old postgresql.conf files containing naked values would no longer work. But all you'd need to do is to add in the units, which would work on older versions too, and would be good for readability anyway. I like this idea with one addition. We should have a default unit for each. For wal_sender_timeout seconds makes sense, but for checkpoint_timeout minutes makes sense (for example). Uh, if specifying the unit is mandatory, what exactly would the default unit mean? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Eliminating PD_ALL_VISIBLE, take 2
On 30.05.2013 06:54, Jeff Davis wrote: Continuation of: http://www.postgresql.org/message-id/1353551097.11440.128.camel@sussancws0025 Rebased patch attached; no other changes. @@ -675,6 +675,16 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, } /* +* If this page is left over from an upgraded system, it may have a +* PD_ALL_VISIBLE bit set (which is deprecated). If so, clear it. +*/ + if (PageIsAllVisible(page)) + { + PageClearAllVisible(page); + MarkBufferDirty(buf); + } + + /* * Prune all HOT-update chains in this page. * * We count tuples removed by the pruning step as removed by VACUUM. That could cause a torn page and checksum failure if checksums are enabled. Actually, I think the later PageClearAllVisible() call later in the function has the same problem, even without this patch. Instead of adding a new vmbuffer argument to heap_insert() and friends, could we put that into BulkInsertStateData? The new argument is similar to the current bulk-insert state in spirit. That would simplify the callers and make the heapam API cleaner. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] units in postgresql.conf comments
On 05/30/2013 01:14 AM, Heikki Linnakangas wrote: On 30.05.2013 10:52, Joshua D. Drake wrote: On 05/30/2013 12:01 AM, Heikki Linnakangas wrote: We could make it mandatory to specify the unit in the value. Ie. throw an error on wal_sender_timeout = 50: ERROR: unit required for option wal_sender_timeout HINT: Valid units for this parameter are ms, s, min, h, and d. Then you wouldn't need a comment to explain what the unit of a naked value is. The only problem I see with that is backwards-compatibility. Old postgresql.conf files containing naked values would no longer work. But all you'd need to do is to add in the units, which would work on older versions too, and would be good for readability anyway. I like this idea with one addition. We should have a default unit for each. For wal_sender_timeout seconds makes sense, but for checkpoint_timeout minutes makes sense (for example). Uh, if specifying the unit is mandatory, what exactly would the default unit mean? Yeah, see my other email. I missed that part. It is late for me. Sorry for the noise. JD - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Eliminating PD_ALL_VISIBLE, take 2
On 30.05.2013 11:26, Heikki Linnakangas wrote: On 30.05.2013 06:54, Jeff Davis wrote: Continuation of: http://www.postgresql.org/message-id/1353551097.11440.128.camel@sussancws0025 Rebased patch attached; no other changes. @@ -675,6 +675,16 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, } /* + * If this page is left over from an upgraded system, it may have a + * PD_ALL_VISIBLE bit set (which is deprecated). If so, clear it. + */ + if (PageIsAllVisible(page)) + { + PageClearAllVisible(page); + MarkBufferDirty(buf); + } + + /* * Prune all HOT-update chains in this page. * * We count tuples removed by the pruning step as removed by VACUUM. That could cause a torn page and checksum failure if checksums are enabled. Come to think of it, even without the torn page checksum issue, do we really want to actively clear the all-visible flags after upgrade? For tables that haven't been changed, and thus have the all-visible bits set, that amounts to a complete rewrite on the first vacuum after upgrade. That's expensive. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
Joe Conway m...@joeconway.com writes: Here's a cleaned up version, which also includes documentation. I'll commit back to 9.1 in a day or two unless there are any objections. Looks good to me. Were you able to test it against an extension containing both rules and views, to check that pg_dump has no problem with the new set of dependencies? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com wrote: FWIW, posix' description about posix_fallocate() doesn't actually say *anything* about reading. The guarantee it makes is: If posix_fallocate() returns successfully, subsequent writes to the specified file data shall not fail due to the lack of free space on the file system storage media.. http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html So we don't even know whether we can read. I think that means we need to zero the file anyway... Surely this is undue pessimism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/30/13 6:49 AM, Robert Haas wrote: On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com wrote: So we don't even know whether we can read. I think that means we need to zero the file anyway... Surely this is undue pessimism. There have been many occasions where I've found the Linux kernel defining support for POSIX behavior with a NOP stub that basically says we should make this work one day. I don't know whether the fallocate code is one of those or a fully implemented call. Based on that history, until I see a reader that validates the resulting files are good I have to assume they're not. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 2013-05-30 06:49:42 -0400, Robert Haas wrote: On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com wrote: FWIW, posix' description about posix_fallocate() doesn't actually say *anything* about reading. The guarantee it makes is: If posix_fallocate() returns successfully, subsequent writes to the specified file data shall not fail due to the lack of free space on the file system storage media.. http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html So we don't even know whether we can read. I think that means we need to zero the file anyway... Surely this is undue pessimism. Why? The spec doesn't specify that case and that very well allows other behaviour. Glibc sure does behave sensibly and zeroes the data (sysdeps/posix/posix_fallocate64.c for the generic implementation) and so does linux' fallocate() syscall, but that doesn't say much about other implementations. None of the manpages I could find, nor the spec says anything about the file's contents in the extended range. Given there were at least three manpages of different origins that didn't specify that behaviour I am not too optimistic. Why they didn't specify that completely obvious question is hard to understand from my pov. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 2013-05-30 06:55:16 -0400, Greg Smith wrote: On 5/30/13 6:49 AM, Robert Haas wrote: On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com wrote: So we don't even know whether we can read. I think that means we need to zero the file anyway... Surely this is undue pessimism. There have been many occasions where I've found the Linux kernel defining support for POSIX behavior with a NOP stub that basically says we should make this work one day. I don't know whether the fallocate code is one of those or a fully implemented call. Based on that history, until I see a reader that validates the resulting files are good I have to assume they're not. That argument in contrast I find not very convincing though. What was the last incidence of such a system call that did not just error out with ENOTSUPP or such? The linux fallocate call is fully specified for this behaviour and got added 2.6.23, there wasn't a stub before, so I am far less worried about it than about the underspecifiedness of posix_fallocate(). Also, if some system call doesn't follow its documented specifications it's not fully our problem anymore. If we rely on undocumented behaviour though... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] extensible external toast tuple support
Hi, In http://archives.postgresql.org/message-id/20130216164231.GA15069%40awork2.anarazel.de I presented the need for 'indirect' toast tuples which point into memory instead of a toast table. In the comments to that proposal, off-list and in-person talks the wish to make that a more general concept has been voiced. The previous patch used varattrib_1b_e.va_len_1be to discern between different types of external tuples. That obviously only works if the data sizes of all possibly stored datum types are distinct which isn't nice. So what the newer patch now does is to rename that field into 'va_tag' and decide based on that what kind of Datum we have. To get the actual length of that datum there now is a VARTAG_SIZE() macro which maps the tags back to size. To keep on-disk compatibility the size of an external toast tuple containing a varatt_external is used as its tag value. This should allow for fairly easy development of a new compression scheme for out-of-line toast tuples. It will *not* work for compressed inline tuples (i.e. VARATT_4B_C). I am not convinced that that is a problem or that if it is, that it cannot be solved separately. FWIW, in some quick microbenchmarks I couldn't find any performance difference due to the slightly more complex size computation which I do *not* find surprising. Opinions? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 43416ee71033a1bd12bec5e651ff45ea9eeafd56 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Sun, 17 Feb 2013 01:38:17 +0100 Subject: [PATCH] Add support for multiple kinds of external toast datums There are several usecases where our current representation of external toast datums is limiting: * adding new compression schemes * avoidance of repeated detoasting * externally decoded toast tuples For that support 'tags' on external (varattrib_1b_e) varlenas which recoin the current va_len_1be field to store the tag (or type) of a varlena. To determine the actual length a macro VARTAG_SIZE(tag) is added which can be used to map from a tag to the actual length. This patch adds support for 'indirect' tuples which point to some externally allocated memory containing a toast tuple. It also implements the stub for a different compression algorithm. --- src/backend/access/heap/tuptoaster.c | 115 --- src/include/postgres.h | 100 +++--- 2 files changed, 183 insertions(+), 32 deletions(-) diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c index fc37ceb..46c7cf4 100644 --- a/src/backend/access/heap/tuptoaster.c +++ b/src/backend/access/heap/tuptoaster.c @@ -128,7 +128,7 @@ heap_tuple_fetch_attr(struct varlena * attr) struct varlena * heap_tuple_untoast_attr(struct varlena * attr) { - if (VARATT_IS_EXTERNAL(attr)) + if (VARATT_IS_EXTERNAL_OLDSTYLE(attr)) { /* * This is an externally stored datum --- fetch it back from there @@ -145,6 +145,19 @@ heap_tuple_untoast_attr(struct varlena * attr) pfree(tmp); } } + else if (VARATT_IS_EXTERNAL_COMPRESSED(attr)) + { + elog(ERROR, not yet); + } + else if (VARATT_IS_EXTERNAL_INDIRECT(attr)) + { + struct varatt_indirect redirect; + VARATT_EXTERNAL_GET_POINTER(redirect, attr); + attr = (struct varlena *)redirect.pointer; + Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr)); + + attr = heap_tuple_untoast_attr(attr); + } else if (VARATT_IS_COMPRESSED(attr)) { /* @@ -191,7 +204,7 @@ heap_tuple_untoast_attr_slice(struct varlena * attr, char *attrdata; int32 attrsize; - if (VARATT_IS_EXTERNAL(attr)) + if (VARATT_IS_EXTERNAL_OLDSTYLE(attr)) { struct varatt_external toast_pointer; @@ -204,6 +217,13 @@ heap_tuple_untoast_attr_slice(struct varlena * attr, /* fetch it back (compressed marker will get set automatically) */ preslice = toast_fetch_datum(attr); } + else if (VARATT_IS_EXTERNAL_INDIRECT(attr)) + { + struct varatt_indirect redirect; + VARATT_EXTERNAL_GET_POINTER(redirect, attr); + return heap_tuple_untoast_attr_slice(redirect.pointer, + sliceoffset, slicelength); + } else preslice = attr; @@ -267,7 +287,7 @@ toast_raw_datum_size(Datum value) struct varlena *attr = (struct varlena *) DatumGetPointer(value); Size result; - if (VARATT_IS_EXTERNAL(attr)) + if (VARATT_IS_EXTERNAL_OLDSTYLE(attr)) { /* va_rawsize is the size of the original datum -- including header */ struct varatt_external toast_pointer; @@ -275,6 +295,17 @@ toast_raw_datum_size(Datum value) VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr); result = toast_pointer.va_rawsize; } + else if (VARATT_IS_EXTERNAL_COMPRESSED(attr)) + { + elog(ERROR, not yet); + } + else if (VARATT_IS_EXTERNAL_INDIRECT(attr)) + { + struct varatt_indirect toast_pointer; + + VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr); + return
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/30/13 7:17 AM, Andres Freund wrote: That argument in contrast I find not very convincing though. What was the last incidence of such a system call that did not just error out with ENOTSUPP or such? http://linux.die.net/man/2/posix_fadvise talks about POSIX_FADV_NOREUSE and POSIX_FADV_WILLNEED being both buggy and quietly mapped to a no-op, depending on your version. I know there were more examples than just that one that popped up during the testing of effective_io_concurrency. My starting position has to assume that posix_fallocate can have the same sort of surprising behavior that showed up repeatedly when we were trying to use posix_fadvise more aggressively. The way O_SYNC was quietly mapped to O_DSYNC (which isn't the same thing) was a similar issue, and that's the first one that left me forever skeptical of Linux kernel claims in this area until they are explicitly validated: http://lwn.net/Articles/350225/ -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 2013-05-30 07:48:51 -0400, Greg Smith wrote: On 5/30/13 7:17 AM, Andres Freund wrote: That argument in contrast I find not very convincing though. What was the last incidence of such a system call that did not just error out with ENOTSUPP or such? http://linux.die.net/man/2/posix_fadvise talks about POSIX_FADV_NOREUSE and POSIX_FADV_WILLNEED being both buggy and quietly mapped to a no-op, depending on your version. I know there were more examples than just that one that popped up during the testing of effective_io_concurrency. My starting position has to assume that posix_fallocate can have the same sort of surprising behavior that showed up repeatedly when we were trying to use posix_fadvise more aggressively. Uh. How is that a correctness problem? fadvise is a hint which is pretty different from a fallocate where ignoring would have way much more severe consequences. I don't think that's a very meaningful comparison. The way O_SYNC was quietly mapped to O_DSYNC (which isn't the same thing) was a similar issue, and that's the first one that left me forever skeptical of Linux kernel claims in this area until they are explicitly validated: http://lwn.net/Articles/350225/ Yea, but that mistake is literally decades old... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] removing PD_ALL_VISIBLE
On Thu, May 30, 2013 at 12:06 AM, Jeff Davis pg...@j-davis.com wrote: AFAICS, the main benefit of eliminating PD_ALL_VISIBLE is that we eliminate one write cycle; that is, we won't dirty the page once to hint it and then again to mark it all-visible. But as of 9.3, that should really only be a problem in the insert-only case. And in that case, my proposal to consider all-visible pages as frozen would be a huge win, because you'd only need to emit XLOG_HEAP_VISIBLE for every page in the heap, rather than XLOG_HEAP_FREEZE. Agreed. Just to quantify that a bit more, I ran this command a couple of times: dropdb rhaas ; sleep 5 ; createdb ; sleep 5 ; pgbench -i -s 1000 -n; sleep 5 ; time psql -c checkpoint ; time psql -c 'vacuum' And also this one: dropdb rhaas ; sleep 5 ; createdb ; sleep 5 ; pgbench -i -s 1000 -n; sleep 5 ; time psql -c checkpoint ; time psql -c 'vacuum freeze' In the first one, the vacuum at the end takes about 25 seconds. In the second one, it takes about 15 minutes, during which time there's one CPU core running at about 10%; the remainder of the time is spent waiting for disk I/O. A little follow-up testing shows that the vacuum emits 88MB of WAL, while the vacuum freeze emits 13GB of WAL. This is on the 16-core, 64-thread IBM POWER box with the following non-default configuration settings: shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 log_line_prefix = '%t [%p] ' Andres' proposal for freezing at the same time we mark pages all-visible relies on emitting FPIs when we mark pages all-visible, but I hope that the test above is convincing evidence that it would be *really* expensive for some users. My proposal to consider all-visible pages as frozen avoids that cost, but as far as I can see, it also requires PD_ALL_VISIBLE to stick around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/30/13 7:52 AM, Andres Freund wrote: fadvise is a hint which is pretty different from a fallocate where ignoring would have way much more severe consequences. Yes, it will. That's why I want to see it tested. There is more than enough past examples of bad behavior here to be skeptical that this sort of API may not work exactly as specified. If you're willing to believe the spec, that's fine, but I think that's dangerously optimistic. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Thu, May 30, 2013 at 7:13 AM, Andres Freund and...@2ndquadrant.com wrote: Surely this is undue pessimism. Why? The spec doesn't specify that case and that very well allows other behaviour. Glibc sure does behave sensibly and zeroes the data (sysdeps/posix/posix_fallocate64.c for the generic implementation) and so does linux' fallocate() syscall, but that doesn't say much about other implementations. None of the manpages I could find, nor the spec says anything about the file's contents in the extended range. Given there were at least three manpages of different origins that didn't specify that behaviour I am not too optimistic. Why they didn't specify that completely obvious question is hard to understand from my pov. I think they didn't specify it because it IS obvious. As Stephen says, it's been understood for decades that allowing unzeroed pages to be reallocated to some other file is a major security hole. I think we can assume that no credible OS does that. If there's some OS out there that chooses to fill the pre-extended pages with 0x55 or cat /dev/urandom instead of 0x00, they probably deserve what they get. It's hard for me to be believe that anything that silly actually exists. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] removing PD_ALL_VISIBLE
On 2013-05-30 07:54:38 -0400, Robert Haas wrote: On Thu, May 30, 2013 at 12:06 AM, Jeff Davis pg...@j-davis.com wrote: AFAICS, the main benefit of eliminating PD_ALL_VISIBLE is that we eliminate one write cycle; that is, we won't dirty the page once to hint it and then again to mark it all-visible. But as of 9.3, that should really only be a problem in the insert-only case. And in that case, my proposal to consider all-visible pages as frozen would be a huge win, because you'd only need to emit XLOG_HEAP_VISIBLE for every page in the heap, rather than XLOG_HEAP_FREEZE. Agreed. Just to quantify that a bit more, I ran this command a couple of times: dropdb rhaas ; sleep 5 ; createdb ; sleep 5 ; pgbench -i -s 1000 -n; sleep 5 ; time psql -c checkpoint ; time psql -c 'vacuum' And also this one: dropdb rhaas ; sleep 5 ; createdb ; sleep 5 ; pgbench -i -s 1000 -n; sleep 5 ; time psql -c checkpoint ; time psql -c 'vacuum freeze' In the first one, the vacuum at the end takes about 25 seconds. In the second one, it takes about 15 minutes, during which time there's one CPU core running at about 10%; the remainder of the time is spent waiting for disk I/O. A little follow-up testing shows that the vacuum emits 88MB of WAL, while the vacuum freeze emits 13GB of WAL. This is on the 16-core, 64-thread IBM POWER box with the following non-default configuration settings: shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 log_line_prefix = '%t [%p] ' Andres' proposal for freezing at the same time we mark pages all-visible relies on emitting FPIs when we mark pages all-visible, but I hope that the test above is convincing evidence that it would be *really* expensive for some users. My proposal to consider all-visible pages as frozen avoids that cost I think I basically suggested treating all visible as frozen, didn't I? If not, I had lost sync between my fingers and my thoughts which happens too often ;). You had noticed that my proposed was lacking a bit around when we omit FPIs for the page while setting all-visible, but we both thought that we may find a workaround that - which looking at the page level flag first basically is. As far as I understand the trick basically is that we can rely on an FPI being logged when an action unsetting ALL_VISIBLE is performed. That all-visible would then make sure the hint-bits marking indvidual tuples as frozen would hit disk. For that we need to add some more work though, consider: 1) write tuples on a page 2) freeze page by setting ALL_VISIBLE and setting hint bits. Setting ALL_VISIBLE is wall logged 3) crash 4) replay ALL_VISIBLE, set it on the page level. The individual tuples are *not* guaranteed to be marked frozen. 5) update tuple on the page unsetting all visible. Emits an FPI which does *not* have the tuples marked as frozen. Easy enough and fairly cheap to fix by having a function that checks that updates the hint bits on a page when unsetting all visible since we can just set it for all pre-existing tuples. but as far as I can see, it also requires PD_ALL_VISIBLE to stick around. Now, I am far from being convinced its a good idea to get rid of PD_ALL_VISIBLE, but I don't think it does. Except that it currently is legal for the page level ALL_VISIBLE being set while the corresponding visibilitymap one isn't there's not much prohibiting us fundamentally from looking in the vm when we need to know whether the page is all visible, is there? To the contrary, this actually seems to be a pretty good case for Jeff's proposed behaviour since it would allow freezing while only writing the vm? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 2013-05-30 08:02:56 -0400, Robert Haas wrote: On Thu, May 30, 2013 at 7:13 AM, Andres Freund and...@2ndquadrant.com wrote: Surely this is undue pessimism. Why? The spec doesn't specify that case and that very well allows other behaviour. Glibc sure does behave sensibly and zeroes the data (sysdeps/posix/posix_fallocate64.c for the generic implementation) and so does linux' fallocate() syscall, but that doesn't say much about other implementations. None of the manpages I could find, nor the spec says anything about the file's contents in the extended range. Given there were at least three manpages of different origins that didn't specify that behaviour I am not too optimistic. Why they didn't specify that completely obvious question is hard to understand from my pov. I think they didn't specify it because it IS obvious. As Stephen says, it's been understood for decades that allowing unzeroed pages to be reallocated to some other file is a major security hole. I think we can assume that no credible OS does that. If there's some OS out there that chooses to fill the pre-extended pages with 0x55 or cat /dev/urandom instead of 0x00, they probably deserve what they get. It's hard for me to be believe that anything that silly actually exists. I don't think there's much danger of getting uninitialized data or such. That clearly would be insane. I think somebody might interpret it as read(2) returning an error until the page has been written to which isn't completely crazy. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/30/13 7:13 AM, Andres Freund wrote: Why? The spec doesn't specify that case and that very well allows other behaviour. Glibc sure does behave sensibly and zeroes the data (sysdeps/posix/posix_fallocate64.c for the generic implementation) and so does linux' fallocate() syscall, but that doesn't say much about other implementations. glibc actually only writes one byte to every file system block, to make sure the block is allocated. It doesn't actually zero every byte. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/30/13 8:02 AM, Robert Haas wrote: If there's some OS out there that chooses to fill the pre-extended pages with 0x55 or cat /dev/urandom instead of 0x00, they probably deserve what they get. Even that wouldn't be a problem for our purpose. The only problem would be if you can't read from the allocated region at all. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 2013-05-30 08:19:17 -0400, Peter Eisentraut wrote: On 5/30/13 8:02 AM, Robert Haas wrote: If there's some OS out there that chooses to fill the pre-extended pages with 0x55 or cat /dev/urandom instead of 0x00, they probably deserve what they get. Even that wouldn't be a problem for our purpose. The only problem would be if you can't read from the allocated region at all. Well, only as long as we only use it for preallocation of wal files. I am much, much more interested in doing that for the heap. And there that surely would be a problem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 2013-05-30 08:17:28 -0400, Peter Eisentraut wrote: On 5/30/13 7:13 AM, Andres Freund wrote: Why? The spec doesn't specify that case and that very well allows other behaviour. Glibc sure does behave sensibly and zeroes the data (sysdeps/posix/posix_fallocate64.c for the generic implementation) and so does linux' fallocate() syscall, but that doesn't say much about other implementations. glibc actually only writes one byte to every file system block, to make sure the block is allocated. It doesn't actually zero every byte. Which is fine since that guarantees we can read from those areas... And unless I misremember something that actually guarantees that the rest of the data is initialized to zero as well. Yes: subsequent reads of data in the gap shall return bytes with the value 0 until data is actually written into the gap. But really, I am not at all concerned about some obscure values being returned, but about a read() not being successful.. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL 9.3 beta breaks some extensions make install
On 05/29/2013 06:08 PM, Cédric Villemain wrote: I just took time to inspect our contribs, USE_PGXS is not supported by all of them atm because of SHLIB_PREREQS (it used submake) I have a patch pending here to fix that. Once all our contribs can build with USE_PGXS I fix the VPATH. I've added 'most' of the patches to the commitfest... (I am not sure it is required, as this is more bugfix than anything else IMHO) See https://commitfest.postgresql.org/action/patch_view?id=1122 https://commitfest.postgresql.org/action/patch_view?id=1123 https://commitfest.postgresql.org/action/patch_view?id=1124 I stopped trying to add new item after too many failures from https://commitfest.postgresql.org/action/patch_form So one patch is not in the commitfest yet (fix_install_ext_vpath.patch) failures? what kind of issues did you experience? Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] removing PD_ALL_VISIBLE
On 30.05.2013 15:12, Andres Freund wrote: Now, I am far from being convinced its a good idea to get rid of PD_ALL_VISIBLE, but I don't think it does. Except that it currently is legal for the page level ALL_VISIBLE being set while the corresponding visibilitymap one isn't there's not much prohibiting us fundamentally from looking in the vm when we need to know whether the page is all visible, is there? Hmm, so you're suggesting that the visibility map would be *required* to interpret the pages correctly. Ie. if you just zapped the visibility map, you'd lose critical information and the heap would appear to be corrupt. I guess that's possible, but it makes me quite uneasy. At the moment, it's relieving to know that it's always safe to just truncate the visibility map in case of emergency. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] removing PD_ALL_VISIBLE
On 2013-05-30 15:34:04 +0300, Heikki Linnakangas wrote: On 30.05.2013 15:12, Andres Freund wrote: Now, I am far from being convinced its a good idea to get rid of PD_ALL_VISIBLE, but I don't think it does. Except that it currently is legal for the page level ALL_VISIBLE being set while the corresponding visibilitymap one isn't there's not much prohibiting us fundamentally from looking in the vm when we need to know whether the page is all visible, is there? Hmm, so you're suggesting that the visibility map would be *required* to interpret the pages correctly. Ie. if you just zapped the visibility map, you'd lose critical information and the heap would appear to be corrupt. I guess that's possible, but it makes me quite uneasy. At the moment, it's relieving to know that it's always safe to just truncate the visibility map in case of emergency. I didn't say its a good idea, just that I don't think Robert's conclusion is necessarily valid. But requiring only the few kbytes of the vm to be written instead of all of the heap during freezeing (or whatever we would call it) has quite some allure, I admit that. But I think that should be a separate project to reeingineering how freezing works. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] pg_upgrade -u
On May 29, 2013, at 11:07 AM, Bruce Momjian wrote: On Wed, May 29, 2013 at 08:59:42AM -0400, Ray Stell wrote: [ moved to hacker ] The question is whether hard-wiring these helps more than it hurts, and which ones should be hard-wired. I seems to me that superuser is exactly that special case and that if an alternate superuser is hardwired in the src cluster then -u/-U and that specific value will be required on both sides of pg_upgrade, no variability is needed and perhaps not possible. You're point is well taken for port. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
* Peter Eisentraut (pete...@gmx.net) wrote: On 5/30/13 7:13 AM, Andres Freund wrote: Why? The spec doesn't specify that case and that very well allows other behaviour. Glibc sure does behave sensibly and zeroes the data (sysdeps/posix/posix_fallocate64.c for the generic implementation) and so does linux' fallocate() syscall, but that doesn't say much about other implementations. glibc actually only writes one byte to every file system block, to make sure the block is allocated. It doesn't actually zero every byte. That goes back to the 'sane implementation' question.. Is there a case where that would actually be different from writing zeros for the entire block..? Is there some OS that gives you random data for the 'hole' when you write a byte, seek to the start of the next block and then write another byte? That actually *would* be against what's documented and required by spec, no? Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
* Andres Freund (and...@2ndquadrant.com) wrote: On 2013-05-30 08:19:17 -0400, Peter Eisentraut wrote: On 5/30/13 8:02 AM, Robert Haas wrote: If there's some OS out there that chooses to fill the pre-extended pages with 0x55 or cat /dev/urandom instead of 0x00, they probably deserve what they get. Even that wouldn't be a problem for our purpose. The only problem would be if you can't read from the allocated region at all. Well, only as long as we only use it for preallocation of wal files. I am much, much more interested in doing that for the heap. And there that surely would be a problem. Yes, that was my thinking as well. If posix_fallocate is faster than writing out 8K of zeros, and the block can immediately be read as if it had actually been written to, then I'd be very interested in using it to extend heap files. As I mentioned in this thread (or perhaps it was another), I don't think this solves the locking issue around the relation extention lock, but it might help some and it may make it easier to tweak that logic to allocate larger chunks in the future. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
* Andres Freund (and...@2ndquadrant.com) wrote: But really, I am not at all concerned about some obscure values being returned, but about a read() not being successful.. Alright, so what do we need to do to test this? We really just need a short C program written up and then a bunch of folks to run it on various architectures, right? Gee, sounds like what the buildfarm was made for (alright, alright, PostgreSQL isn't exactly a 'short C program', but you get the idea). As I recall, Andrew reworked the buildfarm code to be more modular too.. Anyone have thoughts about how we could run these kinds of tests with it? Or do people think that's a bad idea? Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 2013-05-30 08:53:37 -0400, Stephen Frost wrote: * Andres Freund (and...@2ndquadrant.com) wrote: But really, I am not at all concerned about some obscure values being returned, but about a read() not being successful.. Alright, so what do we need to do to test this? We really just need a short C program written up and then a bunch of folks to run it on various architectures, right? After a bit of standard perusing writing a single byte to the end of the file after the fallocate ought to make at least the reading guaranteed to be defined. If we did seek(last_byte); write(); posix_fallocate() we should even always have defined content. Yuck. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
* Andres Freund (and...@2ndquadrant.com) wrote: After a bit of standard perusing writing a single byte to the end of the file after the fallocate ought to make at least the reading guaranteed to be defined. If we did seek(last_byte); write(); posix_fallocate() we should even always have defined content. Yuck. Alright, but would that actually be any better than just doing what glibc's posix_fallocate() does in the generic case? And, to be honest, it makes me a bit nervous to seek/write like that because it looks like the typical create a hole setup, which we certainly aren't intending, yet if the posix_fallocate() call disappeared, or did nothing, or this code was copied w/o it, or someone didn't understand what it did, we could end up with that. Not a fan. :( Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] PostgreSQL 9.3 beta breaks some extensions make install
Le jeudi 30 mai 2013 14:32:46, Stefan Kaltenbrunner a écrit : On 05/29/2013 06:08 PM, Cédric Villemain wrote: I just took time to inspect our contribs, USE_PGXS is not supported by all of them atm because of SHLIB_PREREQS (it used submake) I have a patch pending here to fix that. Once all our contribs can build with USE_PGXS I fix the VPATH. I've added 'most' of the patches to the commitfest... (I am not sure it is required, as this is more bugfix than anything else IMHO) See https://commitfest.postgresql.org/action/patch_view?id=1122 https://commitfest.postgresql.org/action/patch_view?id=1123 https://commitfest.postgresql.org/action/patch_view?id=1124 I stopped trying to add new item after too many failures from https://commitfest.postgresql.org/action/patch_form So one patch is not in the commitfest yet (fix_install_ext_vpath.patch) failures? what kind of issues did you experience? I didn't sent too much details as I am not sure if it was my setup which breaks things or not. [...] Just tested with Stephen, looks like a problem with something on my side, sorry for the noise. (rekonq 0 - chromium 1) -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation signature.asc Description: This is a digitally signed message part.
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/30/13 8:50 AM, Stephen Frost wrote: I don't think this solves the locking issue around the relation extention lock, but it might help some and it may make it easier to tweak that logic to allocate larger chunks in the future. It might make it a bit faster, but it doesn't make it any easier to implement. The messy part of extending relations in larger chunks is how to communicate that back into the buffer manager usefully. The extension path causing trouble is RelationGetBufferForTuple calling ReadBufferBI. All of that is passing a single buffer around. There's no simple way I can see to rewrite it to handle more than one at a time. I have a test case for relation extension that I'm going to package up soon. That makes it easy to see where the problem is at. Far as I can tell the reason it hasn't been fixed before now is that it's a pain to write the code. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Freezing without write I/O
Since we're bashing around ideas around freezing, let me write down the idea I've been pondering and discussing with various people for years. I don't think I invented this myself, apologies to whoever did for not giving credit. The reason we have to freeze is that otherwise our 32-bit XIDs wrap around and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but that would waste a lot space. The trick is to add a field to the page header indicating the 'epoch' of the XID, while keeping the XIDs in tuple header 32-bit wide (*). The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we could implicitly treat anything older than relfrozenxid as frozen. That's the basic idea. Vacuum freeze only needs to remove dead tuples, but doesn't need to dirty pages that contain no dead tuples. Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to replace the XIDs with FrozenXid whenever the difference between the smallest and largest XID on a page exceeds 2^31. But that would only happen when you're updating the page, in which case the page is dirtied anyway, so it wouldn't cause any extra I/O. This would also be the first step in allowing the clog to grow larger than 2 billion transactions, eliminating the need for anti-wraparound freezing altogether. You'd still want to truncate the clog eventually, but it would be nice to not be pressed against the wall with run vacuum freeze now, or the system will shut down. (*) Adding an epoch is inaccurate, but I like to use that as my mental model. If you just add a 32-bit epoch field, then you cannot have xids from different epochs on the page, which would be a problem. In reality, you would store one 64-bit XID value in the page header, and use that as the reference point for all the 32-bit XIDs on the tuples. See existing convert_txid() function for how that works. Another method is to store the 32-bit xid values in tuple headers as offsets from the per-page 64-bit value, but then you'd always need to have the 64-bit value at hand when interpreting the XIDs, even if they're all recent. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Thu, May 30, 2013 at 8:14 AM, Andres Freund and...@2ndquadrant.com wrote: I don't think there's much danger of getting uninitialized data or such. That clearly would be insane. I think somebody might interpret it as read(2) returning an error until the page has been written to which isn't completely crazy. In the absence of tangible evidence of some implementation that behaves that way, I think that's just paranoia. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] removing PD_ALL_VISIBLE
On Thu, May 30, 2013 at 8:12 AM, Andres Freund and...@2ndquadrant.com wrote: As far as I understand the trick basically is that we can rely on an FPI being logged when an action unsetting ALL_VISIBLE is performed. That all-visible would then make sure the hint-bits marking indvidual tuples as frozen would hit disk. For that we need to add some more work though, consider: 1) write tuples on a page 2) freeze page by setting ALL_VISIBLE and setting hint bits. Setting ALL_VISIBLE is wall logged 3) crash 4) replay ALL_VISIBLE, set it on the page level. The individual tuples are *not* guaranteed to be marked frozen. 5) update tuple on the page unsetting all visible. Emits an FPI which does *not* have the tuples marked as frozen. Easy enough and fairly cheap to fix by having a function that checks that updates the hint bits on a page when unsetting all visible since we can just set it for all pre-existing tuples. Basically, yes, though I would say infomask bits rather than hint bits, since not all of them are only hints, and this case would not be merely a hint. but as far as I can see, it also requires PD_ALL_VISIBLE to stick around. Now, I am far from being convinced its a good idea to get rid of PD_ALL_VISIBLE, but I don't think it does. Except that it currently is legal for the page level ALL_VISIBLE being set while the corresponding visibilitymap one isn't there's not much prohibiting us fundamentally from looking in the vm when we need to know whether the page is all visible, is there? To the contrary, this actually seems to be a pretty good case for Jeff's proposed behaviour since it would allow freezing while only writing the vm? Well, as Heikki points out, I think that's unacceptably dangerous. Loss or corruption of a single visibility map page means possible loss of half a gigabyte of data. Also, if we go that route, looking at the visibility map is no longer an optimization; it's essential for correctness. We can't decide to skip it when it seems expensive, for example, as Jeff was proposing. There's another thing that's bothering me about this whole discussion, too. If looking at another page for the information we need to make visibility decisions is so cheap that we need not be concerned with it, then why do we need hint bits? I realize that it's not quite the same thing, because CLOG doesn't have as much locality of access as the visibility map; you're guaranteed to find all the information you need for a single heap page on a single VM page. Also, CLOG is per-tuple, not per-page, and we get a decent speed-up from checking once for the whole page rather than for each tuple. Nonetheless, I think the contrast between Jeff's tests, which aren't showing much impact from the increased visibility map traffic, and previous hint-bit removal tests, which have crashed and burned, may be caused in part by the fact that our algorithms and locking regimen for shared_buffers are much more sophisticated than for SLRU. I'm not eager to have our design decisions driven by that gap. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/30/2013 02:02 AM, Dimitri Fontaine wrote: Joe Conway m...@joeconway.com writes: Here's a cleaned up version, which also includes documentation. I'll commit back to 9.1 in a day or two unless there are any objections. Looks good to me. Were you able to test it against an extension containing both rules and views, to check that pg_dump has no problem with the new set of dependencies? PostGIS has both: test=# \dv List of relations Schema | Name| Type | Owner - +---+--+-- public | geography_columns | view | postgres public | geometry_columns | view | postgres public | raster_columns| view | postgres public | raster_overviews | view | postgres (4 rows) select tablename, rulename from pg_rules ; tablename |rulename - --+- pg_settings | pg_settings_n pg_settings | pg_settings_u geometry_columns | geometry_columns_delete geometry_columns | geometry_columns_update geometry_columns | geometry_columns_insert (5 rows) 8 # pg_dump test /tmp/test-01.dmp # dropdb test # createdb test # psql test /tmp/test-01.dmp SET SET SET SET SET SET CREATE EXTENSION COMMENT CREATE EXTENSION COMMENT SET REVOKE REVOKE GRANT GRANT # pg_dump test /tmp/test-02.dmp # diff /tmp/test-01.dmp /tmp/test-02.dmp ...no differences... 8 Joe - -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, 24x7 Support -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJRp1ykAAoJEDfy90M199hlLQcP/1OkpFeb99EO9xca0RD+WIHS FrqhBEJDHA4ujODvitZRMTFjpS1WH4Difm7P05Lvbr1xEUmwuSD6oBw/VQ1p6cxs RyIvUM1uLVhR/nwjMeymner9kOINPu4rBVKf+7EgPJQcFvZuUSzafNGH1l70p6wk dXMA2ggjjFdvF6voVxaKkHFbs+uttURNDZ2l0f6eb4QJRZta+NuFCOtIkPTqBESx oABWuoAutAEX0Z7b0iEyNjjDLduPzjMIqQm8Y6NfsGmkEYd2jrpYVl04T8hcbqf0 vFJ2NvblvuaHoRIhq/ZYbFt9dQKIdoUtNuzR8MOK474mD/VrX6v/xquGh1rcmL4x 1k94Lis3Cf/QEBUEqwKNribkOLemaxEEDVnVTCWSC59FoDbaAZuiwtYspHYwAQED D3nZ9jknEr+Bziqw6y8KP3wQGAbyssIKdtXFlw2u1BFeFjuWK5pL8vR3vi08j/Ij diycCOBLotJGlkaHEt7vCNMTbHlIru4d4yblh0hbB6wL6JvI2HbGlK5chPPqIu+O zHpPGUuTy7lgi+0809k5ceoqYUDJJTo0yu/3BuvLeaZwJqfS9QBIjCdryb/0MCVn QJ6u3r54aSz4FQHP8iDoDnfbZIAdpCtjlqiTxLARxxYZqWt9nHoW0bC9fZpTkBfT YxJX5C74NCVHE2Qdqnqx =fbuL -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries
On Wed, May 29, 2013 at 10:51 PM, Amit Langote amitlangot...@gmail.com wrote: So, it appears, for search strings consisting of 2 (or 3) characters, trigrams can not be utilized. No? I think that's right. trigram means a sequence of three characters, and what's stored in the indexes are three-character sequences from the original text. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries
On Thu, May 30, 2013 at 11:47 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, May 29, 2013 at 10:51 PM, Amit Langote amitlangot...@gmail.com wrote: So, it appears, for search strings consisting of 2 (or 3) characters, trigrams can not be utilized. No? I think that's right. trigram means a sequence of three characters, and what's stored in the indexes are three-character sequences from the original text. Was there any improvement to pg_trgm in recent past that could make it better for partial matching (the case in question I suppose) or is partial-matching a different thing altogether? -- Amit Langote -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
Joe Conway m...@joeconway.com writes: Were you able to test it against an extension containing both rules and views, to check that pg_dump has no problem with the new set of dependencies? PostGIS has both: [...] # pg_dump test /tmp/test-02.dmp # diff /tmp/test-01.dmp /tmp/test-02.dmp ...no differences... Perfect, thanks! -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
Greg Smith escribió: The messy part of extending relations in larger chunks is how to communicate that back into the buffer manager usefully. The extension path causing trouble is RelationGetBufferForTuple calling ReadBufferBI. All of that is passing a single buffer around. There's no simple way I can see to rewrite it to handle more than one at a time. No, but we can have it create several pages and insert them into the FSM. So they aren't returned to the original caller but are available to future users. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/30/13 11:21 AM, Alvaro Herrera wrote: Greg Smith escribió: The messy part of extending relations in larger chunks is how to communicate that back into the buffer manager usefully. The extension path causing trouble is RelationGetBufferForTuple calling ReadBufferBI. All of that is passing a single buffer around. There's no simple way I can see to rewrite it to handle more than one at a time. No, but we can have it create several pages and insert them into the FSM. So they aren't returned to the original caller but are available to future users. There's actually a code comment wondering about this topic for the pages that are already created, in src/backend/access/heap/hio.c : Remember the new page as our target for future insertions. XXX should we enter the new page into the free space map immediately, or just keep it for this backend's exclusive use in the short run (until VACUUM sees it)? Seems to depend on whether you expect the current backend to make more insertions or not, which is probably a good bet most of the time. So for now, don't add it to FSM yet. We have to be careful about touching too much at that particular point, because it's holding a relation extension lock at the obvious spot to make a change. There's an interesting overlap with these questions about how files are extended too, with this comment in that file too, just before the above: XXX This does an lseek - rather expensive - but at the moment it is the only way to accurately determine how many blocks are in a relation. Is it worth keeping an accurate file length in shared memory someplace, rather than relying on the kernel to do it for us? That whole sequence of code took the easy way forward when it was written, but it's obvious the harder one (also touching the FSM) was considered even then. The whole sequence needs to be revisited to pull off multiple page extension. I wouldn't say it's hard, but it's enough work that I haven't been able to find a block of time to go through the whole thing. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries
On Thu, May 30, 2013 at 11:00 AM, Amit Langote amitlangot...@gmail.com wrote: Was there any improvement to pg_trgm in recent past that could make it better for partial matching (the case in question I suppose) or is partial-matching a different thing altogether? Sorry, I am not sure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries
2013/5/31 Amit Langote amitlangot...@gmail.com On Thu, May 30, 2013 at 11:47 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, May 29, 2013 at 10:51 PM, Amit Langote amitlangot...@gmail.com wrote: So, it appears, for search strings consisting of 2 (or 3) characters, trigrams can not be utilized. No? I think that's right. trigram means a sequence of three characters, and what's stored in the indexes are three-character sequences from the original text. Was there any improvement to pg_trgm in recent past that could make it better for partial matching (the case in question I suppose) or is partial-matching a different thing altogether? Hi Amit, following emails are discussed about partial match of pg_trgm. I hope will this help. http://www.postgresql.org/message-id/cahgqgwfjshvv2ngme19wdtw9tefw_w7h2ns4e+yysjkb9wd...@mail.gmail.com as you may know, if search string contains multibyte characters trigram key is converted to CRC of 4 byte and it is used as key. (but only use upper 3 byte from CRC) so we can do partial matching if KEEPONLYALNUM is enabled. Regards, Masahiko Sawada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 9.3: Empty arrays returned by array_remove()
Testing 9.3beta, it seems that array_remove() may return an empty 1-d array whose upper bound is lower than its lower bound. I know that we discussed allowing this kind of array, but I don't think that discussion reached any conclusion, other than to agree that the current empty 0-d array behaviour would be kept in 9.3. I don't think it's intentional, but the current code in array_remove() can return something like this: SELECT array_dims(array_remove(array[1], 1)); array_dims [1:0] (1 row) and so the resulting empty 1-d array won't compare as equal to the usual 0-d empty array: SELECT array_remove(array[1], 1) = '{}'; ?column? -- f (1 row) The LHS is effectively '[1:0]={}', but we don't currently allow that syntax, so I don't think we should be returning it (it wouldn't survive a dump/restore, for example). Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Eliminating PD_ALL_VISIBLE, take 2
On Thu, 2013-05-30 at 11:32 +0300, Heikki Linnakangas wrote: That could cause a torn page and checksum failure if checksums are enabled. Thank you, I missed that in the rebase; it should be MarkBufferDirtyHint(). Come to think of it, even without the torn page checksum issue, do we really want to actively clear the all-visible flags after upgrade? For tables that haven't been changed, and thus have the all-visible bits set, that amounts to a complete rewrite on the first vacuum after upgrade. That's expensive. I expected that question and intended to raise it for discussion when and if we get past performance concerns (I believe Robert is still not convinced that the patch is viable). We have a few options: We can ignore the bit entirely, or we can aggressively unset it, or we can opportunistically unset it if the page is already dirty. I don't think we're in a hurry to reuse that bit for something else, so maybe it's best to just ignore it entirely. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
Heikki, This sounds a lot like my idea for 9.3, which didn't go anywhere. You've worked out the issues I couldn't, I think. Another method is to store the 32-bit xid values in tuple headers as offsets from the per-page 64-bit value, but then you'd always need to have the 64-bit value at hand when interpreting the XIDs, even if they're all recent. Yeah, -1 on the latter, not least because it would require a 100% rewrite of the tables in order to upgrade. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The reason we have to freeze is that otherwise our 32-bit XIDs wrap around and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but that would waste a lot space. The trick is to add a field to the page header indicating the 'epoch' of the XID, while keeping the XIDs in tuple header 32-bit wide (*). Check. The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we could implicitly treat anything older than relfrozenxid as frozen. Check. That's the basic idea. Vacuum freeze only needs to remove dead tuples, but doesn't need to dirty pages that contain no dead tuples. Check. Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to replace the XIDs with FrozenXid whenever the difference between the smallest and largest XID on a page exceeds 2^31. But that would only happen when you're updating the page, in which case the page is dirtied anyway, so it wouldn't cause any extra I/O. It would cause some extra WAL activity, but it wouldn't dirty the page an extra time. This would also be the first step in allowing the clog to grow larger than 2 billion transactions, eliminating the need for anti-wraparound freezing altogether. You'd still want to truncate the clog eventually, but it would be nice to not be pressed against the wall with run vacuum freeze now, or the system will shut down. Interesting. That seems like a major advantage. (*) Adding an epoch is inaccurate, but I like to use that as my mental model. If you just add a 32-bit epoch field, then you cannot have xids from different epochs on the page, which would be a problem. In reality, you would store one 64-bit XID value in the page header, and use that as the reference point for all the 32-bit XIDs on the tuples. See existing convert_txid() function for how that works. Another method is to store the 32-bit xid values in tuple headers as offsets from the per-page 64-bit value, but then you'd always need to have the 64-bit value at hand when interpreting the XIDs, even if they're all recent. As I see it, the main downsides of this approach are: (1) It breaks binary compatibility (unless you do something to provided for it, like put the epoch in the special space). (2) It consumes 8 bytes per page. I think it would be possible to get this down to say 5 bytes per page pretty easily; we'd simply decide that the low-order 3 bytes of the reference XID must always be 0. Possibly you could even do with 4 bytes, or 4 bytes plus some number of extra bits. (3) You still need to periodically scan the entire relation, or else have a freeze map as Simon and Josh suggested. The upsides of this approach as compared with what Andres and I are proposing are: (1) It provides a stepping stone towards allowing indefinite expansion of CLOG, which is quite appealing as an alternative to a hard shut-down. (2) It doesn't place any particular requirements on PD_ALL_VISIBLE. I don't personally find this much of a benefit as I want to keep PD_ALL_VISIBLE, but I know Jeff and perhaps others disagree. Random thought: Could you compute the reference XID based on the page LSN? That would eliminate the storage overhead. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On Thu, May 30, 2013 at 1:39 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The reason we have to freeze is that otherwise our 32-bit XIDs wrap around and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but that would waste a lot space. The trick is to add a field to the page header indicating the 'epoch' of the XID, while keeping the XIDs in tuple header 32-bit wide (*). (3) You still need to periodically scan the entire relation, or else have a freeze map as Simon and Josh suggested. Why is this scan required? Also, what happens if you delete a tuple on a page when another tuple on the same page with age 2^32 that is still in an open transaction? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Vacuum, Freeze and Analyze: the big picture
Folks, There's currently some great ideas bouncing around about eliminating the overhead associated with FREEZE. However, I wanted to take a step back and take a look at the big picture for VACUUM, FREEZE and ANALYZE. Otherwise, we're liable to repeat the 8.4 problem of making one operation better (background vacuum) while making another one worse (freezing). The big, big picture is this: 90% of our users need to think about VACUUM/ANALYZE at least 10% of the time and 10% of our users need to think about it almost 90% of the time. That's considerably better than was the case 5 years ago, when vacuum management was a daily or weekly responsibility for nearly 100% of our users, but it's still not good enough. Our target should be that only those with really unusual setups should have to *ever* think about vacuum and analyze. So I've set up a wiki page to document the various problems that force users to think about vacuum and analyze and try to troubleshoot it: https://wiki.postgresql.org/wiki/VacuumHeadaches We can also collect suggested solutions here. I'm looking to create a long-term development target which removes most of these vacuum headaches over the next 3 or 4 releases, without making the unremoved headaches siginficantly worse. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture
Hi, On 2013-05-30 11:48:12 -0700, Josh Berkus wrote: There's currently some great ideas bouncing around about eliminating the overhead associated with FREEZE. However, I wanted to take a step back and take a look at the big picture for VACUUM, FREEZE and ANALYZE. Otherwise, we're liable to repeat the 8.4 problem of making one operation better (background vacuum) while making another one worse (freezing). Inhowfar did 8.4 make freezing worse? I can't remember any new problems there? I agree that we need to be careful not to make things worse... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture
On 30 May 2013 19:48, Josh Berkus j...@agliodbs.com wrote: Folks, There's currently some great ideas bouncing around about eliminating the overhead associated with FREEZE. However, I wanted to take a step back and take a look at the big picture for VACUUM, FREEZE and ANALYZE. Otherwise, we're liable to repeat the 8.4 problem of making one operation better (background vacuum) while making another one worse (freezing). The big, big picture is this: 90% of our users need to think about VACUUM/ANALYZE at least 10% of the time and 10% of our users need to think about it almost 90% of the time. That's considerably better than was the case 5 years ago, when vacuum management was a daily or weekly responsibility for nearly 100% of our users, but it's still not good enough. Our target should be that only those with really unusual setups should have to *ever* think about vacuum and analyze. So I've set up a wiki page to document the various problems that force users to think about vacuum and analyze and try to troubleshoot it: https://wiki.postgresql.org/wiki/VacuumHeadaches We can also collect suggested solutions here. I'm looking to create a long-term development target which removes most of these vacuum headaches over the next 3 or 4 releases, without making the unremoved headaches siginficantly worse. Great collection of issues. I'm not sure I understand this: Problem: As of 9.3, there's a significant benefit to vacuum freezing tables early so that index-only scan is enabled, since freezing also updates the visibility map. However, with default settings, such freezing only happens for data which is very old. This means that index-only scan is less effective than it could be for tables which have relatively infrequent updates and deletes. Why specifically VACUUM FREEZE rather than regular VACUUM? I thought regular VACUUM updated the visibility map too? And why as of 9.3 instead of 9.2? -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On 30.05.2013 21:46, Merlin Moncure wrote: On Thu, May 30, 2013 at 1:39 PM, Robert Haasrobertmh...@gmail.com wrote: On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The reason we have to freeze is that otherwise our 32-bit XIDs wrap around and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but that would waste a lot space. The trick is to add a field to the page header indicating the 'epoch' of the XID, while keeping the XIDs in tuple header 32-bit wide (*). (3) You still need to periodically scan the entire relation, or else have a freeze map as Simon and Josh suggested. Why is this scan required? To find all the dead tuples and remove them, and advance relfrozenxid. That in turn is required so that you can truncate the clog. This scheme relies on assuming that everything older than relfrozenxid committed, so if there are any aborted XIDs present in the table, you can't advance relfrozenxid past them. Come to think of it, if there are no aborted XIDs in a range of XIDs, only commits, then you could just advance relfrozenxid past that range and truncate away the clog, without scanning the table. But that's quite a special case - generally there would be at least a few aborted XIDs - so it's probably not worth adding any special code to take advantage of that. Also, what happens if you delete a tuple on a page when another tuple on the same page with age 2^32 that is still in an open transaction? Can't let that happen. Same as today. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture
Problem: As of 9.3, there's a significant benefit to vacuum freezing tables early so that index-only scan is enabled, since freezing also updates the visibility map. However, with default settings, such freezing only happens for data which is very old. This means that index-only scan is less effective than it could be for tables which have relatively infrequent updates and deletes. Why specifically VACUUM FREEZE rather than regular VACUUM? I thought regular VACUUM updated the visibility map too? And why as of 9.3 instead of 9.2? As of 9.2, that was a typo. Allvisible only gets set if there was some reason for VACUUM to visit the page anyway, no? Which means that an insert-only or insert-mostly table doesn't get set allvisible until FREEZE. And insert-only tables are usually very large, and thus really *need* index-only scan. Hmmm. I should rewrite that item entirely. It has nothing to do with FREEZE, really. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture
Inhowfar did 8.4 make freezing worse? I can't remember any new problems there? Before the Visibility Map, we always vacuumed all pages in a relation when it was vacuumed at all. This means that we froze tuples at vacuum_min_freeze_age, not at autovacuum_max_freeze_age, which is when we do it post-8.4. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture
On 2013-05-30 12:11:23 -0700, Josh Berkus wrote: Inhowfar did 8.4 make freezing worse? I can't remember any new problems there? Before the Visibility Map, we always vacuumed all pages in a relation when it was vacuumed at all. This means that we froze tuples at vacuum_min_freeze_age, not at autovacuum_max_freeze_age, which is when we do it post-8.4. If we have reason to vacuum the relation we do it at vacuum_freeze_table_age, not only at autovacuum_max_freeze_age. The difference is that the latter triggers a vacuum, while the former only changes a partial vacuum into a full one. Calling that behaviour unconditionally worse is, err, interesting... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture
On 2013-05-30 20:01:01 +0100, Thom Brown wrote: Problem: As of 9.3, there's a significant benefit to vacuum freezing tables early so that index-only scan is enabled, since freezing also updates the visibility map. However, with default settings, such freezing only happens for data which is very old. This means that index-only scan is less effective than it could be for tables which have relatively infrequent updates and deletes. Why specifically VACUUM FREEZE rather than regular VACUUM? I thought regular VACUUM updated the visibility map too? It does. It's after all what it uses to decide which parts of the table to scan if not doing a full table vacuum. And why as of 9.3 instead of 9.2? Mabe because 9.3 updates the vm quicker than earlier version by checking whether all tuples are visible after we've actually removed the dead tuples. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture
If we have reason to vacuum the relation we do it at vacuum_freeze_table_age, not only at autovacuum_max_freeze_age. The difference is that the latter triggers a vacuum, while the former only changes a partial vacuum into a full one. Calling that behaviour unconditionally worse is, err, interesting... *overall* it's better. But as far as FREEZE itself is concerned, it's worse. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On 2013-05-30 14:39:46 -0400, Robert Haas wrote: Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to replace the XIDs with FrozenXid whenever the difference between the smallest and largest XID on a page exceeds 2^31. But that would only happen when you're updating the page, in which case the page is dirtied anyway, so it wouldn't cause any extra I/O. It would cause some extra WAL activity, but it wouldn't dirty the page an extra time. You probably could do it similarly to how we currently do XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The insert/update/delete record will already contain a FPI if necessary, so that should be safe. This would also be the first step in allowing the clog to grow larger than 2 billion transactions, eliminating the need for anti-wraparound freezing altogether. You'd still want to truncate the clog eventually, but it would be nice to not be pressed against the wall with run vacuum freeze now, or the system will shut down. Interesting. That seems like a major advantage. Hm. Why? If freezing gets notably cheaper I don't really see much need for keeping that much clog around? If we still run into anti-wraparound areas, there has to be some major operational problem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries
On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko sawada.m...@gmail.comwrote: following emails are discussed about partial match of pg_trgm. I hope will this help. http://www.postgresql.org/message-id/cahgqgwfjshvv2ngme19wdtw9tefw_w7h2ns4e+yysjkb9wd...@mail.gmail.com as you may know, if search string contains multibyte characters trigram key is converted to CRC of 4 byte and it is used as key. (but only use upper 3 byte from CRC) so we can do partial matching if KEEPONLYALNUM is enabled. Please, read the further discussion on that thread. We can't do partial matching because of CRC independently of KEEPONLYALNUM. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture
On 2013-05-30 12:18:29 -0700, Josh Berkus wrote: If we have reason to vacuum the relation we do it at vacuum_freeze_table_age, not only at autovacuum_max_freeze_age. The difference is that the latter triggers a vacuum, while the former only changes a partial vacuum into a full one. Calling that behaviour unconditionally worse is, err, interesting... *overall* it's better. But as far as FREEZE itself is concerned, it's worse. I am not trying to give you a hard time, but I really can't follow. In 8.3 we only froze tuples that were older than vacuum_freeze_min_age, just as today (although the default was higher then than today). 100mio transactions is long enough that you almost guaranteedly be in a different checkpoint cycle when freezing than when initially writing the tuple's buffer. So independent of the time the buffer is frozen (be it a) we always scan the whole relation, b) we have a partial vacuum upgraded to a full one due to vacuum_freeze_table_age c) an anti wraparound vacuum) we will usually write a buffer multiple times. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On Thu, May 30, 2013 at 04:33:50PM +0300, Heikki Linnakangas wrote: This would also be the first step in allowing the clog to grow larger than 2 billion transactions, eliminating the need for anti-wraparound freezing altogether. You'd still want to truncate the clog eventually, but it would be nice to not be pressed against the wall with run vacuum freeze now, or the system will shut down. Keep in mind that autovacuum_freeze_max_age is 200M to allow faster clog truncation. Does this help that? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries
On Fri, May 31, 2013 at 4:25 AM, Alexander Korotkov aekorot...@gmail.com wrote: On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko sawada.m...@gmail.com wrote: following emails are discussed about partial match of pg_trgm. I hope will this help. http://www.postgresql.org/message-id/cahgqgwfjshvv2ngme19wdtw9tefw_w7h2ns4e+yysjkb9wd...@mail.gmail.com as you may know, if search string contains multibyte characters trigram key is converted to CRC of 4 byte and it is used as key. (but only use upper 3 byte from CRC) so we can do partial matching if KEEPONLYALNUM is enabled. Please, read the further discussion on that thread. We can't do partial matching because of CRC independently of KEEPONLYALNUM. Thank you Sawada-san and Alexander. I think the idea of using trigram text itself rather than its CRC (due to its problems in partial matching) as GIN key (?) has not been implemented into pg_trgm yet, right? And even though, such a facility would be added, we would still need to handle multibyte characters case differently (even for partial matching), is that right? -- Amit Langote -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On Thu, May 30, 2013 at 3:22 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-05-30 14:39:46 -0400, Robert Haas wrote: Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to replace the XIDs with FrozenXid whenever the difference between the smallest and largest XID on a page exceeds 2^31. But that would only happen when you're updating the page, in which case the page is dirtied anyway, so it wouldn't cause any extra I/O. It would cause some extra WAL activity, but it wouldn't dirty the page an extra time. You probably could do it similarly to how we currently do XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The insert/update/delete record will already contain a FPI if necessary, so that should be safe. Ah, good point. This would also be the first step in allowing the clog to grow larger than 2 billion transactions, eliminating the need for anti-wraparound freezing altogether. You'd still want to truncate the clog eventually, but it would be nice to not be pressed against the wall with run vacuum freeze now, or the system will shut down. Interesting. That seems like a major advantage. Hm. Why? If freezing gets notably cheaper I don't really see much need for keeping that much clog around? If we still run into anti-wraparound areas, there has to be some major operational problem. That is true, but we have a decent number of customers who do in fact have such problems. I think that's only going to get more common. As hardware gets faster and PostgreSQL improves, people are going to process more and more transactions in shorter and shorter periods of time. Heikki's benchmark results for the XLOG scaling patch show rates of 80,000 tps. Even at a more modest 10,000 tps, with default settings, you'll do anti-wraparound vacuums of the entire cluster about every 8 hours. That's not fun. Being able to do such scans only of the not-all-visible pages would be a huge step forward, of course. But not having to do them on any particular deadline would be a whole lot better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries
On Fri, May 31, 2013 at 4:25 AM, Alexander Korotkov aekorot...@gmail.com wrote: On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko sawada.m...@gmail.com wrote: following emails are discussed about partial match of pg_trgm. I hope will this help. http://www.postgresql.org/message-id/cahgqgwfjshvv2ngme19wdtw9tefw_w7h2ns4e+yysjkb9wd...@mail.gmail.com as you may know, if search string contains multibyte characters trigram key is converted to CRC of 4 byte and it is used as key. (but only use upper 3 byte from CRC) so we can do partial matching if KEEPONLYALNUM is enabled. Please, read the further discussion on that thread. We can't do partial matching because of CRC independently of KEEPONLYALNUM. Also, a few more questions: 1) When building a trgm index, are there any differences for multi-byte character strings. For example, would a 2 character Japanese string (multi-byte offcourse) produce exactly 3 trigrams to be stored in the index which would later be used while look-up? 2) And if that is so, is there problem in gin_extract_query_trgm(), that is while generating trigrams from a query search term that causes trigrams (stored in the index if answer to (1) is yes) NOT to be used in such a partial matching case? -- Amit Langote -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On Thu, May 30, 2013 at 2:39 PM, Robert Haas robertmh...@gmail.com wrote: Random thought: Could you compute the reference XID based on the page LSN? That would eliminate the storage overhead. After mulling this over a bit, I think this is definitely possible. We begin a new half-epoch every 2 billion transactions. We remember the LSN at which the current half-epoch began and the LSN at which the previous half-epoch began. When a new half-epoch begins, the first backend that wants to stamp a tuple with an XID from the new half-epoch must first emit a new half-epoch WAL record, which becomes the starting LSN for the new half-epoch. We define a new page-level bit, something like PD_RECENTLY_FROZEN. When this bit is set, it means there are no unfrozen tuples on the page with XIDs that predate the current half-epoch. Whenever we know this to be true, we set the bit. If the page LSN crosses more than one half-epoch boundary at a time, we freeze the page and set the bit. If the page LSN crosses exactly one half-epoch boundary, then (1) if the bit is set, we clear it and (2) if the bit is not set, we freeze the page and set the bit. The advantage of this is that we avoid an epidemic of freezing right after a half-epoch change. Immediately after a half-epoch change, many pages will mix tuples from the current and previous half-epoch - but relatively few pages will have tuples from the current half-epoch and a half-epoch more than one in the past. As things stand today, we really only need to remember the last two half-epoch boundaries; they could be stored, for example, in the control file. But if we someday generalize CLOG to allow indefinite retention as you suggest, we could instead remember all half-epoch boundaries that have ever occurred; just maintain a file someplace with 8 bytes of data for every 2 billion XIDs consumed over the lifetime of the cluster. In fact, we might want to do it that way anyhow, just to keep our options open, and perhaps for forensics. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers