Re: [HACKERS] Compression of full-page-writes
On 2013-10-22 12:52:09 +0900, Fujii Masao wrote: So, our consensus is to introduce the hooks for FPW compression so that users can freely select their own best compression algorithm? No, I don't think that's concensus yet. If you want to make it configurable on that level you need to have: 1) compression format signature on fpws 2) mapping between identifiers for compression formats and the libraries implementing them. Otherwise you can only change the configuration at initdb time... Also, probably we need to implement at least one compression contrib module using that hook, maybe it's based on pglz or snappy. From my tests for toast compression I'd suggest starting with lz4. I'd suggest starting by publishing test results with a more modern compression formats, but without hacks like increasing padding. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cube extension point support // GSoC'13
On Tue, Oct 22, 2013 at 2:00 AM, Dimitri Fontaine dimi...@2ndquadrant.frwrote: Tom Lane t...@sss.pgh.pa.us writes: I believe the reason GIST has compress/decompress functions is not for TOAST (they predate that, if memory serves), but to allow the on-disk representation of an index entry to be different from the data type's normal representation in other ways --- think lossy storage in particular. My understanding of the use case for those functions is to do with storing a different data type in the index upper nodes and in the index leafs. It should be possible to do that in a non-lossy way, so that you would implement compress/decompress and not declare the RECHECK bits. Then again I'm talking from 8.3 era memories of when I tried to understand GiST enough to code the prefix extension. Actually, I mean purpose of this particular decompress function implementation, not compress/decompress in general. I understand that in general compress/decompress can do useful job. -- With best regards, Alexander Korotkov.
[HACKERS] Why the asprintf patch is still breaking the buildfarm
So I returned from vacation only to find that the buildfarm has a bad case of acne. All the Windows members are red or pink, and have been for awhile. Sigh. After some research I believe that I understand the reason for the CHECK failures, at least: 1. src/port/asprintf.c exhibits a truly touching faith that vsnprintf will report exactly the number of bytes that would have been required, even if the buffer is not that large. While this is what is specified in recent versions of the POSIX standard, older platforms have much sketchier behavior. 2. In particular, our own src/port/snprintf.c follows the SUS v2 rule that it should report the number of bytes it *actually wrote*. This means that asprintf.c will never think that its initial 128-byte allocation was insufficient. So, on platforms where we use this implementation (notably including Windows), the result of any asprintf call is effectively truncated at 128 bytes. 3. I believe the exact cause of the reported failures is that the add_to_path calls in pg_regress.c result in truncating the value of the PATH environment value, causing system() to not find the perl executable. (jacana is probably failing because of truncation of LD_LIBRARY_PATH instead, which is unsurprising since the problem would move around depending on the directory path lengths in use.) IMO src/port/asprintf.c is hopelessly naive, as well as ugly and undocumented. We should throw it away and replace it with an implementation more like stringinfo.c's appendStringInfo, which is code that has been through the wars and is known to be pretty bulletproof these days. Aside from the immediate problem, that would allow us to get rid of the unportable va_copy calls. (I say they're unportable because no such functionality is specified in SUS v2. And no, I do not have any faith at all in commit c2316dcda1cd057d7d4a56e3a51e3f8f0527e906 as a workaround.) I have a lot of other gripes about this whole patch, but they can wait till tomorrow. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why the asprintf patch is still breaking the buildfarm
On Tue, Oct 22, 2013 at 8:58 PM, Tom Lane t...@sss.pgh.pa.us wrote: So I returned from vacation only to find that the buildfarm has a bad case of acne. All the Windows members are red or pink, and have been for awhile. Sigh. After some research I believe that I understand the reason for the CHECK failures, at least: 1. src/port/asprintf.c exhibits a truly touching faith that vsnprintf will report exactly the number of bytes that would have been required, even if the buffer is not that large. While this is what is specified in recent versions of the POSIX standard, older platforms have much sketchier behavior. 2. In particular, our own src/port/snprintf.c follows the SUS v2 rule that it should report the number of bytes it *actually wrote*. This means that asprintf.c will never think that its initial 128-byte allocation was insufficient. So, on platforms where we use this implementation (notably including Windows), the result of any asprintf call is effectively truncated at 128 bytes. Thanks for looking at this. I had a bash and trying to figure out why vcregress check would not work last night and didn't get very far... I can confirm that you are right just by changing the 128 into 12800 and compiling, vcregress check worked after that. Regards David Rowley I have a lot of other gripes about this whole patch, but they can wait till tomorrow. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Heavily modified big table bloat even in auto vacuum is running
On 22 October 2013 10:15 Amit Kapila wrote: On Mon, Oct 21, 2013 at 10:54 AM, Haribabu kommi haribabu.ko...@huawei.com wrote: Yes, it's correct. nkeep counter have the dead tuples which are recently dead and are not vacuumed. The removal of tuples vacuumed from dead tuples should be the same as nkeep counter. So if we remove the nkeep from num_tuples which gives us the proper live tuples. How about following statement at the end scan for all blocks. num_tuples -= nkeep; Actually what I had in mind was to use nkeep to estimate n_dead_tuples similar to how num_tuples is used to estimate n_live_tuples. I think it will match what Tom had pointed in his response (What would make more sense to me is for VACUUM to estimate the number of remaining dead tuples somehow and send that in its message. However, since the whole point here is that we aren't accounting for transactions that commit while VACUUM runs, it's not very clear how to do that.) I changed the patch as passing the nkeep counter data as the new dead tuples in the relation to stats like the new_rel_tuples. The nkeep counter is an approximation of dead tuples data of a relation. Instead of resetting dead tuples stats as zero, used this value to set n_dead_tuples same as n_live_tuples. Patch is attached in the mail. Please let me know if any changes are required. Regards, Hari Babu. vacuum_fix_v3.patch Description: vacuum_fix_v3.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why the asprintf patch is still breaking the buildfarm
On 22/10/2013 09:58, Tom Lane wrote: So I returned from vacation only to find that the buildfarm has a bad case of acne. All the Windows members are red or pink, and have been for awhile. Sigh. After some research I believe that I understand the reason for the CHECK failures, at least: 1. src/port/asprintf.c exhibits a truly touching faith that vsnprintf will report exactly the number of bytes that would have been required, even if the buffer is not that large. While this is what is specified in recent versions of the POSIX standard, older platforms have much sketchier behavior. Just to be pedantic, this is required by C99. [...] Regards Manlio Perillo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Getting Tuples list from an index
Naman wrote: I have an 3 indexes on a relation t2(A,B,C) index1 , index2 ,index3 What i need is if i know the indexname (say index1) then is their any programmatic way by which i can get the list of tuples which comes under the index specified( i.e index1) Do you need anything that exceeds a catalog query? Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
Tom Lane t...@sss.pgh.pa.us writes: Hm. It's been a long time since college statistics, but doesn't the entire concept of standard deviation depend on the assumption that the underlying distribution is more-or-less normal (Gaussian)? Is there a I just had a quick chat with a statistician friends of mine on that topic, and it seems that the only way to make sense of an average is if you know already the distribution. In our case, what I keep experiencing with tuning queries is that we have like 99% of them running under acceptable threshold and 1% of them taking more and more time. In a normal (Gaussian) distribution, there would be no query time farther away from the average than any other, so my experience tells me that the query time distribution is anything BUT normal (Gaussian). good reason to suppose that query runtime is Gaussian? (I'd bet not; in particular, multimodal behavior seems very likely due to things like plan changes.) If not, how much does that affect the usefulness of a standard-deviation calculation? I don't know what multi-modal is. What I've been gathering from my quick chat this morning is that either you know how to characterize the distribution and then the min max and average are useful on their own, or you need to keep track of an histogram where all the bins are of the same size to be able to learn what the distribution actually is. We didn't get to the point where I could understand if storing histogram with a constant size on log10 of the data rather than the data itself is going to allow us to properly characterize the distribution. The main question I want to answer here would be the percentiles one, I want to get the query max execution timing for 95% of the executions, then 99%, then 99.9% etc. There's no way to answer that without knowing the distribution shape, so we need enough stats to learn what the distribution shape is (hence, histograms). Of course keeping enough stats seems to always begin with keeping the min, max and average, so we can just begin there. We would just be unable to answer interesting questions with just that. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On Tue, Oct 22, 2013 at 2:56 AM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote: Tom Lane t...@sss.pgh.pa.us writes: Hm. It's been a long time since college statistics, but doesn't the entire concept of standard deviation depend on the assumption that the underlying distribution is more-or-less normal (Gaussian)? Is there a I just had a quick chat with a statistician friends of mine on that topic, and it seems that the only way to make sense of an average is if you know already the distribution. In our case, what I keep experiencing with tuning queries is that we have like 99% of them running under acceptable threshold and 1% of them taking more and more time. Agreed. In a lot of Heroku's performance work, the Perc99 and Perc95 have provided a lot more value that stddev, although stddev is a lot better than nothing and probably easier to implement. There are apparently high-quality statistical approximations of these that are not expensive to compute and are small in memory representation. That said, I'd take stddev over nothing for sure. Handily for stddev, I think by snapshots of count(x), sum(x), sum(x**2) (which I understand to be the components of stddev), I think one can compute stddevs across different time spans using auxiliary tools that sample this triplet on occasion. That's kind of a handy property that I'm not sure if percN-approximates can get too easily. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: lob conversion functionality
Hello here is patch Regards Pavel 2013/10/21 Noah Misch n...@leadboat.com On Mon, Oct 21, 2013 at 08:10:24PM +0200, Pavel Stehule wrote: On 10/20/2013 07:52 PM, Noah Misch wrote: Anything we do here effectively provides wrappers around the existing functions tailored toward the needs of libpq. To clarify the above statement: the existing lo* SQL functions are designed to fit the needs of the libpq APIs that call those SQL functions internally. The additions we're discussing are SQL functions designed to fit the needs of user-written SQL statements. I am for including to core - we have no buildin SQL functions that allows access simple and fast access on binary level. Next - these functions completes lo functionality. Other questions - should be these functions propagated to libpq? No; I agree that the existing libpq large object API is adequate. and who will write patch? You or me? If you're prepared to change the function names and add the subset-oriented functions, I would appreciate that. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com load_lo_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On 22/10/13 22:56, Dimitri Fontaine wrote: Tom Lane t...@sss.pgh.pa.us writes: Hm. It's been a long time since college statistics, but doesn't the entire concept of standard deviation depend on the assumption that the underlying distribution is more-or-less normal (Gaussian)? Is there a I just had a quick chat with a statistician friends of mine on that topic, and it seems that the only way to make sense of an average is if you know already the distribution. In our case, what I keep experiencing with tuning queries is that we have like 99% of them running under acceptable threshold and 1% of them taking more and more time. In a normal (Gaussian) distribution, there would be no query time farther away from the average than any other, so my experience tells me that the query time distribution is anything BUT normal (Gaussian). good reason to suppose that query runtime is Gaussian? (I'd bet not; in particular, multimodal behavior seems very likely due to things like plan changes.) If not, how much does that affect the usefulness of a standard-deviation calculation? I don't know what multi-modal is. [...] Multi-modal is basically having more than one hump when you graph the frequencies of values. If you gave a series of mathematical questions of varying degrees of difficulty and divers areas in mathematics to a group of people between the ages of 20 25 selected at random in New Zealand, then you would have at least 2 humps. One hump would be those who had little mathematical training and/or no interest and those that had had more advanced mathematical training and/or were interested in mathematics. You would also get at least 2 humps if you plotted numbers of people under the age of 50, with the number of visits to medical practioners. Basically those people with chronic illnesses with those who tend not to have extended periods of illness - this implies 2 humps, but it may be more complicated. Grabbing people at random and getting them to fire a rifle at targets would also be multi modal. A lot of people with low scores and a lessor percentage with reasonable scores. I would expect this to be quite pronounced, people with lots of rifle practice will tend to do significantly better. Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] all_visible replay aborting due to uninitialized pages
Hi Robert, Heikki, On 2013-09-24 13:25:41 +0200, Andres Freund wrote: I'm afraid this patch was a few bricks shy of a load. The log_newpage_buffer() function asserts that: /* We should be in a critical section. */ Assert(CritSectionCount 0); But the call in vacuumlazy.c is not inside a critical section. Also, the comments in log_newpage_buffer() say that the caller should mark the buffer dirty *before* calling log_newpage_buffer(), but in vacuumlazy.c, it's marked dirty afterwards. I'm not sure what consequences that might have, but at least it contradicts the comment. (spotted this while working on a patch, and ran into the assertion on crash recovery) What about the attached patches (one for 9.3 and master, the other for 9.2)? I've tested that I can trigger the assert before and not after by inserting faults... Yould either of you commit those patches to the corresponding branches? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
All, * Dimitri Fontaine (dimi...@2ndquadrant.fr) wrote: In our case, what I keep experiencing with tuning queries is that we have like 99% of them running under acceptable threshold and 1% of them taking more and more time. This is usually described (at least where I come from) as 'rare events', which goes to Tom's point that averages, stddev, etc, are not ideal (though they are still better than nothing). good reason to suppose that query runtime is Gaussian? (I'd bet not; in particular, multimodal behavior seems very likely due to things like plan changes.) If not, how much does that affect the usefulness of a standard-deviation calculation? Oscillating plan changes may fit multimodal but I don't feel that's typical. My experience has been it's either an extremely rare plan difference or it's a shift from one plan to another over time. What I've been gathering from my quick chat this morning is that either you know how to characterize the distribution and then the min max and average are useful on their own, or you need to keep track of an histogram where all the bins are of the same size to be able to learn what the distribution actually is. A histogram would certainly be useful. We may also wish to look into outlier/rare event detection methods and increase the logging we do in those cases (if possible). Of course keeping enough stats seems to always begin with keeping the min, max and average, so we can just begin there. We would just be unable to answer interesting questions with just that. It would probably be good to do some research into techniques for outlier detection which minimizes CPU and storage cost. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] logical changeset generation v6.2
On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote: So. As it turns out that solution isn't sufficient in the face of VACUUM FULL and mixed DML/DDL transaction that have not yet been decoded. To reiterate, as published it works like: For every modification of catalog tuple (insert, multi_insert, update, delete) that has influence over visibility issue a record that contains: * filenode * ctid * (cmin, cmax) When doing a visibility check on a catalog row during decoding of mixed DML/DDL transaction lookup (cmin, cmax) for that row since we don't store both for the tuple. That mostly works great. The problematic scenario is decoding a transaction that has done mixed DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM FULL obviously changes the filenode and the ctid of a tuple, so we cannot successfully do a lookup based on what we logged before. So I have a new idea for handling this problem, which seems obvious in retrospect. What if we make the VACUUM FULL or CLUSTER log the old CTID - new CTID mappings? This would only need to be done for catalog tables, and maybe could be skipped for tuples whose XIDs are old enough that we know those transactions must already be decoded. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.4
On Tue, Oct 22, 2013 at 10:07 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-10-21 20:16:29 +0200, Andres Freund wrote: On 2013-10-18 20:50:58 +0200, Andres Freund wrote: How about modifying the selection to go from: * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL; * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb) Current draft is: ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL|DEFAULT ALTER TABLE ... REPLICA IDENTITY USING INDEX ...; which leaves the door open for ALTER TABLE ... REPLICA IDENTITY USING '(' column_name_list ')'; Does anybody have a strong feeling about requiring support for CREATE TABLE for this? Attached is a patch ontop of master implementing this syntax. It's not wired up to the changeset extraction patch yet as I am not sure whether others agree about the storage. pg_class grew a 'relreplident' char, storing: * 'd' default * 'n' nothing * 'f' full * 'i' index with indisreplident set, or default if index has been dropped pg_index grew a 'indisreplident' bool indicating it is set as the replica identity for a replident = 'i' relation. Both changes shouldn't change the width of the affected relations, they should reuse existing padding. Does somebody prefer a different storage? I had imagined that the storage might consist simply of a pg_attribute boolean. So full would turn them all on, null would turn them all of, etc. But that does make it hard to implement the whatever the pkey happens to be right now behavior, so maybe your idea is better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-22 10:52:48 -0400, Robert Haas wrote: On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote: So. As it turns out that solution isn't sufficient in the face of VACUUM FULL and mixed DML/DDL transaction that have not yet been decoded. To reiterate, as published it works like: For every modification of catalog tuple (insert, multi_insert, update, delete) that has influence over visibility issue a record that contains: * filenode * ctid * (cmin, cmax) When doing a visibility check on a catalog row during decoding of mixed DML/DDL transaction lookup (cmin, cmax) for that row since we don't store both for the tuple. That mostly works great. The problematic scenario is decoding a transaction that has done mixed DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM FULL obviously changes the filenode and the ctid of a tuple, so we cannot successfully do a lookup based on what we logged before. So I have a new idea for handling this problem, which seems obvious in retrospect. What if we make the VACUUM FULL or CLUSTER log the old CTID - new CTID mappings? This would only need to be done for catalog tables, and maybe could be skipped for tuples whose XIDs are old enough that we know those transactions must already be decoded. Ah. If it only were so simple ;). That was my first idea, and after I'd bragged in an 2ndq internal chat that I'd found a simple idea I obviously had to realize it doesn't work. Consider: INIT_LOGICAL_REPLICATION; CREATE TABLE foo(...); BEGIN; INSERT INTO foo; ALTER TABLE foo ...; INSERT INTO foo; COMMIT TX 3; VACUUM FULL pg_class; START_LOGICAL_REPLICATION; When we decode tx 3 we haven't yet read the mapping from the vacuum freeze. That scenario can happen either because decoding was stopped for a moment, or because decoding couldn't keep up (slow connection, whatever). There also can be nasty variations where the VACUUM FULL happens while a transaction is writing data since we don't hold locks on system relations for very long. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why the asprintf patch is still breaking the buildfarm
Manlio Perillo manlio.peri...@gmail.com writes: On 22/10/2013 09:58, Tom Lane wrote: 1. src/port/asprintf.c exhibits a truly touching faith that vsnprintf will report exactly the number of bytes that would have been required, even if the buffer is not that large. While this is what is specified in recent versions of the POSIX standard, older platforms have much sketchier behavior. Just to be pedantic, this is required by C99. Yeah. As a separate matter, it might be useful to revise stringinfo.c and the asprintf code so that *if* the returned value is larger than the given buffer size, we use it as a guide to resizing, avoiding the possible need to loop multiple times to make the buffer large enough. And we could also improve our own implementation of snprintf to follow the C99 spec. The point here is that we still need to cope with pre-C99 implementations that might return -1 or the given buffer size on overflow. The NetBSD implementation doesn't do that, which is reasonable in their context, but not workable for us. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] COPY table FROM STDIN doesn't show count tag
On 21 October 2013 20:48, Robert Haas robertmh...@gmail.com wrote: On Fri, Oct 18, 2013 at 7:37 AM, Rajeev rastogi rajeev.rast...@huawei.com wrote: From the following mail, copy behaviour between stdin and normal file having some inconsistency. http://www.postgresql.org/message-id/ce85a517.4878e%tim.k...@gmail.com The issue was that if copy execute from stdin, then it goes to the server to execute the command and then server request for the input, it sends back the control to client to enter the data. So once client sends the input to server, server execute the copy command and sends back the result to client but client does not print the result instead it just clear it out. Changes are made to ensure the final result from server get printed before clearing the result. Please find the patch for the same and let me know your suggestions. Please add your patch to the currently-open CommitFest so that it does not get forgotten: https://commitfest.postgresql.org/action/commitfest_view/open Added to the currently-open CommitFest. Thanks and Regards, Kumar Rajeev Rastogi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 22, 2013 at 11:02 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-10-22 10:52:48 -0400, Robert Haas wrote: On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote: So. As it turns out that solution isn't sufficient in the face of VACUUM FULL and mixed DML/DDL transaction that have not yet been decoded. To reiterate, as published it works like: For every modification of catalog tuple (insert, multi_insert, update, delete) that has influence over visibility issue a record that contains: * filenode * ctid * (cmin, cmax) When doing a visibility check on a catalog row during decoding of mixed DML/DDL transaction lookup (cmin, cmax) for that row since we don't store both for the tuple. That mostly works great. The problematic scenario is decoding a transaction that has done mixed DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM FULL obviously changes the filenode and the ctid of a tuple, so we cannot successfully do a lookup based on what we logged before. So I have a new idea for handling this problem, which seems obvious in retrospect. What if we make the VACUUM FULL or CLUSTER log the old CTID - new CTID mappings? This would only need to be done for catalog tables, and maybe could be skipped for tuples whose XIDs are old enough that we know those transactions must already be decoded. Ah. If it only were so simple ;). That was my first idea, and after I'd bragged in an 2ndq internal chat that I'd found a simple idea I obviously had to realize it doesn't work. Consider: INIT_LOGICAL_REPLICATION; CREATE TABLE foo(...); BEGIN; INSERT INTO foo; ALTER TABLE foo ...; INSERT INTO foo; COMMIT TX 3; VACUUM FULL pg_class; START_LOGICAL_REPLICATION; When we decode tx 3 we haven't yet read the mapping from the vacuum freeze. That scenario can happen either because decoding was stopped for a moment, or because decoding couldn't keep up (slow connection, whatever). That strikes me as a flaw in the implementation rather than the idea. You're presupposing a patch where the necessary information is available in WAL yet you don't make use of it at the proper time. It seems to me that you have to think of the CTID map as tied to a relfilenode; if you try to use one relfilenode's map with a different relfilenode, it's obviously not going to work. So don't do that. /me looks innocent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-18 20:26:16 +0200, Andres Freund wrote: 4) Store both (cmin, cmax) for catalog tuples. BTW: That would have the nice side-effect of delivering the basis of what you need to do parallel sort in a transaction that previously has performed DDL. Currently you cannot do anything in parallel after DDL, even if you only scan the table in one backend, because operators et al. have to do catalog lookups which you can't do consistently since cmin/cmax aren't available in both. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 22.10.2013 19:12, Andres Freund wrote: On 2013-10-18 20:26:16 +0200, Andres Freund wrote: 4) Store both (cmin, cmax) for catalog tuples. BTW: That would have the nice side-effect of delivering the basis of what you need to do parallel sort in a transaction that previously has performed DDL. Currently you cannot do anything in parallel after DDL, even if you only scan the table in one backend, because operators et al. have to do catalog lookups which you can't do consistently since cmin/cmax aren't available in both. Parallel workers will need cmin/cmax for user tables too, to know which tuples are visible to the snapshot. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote: On 22.10.2013 19:12, Andres Freund wrote: On 2013-10-18 20:26:16 +0200, Andres Freund wrote: 4) Store both (cmin, cmax) for catalog tuples. BTW: That would have the nice side-effect of delivering the basis of what you need to do parallel sort in a transaction that previously has performed DDL. Currently you cannot do anything in parallel after DDL, even if you only scan the table in one backend, because operators et al. have to do catalog lookups which you can't do consistently since cmin/cmax aren't available in both. Parallel workers will need cmin/cmax for user tables too, to know which tuples are visible to the snapshot. The existing proposals were mostly about just parallelizing the sort and similar operations, right? In such scenarios you really need it only for the catalog. But we could easily generalize it for user data too. We should even be able to only use wide cids when we a backend needs it it since inherently it's only needed within a single transaction. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 22.10.2013 19:23, Andres Freund wrote: On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote: On 22.10.2013 19:12, Andres Freund wrote: On 2013-10-18 20:26:16 +0200, Andres Freund wrote: 4) Store both (cmin, cmax) for catalog tuples. BTW: That would have the nice side-effect of delivering the basis of what you need to do parallel sort in a transaction that previously has performed DDL. Currently you cannot do anything in parallel after DDL, even if you only scan the table in one backend, because operators et al. have to do catalog lookups which you can't do consistently since cmin/cmax aren't available in both. Parallel workers will need cmin/cmax for user tables too, to know which tuples are visible to the snapshot. The existing proposals were mostly about just parallelizing the sort and similar operations, right? In such scenarios you really need it only for the catalog. But we could easily generalize it for user data too. We should even be able to only use wide cids when we a backend needs it it since inherently it's only needed within a single transaction. Or just hand over a copy of the combocid map to the worker, along with the snapshot. Seems a lot simpler than this wide cids business.. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 22, 2013 at 12:25 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Or just hand over a copy of the combocid map to the worker, along with the snapshot. Seems a lot simpler than this wide cids business.. Yes, that's what Noah and I talked about doing. Or possibly even making the map into a hash table in dynamic shared memory, so that new combo CIDs could be allocated by any backend in the parallel group. But that seems hard, so for starters I think we'll only parallelize read-only operations and just hand over a copy of the map. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Statistics collection for CLUSTER command
On Sun, Oct 20, 2013 at 1:37 AM, Noah Misch n...@leadboat.com wrote: (2013/08/08 20:52), Vik Fearing wrote: As part of routine maintenance monitoring, it is interesting for us to have statistics on the CLUSTER command (timestamp of last run, and number of runs since stat reset) like we have for (auto)ANALYZE and (auto)VACUUM. Patch against today's HEAD attached. Adding new fields to PgStat_StatTabEntry imposes a substantial distributed cost, because every database stats file write-out grows by the width of those fields times the number of tables in the database. Associated costs have been and continue to be a pain point with large table counts: http://www.postgresql.org/message-id/flat/1718942738eb65c8407fcd864883f...@fuzzy.cz http://www.postgresql.org/message-id/flat/52268887.9010...@uptime.jp In that light, I can't justify widening PgStat_StatTabEntry by 9.5% for this. I recommend satisfying this monitoring need in your application by creating a cluster_table wrapper function that issues CLUSTER and then updates statistics you store in an ordinary table. Issue all routine CLUSTERs by way of that wrapper function. A backend change that would help here is to extend event triggers to cover the CLUSTER command, permitting you to inject monitoring after plain CLUSTER and dispense with the wrapper. I unfortunately have to agree with this, but I think it points to the need for further work on the pgstat infrastructure. We used to have one file; now we have one per database. That's better for people with lots of databases, but many people just have one big database. We need a solution here that relieves the pain for those people. (I can't help thinking that the root of the problem here is that we're rewriting the whole file, and that any solution that doesn't somehow facilitate updates of individual records will be only a small improvement.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Reasons not to like asprintf
I wrote: I have a lot of other gripes about this whole patch, but they can wait till tomorrow. The other big thing I don't like about this patch is that adopting asprintf() as a model was not any better thought-out than the portability considerations were. It's a bad choice because: 1. asprintf() is not in C99 nor any POSIX standard. 2. The two known implementations, glibc and *BSD, behave differently, as admitted in the Linux man page. Thus, there is nothing resembling a standard here. 3. Neither implementation has given any thought to error reporting; it's not possible to tell whether a failure was due to out-of-memory or a vsnprintf failure such as a format-string error. (pg_asprintf assumes the former, which probably has something to do with the bogus out-of-memory failures we're seeing on buildfarm member anole, though I'm not sure about the details.) 4. The use of a char** output parameter is not very friendly, because (at least on older versions of gcc) it disables uninitialized-variable detection. That is, if you have code like char **str; if (...) asprintf(str, ...); else asprintf(str, ...); ... use str for something ... many compilers will not be able to check whether you actually assigned to str in both code paths. 5. The reason for the char** parameter is that the function return value is needed for error/buffer overrun handling --- but no caller of pg_asprintf looks at the return value at all. 6. Thus, I much prefer the API design that was used for psprintf, that is return the allocated buffer pointer as the function result. It seems like a bad idea to me that we adopted two different APIs for frontend and backend code; if we're going to hardwire exit()-on-failure into pg_asprintf there is no benefit whatsoever to reserving the function return value for failure indications. In short, I want to change pg_asprintf to return the malloc'd buffer as its function result. This probably means we should change the function name too, since the allusion to asprintf is completely useless if we do that. I'm thinking pg_psprintf for the frontend version of psprintf, but maybe somebody has a better suggestion? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-22 19:25:31 +0300, Heikki Linnakangas wrote: On 22.10.2013 19:23, Andres Freund wrote: On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote: On 22.10.2013 19:12, Andres Freund wrote: On 2013-10-18 20:26:16 +0200, Andres Freund wrote: 4) Store both (cmin, cmax) for catalog tuples. BTW: That would have the nice side-effect of delivering the basis of what you need to do parallel sort in a transaction that previously has performed DDL. Currently you cannot do anything in parallel after DDL, even if you only scan the table in one backend, because operators et al. have to do catalog lookups which you can't do consistently since cmin/cmax aren't available in both. Parallel workers will need cmin/cmax for user tables too, to know which tuples are visible to the snapshot. The existing proposals were mostly about just parallelizing the sort and similar operations, right? In such scenarios you really need it only for the catalog. But we could easily generalize it for user data too. We should even be able to only use wide cids when we a backend needs it it since inherently it's only needed within a single transaction. Or just hand over a copy of the combocid map to the worker, along with the snapshot. Seems a lot simpler than this wide cids business.. That's not sufficient if you want to continue writing in the primary backend though which isn't an uninteresting thing. I am not saying that parallel XXX is a sufficient reason for this, just that it might a be a co-benefactor. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Failure while inserting parent tuple to B-tree is not fun
Splitting a B-tree page is a two-stage process: First, the page is split, and then a downlink for the new right page is inserted into the parent (which might recurse to split the parent page, too). What happens if inserting the downlink fails for some reason? I tried that out, and it turns out that it's not nice. I used this to cause a failure: --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel, _bt_relbuf(rel, pbuf); } + elog(ERROR, fail!); + /* get high key from left page == lowest key on new right page */ ritem = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY)); postgres=# create table foo (i int4 primary key); CREATE TABLE postgres=# insert into foo select generate_series(1, 1); ERROR: fail! That's not surprising. But when I removed that elog again and restarted the server, I still can't insert. The index is permanently broken: postgres=# insert into foo select generate_series(1, 1); ERROR: failed to re-find parent key in index foo_pkey for split pages 4/5 In real life, you would get a failure like this e.g if you run out of memory or disk space while inserting the downlink to the parent. Although rare in practice, it's no fun if it happens. I fixed the the same problem in GiST a few years back, by making it tolerate missing downlinks, and inserting them lazily. The B-tree code tolerates them already on scans, but gets confused on insertion, as seen above. I propose that we use the same approach I used with GiST, and add a flag to the page header to indicate the downlink hasn't been inserted yet. When insertion (or vacuum) bumps into a flagged page, it can finish the incomplete action by inserting the downlink. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-22 11:59:35 -0400, Robert Haas wrote: So I have a new idea for handling this problem, which seems obvious in retrospect. What if we make the VACUUM FULL or CLUSTER log the old CTID - new CTID mappings? This would only need to be done for catalog tables, and maybe could be skipped for tuples whose XIDs are old enough that we know those transactions must already be decoded. Ah. If it only were so simple ;). That was my first idea, and after I'd bragged in an 2ndq internal chat that I'd found a simple idea I obviously had to realize it doesn't work. Consider: INIT_LOGICAL_REPLICATION; CREATE TABLE foo(...); BEGIN; INSERT INTO foo; ALTER TABLE foo ...; INSERT INTO foo; COMMIT TX 3; VACUUM FULL pg_class; START_LOGICAL_REPLICATION; When we decode tx 3 we haven't yet read the mapping from the vacuum freeze. That scenario can happen either because decoding was stopped for a moment, or because decoding couldn't keep up (slow connection, whatever). It seems to me that you have to think of the CTID map as tied to a relfilenode; if you try to use one relfilenode's map with a different relfilenode, it's obviously not going to work. So don't do that. It has to be tied to relfilenode (+ctid) *and* transaction unfortunately. That strikes me as a flaw in the implementation rather than the idea. You're presupposing a patch where the necessary information is available in WAL yet you don't make use of it at the proper time. The problem is that the mapping would be somewhere *ahead* from the transaction/WAL we're currently decoding. We'd need to read ahead till we find the correct one. But I think I mainly misunderstood what you proposed. That mapping could be written besides relfilenode, instead of into the WAL. Then my imagined problem doesn't exist anymore. We only would need to write out mappings for tuples modified since the xmin horizon, so it wouldn't even be *too* bad for bigger relations. This won't easily work for two+ rewrites because we'd need to apply all mappings in order and thus would have to keep a history of intermediate nodes/mappings. But it'd be perfectly doable to simply wait till decoders are caught up. I still feel that simply storing both cmin, cmax is cleaner, but if that's not acceptable, I can certainly live with something like this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
On Tue, Oct 22, 2013 at 9:55 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I propose that we use the same approach I used with GiST, and add a flag to the page header to indicate the downlink hasn't been inserted yet. When insertion (or vacuum) bumps into a flagged page, it can finish the incomplete action by inserting the downlink. Sounds very reasonable, but I'm interested in how you intend to structure things, given this sounds like what could loosely be called btree private state. I may also need to use a flag bit for something that is of interest to the btree code alone. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
On 22.10.2013 20:27, Peter Geoghegan wrote: On Tue, Oct 22, 2013 at 9:55 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I propose that we use the same approach I used with GiST, and add a flag to the page header to indicate the downlink hasn't been inserted yet. When insertion (or vacuum) bumps into a flagged page, it can finish the incomplete action by inserting the downlink. Sounds very reasonable, but I'm interested in how you intend to structure things, given this sounds like what could loosely be called btree private state. I may also need to use a flag bit for something that is of interest to the btree code alone. I may be missing something, but there are already plenty of b-tree specific flags. See BTP_* in nbtree.h. I'll just add another to that list. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
On Tue, Oct 22, 2013 at 10:30 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I may be missing something, but there are already plenty of b-tree specific flags. See BTP_* in nbtree.h. I'll just add another to that list. Based on your remarks, I thought that you were intent on directly using page level bits (pd_flags), rather than the private state flag bits. Blame it on the lack of coffee, I suppose. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reasons not to like asprintf
Tom, * Tom Lane (t...@sss.pgh.pa.us) wrote: In short, I want to change pg_asprintf to return the malloc'd buffer as its function result. This probably means we should change the function name too, since the allusion to asprintf is completely useless if we do that. I'm thinking pg_psprintf for the frontend version of psprintf, but maybe somebody has a better suggestion? Sounds reasonable, and I haven't got a better name, but I'm trying to figure out why psprintf hasn't got the same issues which you mention in your other thread (eg: depending on vsnprintf to return the would-be result size). I'm also a bit nervous that we only check vsnprintf() success in Assert-enabled builds in psprintf, though I suppose that's all we're doing in appendStringInfo too. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund and...@2ndquadrant.com wrote: It seems to me that you have to think of the CTID map as tied to a relfilenode; if you try to use one relfilenode's map with a different relfilenode, it's obviously not going to work. So don't do that. It has to be tied to relfilenode (+ctid) *and* transaction unfortunately. I agree that it does, but it doesn't seem particularly unfortunate to me. That strikes me as a flaw in the implementation rather than the idea. You're presupposing a patch where the necessary information is available in WAL yet you don't make use of it at the proper time. The problem is that the mapping would be somewhere *ahead* from the transaction/WAL we're currently decoding. We'd need to read ahead till we find the correct one. Yes, I think that's what you need to do. But I think I mainly misunderstood what you proposed. That mapping could be written besides relfilenode, instead of into the WAL. Then my imagined problem doesn't exist anymore. That's pretty ugly. I think it should be written into WAL. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-22 13:57:53 -0400, Robert Haas wrote: On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund and...@2ndquadrant.com wrote: That strikes me as a flaw in the implementation rather than the idea. You're presupposing a patch where the necessary information is available in WAL yet you don't make use of it at the proper time. The problem is that the mapping would be somewhere *ahead* from the transaction/WAL we're currently decoding. We'd need to read ahead till we find the correct one. Yes, I think that's what you need to do. My problem with that is that rewrite can be gigabytes into the future. When reading forward we could either just continue reading data into the reorderbuffer, but delay replaying all future commits till we found the currently needed remap. That might have quite the additional storage/memory cost, but runtime complexity should be the same as normal decoding. Or we could individually read ahead for every transaction. But doing so for every transaction will get rather expensive (rougly O(amount_of_wal^2)). I think that'd be pretty similar to just disallowing VACUUM FREEZE/CLUSTER on catalog relations since effectively it'd be to expensive to use. But I think I mainly misunderstood what you proposed. That mapping could be written besides relfilenode, instead of into the WAL. Then my imagined problem doesn't exist anymore. That's pretty ugly. I think it should be written into WAL. It basically has O(1) access, that's why I was thinking about it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On Mon, Oct 21, 2013 at 4:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andrew Dunstan and...@dunslane.net writes: This is why I suggested the standard deviation, and why I find it would be more useful than just min and max. A couple of outliers will set the min and max to possibly extreme values but hardly perturb the standard deviation over a large number of observations. Hm. It's been a long time since college statistics, but doesn't the entire concept of standard deviation depend on the assumption that the underlying distribution is more-or-less normal (Gaussian)? It is easy to misinterpret the standard deviation if the distribution is not gaussian, but that is also true of the average. The standard deviation (or the variance) is commonly used with non-gaussian distributions, either because it is the most efficient estimator for those particular distributions, or just because it is so commonly available. Cheers, Jeff
Re: [HACKERS] Reasons not to like asprintf
Stephen Frost sfr...@snowman.net writes: Sounds reasonable, and I haven't got a better name, but I'm trying to figure out why psprintf hasn't got the same issues which you mention in your other thread (eg: depending on vsnprintf to return the would-be result size). It does, though not directly in psprintf but rather in the underlying vasprintf implementation. (Although psprintf is exposed to the problem that it can't tell out-of-memory from format errors.) What I'm on about in this thread is the API used by all the call sites, not the underlying implementation issues. As long as we're okay with exit or elog on any error, I think we should just have the widely-used functions returning a buffer pointer. Underneath that, we need some work so that we can print more-specific error messages, but that will not be of concern to the callers. It's possibly debatable whether exit-on-error is okay or not. I think it is though, because none of our frontend code is prepared to do anything except go belly-up on out-of-memory, while in the backend case it's only elog(ERROR) anyway. The other possible class of failures is format string or encoding errors, but those seem to reduce to the same cases: I can't envision that frontend code would have any useful recovery strategies. I'd like the printed error message to distinguish these cases, but I don't think the callers need to care. I'm also a bit nervous that we only check vsnprintf() success in Assert-enabled builds in psprintf, though I suppose that's all we're doing in appendStringInfo too. Actually, appendStringInfo treats result 0 as a buffer overrun report; if the failure is persistent because it's really a format/encoding problem, it'll loop till it runs out of memory, and then report the error as that. The trouble here is that in pre-C99 implementations of vsnprintf, return code 0 might mean either buffer overrun or a format/encoding problem. We can't do too much about this unless we are prepared to move our goalposts about portability assumptions. (Hmm ... current POSIX requires *printf to set errno on error, so maybe we could look at errno?) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote: Splitting a B-tree page is a two-stage process: First, the page is split, and then a downlink for the new right page is inserted into the parent (which might recurse to split the parent page, too). What happens if inserting the downlink fails for some reason? I tried that out, and it turns out that it's not nice. I used this to cause a failure: --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel, _bt_relbuf(rel, pbuf); } +elog(ERROR, fail!); + /* get high key from left page == lowest key on new right page */ ritem = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY)); postgres=# create table foo (i int4 primary key); CREATE TABLE postgres=# insert into foo select generate_series(1, 1); ERROR: fail! That's not surprising. But when I removed that elog again and restarted the server, I still can't insert. The index is permanently broken: postgres=# insert into foo select generate_series(1, 1); ERROR: failed to re-find parent key in index foo_pkey for split pages 4/5 In real life, you would get a failure like this e.g if you run out of memory or disk space while inserting the downlink to the parent. Although rare in practice, it's no fun if it happens. Why doesn't the incomplete split mechanism prevent this? Because we do not delay checkpoints on the primary and a checkpoint happened just befor your elog(ERROR) above? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Location for external scripts for Extensions?
All, pg_partman has several external (python) scripts which help the extension, located in /extras/ in its source. The problem currently is that if you install pg_partman via pgxn or package, you don't get those scripts, because there's no install location for them. Where should these go? PGSHARE/pg_partman/? Or somewhere else? I'm looking to set a canonical location for extensions with external scripts. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
On 22.10.2013 21:25, Andres Freund wrote: On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote: Splitting a B-tree page is a two-stage process: First, the page is split, and then a downlink for the new right page is inserted into the parent (which might recurse to split the parent page, too). What happens if inserting the downlink fails for some reason? I tried that out, and it turns out that it's not nice. I used this to cause a failure: --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel, _bt_relbuf(rel, pbuf); } + elog(ERROR, fail!); + /* get high key from left page == lowest key on new right page */ ritem = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY)); postgres=# create table foo (i int4 primary key); CREATE TABLE postgres=# insert into foo select generate_series(1, 1); ERROR: fail! That's not surprising. But when I removed that elog again and restarted the server, I still can't insert. The index is permanently broken: postgres=# insert into foo select generate_series(1, 1); ERROR: failed to re-find parent key in index foo_pkey for split pages 4/5 In real life, you would get a failure like this e.g if you run out of memory or disk space while inserting the downlink to the parent. Although rare in practice, it's no fun if it happens. Why doesn't the incomplete split mechanism prevent this? Because we do not delay checkpoints on the primary and a checkpoint happened just befor your elog(ERROR) above? Because there's no recovery involved. The failure I injected (or an out-of-memory or out-of-disk-space in the real world) doesn't cause a PANIC, just an ERROR that rolls back the current transaction, nothing more. We could put a critical section around the whole recursion that inserts the downlinks, so that you would get a PANIC and the incomplete split mechanism would fix it at recovery. But that would hardly be an improvement. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On Mon, Oct 21, 2013 at 1:36 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Oct 21, 2013 at 4:01 PM, Tom Lane t...@sss.pgh.pa.us wrote: Gavin Flower gavinflo...@archidevsys.co.nz writes: If we're going to extend pg_stat_statements, even more than min and max I'd like to see the standard deviation in execution time. How about the 'median', often a lot more useful than the 'arithmetic mean' (which most people call the 'average'). AFAIK, median is impossible to calculate cheaply (in particular, with a fixed amount of workspace). So this apparently innocent request is actually moving the goalposts a long way, because the space per query table entry is a big concern for pg_stat_statements. Yeah, and I worry about min and max not being very usable - once they get pushed out to extreme values, there's nothing to drag them back toward normality except resetting the stats, and that's not something we want to encourage people to do frequently. Of course, averages over very long sampling intervals may not be too useful anyway, dunno. I think the pg_stat_statements_reset() should be done every time you make a change which you think (or hope) will push the system into a new regime, which goes for either min/max or for average/stdev. A histogram would be cool, but it doesn't seem very practical to implement. If I really needed that I'd probably set log_min_duration_statement = 0 and mine the log files. But that means I'd have to wait to accumulate enough logs once I made that change, then remember to turn it off. What I'd like most in pg_stat_statements now is the ability to distinguish which queries have a user grinding their teeth, versus which ones have a cron job patiently doing a wait4. I don't know the best way to figure that out, other than stratify on application_name. Or maybe a way to selectively undo the query text normalization, so I could see which parameters were causing the problem. Cheers, Jeff
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote: On 22.10.2013 21:25, Andres Freund wrote: On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote: Splitting a B-tree page is a two-stage process: First, the page is split, and then a downlink for the new right page is inserted into the parent (which might recurse to split the parent page, too). What happens if inserting the downlink fails for some reason? I tried that out, and it turns out that it's not nice. I used this to cause a failure: --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel, _bt_relbuf(rel, pbuf); } + elog(ERROR, fail!); + /* get high key from left page == lowest key on new right page */ ritem = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY)); postgres=# create table foo (i int4 primary key); CREATE TABLE postgres=# insert into foo select generate_series(1, 1); ERROR: fail! That's not surprising. But when I removed that elog again and restarted the server, I still can't insert. The index is permanently broken: postgres=# insert into foo select generate_series(1, 1); ERROR: failed to re-find parent key in index foo_pkey for split pages 4/5 In real life, you would get a failure like this e.g if you run out of memory or disk space while inserting the downlink to the parent. Although rare in practice, it's no fun if it happens. Why doesn't the incomplete split mechanism prevent this? Because we do not delay checkpoints on the primary and a checkpoint happened just befor your elog(ERROR) above? Because there's no recovery involved. The failure I injected (or an out-of-memory or out-of-disk-space in the real world) doesn't cause a PANIC, just an ERROR that rolls back the current transaction, nothing more. We could put a critical section around the whole recursion that inserts the downlinks, so that you would get a PANIC and the incomplete split mechanism would fix it at recovery. But that would hardly be an improvement. You were talking about restarting the server, that's why I assumed recovery had been involved... But you just were talking about removing the elog() again. For me this clearly *has* to be in a critical section with the current code. I had always assumed all multi-part actions would be. Do you forsee the fix with ignoring missing downlinks to be back-patchable? FWIW, I think I might have seen real-world cases of this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reasons not to like asprintf
* Tom Lane (t...@sss.pgh.pa.us) wrote: Stephen Frost sfr...@snowman.net writes: Sounds reasonable, and I haven't got a better name, but I'm trying to figure out why psprintf hasn't got the same issues which you mention in your other thread (eg: depending on vsnprintf to return the would-be result size). It does, though not directly in psprintf but rather in the underlying vasprintf implementation. (Although psprintf is exposed to the problem that it can't tell out-of-memory from format errors.) Right, pvsprintf(). Sorry for any confusion. What I'm on about in this thread is the API used by all the call sites, not the underlying implementation issues. As long as we're okay with exit or elog on any error, I think we should just have the widely-used functions returning a buffer pointer. Underneath that, we need some work so that we can print more-specific error messages, but that will not be of concern to the callers. I agree that exit/Assert-or-elog is the right API here. I wonder if it'd be reasonable or worthwhile to try and actually make that happen- that is to say, we really do only have one implementation for both front-end and back-end here, but on the front-end we exit, while on the back-end we elog(ERROR). I'm guessing that's a pipe-dream, but figured I'd mention it anyway, in case people have crafty ideas about how that might be made to work (and if others think it's even a good idea). The other possible class of failures is format string or encoding errors, but those seem to reduce to the same cases: I can't envision that frontend code would have any useful recovery strategies. I'd like the printed error message to distinguish these cases, but I don't think the callers need to care. Agreed. I'm also a bit nervous that we only check vsnprintf() success in Assert-enabled builds in psprintf, though I suppose that's all we're doing in appendStringInfo too. Actually, appendStringInfo treats result 0 as a buffer overrun report; I had been looking at the Assert() that the last character is always '\0' in allocStringInfoVA. Probably a stretch to say that has to be upgraded to an elog(), but I'd certainly be much happier with a runtime did this error? check of some kind than not; hence my support for your errno notion below.. if the failure is persistent because it's really a format/encoding problem, it'll loop till it runs out of memory, and then report the error as that. Ah, yes, I see that in enlargeStringInfo and as long as we keep MaxAllocSize reasonable that isn't terrible. (Hmm ... current POSIX requires *printf to set errno on error, so maybe we could look at errno?) Sounds like a good idea to me.. Being explicit about checking for error results, rather than hoping-against-hope that the only reason we'd ever get a value less than the total length is when we need to provide more memory, would be a lot better. Perhaps it'd be worthwhile to run a test against the build farm and see what kind of errno's are being set in those cases where we're now getting the 'out of memory' failures you mentioned upthread? Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Reasons not to like asprintf
Stephen Frost sfr...@snowman.net writes: I agree that exit/Assert-or-elog is the right API here. I wonder if it'd be reasonable or worthwhile to try and actually make that happen- that is to say, we really do only have one implementation for both front-end and back-end here, but on the front-end we exit, while on the back-end we elog(ERROR). I'm guessing that's a pipe-dream, It's not really --- we could make it happen with a single source-file in libpgcommon, with a few #ifdef FRONTENDs, similarly to exec.c for instance. I'm not entirely sure that it's worth the trouble, since once you take away the error and memory management there's not much left of these functions; so the #ifdefs might end up being most of the code. I'm going to go code it shortly, and we'll find out. I had been looking at the Assert() that the last character is always '\0' in allocStringInfoVA. Probably a stretch to say that has to be upgraded to an elog(), but I'd certainly be much happier with a runtime did this error? check of some kind than not; IIRC, that Assert is meant to catch an ancient buggy version of vsnprintf, so it can't really be replaced by an errno check --- the whole point is that that vsnprintf failed to recognize it was doing something wrong. We could upgrade it to an elog; but we had that debate years ago when the code went in, and I don't think the passage of time has made it more important rather than less so to have an always-on check for that case. (Hmm ... current POSIX requires *printf to set errno on error, so maybe we could look at errno?) Sounds like a good idea to me.. Being explicit about checking for error results, rather than hoping-against-hope that the only reason we'd ever get a value less than the total length is when we need to provide more memory, would be a lot better. Perhaps it'd be worthwhile to run a test against the build farm and see what kind of errno's are being set in those cases where we're now getting the 'out of memory' failures you mentioned upthread? Yeah, I'm hoping that reporting errno will give us some more insight, if anole's problem is still there after we replace the code. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
Andres Freund and...@2ndquadrant.com writes: On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote: We could put a critical section around the whole recursion that inserts the downlinks, so that you would get a PANIC and the incomplete split mechanism would fix it at recovery. But that would hardly be an improvement. For me this clearly *has* to be in a critical section with the current code. I had always assumed all multi-part actions would be. No, that's hardly a good idea. As Heikki says, that would amount to converting an entirely foreseeable situation into a PANIC. Note also that the problem might be persistent, eg if you're out of disk space. In that case, you'd just get the PANIC again at recovery, so now not only have you crashed all your sessions but you have a database that won't start up. I wonder whether Heikki's approach could be used to remove the need for the incomplete-split-fixup code altogether, thus eliminating a class of recovery failure possibilities. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Location for external scripts for Extensions?
Josh Berkus j...@agliodbs.com writes: pg_partman has several external (python) scripts which help the extension, located in /extras/ in its source. The problem currently is that if you install pg_partman via pgxn or package, you don't get those scripts, because there's no install location for them. See also my proposal to solve that, I'd welcome some design level discussions about it: http://www.postgresql.org/message-id/m28uyzgof3@2ndquadrant.fr Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commitfest II CLosed
On 10/21/2013 11:59 AM, Tom Lane wrote: Josh Berkus j...@agliodbs.com writes: Either you're proposing a solution, supporting someone else's solution, or you're saying the problem isn't important. There is no fourth alternative. Nonsense. Pointing out that a proposed solution isn't workable is not saying that the problem isn't important. Or are you trying to establish a rule that we can't complain about a proposal unless we have another one to make? Sorry, I won't accept that. In some cases the other solution is we need to search for a better solution. But if you say the proposed solution is bad without even proposing criteria for a better solution, then you are *de facto* saying that the problem isn't important, whether or not you would like ot pretend that you're saying something else. If a problem is important, then it is worth solving. If it's not worth solving, then it's not important. This is just as true of bugs in our process as it is of bugs in our code; if we release without fixing a bug, then we are making a concrete statement that the bug is not important. For the past two years, we've proceeded without fixing the bugs in our process, which is a material statement that most contributors don't feel that the process is buggy. Heck, the whole discussion about reviews got cut from the Developer's Meeting agenda this year, and if that's not a statement of unimportance, I don't know what is. When I came up with the idea of CommitFests they were supposed to be an incremental improvement for us to build on. Instead it's remained frozen in amber, and steadily becoming less and less effective. I've suggested a number of improvements and changes over the years, and largely been rewarded with denial, attacks, ridicule, and general sandbaggery. I'm done. If the community doesn't think there's a problem, then clearly I'm in error for proposing fixes. Not sure who you're going to get to do CF3, though. I'm not going to be CFM again, and I'm pretty sure nobody else wants the job either. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
On 22.10.2013 22:24, Tom Lane wrote: I wonder whether Heikki's approach could be used to remove the need for the incomplete-split-fixup code altogether, thus eliminating a class of recovery failure possibilities. Yes. I intend to do that, too. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commitfest II CLosed
Josh, * Josh Berkus (j...@agliodbs.com) wrote: In some cases the other solution is we need to search for a better solution. But if you say the proposed solution is bad without even proposing criteria for a better solution, then you are *de facto* saying that the problem isn't important, whether or not you would like ot pretend that you're saying something else. If a problem is important, then it is worth solving. If it's not worth solving, then it's not important. Or you're simply saying that other things hold priority over this particular problem. Sure, that makes it *less* important than other things (if priority is your measure of importance, and it may or may not be) but it is not the same to say that something is unimportant. This is just as true of bugs in our process as it is of bugs in our code; if we release without fixing a bug, then we are making a concrete statement that the bug is not important. Or that the priority of the release is *more* important. Things are not all either red-or-blue here (to use politically correct colors). For the past two years, we've proceeded without fixing the bugs in our process, which is a material statement that most contributors don't feel that the process is buggy. It's not hard to imagine that developers might feel that bugs, code, hacking, etc, are of a higher priority than very vaugue and extremely challenging process 'bugs'. If the community doesn't think there's a problem, then clearly I'm in error for proposing fixes. For my 2c, I agree that there's a problem, but I've got a ton of other tasks, not to mention $dayjob that makes it unlikely that I'll find time to come up with alternate proposals. I will also say that I feel that this process is still an *improvement* over the previous process, which is really the only other one we've actually tested. Perhaps that means we should just try different things out and see if we can't build the best process out there through supervised learning. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Commitfest II CLosed
On 10/21/2013 08:11 AM, Robert Haas wrote: Supposedly, we have a policy that for each patch you submit, you ought to review a patch. That right there ought to provide enough reviewers for all the patches, but clearly it didn't. And I'm pretty sure that some people (like me) looked at a lot MORE patches than they themselves submitted. I think auditing who is not contributing in that area and finding tactful ways to encourage them to contribute would be a very useful service to the project. What if as part of the patch submission process you had to pick the patch you were going to review? If there are no patches to review, then we obviously don't have a problem. If there are patches to review then we are all set. I guess there is the problem of there only being patches that a submitter is not qualified to review but I find that miniscule as every person on this list (myself included) can do a cursory review (patch applies, docs are good, indentation is appropriate, works as advertised). The commitfest app would have to be modified for this but what do people think? Joshua D. Drake -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reasons not to like asprintf
... BTW, another reason to choose identical APIs for frontend and backend versions of these functions is that it greatly eases use of them in shared frontend/backend code. As I notice somebody has *already done* in common/relpath.c. I'm not exactly sure how those psprintf calls are working at all in frontend builds. Maybe they aren't quite, and that has something to do with the failures on anole? In order to avoid having to clutter stuff like that with #ifdef FRONTENDs, I'm now thinking we should use exactly the same names for the frontend and backend versions, ie psprintf() and pvsprintf(). The main reason for considering a pg_ prefix for the frontend versions was to avoid cluttering application namespace; but it's already the case that we don't expect libpgcommon to be namespace clean. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reasons not to like asprintf
* Tom Lane (t...@sss.pgh.pa.us) wrote: ... BTW, another reason to choose identical APIs for frontend and backend versions of these functions is that it greatly eases use of them in shared frontend/backend code. As I notice somebody has *already done* in common/relpath.c. I'm not exactly sure how those psprintf calls are working at all in frontend builds. Maybe they aren't quite, and that has something to do with the failures on anole? Seems plausible. In order to avoid having to clutter stuff like that with #ifdef FRONTENDs, I'm now thinking we should use exactly the same names for the frontend and backend versions, ie psprintf() and pvsprintf(). The main reason for considering a pg_ prefix for the frontend versions was to avoid cluttering application namespace; but it's already the case that we don't expect libpgcommon to be namespace clean. To be honest, I had been assuming we'd use the same names. As for which direction to go, I'd personally prefer psprintf but that's just my lazy developer fingers talking. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Reasons not to like asprintf
Tom Lane wrote: ... BTW, another reason to choose identical APIs for frontend and backend versions of these functions is that it greatly eases use of them in shared frontend/backend code. As I notice somebody has *already done* in common/relpath.c. I'm not exactly sure how those psprintf calls are working at all in frontend builds. There's psprintf in src/common/fe_memutils.c, too, using the backend's API, which is why this works. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
On 2013-10-22 15:24:40 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote: We could put a critical section around the whole recursion that inserts the downlinks, so that you would get a PANIC and the incomplete split mechanism would fix it at recovery. But that would hardly be an improvement. For me this clearly *has* to be in a critical section with the current code. I had always assumed all multi-part actions would be. No, that's hardly a good idea. As Heikki says, that would amount to converting an entirely foreseeable situation into a PANIC. But IIUC this can currently lead to an index giving wrong answers, not just fail at further inserts? I think if we half-split (without inserting the downlink) a page several times, via different child-pages, we might suceed in splitting but we'll have corrupted left/right links. With complete splits such things should be prevented by the page-level locks we hold, but that's not the case anymore. If so that could, especially in combination with unique indexes, lead to quite nasty data corruption I wonder whether Heikki's approach could be used to remove the need for the incomplete-split-fixup code altogether, thus eliminating a class of recovery failure possibilities. That'd be better... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reasons not to like asprintf
Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: ... BTW, another reason to choose identical APIs for frontend and backend versions of these functions is that it greatly eases use of them in shared frontend/backend code. As I notice somebody has *already done* in common/relpath.c. I'm not exactly sure how those psprintf calls are working at all in frontend builds. There's psprintf in src/common/fe_memutils.c, too, using the backend's API, which is why this works. Ah --- that's why it links, anyway. As for works, I suspect that the answer is that anole has a vsnprintf that returns -1 on buffer overrun. vasprintf and then psprintf will interpret that as out of memory, resulting in the observed behavior. It'd be interesting to see the stack trace from that error, though, just to confirm this theory. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
Andres Freund and...@2ndquadrant.com writes: On 2013-10-22 15:24:40 -0400, Tom Lane wrote: No, that's hardly a good idea. As Heikki says, that would amount to converting an entirely foreseeable situation into a PANIC. But IIUC this can currently lead to an index giving wrong answers, not just fail at further inserts? No. It's exactly the same situation as when the insert is still in progress, ie searches have to move right when following the original downlink. Go read the nbtree README. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commitfest II CLosed
On Tue, Oct 22, 2013 at 2:38 PM, Joshua D. Drake j...@commandprompt.com wrote: On 10/21/2013 08:11 AM, Robert Haas wrote: Supposedly, we have a policy that for each patch you submit, you ought to review a patch. That right there ought to provide enough reviewers for all the patches, but clearly it didn't. And I'm pretty sure that some people (like me) looked at a lot MORE patches than they themselves submitted. I think auditing who is not contributing in that area and finding tactful ways to encourage them to contribute would be a very useful service to the project. What if as part of the patch submission process you had to pick the patch you were going to review? If there are no patches to review, then we obviously don't have a problem. If there are patches to review then we are all set. if we are going to modify the CF app (not offering myself, and i'm not trying to bind anyone also) i would prefer to see a flag stating if number of reviews registered there are less than submitted patches. This could be a column just after the author of a patch, so people can give preference to patches of submitters that are also reviewing other people's patches. -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
On 2013-10-22 16:38:05 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2013-10-22 15:24:40 -0400, Tom Lane wrote: No, that's hardly a good idea. As Heikki says, that would amount to converting an entirely foreseeable situation into a PANIC. But IIUC this can currently lead to an index giving wrong answers, not just fail at further inserts? No. It's exactly the same situation as when the insert is still in progress, ie searches have to move right when following the original downlink. Go read the nbtree README. If the parent insert is still in progress we have a write lock on the left and right page preventing such issues, right? At least that's what the nbtree README says... That doesn't hold true if we split a page but fail before re-inserting the parent since the transaction rollback will obviously have released the locks. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reasons not to like asprintf
I wrote: Stephen Frost sfr...@snowman.net writes: I agree that exit/Assert-or-elog is the right API here. I wonder if it'd be reasonable or worthwhile to try and actually make that happen- that is to say, we really do only have one implementation for both front-end and back-end here, but on the front-end we exit, while on the back-end we elog(ERROR). I'm guessing that's a pipe-dream, It's not really --- we could make it happen with a single source-file in libpgcommon, with a few #ifdef FRONTENDs, similarly to exec.c for instance. I'm not entirely sure that it's worth the trouble, since once you take away the error and memory management there's not much left of these functions; so the #ifdefs might end up being most of the code. I'm going to go code it shortly, and we'll find out. Attached is a draft, which compiles though I've not yet actually tested it. It seems pretty reasonable as far as readability goes. A couple of notes: 1. I omitted pvsprintf(), which we don't actually use anywhere anyway. I don't see any way to implement that API without va_copy, which is one of the main things I'm trying to get rid of. Since we've never needed a va_args variant in stringinfo.c in all the years we've had that, I think we can get away with omitting it here. 2. The file includes utils/memutils.h, which I'm not 100% sure is safe to include in the FRONTEND case. This is so that it can refer to MaxAllocSize. If it turns out that that causes build problems on some platforms, we could use some more ad-hoc way to define the limit (maybe just INT_MAX/4?), or we could move the definition of MaxAllocSize into palloc.h. Barring objections, I'll push forward with replacing the existing code with this. regards, tom lane /*- * * psprintf.c * sprintf into an allocated-on-demand buffer * * * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * * * IDENTIFICATION *src/common/psprintf.c * *- */ #ifndef FRONTEND #include postgres.h #else #include postgres_fe.h #endif #include utils/memutils.h static size_t pvsnprintf(char *buf, size_t len, const char *fmt, va_list args); /* * psprintf * * Format text data under the control of fmt (an sprintf-style format string) * and return it in an allocated-on-demand buffer. The buffer is allocated * with palloc in the backend, or malloc in frontend builds. Caller is * responsible to free the buffer when no longer needed, if appropriate. * * Errors are not returned to the caller, but are reported via elog(ERROR) * in the backend, or printf-to-stderr-and-exit() in frontend builds. * One should therefore think twice about using this in libpq. */ char * psprintf(const char *fmt,...) { size_t len = 128; /* initial assumption about buffer size */ for (;;) { char *result; va_list args; /* * Allocate result buffer. Note that in frontend this maps to malloc * with exit-on-error. */ result = (char *) palloc(len); /* Try to format the data. */ va_start(args, fmt); len = pvsnprintf(result, len, fmt, args); va_end(args); if (len == 0) return result; /* success */ /* Release buffer and loop around to try again with larger len. */ pfree(result); } } /* * pvsnprintf * * Attempt to format text data under the control of fmt (an sprintf-style * format string) and insert it into buf (which has length len). * * If successful, return zero. If there's not enough space in buf, return * an estimate of the buffer size needed to succeed (this *must* be more * than len, else psprintf might loop infinitely). * Other error cases do not return. * * XXX This API is ugly, but there seems no alternative given the C spec's * restrictions on what can portably be done with va_list arguments: you have * to redo va_start before you can rescan the argument list, and we can't do * that from here. */ static size_t pvsnprintf(char *buf, size_t len, const char *fmt, va_list args) { int nprinted; Assert(len 0); errno = 0; /* * Assert check here is to catch buggy vsnprintf that overruns the * specified buffer length. Solaris 7 in 64-bit mode is an example of a * platform with such a bug. */ #ifdef USE_ASSERT_CHECKING buf[len - 1] = '\0'; #endif nprinted = vsnprintf(buf, len, fmt, args); Assert(buf[len - 1] == '\0'); /* *
[HACKERS] tracking commit timestamps
Hi, There has been some interest in keeping track of timestamp of transaction commits. This patch implements that. There are some seemingly curious choices here. First, this module can be disabled, and in fact it's turned off by default. At startup, we verify whether it's enabled, and create the necessary SLRU segments if so. And if the server is started with this disabled, we set the oldest value we know about to avoid trying to read the commit TS of transactions of which we didn't keep record. The ability to turn this off is there to avoid imposing the overhead on systems that don't need this feature. Another thing of note is that we allow for some extra data alongside the timestamp proper. This might be useful for a replication system that wants to keep track of the origin node ID of a committed transaction, for example. Exactly what will we do with the bit space we have is unclear, so I have kept it generic and called it commit extra data. This offers the chance for outside modules to set the commit TS of a transaction; there is support for WAL-logging such values. But the core user of the feature (RecordTransactionCommit) doesn't use it, because xact.c's WAL logging itself is enough. For systems that are replicating transactions from remote nodes, it is useful. We also keep track of the latest committed transaction. This is supposed to be useful to calculate replication lag. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services *** a/contrib/pg_xlogdump/rmgrdesc.c --- b/contrib/pg_xlogdump/rmgrdesc.c *** *** 9,14 --- 9,15 #include postgres.h #include access/clog.h + #include access/committs.h #include access/gin.h #include access/gist_private.h #include access/hash.h *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 2257,2262 include 'filename' --- 2257,2277 /listitem /varlistentry + varlistentry id=guc-track-commit-timestamp xreflabel=track_commit_timestamp + termvarnametrack_commit_timestamp/varname (typebool/type)/term + indexterm +primaryvarnametrack_commit_timestamp/ configuration parameter/primary + /indexterm + listitem +para + Record commit time of transactions. This parameter + can only be set in + the filenamepostgresql.conf/ file or on the server command line. + The default value is off. +/para + /listitem + /varlistentry + /variablelist /sect2 *** a/src/backend/access/rmgrdesc/Makefile --- b/src/backend/access/rmgrdesc/Makefile *** *** 8,14 subdir = src/backend/access/rmgrdesc top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \ mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \ standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o --- 8,15 top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o \ !heapdesc.o \ mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \ standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o *** /dev/null --- b/src/backend/access/rmgrdesc/committsdesc.c *** *** 0 --- 1,53 + /*- + * + * committsdesc.c + *rmgr descriptor routines for access/transam/committs.c + * + * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + *src/backend/access/rmgrdesc/committsdesc.c + * + *- + */ + #include postgres.h + + #include access/committs.h + #include utils/timestamp.h + + + void + committs_desc(StringInfo buf, uint8 xl_info, char *rec) + { + uint8 info = xl_info ~XLR_INFO_MASK; + + if (info == COMMITTS_ZEROPAGE) + { + int pageno; + + memcpy(pageno, rec, sizeof(int)); + appendStringInfo(buf, zeropage: %d, pageno); + } + else if (info == COMMITTS_TRUNCATE) + { + int pageno; + + memcpy(pageno, rec, sizeof(int)); + appendStringInfo(buf, truncate before: %d, pageno); + } + else if (info == COMMITTS_SETTS) + { + xl_committs_set *xlrec = (xl_committs_set *) rec; + int i; + + appendStringInfo(buf, set committs %s for: %u, + timestamptz_to_str(xlrec-timestamp), + xlrec-mainxid); + for (i = 0; i xlrec-nsubxids; i++) + appendStringInfo(buf, , %u, xlrec-subxids[i]); + } + else + appendStringInfo(buf, UNKNOWN); + } *** a/src/backend/access/rmgrdesc/xlogdesc.c --- b/src/backend/access/rmgrdesc/xlogdesc.c *** *** 44,50 xlog_desc(StringInfo buf,
[HACKERS] matviews do not lock relations during refresh
Hi, In an postgres build that has added instrumentation to detect cases where indexes are index_open()ed without any locks on the underlying relation, the matview code started to cry during the regression tests. The problem seems to be that refresh_matview_datafill() uses QueryRewrite() without previously using AcquireRewriteLocks() but as QueryRewrite states: * ... * NOTE: the parsetree must either have come straight from the parser, * or have been scanned by AcquireRewriteLocks to acquire suitable locks. */ I think the missed locks could lead to postgres crashing under concurrent DDL. I've attached a quick patch that fixes the issue for me in HEAD, but I am not 100% sure I understand the details of refresh_matview_datafill(). This probably needs to go into 9.3 in an adapted form. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From ae34a1b4d59627360131bb795da75326deea40b3 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Wed, 23 Oct 2013 01:25:49 +0200 Subject: [PATCH] Acquire appropriate locks when rewriting during REFRESH MATERIALIZED VIEW --- src/backend/commands/matview.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c index fcfc678..1c02eab 100644 --- a/src/backend/commands/matview.c +++ b/src/backend/commands/matview.c @@ -283,6 +283,7 @@ refresh_matview_datafill(DestReceiver *dest, Query *query, Oid save_userid; int save_sec_context; int save_nestlevel; + Query *copied_query; /* * Switch to the owner's userid, so that any functions are run as that @@ -294,8 +295,14 @@ refresh_matview_datafill(DestReceiver *dest, Query *query, save_sec_context | SECURITY_RESTRICTED_OPERATION); save_nestlevel = NewGUCNestLevel(); - /* Rewrite, copying the given Query to make sure it's not changed */ - rewritten = QueryRewrite((Query *) copyObject(query)); + /* copy the given Query to make sure it's not changed */ + copied_query = copyObject(query); + + /* acquire locks for rewriting */ + AcquireRewriteLocks(copied_query, false); + + /* Rewrite */ + rewritten = QueryRewrite(copied_query); /* SELECT should never rewrite to more or less than one SELECT query */ if (list_length(rewritten) != 1) -- 1.8.3.251.g1462b67 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why the asprintf patch is still breaking the buildfarm
On Tue, Oct 22, 2013 at 11:00:42AM -0400, Tom Lane wrote: Yeah. As a separate matter, it might be useful to revise stringinfo.c and the asprintf code so that *if* the returned value is larger than the given buffer size, we use it as a guide to resizing, avoiding the possible need to loop multiple times to make the buffer large enough. And we could also improve our own implementation of snprintf to follow the C99 spec. The point here is that we still need to cope with pre-C99 implementations that might return -1 or the given buffer size on overflow. The NetBSD implementation doesn't do that, which is reasonable in their context, but not workable for us. I would vote for choosing the standard we want vsnprintf() to follow (probably C99) and substituting a conforming implementation wherever configure detects that libc does not conform. We'll be shipping some replacement vsnprintf() in any case; we may as well use it to insulate the rest of our code from less-preferred variants. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] missing locking in at least INSERT INTO view WITH CHECK
Hi, Using the same debugging hack^Wpatch (0001) as in the matview patch (0002) an hour or so ago I noticed that INSERT INTO view WITH CHECK doesn't lock the underlying relations properly. I've attached a sort-of-working (0003) hack but I really doubt it's the correct approach, I don't really know enough about that area of the code. This looks like something that needs to be fixed. Also attached is 0004 which just adds a heap_lock() around a newly created temporary table in the matview code which shouldn't be required for correctness but gives warm and fuzzy feelings as well as less debugging noise. Wouldn't it be a good idea to tack such WARNINGs (in a proper and clean form) to index_open (checking the underlying relation is locked), relation_open(..., NoLock) (checking the relation has previously been locked) and maybe RelationIdGetRelation() when cassert is enabled? ISTM we frequently had bugs around this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] missing locking in at least INSERT INTO view WITH CHECK
On 2013-10-23 03:18:55 +0200, Andres Freund wrote: (000[1-4]) Attached. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From bf329af4eb6d839ae2a75c4f8a2d6867877510f4 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Wed, 23 Oct 2013 02:34:51 +0200 Subject: [PATCH 1/4] debug-lock-on-index-without-heap-lock --- src/backend/access/heap/heapam.c | 10 ++ src/backend/access/index/indexam.c | 14 ++ src/backend/storage/lmgr/lmgr.c| 9 + src/backend/storage/lmgr/lock.c| 16 src/backend/utils/cache/relcache.c | 10 ++ src/include/storage/lmgr.h | 2 ++ src/include/storage/lock.h | 2 ++ 7 files changed, 63 insertions(+) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index a21f31b..3eecd5f 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1203,6 +1203,16 @@ heap_open(Oid relationId, LOCKMODE lockmode) errmsg(\%s\ is a composite type, RelationGetRelationName(r; + if (IsNormalProcessingMode() lockmode == NoLock) + { + LOCKMODE mode; + + mode = StrongestLocalRelationLock(relationId); + if (mode == NoLock) + elog(WARNING, no relation lock on rel %s, + RelationGetRelationName(r)); + } + return r; } diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c index b878155..10c3991 100644 --- a/src/backend/access/index/indexam.c +++ b/src/backend/access/index/indexam.c @@ -65,6 +65,8 @@ #include postgres.h +#include miscadmin.h + #include access/relscan.h #include access/transam.h #include catalog/index.h @@ -142,6 +144,8 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation, * */ +#include catalog/catalog.h + /* * index_open - open an index relation by relation OID * @@ -169,6 +173,16 @@ index_open(Oid relationId, LOCKMODE lockmode) errmsg(\%s\ is not an index, RelationGetRelationName(r; + if (IsNormalProcessingMode()) + { + LOCKMODE mode; + + mode = StrongestLocalRelationLock(r-rd_index-indrelid); + if (mode == NoLock) + elog(WARNING, no relation lock on rel %u held while opening index %s, + r-rd_index-indrelid, + RelationGetRelationName(r)); + } return r; } diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c index a978172..32d7bba 100644 --- a/src/backend/storage/lmgr/lmgr.c +++ b/src/backend/storage/lmgr/lmgr.c @@ -232,6 +232,15 @@ UnlockRelation(Relation relation, LOCKMODE lockmode) LockRelease(tag, lockmode, false); } +LOCKMODE +StrongestLocalRelationLock(Oid relid) +{ + LOCKTAG tag; + SetLocktagRelationOid(tag, relid); + return StrongestLocalLock(tag); +} + + /* * LockHasWaitersRelation * diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index f4f32e9..2f068be 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -2365,6 +2365,22 @@ LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent) ResourceOwnerForgetLock(CurrentResourceOwner, locallock); } +LOCKMODE +StrongestLocalLock(const LOCKTAG* locktag) +{ + HASH_SEQ_STATUS status; + LOCALLOCK *locallock; + LOCKMODE strongest = NoLock; + hash_seq_init(status, LockMethodLocalHash); + + while ((locallock = (LOCALLOCK *) hash_seq_search(status)) != NULL) + { + if (memcmp(locallock-tag.lock, locktag, sizeof(LOCKTAG)) == 0) + strongest = Max(strongest, locallock-tag.mode); + } + return strongest; +} + /* * FastPathGrantRelationLock * Grant lock using per-backend fast-path array, if there is space. diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index b4cc6ad..f74c6af 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -1577,6 +1577,16 @@ RelationIdGetRelation(Oid relationId) { Relation rd; + if (IsNormalProcessingMode()) + { + LOCKMODE mode; + + mode = StrongestLocalRelationLock(relationId); + if (mode == NoLock) + elog(WARNING, no relation lock for descriptor of oid %u, + relationId); + } + /* * first try to find reldesc in the cache */ diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h index 1a8c018..c2c6026 100644 --- a/src/include/storage/lmgr.h +++ b/src/include/storage/lmgr.h @@ -40,6 +40,8 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode); extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode); extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode); +extern LOCKMODE StrongestLocalRelationLock(Oid relid); + /* Lock a page (currently only used within indexes) */ extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode); extern bool
Re: [HACKERS] Compression of full-page-writes
(2013/10/22 12:52), Fujii Masao wrote: On Tue, Oct 22, 2013 at 12:47 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Mon, Oct 21, 2013 at 4:40 PM, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote: (2013/10/19 14:58), Amit Kapila wrote: On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote: In general, my thinking is that we should prefer compression to reduce IO (WAL volume), because reducing WAL volume has other benefits as well like sending it to subscriber nodes. I think it will help cases where due to less n/w bandwidth, the disk allocated for WAL becomes full due to high traffic on master and then users need some alternative methods to handle such situations. Do you talk about archiving WAL file? One of the points what I am talking about is sending data over network to subscriber nodes for streaming replication and another is WAL in pg_xlog. Both scenario's get benefited if there is is WAL volume. It can easy to reduce volume that we set and add compression command with copy command at archive_command. Okay. I think many users would like to use a method which can reduce WAL volume and the users which don't find it enough useful in their environments due to decrease in TPS or not significant reduction in WAL have the option to disable it. I favor to select compression algorithm for higher performance. If we need to compress WAL file more, in spite of lessor performance, we can change archive copy command with high compression algorithm and add documents that how to compress archive WAL files at archive_command. Does it wrong? No, it is not wrong, but there are scenario's as mentioned above where less WAL volume can be beneficial. In actual, many of NoSQLs use snappy for purpose of higher performance. Okay, you can also check the results with snappy algorithm, but don't just rely completely on snappy for this patch, you might want to think of another alternative for this patch. So, our consensus is to introduce the hooks for FPW compression so that users can freely select their own best compression algorithm? Yes, it will be also good for future improvement. But I think WAL compression for disaster recovery system should be need in walsender and walreceiver proccess, and it is propety architecture for DR system. Higher compression ratio with high CPU usage algorithm in FPW might affect bad for perfomance in master server. If we can set compression algorithm in walsender and walreciever, performance is same as before or better, and WAL send performance will be better. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist
On Mon, Oct 14, 2013 at 5:38 PM, Josh Kupershmidt schmi...@gmail.com wrote: Also, Pavel, this patch is still listed as 'Needs Review' in the CF app, but I haven't seen a response to the concerns in my last message. It looks like this patch has been imported into the 2013-11 CF [1] and marked Needs Review. I looked at the version of the patch pointed to in that CF entry back in July, and noted [2] several problems that still seemed to be present in the patch, for which I never saw a followup from Pavel. IMO this patch should have gotten marked Returned with Feedback pending a response from Pavel. Josh [1] https://commitfest.postgresql.org/action/patch_view?id=1174 [2] http://www.postgresql.org/message-id/CAK3UJREth9DVL5U7ewOLQYhXF7EcV5BABFE+pzPQjkPfqbW=v...@mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why the asprintf patch is still breaking the buildfarm
Noah Misch n...@leadboat.com writes: On Tue, Oct 22, 2013 at 11:00:42AM -0400, Tom Lane wrote: Yeah. As a separate matter, it might be useful to revise stringinfo.c and the asprintf code so that *if* the returned value is larger than the given buffer size, we use it as a guide to resizing, avoiding the possible need to loop multiple times to make the buffer large enough. And we could also improve our own implementation of snprintf to follow the C99 spec. The point here is that we still need to cope with pre-C99 implementations that might return -1 or the given buffer size on overflow. The NetBSD implementation doesn't do that, which is reasonable in their context, but not workable for us. I would vote for choosing the standard we want vsnprintf() to follow (probably C99) and substituting a conforming implementation wherever configure detects that libc does not conform. We'll be shipping some replacement vsnprintf() in any case; we may as well use it to insulate the rest of our code from less-preferred variants. The problem is that we can't tell whether vsnprintf is standard-conforming without a run-time test. That's bad for cross-compiled builds, and it's pretty hazardous even for normal cases, since conceivably an executable built on one machine could be used on another one with different run-time behavior. I'd be willing to take those risks if we got a significant benefit from it, but in this case I don't see much advantage to be had. The code in stringinfo/psprintf wouldn't get very much simpler if we assumed C99 behavior, and we've pretty well isolated the number of places that care to those. (I see a couple places in pg_dump that could be modified to use psprintf instead of direct vsnprintf calls; will go fix.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Sigh, my old HPUX box is totally broken by DSM patch
initdb.c quoth: * ... but the fact that a platform has shm_open * doesn't guarantee that that call will succeed when attempted. Indeed: $ initdb The files belonging to this database system will be owned by user postgres. This user must also own the server process. The database cluster will be initialized with locale C. The default database encoding has accordingly been set to SQL_ASCII. The default text search configuration will be set to english. Data page checksums are disabled. creating directory /home/postgres/testversion/data ... ok creating subdirectories ... ok selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting dynamic shared memory implementation ... Bad system call(coredump) $ gdb shows: Core was generated by `initdb'. Program terminated with signal 12, Bad system call. (gdb) bt #0 0xc0143fb0 in ?? () from /usr/lib/libc.1 #1 0xa890 in choose_dsm_implementation () at initdb.c:1098 #2 0xab10 in test_config_settings () at initdb.c:1217 #3 0xe310 in initialize_data_directory () at initdb.c:3412 #4 0xed0c in main (argc=1, argv=0x7b03ac68) at initdb.c:3691 #5 0xc0065784 in ?? () from /usr/lib/libc.1 I'm not entirely sure what to do about this. Conceivably we could have initdb catch SIGSYS, but that seems rather ugly. Maybe configure needs a run-time test to see if shm_open will work, rather than just probing to see if such a function exists? I'm not thrilled with run-time tests in configure though. Another possibility is for initdb to execute the probe in a forked subprocess instead of risking doing it itself. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
Hi All, (2013/10/22 22:26), Stephen Frost wrote: * Dimitri Fontaine (dimi...@2ndquadrant.fr) wrote: In our case, what I keep experiencing with tuning queries is that we have like 99% of them running under acceptable threshold and 1% of them taking more and more time. This is usually described (at least where I come from) as 'rare events', which goes to Tom's point that averages, stddev, etc, are not ideal (though they are still better than nothing). good reason to suppose that query runtime is Gaussian? (I'd bet not; in particular, multimodal behavior seems very likely due to things like plan changes.) If not, how much does that affect the usefulness of a standard-deviation calculation? Oscillating plan changes may fit multimodal but I don't feel that's typical. My experience has been it's either an extremely rare plan difference or it's a shift from one plan to another over time. After all, all of avg, min, max and stdev are only numerical value for predicting model. There aren't the robustness and strictness such as Write Ahead Logging. It resembles a weather forecast. They are still better than nothing. It is needed a human judgment to finally suppose a cause from the numerical values. By the way, we can guess probability of the value from stdev. Therefore we can guess easily even if there is an extreme value in min/max whether it is normal or not. What I've been gathering from my quick chat this morning is that either you know how to characterize the distribution and then the min max and average are useful on their own, or you need to keep track of an histogram where all the bins are of the same size to be able to learn what the distribution actually is. A histogram would certainly be useful. We may also wish to look into outlier/rare event detection methods and increase the logging we do in those cases (if possible). Of course keeping enough stats seems to always begin with keeping the min, max and average, so we can just begin there. We would just be unable to answer interesting questions with just that. It would probably be good to do some research into techniques for outlier detection which minimizes CPU and storage cost. pg_stat_statement is often used in operating database system, so I don't like high CPU usage implementation. The software which will be lessor performance just to install it is too unpleasant to accept. And if we need more detail information for SQL tuning, it would be better to develop other useful performance tuning and monitoring contrib not to use in operating database system. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Wed, Oct 23, 2013 at 7:05 AM, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote: (2013/10/22 12:52), Fujii Masao wrote: On Tue, Oct 22, 2013 at 12:47 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Mon, Oct 21, 2013 at 4:40 PM, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote: (2013/10/19 14:58), Amit Kapila wrote: On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote: In actual, many of NoSQLs use snappy for purpose of higher performance. Okay, you can also check the results with snappy algorithm, but don't just rely completely on snappy for this patch, you might want to think of another alternative for this patch. So, our consensus is to introduce the hooks for FPW compression so that users can freely select their own best compression algorithm? Yes, it will be also good for future improvement. But I think WAL compression for disaster recovery system should be need in walsender and walreceiver proccess, and it is propety architecture for DR system. Higher compression ratio with high CPU usage algorithm in FPW might affect bad for perfomance in master server. This is true, thats why there is a discussion for pluggable API for compression of WAL, we should try to choose best algorithm from the available choices. Even after that I am not sure it works same for all kind of loads, so user will have option to completely disable it as well. If we can set compression algorithm in walsender and walreciever, performance is same as before or better, and WAL send performance will be better. Do you mean to say that walsender should compress the data before sending and then walreceiver will decompress it, if yes then won't it add extra overhead on standby, or do you think as walreceiver has to read less data from socket, so it will compensate for it. I think may be we should consider this if the test results are good, but lets not try to do this until the current patch proves that such mechanism is good for WAL compression. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add \i option to bring in the specified file as a quoted literal
On Tue, Oct 22, 2013 at 3:04 AM, Piotr Marcinczyk pmarc...@gmail.com wrote: Hi, I would like to implement item from TODO marked as easy: Add \i option to bring in the specified file as a quoted literal. I understand intent of this item, to be able to have parts of query written in separate files (now it is impossible, because \i tries to execute content of file as a separate command by function process_file). For the usecase discussed in the mail chain of that TODO item, Robert Haas has provided an alternative to achieve it, please check below link: http://www.postgresql.org/message-id/AANLkTi=7c8xfyf7uqw0y+si8ondkoy2nx8jc4bu0g...@mail.gmail.com If you think that alternative is not sufficient for the use case, then adding new option/syntax is worth, otherwise it might be a shortcut or other form of some existing way which can be useful depending on how frequently users use this syntax. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers