Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
On Tue, Feb 4, 2014 at 11:56 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Since, as I mentioned, _bt_finish_split() ultimately unlocks *and unpins*, it may not be the same buffer as before, so even with the refactoring there are race conditions. Care to elaborate? Or are you just referring to the missing buf = ? Yes, that's all I meant. Attached is a new version of the patch, with those issues fixed. btree-incomplete-split-4.patch is a complete patch against the latest fix-btree-page-deletion patch, and moveright-assign-fix.patch is just the changes to _bt_moveright, if you want to review just the changes since the previous patch I posted. Cool. I'll take a good look at it tomorrow morning PST. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD
On 2014-02-04 16:24:02 -0800, Peter Geoghegan wrote: On Mon, Feb 3, 2014 at 3:38 PM, Andres Freund and...@2ndquadrant.com wrote: A quick hack (attached) making BufferDescriptor 64byte aligned indeed restored performance across all max_connections settings. It's not surprising that a misaligned buffer descriptor causes problems - there'll be plenty of false sharing of the spinlocks otherwise. Curious that the the intel machine isn't hurt much by this. What fiddling are you thinking of? Basically always doing a TYPEALIGN(CACHELINE_SIZE, addr) before returning from ShmemAlloc() (and thereby ShmemInitStruct). There is something you have not drawn explicit attention to that is very interesting. If we take REL9_3_STABLE tip to be representative (built with full -O2 optimization, no assertions just debugging symbols), setting max_connections to 91 from 90 does not have the effect of making the BufferDescriptors array aligned; it has the effect of making it *misaligned*. You reported that 91 was much better than 90. I think that the problem actually occurs when the array *is* aligned! I don't think you can learn much from the alignment in 9.3 vs. HEAD. Loads has changed since, most prominently and recently Robert's LWLock work. That certainly has changed allocation patterns. It will also depend on some other parameters, e.g. changing max_wal_senders, max_background_workers will also change the offset. It's not that 91 is intrinsically better, it just happened to give a aligned BufferDescriptors array when the other parameters weren't changed at the same time. I suspect that the scenario described in this article accounts for the quite noticeable effect reported: http://danluu.com/3c-conflict I don't think that's applicable here. What's described there is relevant for access patterns that are larger multiple of the cacheline size - but our's is exactly cacheline sized. What can happen in such scenarios is that all your accesses map to the same set of cachelines, so instead of using most of the cache, you end up using only 8 or so (8 is a common size of set associative caches these days). Theoretically we could see something like that for shared_buffers itself, but I *think* our accesses are too far spread around in them for that to be a significant issue. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CacheInvalidateRelcache in btree is a crummy idea
On 02/02/2014 11:45 PM, Tom Lane wrote: So I'm thinking my commit d2896a9ed, which introduced this mechanism, was poorly thought out and we should just remove the relcache invals as per the attached patch. Letting _bt_getroot() update the cached metapage at next use should be a lot cheaper than a full relcache rebuild for the index. Looks good to me. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 02/04/2014 11:58 PM, Andres Freund wrote: On February 4, 2014 10:50:10 PM CET, Peter Geoghegan p...@heroku.com wrote: On Tue, Feb 4, 2014 at 11:11 AM, Andres Freund and...@2ndquadrant.com wrote: Does this feature relate to compression of WAL page images at all? No. So the obvious question is: where, if anywhere, do the two efforts (this patch, and Fujii's patch) overlap? Does Fujii have any concerns about this patch as it relates to his effort to compress FPIs? I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an irrelevant volume. Correct. Compressing a full-page image happens on the first update after a checkpoint, and the diff between old and new tuple is not used in that case. Compressing full page images makes a difference if you're doing random updates across a large table, so that you only update each buffer 1-2 times. This patch will have no effect in that case. And when you update the same page many times between checkpoints, the full-page image is insignificant, and this patch has a big effect. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part2: fast scan
On Wed, Feb 5, 2014 at 1:23 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Mon, Feb 3, 2014 at 6:31 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Mon, Jan 27, 2014 at 7:30 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Mon, Jan 27, 2014 at 2:32 PM, Alexander Korotkov aekorot...@gmail.com wrote: Every single change you did in fast scan seems to be reasonable, but testing shows that something went wrong. Simple test with 3 words of different selectivities. After applying your patches: # select count(*) from fts_test where fti @@ plainto_tsquery('english', 'gin index select'); count ─── 627 (1 row) Time: 21,252 ms In original fast-scan: # select count(*) from fts_test where fti @@ plainto_tsquery('english', 'gin index select'); count ─── 627 (1 row) Time: 3,382 ms I'm trying to get deeper into it. I had two guesses about why it's become so slower than in my original fast-scan: 1) Not using native consistent function 2) Not sorting entries I attach two patches which rollback these two features (sorry for awful quality of second). Native consistent function accelerates thing significantly, as expected. Tt seems that sorting entries have almost no effect. However it's still not as fast as initial fast-scan: # select count(*) from fts_test where fti @@ plainto_tsquery('english', 'gin index select'); count ─── 627 (1 row) Time: 5,381 ms Tomas, could you rerun your tests with first and both these patches applied against patches by Heikki? I found my patch 0005-Ternary-consistent-implementation.patch to be completely wrong. It introduces ternary consistent function to opclass, but don't uses it, because I forgot to include ginlogic.c change into patch. So, it shouldn't make any impact on performance. However, testing results with that patch significantly differs. That makes me very uneasy. Can we now reproduce exact same? Right version of these two patches in one against current head is attached. I've rerun tests with it, results are /mnt/sas-raid10/gin-testing/queries/9.4-fast-scan-10. Could you rerun postprocessing including graph drawing? Sometimes test cases are not what we expect. For example: =# explain SELECT id FROM messages WHERE body_tsvector @@ to_tsquery('english','(5alpha1-initdb''d)'); QUERY PLAN ─ Bitmap Heap Scan on messages (cost=84.00..88.01 rows=1 width=4) Recheck Cond: (body_tsvector @@ '''5alpha1-initdb'' ''5alpha1'' ''initdb'' ''d'''::tsquery) - Bitmap Index Scan on messages_body_tsvector_idx (cost=0.00..84.00 rows=1 width=0) Index Cond: (body_tsvector @@ '''5alpha1-initdb'' ''5alpha1'' ''initdb'' ''d'''::tsquery) Planning time: 0.257 ms (5 rows) 5alpha1-initdb'd is 3 gin entries with different frequencies. Also, these patches are not intended to change relevance ordering speed. When number of results are high, most of time is relevance calculating and sorting. I propose to remove ORDER BY clause from test cases to see scan speed more clear. I've dump of postgresql.org search queries from Magnus. We can add them to our test case. It turns out that version 10 contained bug in ternary consistent function implementation for tsvector. Fixed in attached version. Attached patch is light version of fast scan. It does extra consistent function calls only on startScanKey, no extra calls during scan of the index. It finds subset of rarest entries absence of which guarantee false consistent function result. I've run real-life tests based of postgresql.org logs (33318 queries). Here is a table with summary time of running whole test case. =# select method, sum(time) from test_result group by method order by method; method| sum -+-- fast-scan-11| 126916.11199 fast-scan-light | 137321.211 heikki | 466751.693 heikki-true-ternary | 454113.62397 master | 446976.288 (6 rows) where 'heikki' is gin-ternary-logic binary-heap preconsistent-only-on-new-page.patch and 'heikki-true-ternary' is version with my catalog changes promoting ternary consistent function to opclass. We can see fast-scan-light gives almost same performance benefit on queries. However, I test only CPU effect, not IO effect. Any thoughts? -- With best regards, Alexander Korotkov. gin-fast-scan-light.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix picksplit with nan values
On Sat, Feb 1, 2014 at 7:50 AM, Bruce Momjian br...@momjian.us wrote: Where are we on this? I found myself to have empty draft letter from November with new version of patch attached. I'll return here when we have some solution in gin fast scan challenge. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 02/05/2014 07:54 AM, Amit Kapila wrote: On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila amit.kapil...@gmail.com wrote: Now there is approximately 1.4~5% CPU gain for hundred tiny fields, half nulled case Assuming that the logic isn't buggy, a point in need of further study, I'm starting to feel like we want to have this. And I might even be tempted to remove the table-level off switch. I have tried to stress on worst case more, as you are thinking to remove table-level switch and found that even if we increase the data by approx. 8 times (ten long fields, all changed, each field contains 80 byte data), the CPU overhead is still 5% which clearly shows that the overhead doesn't increase much even if the length of unmatched data is increased by much larger factor. So the data for worst case adds more weight to your statement (remove table-level switch), however there is no harm in keeping table-level option with default as 'true' and if some users are really sure the updates in their system will have nothing in common, then they can make this new option as 'false'. Below is data for the new case ten long fields, all changed added in attached script file: That's not the worst case, by far. First, note that the skipping while scanning new tuple is only performed in the first loop. That means that as soon as you have a single match, you fall back to hashing every byte. So for the worst case, put one 4-byte field as the first column, and don't update it. Also, I suspect the runtimes in your test were dominated by I/O. When I scale down the number of rows involved so that the whole test fits in RAM, I get much bigger differences with and without the patch. You might also want to turn off full_page_writes, to make the effect clear with less data. So, I came up with the attached worst case test, modified from your latest test suite. unpatched: testname | wal_generated | duration --+---+-- ten long fields, all but one changed | 343385312 | 2.20806908607483 ten long fields, all but one changed | 336263592 | 2.18997097015381 ten long fields, all but one changed | 336264504 | 2.17843413352966 (3 rows) pgrb_delta_encoding_v8.patch: testname | wal_generated | duration --+---+-- ten long fields, all but one changed | 338356944 | 3.33501315116882 ten long fields, all but one changed | 344059272 | 3.37364101409912 ten long fields, all but one changed | 336257840 | 3.36244201660156 (3 rows) So with this test, the overhead is very significant. With the skipping logic, another kind of worst case case is that you have a lot of similarity between the old and new tuple, but you miss it because you skip. For example, if you change the first few columns, but leave a large text column at the end of the tuple unchanged. - Heikki wal-update-testsuite.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 01/30/2014 08:53 AM, Amit Kapila wrote: On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/29/2014 02:21 PM, Amit Kapila wrote: The main reason to process in chunks as much as possible is to save cpu cycles. For example if we build hash table byte-by-byte, then even for best case where most of tuple has a match, it will have reasonable overhead due to formation of hash table. Hmm. One very simple optimization we could do is to just compare the two strings byte by byte, before doing anything else, to find any common prefix they might have. Then output a tag for the common prefix, and run the normal algorithm on the rest of the strings. In many real-world tables, the 1-2 first columns are a key that never changes, so that might work pretty well in practice. Maybe it would also be worthwhile to do the same for any common suffix the tuples might have. Is it possible to do for both prefix and suffix together, basically the question I have in mind is what will be deciding factor for switching from hash table mechanism to string comparison mode for suffix. Do we switch when we find long enough match? I think you got it backwards. You don't switch from hash table mechanism to string comparison. You do the prefix/suffix comparison *first*, and run the hash table algorithm only on the middle part, between the common prefix and suffix. Can we do this optimization after the basic version is acceptable? I would actually suggest doing that first. Perhaps even ditch the whole history table approach and do *only* the scan for prefix and suffix. That's very cheap, and already covers a large fraction of UPDATEs that real applications do. In particular, it's optimal for the case that you update only a single column, something like UPDATE foo SET bar = bar + 1. I'm pretty sure the overhead of that would be negligible, so we could always enable it. There are certainly a lot of scenarios where prefix/suffix detection alone wouldn't help, but so what. Attached is a quick patch for that, if you want to test it. - Heikki diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index e0b8a4e..c4ac2bd 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI /listitem /varlistentry + varlistentry +termliteralwal_compress_update/ (typeboolean/)/term +listitem + para + Enables or disables the WAL tuple compression for commandUPDATE/ + on this table. Default value of this option is false to maintain + backward compatability for the command. If true, all the update + operations on this table which will place the new tuple on same page + as it's original tuple will compress the WAL for new tuple and + subsequently reduce the WAL volume. It is recommended to enable + this option for tables where commandUPDATE/ changes less than + 50 percent of tuple data. + /para + /listitem +/varlistentry + /variablelist /refsect2 diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c index aea9d40..3bf5728 100644 --- a/src/backend/access/common/heaptuple.c +++ b/src/backend/access/common/heaptuple.c @@ -60,6 +60,7 @@ #include access/sysattr.h #include access/tuptoaster.h #include executor/tuptable.h +#include utils/pg_rbcompress.h /* Does att's datatype allow packing into the 1-byte-header varlena format? */ @@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest) memcpy((char *) dest-t_data, (char *) src-t_data, src-t_len); } +/* + * heap_delta_encode + * + * Calculate the delta between two tuples and generate + * encoded wal tuple (EWT), using pgrb. The result is stored + * in *encdata. + * + */ +bool +heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup, + char *encdata, uint32 *enclen) +{ + return pgrb_delta_encode( + (char *) newtup-t_data + offsetof(HeapTupleHeaderData, t_bits), + newtup-t_len - offsetof(HeapTupleHeaderData, t_bits), + (char *) oldtup-t_data + offsetof(HeapTupleHeaderData, t_bits), + oldtup-t_len - offsetof(HeapTupleHeaderData, t_bits), + encdata, enclen, NULL + ); +} + +/* + * heap_delta_decode + * + * Decode a tuple using delta-encoded WAL tuple and old tuple version. + * + */ +void +heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup) +{ + pgrb_delta_decode(encdata, enclen, + (char *) newtup-t_data + offsetof(HeapTupleHeaderData, t_bits), + MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits), + newtup-t_len, + (char *) oldtup-t_data + offsetof(HeapTupleHeaderData, t_bits), + oldtup-t_len - offsetof(HeapTupleHeaderData, t_bits)); +} + /* * heap_form_tuple * construct a tuple from the
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/30/2014 08:53 AM, Amit Kapila wrote: Is it possible to do for both prefix and suffix together, basically the question I have in mind is what will be deciding factor for switching from hash table mechanism to string comparison mode for suffix. Do we switch when we find long enough match? I think you got it backwards. You don't switch from hash table mechanism to string comparison. You do the prefix/suffix comparison *first*, and run the hash table algorithm only on the middle part, between the common prefix and suffix. Can we do this optimization after the basic version is acceptable? I would actually suggest doing that first. Perhaps even ditch the whole history table approach and do *only* the scan for prefix and suffix. That's very cheap, and already covers a large fraction of UPDATEs that real applications do. In particular, it's optimal for the case that you update only a single column, something like UPDATE foo SET bar = bar + 1. I'm pretty sure the overhead of that would be negligible, so we could always enable it. There are certainly a lot of scenarios where prefix/suffix detection alone wouldn't help, but so what. Attached is a quick patch for that, if you want to test it. I have done one test where there is a large suffix match, but not large enough that it can compress more than 75% of string, the CPU overhead with wal-update-prefix-suffix-encode-1.patch is not much, but there is no I/O reduction as well. However for same case there is both significant WAL reduction and CPU gain with pgrb_delta_encoding_v8.patch I have updated ten long fields, all changed such that there is large suffix match. Updated script is attached. Unpatched testname | wal_generated | duration --+---+-- ten long fields, all changed |1760986528 | 28.3700430393219 ten long fields, all changed |1760981320 | 28.53244805336 ten long fields, all changed |1764294992 | 28.6722140312195 (3 rows) wal-update-prefix-suffix-encode-1.patch testname | wal_generated | duration --+---+-- ten long fields, all changed |1760986016 | 29.4183659553528 ten long fields, all changed |1760981904 | 29.7636449337006 ten long fields, all changed |1762436104 | 29.508908033371 (3 rows) pgrb_delta_encoding_v8.patch testname | wal_generated | duration --+---+-- ten long fields, all changed | 733969304 | 23.916286945343 ten long fields, all changed | 733977040 | 23.6019561290741 ten long fields, all changed | 737384632 | 24.2645490169525 With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com wal-update-testsuite.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CacheInvalidateRelcache in btree is a crummy idea
Heikki Linnakangas hlinnakan...@vmware.com writes: On 02/02/2014 11:45 PM, Tom Lane wrote: So I'm thinking my commit d2896a9ed, which introduced this mechanism, was poorly thought out and we should just remove the relcache invals as per the attached patch. Letting _bt_getroot() update the cached metapage at next use should be a lot cheaper than a full relcache rebuild for the index. Looks good to me. I did think of one possible objection to this idea. Although _bt_getroot() checks that the page it arrives at is a usable fast root, it could still fail if the page number is off the end of the relation; that is, we have a cache entry that predates an index truncation. Currently, the only way for a btree to get physically shorter is a REINDEX, which implies a relfilenode change and hence a (transactional) relcache flush, so this couldn't happen. And if we tried to allow VACUUM to truncate an index without a full lock, we'd probably have the same type of issue for any concurrent process that's following a stale cross-page link, not just a link to the root. So I'm not particularly impressed by this objection, but it could be made. If we ever did need to make that work, a possible solution would be to refactor things so that the metapage cache lives at the smgr level not the relcache level, where it'd get blown away by a cross-backend smgr inval (CacheInvalidateSmgr) --- which is nontransactional, thereby fixing the basic problem with the way it's being done now. There would still be some issues about locking, but at least the cache per se wouldn't be a hazard anymore. Since I'm not aware of any plans to make on-the-fly btree truncation work, I won't bother to implement that now, but I thought I'd mention it for the archives. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD
Andres Freund and...@2ndquadrant.com writes: On 2014-02-04 16:24:02 -0800, Peter Geoghegan wrote: I suspect that the scenario described in this article accounts for the quite noticeable effect reported: http://danluu.com/3c-conflict I don't think that's applicable here. Maybe, or maybe not, but I think it does say that we should be very wary of proposals to force data structure alignment without any testing of the consequences. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD
On 2014-02-05 09:57:11 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-02-04 16:24:02 -0800, Peter Geoghegan wrote: I suspect that the scenario described in this article accounts for the quite noticeable effect reported: http://danluu.com/3c-conflict I don't think that's applicable here. Maybe, or maybe not, but I think it does say that we should be very wary of proposals to force data structure alignment without any testing of the consequences. I agree it needs testing, but what the page is talking about really, really doesn't have *anything* to do with this. What the author is talking about is not storing things *page* aligned (I.e. 4k or a multiple). The problem is that the beginning of each such page falls into the same cacheline set and thus doesn't utilize the entire L1/L2/L3 but only the single set they map into. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD
On Wed, Feb 5, 2014 at 3:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: Maybe, or maybe not, but I think it does say that we should be very wary of proposals to force data structure alignment without any testing of the consequences. Sure. see for instance http://igoro.com/archive/gallery-of-processor-cache-effects/ From what I understand what Andres is suggesting is ensuring that each BufferDescriptor is sitting entirely in one cache line. That seems very unlikely to be worse than having a BufferDescriptor spanning two cache lines and being on the same cache line as the adjacent BufferDescriptors. But this all depends on knowing the length of the cache lines. I see a lot of confusion online over whether cache lines are 64 bytes, 128 bytes, or other length even just on Intel architectures, let alone others. I wonder if there are any generic tools to optimize array/structure layouts based on cachegrind profiling or something like that. Then we wouldn't need to know the oddities ourselves and optimize manually. We could maybe even do it on the build farm and select the right profile at build time by matching build target information. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD
On 2014-02-05 16:14:01 +0100, Greg Stark wrote: I see a lot of confusion online over whether cache lines are 64 bytes, 128 bytes, or other length even just on Intel architectures, let alone others. All current x86 processors use 64. But even if it were bigger/smaller, they will be either 32, or 128. Neither benefits from touching more cachelines than necessary. E.g. in the 128 case, we could still touch two with the current code. The effects referred to upthread only affect code with larger multiples of the cacheline size. Not what we have here. I wonder if there are any generic tools to optimize array/structure layouts based on cachegrind profiling or something like that. Then we wouldn't need to know the oddities ourselves and optimize manually. We could maybe even do it on the build farm and select the right profile at build time by matching build target information. There's profiling tools (e.g. perf's -e stalled-cycles-(frontent|backend)), but I don't think there's more than that. And I think somebody already thought about it (c.f. ALIGNOF_BUFFER), it just wasn't updated in the last 10 years. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 5:13 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/05/2014 07:54 AM, Amit Kapila wrote: On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila amit.kapil...@gmail.com wrote: Now there is approximately 1.4~5% CPU gain for hundred tiny fields, half nulled case Assuming that the logic isn't buggy, a point in need of further study, I'm starting to feel like we want to have this. And I might even be tempted to remove the table-level off switch. I have tried to stress on worst case more, as you are thinking to remove table-level switch and found that even if we increase the data by approx. 8 times (ten long fields, all changed, each field contains 80 byte data), the CPU overhead is still 5% which clearly shows that the overhead doesn't increase much even if the length of unmatched data is increased by much larger factor. So the data for worst case adds more weight to your statement (remove table-level switch), however there is no harm in keeping table-level option with default as 'true' and if some users are really sure the updates in their system will have nothing in common, then they can make this new option as 'false'. Below is data for the new case ten long fields, all changed added in attached script file: That's not the worst case, by far. First, note that the skipping while scanning new tuple is only performed in the first loop. That means that as soon as you have a single match, you fall back to hashing every byte. So for the worst case, put one 4-byte field as the first column, and don't update it. Also, I suspect the runtimes in your test were dominated by I/O. When I scale down the number of rows involved so that the whole test fits in RAM, I get much bigger differences with and without the patch. You might also want to turn off full_page_writes, to make the effect clear with less data. So with this test, the overhead is very significant. With the skipping logic, another kind of worst case case is that you have a lot of similarity between the old and new tuple, but you miss it because you skip. This is exactly the reason why I have not kept skipping logic in second pass(loop), but I think may be it would have been better to keep it not as aggressive as in first pass. The basic idea I had in mind is that if we get match, then there is high chance that we get match in consecutive positions. I think we should see this patch as I/O reduction feature rather than CPU gain/overhead because the I/O reduction in WAL has some other has some other benefits like transfer for replication, in archiving, recovery, basically where-ever there is disk read operation, as I/O reduction will amount to less data read and which can be beneficial in many ways. Sometime back, I was reading article on benefits of compression in database where the benefits are shown something like what I said above (atleast that is what I understood from it). Link to that article is: http://db2guys.wordpress.com/2013/08/23/compression/ Another thing is that I think it might be difficult to get negligible overhead for data which is very less or non-compressible, that's why it is preferred to have compression for table enabled with switch. Is it viable to see here, what is the best way to get I/O reduction for most cases and provide a switch so that for worst cases user can make it off? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 02/05/2014 04:48 PM, Amit Kapila wrote: I have done one test where there is a large suffix match, but not large enough that it can compress more than 75% of string, the CPU overhead with wal-update-prefix-suffix-encode-1.patch is not much, but there is no I/O reduction as well. Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently I got that backwards in the patch... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 8:50 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/05/2014 04:48 PM, Amit Kapila wrote: I have done one test where there is a large suffix match, but not large enough that it can compress more than 75% of string, the CPU overhead with wal-update-prefix-suffix-encode-1.patch is not much, but there is no I/O reduction as well. Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently I got that backwards in the patch... Okay I think that is right, may be I can change the that check to see the difference, but in general isn't it going to loose compression in much more cases like if there is less than 25% match in prefix/suffix, but more than 50% match in between the string. While debugging, I noticed that it compresses less than history table approach for general cases when internally update is done like for Truncate table. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
+static void +recvJsonbValue(StringInfo buf, JsonbValue *v, uint32 level, int c) + v-size = sizeof(JEntry) * 2 + VARSIZE_ANY(v-numeric); What's the *2 here? Reservation for aligment. It's allowed to be v-size greater than it's actually needed. Fixed. This function and recvJsonbValue call each other recursively, afaics without any limit, shouldn't they check for the stack depth? added a check_stack_depth() *3? Jentry + header + reservation for aligment + v-hash.pairs = palloc(sizeof(*v-hash.pairs) * v-hash.npairs); + if (v-hash.npairs (buf-len - buf-cursor) / (2 * sizeof(uint32))) ereport(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE) 2 * sizeof(uint32) - minimal size of object element (key plus its value) Shouldn't that be an ereport(ERRCODE_DATATYPE_MISMATCH)? Similar in a few other places. fixed +char * +JsonbToCString(StringInfo out, char *in, int estimated_len) Such a behaviour certainly deserves a documentary comment. Generally some more functions could use that. add comment + while ((type = JsonbIteratorGet(it, v, false)) != 0) +reout: + goto reout; Hrmpf. :) commented +Datum +jsonb_typeof(PG_FUNCTION_ARGS) +{ ... +} Hm, shouldn't that be in jsonfuncs.c? No idea, i don't have an objection send/recv for hstore is fixed too. Should I make new version of patch? Right now it's placed on github. May be Andrew wants to change something? -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On Wed, Feb 5, 2014 at 12:44 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: send/recv functions are also needed for binary-format COPY. IMHO jsonb must have send/recv functions. All other built-in types have them, except for types like 'smgr', 'aclitem' and 'any*' that no-one should be using as column types. Yes -- completely agree. I also consider the hstore functionality (in particular, searching and access operators) to be essential functionality. I'm actually surprised we have an alternate binary wire format for jsonb at all; json is explicitly text and I'm not sure what the use case of sending the internal structure is. Meaning, maybe jsonb send/recv should be a thin wrapper to sending the json string. The hstore send/recv I think properly covers the case where client side binary wire format actors would want to manage performance critical cases that want to avoid parsing. On Wed, Feb 5, 2014 at 1:21 AM, Oleg Bartunov obartu...@gmail.com wrote: Andrew provided us more information and we'll work on recv. What people think about testing this stuff ? btw, we don't have any regression test on this. I'm intensely interested in this work; I consider it to be transformative. I've *lightly* tested the jsonb/hstore functionality and so far everything is working. I still have concerns about the API. Aside from the stuff I mentioned upthread I find the API split between jsonb and hstore to be a little odd; a lot of useful bits (for example, the @ operator) come via the hstore type only. So these types are joined at the hip for real work which makes the diverging incomplete behaviors in functions like populate_record() disconcerting. Another point I'm struggling with is what jsonb brings to the table that isn't covered either hstore or json; working through a couple of cases I find myself not using the jsonb functionality except as a 'hstore json formatter' which the json type covers. I'm probably being obtuse, but we have to be cautious before plonking a couple of dozen extra functions in the public schema. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On 02/05/2014 10:48 AM, Merlin Moncure wrote: On Wed, Feb 5, 2014 at 12:44 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: send/recv functions are also needed for binary-format COPY. IMHO jsonb must have send/recv functions. All other built-in types have them, except for types like 'smgr', 'aclitem' and 'any*' that no-one should be using as column types. Yes -- completely agree. I also consider the hstore functionality (in particular, searching and access operators) to be essential functionality. I'm actually surprised we have an alternate binary wire format for jsonb at all; json is explicitly text and I'm not sure what the use case of sending the internal structure is. Meaning, maybe jsonb send/recv should be a thin wrapper to sending the json string. The hstore send/recv I think properly covers the case where client side binary wire format actors would want to manage performance critical cases that want to avoid parsing. The whole reason we have jsonb is to avoid reparsing where possible. In fact, I'd rather have the send and recv functions in the jsonb code and have hstore's functions call them, so we don't duplicate code. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On 02/05/2014 10:36 AM, Teodor Sigaev wrote: +Datum +jsonb_typeof(PG_FUNCTION_ARGS) +{ ... +} Hm, shouldn't that be in jsonfuncs.c? No idea, i don't have an objection No it shouldn't. The json equivalent function is in json.c, and needs to be because it uses the parser internals that aren't exposed outside that code. send/recv for hstore is fixed too. Should I make new version of patch? Right now it's placed on github. May be Andrew wants to change something? I'll take a look, but I think we need to unify this so we use one set of send/recv code for the two types if possible, as I just said to Merlin. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD
Andres Freund and...@2ndquadrant.com writes: And I think somebody already thought about it (c.f. ALIGNOF_BUFFER), it just wasn't updated in the last 10 years. No, ALIGNOF_BUFFER is there because we read something that said that I/O transfers between userspace and kernel disk cache would be faster with aligned buffers. There's been no particular thought given to alignment of other data structures, AFAIR. It may well be that your proposal is spot on. But I'd like to see some data-structure-by-data-structure measurements, rather than assuming that alignment must be a good thing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On Wed, Feb 5, 2014 at 10:22 AM, Andrew Dunstan and...@dunslane.net wrote: I'm actually surprised we have an alternate binary wire format for jsonb at all; json is explicitly text and I'm not sure what the use case of sending the internal structure is. Meaning, maybe jsonb send/recv should be a thin wrapper to sending the json string. The hstore send/recv I think properly covers the case where client side binary wire format actors would want to manage performance critical cases that want to avoid parsing. The whole reason we have jsonb is to avoid reparsing where possible Sure; but on the server side. The wire format is for handling client concerns. For example, the case you're arguing for would be for libpq client to extract as jsonb as binary, manipulate it on a binary level, then send it back as binary. I find this case to be something of a stretch. That being said, for binary dump/restore perhaps there's a performance case to be made. In fact, I'd rather have the send and recv functions in the jsonb code and have hstore's functions call them, so we don't duplicate code. yeah. Agree that there needs to be two sets of routines, not three. I think a case could be made for the jsonb type could take either json's or hstore's depending on if the above. FWIW, either way is fine. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD
On 2014-02-05 11:23:29 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: And I think somebody already thought about it (c.f. ALIGNOF_BUFFER), it just wasn't updated in the last 10 years. No, ALIGNOF_BUFFER is there because we read something that said that I/O transfers between userspace and kernel disk cache would be faster with aligned buffers. There's been no particular thought given to alignment of other data structures, AFAIR. But it's not aligned anymore on at last half a decade's hardware, and it's what we already align *all* bigger ShmemAlloc() values with. And BufferDescriptors surely counts as larger in its entirety. It may well be that your proposal is spot on. But I'd like to see some data-structure-by-data-structure measurements, rather than assuming that alignment must be a good thing. I am fine with just aligning BufferDescriptors properly. That has clearly shown massive improvements. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
Merlin Moncure mmonc...@gmail.com writes: On Wed, Feb 5, 2014 at 10:22 AM, Andrew Dunstan and...@dunslane.net wrote: The whole reason we have jsonb is to avoid reparsing where possible Sure; but on the server side. The wire format is for handling client concerns. For example, the case you're arguing for would be for libpq client to extract as jsonb as binary, manipulate it on a binary level, then send it back as binary. I find this case to be something of a stretch. I'm with Merlin in thinking that the case for exposing a binary format to clients is pretty weak, or at least a convincing use-case has not been shown. Given the concerns upthread about security hazards in the patch's existing recv code, and the fact that it's already February, switching to binary is the same as text may well be the most prudent path here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On 02/05/2014 11:40 AM, Tom Lane wrote: Merlin Moncure mmonc...@gmail.com writes: On Wed, Feb 5, 2014 at 10:22 AM, Andrew Dunstan and...@dunslane.net wrote: The whole reason we have jsonb is to avoid reparsing where possible Sure; but on the server side. The wire format is for handling client concerns. For example, the case you're arguing for would be for libpq client to extract as jsonb as binary, manipulate it on a binary level, then send it back as binary. I find this case to be something of a stretch. I'm with Merlin in thinking that the case for exposing a binary format to clients is pretty weak, or at least a convincing use-case has not been shown. Given the concerns upthread about security hazards in the patch's existing recv code, and the fact that it's already February, switching to binary is the same as text may well be the most prudent path here. If we do that we're going to have to live with that forever, aren't we? I don't see why there should be a convincing case for binary format for nested hstore but not for jsonb. If it were only for arbitrary libpq clietns I wouldn't bother so much. To me the main case for binary format is that some people use COPY BINARY for efficiency reasons, and I heard tell recently of someone working on that as an option for pg_dump, which seems to me worth considering. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: show xid and xmin in pg_stat_activity and pg_stat_replication
On Mon, Feb 3, 2014 at 6:47 AM, Christian Kruse christ...@2ndquadrant.com wrote: [ new patch ] Is there some compelling reason not to write the documentation link as xref linkend=guc-hot-standby-feedback rather than using link? It feels weird to me that the new columns are called transactionid and xmin. Why not xid and xmin? If I understand correctly, modifying PgBackendStatus adds additional fields to the shared memory data structure that are never used and will be returned by functions like pgstat_fetch_stat_beentry() unitialized. That seems both inefficient and a pitfall for the unwary. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
Andrew Dunstan and...@dunslane.net writes: On 02/05/2014 11:40 AM, Tom Lane wrote: switching to binary is the same as text may well be the most prudent path here. If we do that we're going to have to live with that forever, aren't we? Yeah, but the other side of that coin is that we'll have to live forever with whatever binary format we pick, too. If it turns out to be badly designed, that could be much worse than eating some parsing costs during dump/restore. If we had infinite time/manpower, this wouldn't really be an issue. We don't, though, and so I suggest that this may be one of the better things to toss overboard. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Prohibit row-security + inheritance in 9.4?
On Mon, Feb 3, 2014 at 10:23 PM, Craig Ringer cr...@2ndquadrant.com wrote: Then during expansion of the range table, no code is needed to ignore child rls quals and copy parent rels to child rels. This is what's already implemented and isn't a huge amount of code to begin with, so I don't see this as being an argument against having the flexibility. It would seem to me that any additional logic that can be avoided during planning is a good thing. Also, the argument that something is already implemented, does not itself make it good to commit. The implementation with the minimum of required logic and complexity is apply the parent's row-security quals to the children, don't check children for quals. Any special handling of child rels creates exciting challenges because inheritance expansion happens _after_ a bunch of the query planner and optimizer has run. Injecting new expression trees at this point is messy, especially if you want those expression trees to in turn contain row-security qualified tables, inheritance, etc. As previously discussed, applying the parent qual to children ensures that what's visible when querying a relation that has children is consistent with its row-security qual. If the parent qual is applied only on the parent rel directly, not children, then querying the parent could emit rows not permitted by the parent's row-security qual. I'm not happy with that, and as Stephen poined out upthread, it isn't really consistent with how permissions work with inheritance otherwise. If instead the parent qual is applied to the parent and all children, and you *also* add the quals of each child, you get a complex, hard to debug, hard to optimize mess. You also run back into the problem mentioned above, that adding quals after inhertiance expansion is messy and problematic. It's also really ugly in what's going to be the most common case, where the child quals are the same as the parent quals, as you'll get nested identical quals. Despite the challenges with it, I think that's the least insane way to respect child quals on parents. It has pretty much zero chance of happening in 9.4, though; the discussed approach of building row-security on top of updatable security barrier views doesn't play well with adding inheritance on inheritance-expanded children. One answer for that would be to keep inheritance as it is for 9.4 (if I can get the remaining issues sorted out) and in 9.5, if possible, allow the addition of a row-security qual that, if it appears on a child rel during inheritance expansion, _is_ expanded. At least, if it proves necessary, which I'm not entirely convinced of. Me, neither. When you first described the scheme you're currently pursuing, I thought it sounded kind of nuts. But as I've thought about it more, it's grown on me. Today, the way to do row-level security is: 1. Create a security barrier view over the table. 2. Grant rights to the view instead of the table, and tell people to go through that. Now, if you did that, as far as I can see, it would have exactly the same semantics as what you're proposing to do here, with the sole exception that you wouldn't need to tell people to access the view rather the table. So yeah it's kind of funky but I have a feeling any rule we come up with here will seem odd in some scenarios, and this one at least has the virtue of being relatively easy to implement and consistent with how similar things work today. I can't knock that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On 02/05/2014 12:48 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 02/05/2014 11:40 AM, Tom Lane wrote: switching to binary is the same as text may well be the most prudent path here. If we do that we're going to have to live with that forever, aren't we? Yeah, but the other side of that coin is that we'll have to live forever with whatever binary format we pick, too. If it turns out to be badly designed, that could be much worse than eating some parsing costs during dump/restore. If we had infinite time/manpower, this wouldn't really be an issue. We don't, though, and so I suggest that this may be one of the better things to toss overboard. The main reason I'm prepared to consider this is the JSON parser seems to be fairly efficient (See Oleg's recent stats) and in fact we'd more or less be parsing the binary format on input anyway, so there's no proof that a binary format is going to be hugely faster (or possibly even that it will be faster at all). If anyone else has opinions on this sing out pretty darn soon (like the next 24 hours or so, before I begin.) cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD
On Wed, Feb 5, 2014 at 7:21 AM, Andres Freund and...@2ndquadrant.com wrote: All current x86 processors use 64. But even if it were bigger/smaller, they will be either 32, or 128. Neither benefits from touching more cachelines than necessary. E.g. in the 128 case, we could still touch two with the current code. That's true, but I believe that a hardware optimization called Adjacent Cache Line Prefetch is widely supported by recent microarchitectures, and is sometimes enabled by default. I'm not suggesting that that would necessarily radically alter the outcome, but it is a consideration. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On Wed, Feb 5, 2014 at 11:48 AM, Tom Lane t...@sss.pgh.pa.us wrote: If we had infinite time/manpower, this wouldn't really be an issue. We don't, though, and so I suggest that this may be one of the better things to toss overboard. The hstore send/recv functions have basically the same (copy/pasted/name adjusted) implementation. Since hstore will presumably remain (as the current hstore is) 'deep binary' and all of Andres's gripes apply to the hstore as well, this change buys us precisely zap from a time perspective; it comes down to which is intrinsically the better choice. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On Tue, Feb 4, 2014 at 12:11 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I have run into some issues, though: 1. certain types, particularly timestamp/timestamptz but really this could happen for any type, have unusual typmod output behavior. For those one cannot just use the schema-qualified catalog names and then append the typmod at the end; because what you end up is something like pg_catalog.timestamptz(4) with time zone because, for whatever reason, the with time zone is part of typmod output. But this doesn't work at all for input. I'm not sure how to solve this. How about doing whatever pg_dump does? 2. I have been having object definitions be emitted complete; in particular, sequences have OWNED BY clauses when they have an owning column. But this doesn't work with a SERIAL column, because we get output like this: alvherre=# CREATE TABLE public.hijo (b serial); NOTICE: expanded: CREATE SEQUENCE public.hijo_b_seq INCREMENT BY 1 MINVALUE 1 MAXVALUE 9223372036854775807 START WITH 1 CACHE 1 NO CYCLE OWNED BY public.hijo.b NOTICE: expanded: CREATE TABLE public.hijo (b pg_catalog.int4 DEFAULT nextval('hijo_b_seq'::regclass) NOT NULL ) which is all nice, except that the sequence is using the column name as owner before the column has been created in the first place. Both these command will, of course, fail, because both depend on the other to have been executed first. The tie is easy to break in this case: just don't emit the OWNED BY clause .. but it makes me a bit nervous to be hardcoding the decision of parts that might depend on others. OTOH pg_dump already knows how to split objects in constituent parts as necessary; maybe it's not so bad. Well, the sequence can't depend on a table column that doesn't exist yet, so if it's in effect doing what you've shown there, it's cheating by virtue of knowing that nobody can observe the intermediate state. Strictly speaking, there's nothing wrong with emitting those commands just as you have them there; they won't run, but if what you want to do is log what's happened rather than replay it, that's OK. Producing output that is actually executable is a strictly harder problem than producing output that accurately describes what happened. As you say, pg_dump already splits things and getting executable output out of this facility will require the same kinds of tricks here. This gets back to my worry about maintaining two or three copies of the code that solve many of the same problems in quite different ways... 3. It is possible to coerce ruleutils.c to emit always-qualified names by using PushOverrideSearchPath() facility; but actually this doesn't always work, because some places in namespace.c believe that PG_CATALOG_NAMESPACE is always visible and so certain objects are not qualified. In particular, text columns using default collation will be emitted as having collation default and not pg_catalog.default as I would have initially expected. Right now it doesn't seem like this is a problem, but it is unusual. We have a quote_all_identifiers flag. We could have a schema_qualify_all_identifiers flag, too. Then again, why is the behavior of schema-qualifying absolutely everything even desirable? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Viability of text HISTORY/INSTALL/regression README files (was Re: [COMMITTERS] pgsql: Document a few more regression test hazards.)
On Tue, Feb 4, 2014 at 11:43 PM, Noah Misch n...@leadboat.com wrote: Right. I mean, a lot of the links say things like Section 26.2 which obviously makes no sense in a standalone text file. For xrefs normally displayed that way, text output could emit a URL, either inline or in the form of a footnote. For link targets (e.g. SQL commands) having a friendly text fragment for xref sites, use the normal fragment. True. If we're going to keep these things around, something like that would avoid some annoyances for documentation authors. But I still think we ought to just nuke them, because who cares? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD
On Wed, Feb 5, 2014 at 11:37 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-02-05 11:23:29 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: And I think somebody already thought about it (c.f. ALIGNOF_BUFFER), it just wasn't updated in the last 10 years. No, ALIGNOF_BUFFER is there because we read something that said that I/O transfers between userspace and kernel disk cache would be faster with aligned buffers. There's been no particular thought given to alignment of other data structures, AFAIR. But it's not aligned anymore on at last half a decade's hardware, and it's what we already align *all* bigger ShmemAlloc() values with. And BufferDescriptors surely counts as larger in its entirety. It may well be that your proposal is spot on. But I'd like to see some data-structure-by-data-structure measurements, rather than assuming that alignment must be a good thing. I am fine with just aligning BufferDescriptors properly. That has clearly shown massive improvements. I thought your previous idea of increasing BUFFERALIGN to 64 bytes had a lot to recommend it. But that doesn't mean it doesn't need testing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: show xid and xmin in pg_stat_activity and pg_stat_replication
Robert Haas robertmh...@gmail.com writes: It feels weird to me that the new columns are called transactionid and xmin. Why not xid and xmin? Actually the part of that that bothers me is xmin, which conflicts with a reserved system column name. While you can legally pick such conflicting names for view columns, it's not going to be so much fun when you try to join that view against some regular table. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] should we add a XLogRecPtr/LSN SQL type?
On Tue, Feb 4, 2014 at 3:26 PM, Peter Eisentraut pete...@gmx.net wrote: Perhaps this type should be called pglsn, since it's an implementation-specific detail and not a universal concept like int, point, or uuid. If we're going to do that, I suggest pg_lsn rather than pglsn. We already have pg_node_tree, so using underscores for separation would be more consistent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: show xid and xmin in pg_stat_activity and pg_stat_replication
On Wed, Feb 5, 2014 at 1:21 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: It feels weird to me that the new columns are called transactionid and xmin. Why not xid and xmin? Actually the part of that that bothers me is xmin, which conflicts with a reserved system column name. While you can legally pick such conflicting names for view columns, it's not going to be so much fun when you try to join that view against some regular table. That's a fair point, too. So maybe we should go with something like backend_xid and backend_xmin or some other prefix that we come up with. My concern is more that I think they should be consistent somehow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] narwhal and PGDLLIMPORT
On 02/03/2014 07:04 AM, Robert Haas wrote: On Mon, Feb 3, 2014 at 9:57 AM, Andrew Dunstan and...@dunslane.net wrote: It's not so long ago that they were saying they would no longer publish free-as-in-beer command line compilers at all. The outrage made them change their minds, but we really can't rely on only Microsoft compilers for Windows. +1. It appears that LLVM supports Windows: http://llvm.org/docs/GettingStartedVS.html Sincerely, JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD
On Tue, Feb 4, 2014 at 4:24 PM, Peter Geoghegan p...@heroku.com wrote: There is something you have not drawn explicit attention to that is very interesting. If we take REL9_3_STABLE tip to be representative (built with full -O2 optimization, no assertions just debugging symbols), setting max_connections to 91 from 90 does not have the effect of making the BufferDescriptors array aligned; it has the effect of making it *misaligned*. I spoke too soon; the effect is indeed reversed on master (i.e. bad max_connection settings are misaligned, and vice-versa). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] should we add a XLogRecPtr/LSN SQL type?
On 2/5/14, 1:31 PM, Robert Haas wrote: On Tue, Feb 4, 2014 at 3:26 PM, Peter Eisentraut pete...@gmx.net wrote: Perhaps this type should be called pglsn, since it's an implementation-specific detail and not a universal concept like int, point, or uuid. If we're going to do that, I suggest pg_lsn rather than pglsn. We already have pg_node_tree, so using underscores for separation would be more consistent. Yes, that's a good precedent in multiple ways. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [doc patch] extra_float_digits and casting from real to numeric
On Tue, Feb 4, 2014 at 11:25 AM, Christoph Berg christoph.b...@credativ.de wrote: Re: To Tom Lane 2014-01-08 20140108094017.ga20...@msgid.df7cb.de What about this patch to mention this gotcha more explicitely in the documentation? diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml new file mode 100644 index 0386330..968f4a7 *** a/doc/src/sgml/datatype.sgml --- b/doc/src/sgml/datatype.sgml *** NUMERIC *** 689,694 --- 689,697 literal0/literal, the output is the same on every platform supported by PostgreSQL. Increasing it will produce output that more accurately represents the stored value, but may be unportable. + Casts to other numeric datatypes and the literalto_char/literal + function are not affected by this setting, it affects only the text + representation. /para /note Anyone for that patch? Well, the new text kinda recapitulates what the existing text already says. If we're going to clarify, I'd do it like this: The xref linkend=guc-extra-float-digits setting controls the number of extra significant digits included when a floating point value is converted to text for output. It does not affect the results when a floating point number is converted to some other data type or formatted using literalto_char/literal. But frankly I'm inclined to just leave it alone. It says that it affects what happens when the value is converted to text for output. That's specific and accurate. Granted, someone could misunderstand, but that's true of almost anything we might write, and being too long-winded has costs of its own. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] narwhal and PGDLLIMPORT
On 02/05/2014 01:41 PM, Joshua D. Drake wrote: On 02/03/2014 07:04 AM, Robert Haas wrote: On Mon, Feb 3, 2014 at 9:57 AM, Andrew Dunstan and...@dunslane.net wrote: It's not so long ago that they were saying they would no longer publish free-as-in-beer command line compilers at all. The outrage made them change their minds, but we really can't rely on only Microsoft compilers for Windows. +1. It appears that LLVM supports Windows: http://llvm.org/docs/GettingStartedVS.html If someone wants to work on getting a Windows build working with llvm then go for it. But until then the Mingw tools are the only thing we have that we know works other than Microsoft tools. Having a compiler that's alleged to work is only one step of many in getting there. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 12:50 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an irrelevant volume. Correct. Compressing a full-page image happens on the first update after a checkpoint, and the diff between old and new tuple is not used in that case. Uh, I really just meant that one thing that might overlap is considerations around the choice of compression algorithm. I think that there was some useful discussion of that on the other thread as well. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On 02/05/2014 07:48 AM, Merlin Moncure wrote: Another point I'm struggling with is what jsonb brings to the table that isn't covered either hstore or json; working through a couple of cases I find myself not using the jsonb functionality except as a 'hstore json formatter' which the json type covers. I'm probably being obtuse, but we have to be cautious before plonking a couple of dozen extra functions in the public schema. There's three reasons why it's worthwhile: 1) user-friendliness: telling users they need to do ::JSON and ::HSTORE2 all the time is sufficiently annoying -- and prone to causing errors -- to be a blocker to adoption by a certain, very numerous, class of user. 2) performance: to the extent that we can operate entirely in JSONB and not transform back and forth to JSON and HSTORE, function calls (and index lookups) will be much faster. And given the competition, speed is important. 3) growth: 9.4's JSONB functions are a prerequisite to developing richer JSON querying capabilities in 9.5 and later, which will go beyond JSON formatting for HSTORE. Frankly, if it were entirely up to me HSTORE2 would be part of core and its only interface would be JSONB. But it's not. So this is a compromise. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext
Noah Misch n...@leadboat.com writes: The following test case reliably hits this new assertion: create table t (c int); begin; create index on t (c); savepoint q; insert into t values (1); select 1/0; As of commit ac8bc3b6e4, this test case no longer triggers the assertion, because it depends on the INSERT issuing a relcache invalidation request against t's index. btree used to do that when changing the index metapage, but no longer does. The underlying problem is certainly still there, though, and can be triggered with this slightly more complex test: create or replace function inverse(int) returns float8 as $$ begin analyze t1; return 1::float8/$1; exception when division_by_zero then return 0; end$$ language plpgsql volatile; drop table if exists t1; create table t1 (c float8 unique); insert into t1 values (1); insert into t1 values (inverse(0)); Here, the ANALYZE triggers a relcache inval within the subtransaction established by the function's BEGIN/EXCEPTION block, and then we abort that subtransaction with a zero-divide, so you end up at the same place as with the older example. Of note is that we still have to have an index on t1; remove that, no assert. This is a bit surprising since the ANALYZE certainly causes a relcache flush on the table not just the index. The reason turns out to be that for a simple table like this, relcache entry rebuild does not involve consulting any syscache: we load the pg_class row with a systable scan on pg_class, and build the tuple descriptor using a systable scan on pg_attribute, and we're done. IIRC this was done this way intentionally to avoid duplicative caching of the pg_class and pg_attribute rows. However, RelationReloadIndexInfo uses the syscaches with enthusiasm, so you will hit the Assert in question if you're trying to rebuild an index; and you possibly could hit it for a table if you have a more complicated table definition, such as one with rules. Of course, a direct system catalog scan is certainly no safer in an aborted transaction than the one that catcache.c is refusing to do. Therefore, in my opinion, relcache.c ought also to be doing an Assert(IsTransactionState()), at least in ScanPgRelation and perhaps everywhere that it does catalog scans. I stuck such an Assert in ScanPgRelation, and verified that it doesn't break any existing regression tests --- although of course the above test case still fails, and now even without declaring an index. Barring objections I'll go commit that. It's a bit pointless to be Asserting that catcache.c does nothing unsafe when relcache.c does the same things without any such test. (Alternatively, maybe we should centralize the asserting in systable_beginscan or some such place?) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: show xid and xmin in pg_stat_activity and pg_stat_replication
Robert Haas robertmh...@gmail.com writes: On Wed, Feb 5, 2014 at 1:21 PM, Tom Lane t...@sss.pgh.pa.us wrote: Actually the part of that that bothers me is xmin, which conflicts with a reserved system column name. While you can legally pick such conflicting names for view columns, it's not going to be so much fun when you try to join that view against some regular table. That's a fair point, too. So maybe we should go with something like backend_xid and backend_xmin or some other prefix that we come up with. My concern is more that I think they should be consistent somehow. Works for me. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On 02/05/2014 02:03 PM, Josh Berkus wrote: Frankly, if it were entirely up to me HSTORE2 would be part of core and its only interface would be JSONB. But it's not. So this is a compromise. You could only do that by inventing a new type. But hstore2 isn't a new type, it's meant to be the existing hstore type with new capabilities. Incidentally, some work is being done by one of my colleagues on an extension of gin/gist operators for indexing jsonb similarly to hstore2. Now that will possibly be something we can bring into 9.4, although we'll have to check how we go about pg_upgrade for that case. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
Robert Haas escribió: On Tue, Feb 4, 2014 at 12:11 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I have run into some issues, though: 1. certain types, particularly timestamp/timestamptz but really this could happen for any type, have unusual typmod output behavior. For those one cannot just use the schema-qualified catalog names and then append the typmod at the end; because what you end up is something like pg_catalog.timestamptz(4) with time zone because, for whatever reason, the with time zone is part of typmod output. But this doesn't work at all for input. I'm not sure how to solve this. How about doing whatever pg_dump does? We use format_type() for that as far as I know. What it does differently is use undecorated names defined by the standard for some types, which are never schema qualified and are never ambiguous because they are reserved words that would require quoting if used by user-defined type names. We can't use that here: somewhere upthread we noticed issues when using those which is why we're now trying to use catalog names instead of those special names. (I don't think it's impossible to use such names: we just need to ensure we handle quoting correctly for the funny cases such as char and bit.) One idea is to chop the typmod output string at the closing parens. That seems to work well for the types that come with core code ... and the rule with the funny string at the end after the parenthised part of the typmod works only by type names with hardcoded productions in gram.y, so there is no danger that outside code will have a problem with that strategy. (I verified PostGIS types with typmods just to see an example of out-of-core code at work, and unsurprisingly it only emits stuff inside parens.) 2. I have been having object definitions be emitted complete; in particular, sequences have OWNED BY clauses when they have an owning column. But this doesn't work with a SERIAL column, because we get output like this: Well, the sequence can't depend on a table column that doesn't exist yet, so if it's in effect doing what you've shown there, it's cheating by virtue of knowing that nobody can observe the intermediate state. Strictly speaking, there's nothing wrong with emitting those commands just as you have them there; they won't run, but if what you want to do is log what's happened rather than replay it, that's OK. Producing output that is actually executable is a strictly harder problem than producing output that accurately describes what happened. As you say, pg_dump already splits things and getting executable output out of this facility will require the same kinds of tricks here. Yeah. Moreover this will require that this new event trigger code is able to handle (at least certain kinds of) ALTER, which expands this patch in scope by a wide margin. Producing executable commands is an important goal. This gets back to my worry about maintaining two or three copies of the code that solve many of the same problems in quite different ways... Agreed. It would be real good to be able to use this code for pg_dump too, but it seems dubious. The two environments seem just too different to be able to reuse anything. But if you see a way, by all means shout. 3. It is possible to coerce ruleutils.c to emit always-qualified names by using PushOverrideSearchPath() facility; but actually this doesn't always work, because some places in namespace.c believe that PG_CATALOG_NAMESPACE is always visible and so certain objects are not qualified. In particular, text columns using default collation will be emitted as having collation default and not pg_catalog.default as I would have initially expected. Right now it doesn't seem like this is a problem, but it is unusual. We have a quote_all_identifiers flag. We could have a schema_qualify_all_identifiers flag, too. Actually the bit about collations was just a garden variety bug: turns out I was using a %{}I specifier (and correspondingly only the collation name as a string) instead of %{}D and get_catalog_object_by_oid() to match. I'm not yet sure if this new flag you suggest will really be needed, but thanks for the idea. Then again, why is the behavior of schema-qualifying absolutely everything even desirable? Well, someone could create a collation in another schema with the same name as a system collation and the command would become ambiguous. For example, this command create table f (a text collate POSIX); creates a column whose collation is either public.POSIX or the system POSIX collation, depending on whether public appears before pg_catalog in search_path. Having someone create a POSIX collation in a schema that appears earlier than pg_catalog is pretty far-fetched, but ISTM that we really need to cover that scenario. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services --
Re: [HACKERS] narwhal and PGDLLIMPORT
On 02/05/2014 10:56 AM, Andrew Dunstan wrote: It appears that LLVM supports Windows: http://llvm.org/docs/GettingStartedVS.html If someone wants to work on getting a Windows build working with llvm then go for it. But until then the Mingw tools are the only thing we have that we know works other than Microsoft tools. Having a compiler that's alleged to work is only one step of many in getting there. I was just trying to provide another resource in case people wanted to take a look. JD cheers andrew -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
Alvaro Herrera alvhe...@2ndquadrant.com writes: Robert Haas escribió: How about doing whatever pg_dump does? We use format_type() for that as far as I know. What it does differently is use undecorated names defined by the standard for some types, which are never schema qualified and are never ambiguous because they are reserved words that would require quoting if used by user-defined type names. We can't use that here: somewhere upthread we noticed issues when using those which is why we're now trying to use catalog names instead of those special names. (I don't think it's impossible to use such names: we just need to ensure we handle quoting correctly for the funny cases such as char and bit.) Yeah, but wouldn't that complexity also bubble into user code within the event triggers? Since there's no real need for SQL standard compliance in this context, I think minimizing the number of weird formats is a win. One idea is to chop the typmod output string at the closing parens. +1. The only reason timestamptypmodout works like that is that we're trying to match the SQL standard's spelling of the type names, and that committee apparently considers it an off day whenever they can't invent some randomly-incompatible-with-everything syntax. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On Wed, Feb 5, 2014 at 1:03 PM, Josh Berkus j...@agliodbs.com wrote: On 02/05/2014 07:48 AM, Merlin Moncure wrote: Another point I'm struggling with is what jsonb brings to the table that isn't covered either hstore or json; working through a couple of cases I find myself not using the jsonb functionality except as a 'hstore json formatter' which the json type covers. I'm probably being obtuse, but we have to be cautious before plonking a couple of dozen extra functions in the public schema. There's three reasons why it's worthwhile: 1) user-friendliness: telling users they need to do ::JSON and ::HSTORE2 all the time is sufficiently annoying -- and prone to causing errors -- to be a blocker to adoption by a certain, very numerous, class of user. That's a legitimate point of concern. But in and of itself I'm sure sure it warrants exposing a separate API. 2) performance: to the extent that we can operate entirely in JSONB and not transform back and forth to JSON and HSTORE, function calls (and index lookups) will be much faster. And given the competition, speed is important. Not following this. I do not see how the presence of jsonb helps at all. Client to server communication will be text-binary (and vice versa) and handling within the server itself will be in binary. This is the crux of my point. 3) growth: 9.4's JSONB functions are a prerequisite to developing richer JSON querying capabilities in 9.5 and later, which will go beyond JSON formatting for HSTORE. I kind of get this point. But in lieu of a practical use case today, what's the rush to implement? I fully anticipate I'm out on left field on this one (I have a cot and mini fridge there). The question on the table is: what use cases (performance included) does jsonb solve that is not solve can't be solved without it? With the possible limited exception of andrew's yet to be delivered enhanced deserialization routines, I can't think of any. If presented with reasonable evidence I'll shut my yap, pronto. Frankly, if it were entirely up to me HSTORE2 would be part of core and its only interface would be JSONB. But it's not. So this is a compromise. I don't. To be pedantic: hstore is in core, but packaged as an extension. That's a very important distinction. In fact, I'll go further and say it seem wise for all SQL standard type work to happen in extensions. As long as it's an in core contrib extension, I see no stigma to that whatsoever. It's not clear at all to me why the json type was put to the public schema and now we're about to double down with jsonb. Having things extension packaged greatly eases concerns about future API changes because if problems emerge it's not impossible to imagine compatibility extensions to appear to bridge the gap if certain critical functions change. That's exactly the sort of thing that we may want to happen here, I think. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On 02/05/2014 03:15 PM, Merlin Moncure wrote: On Wed, Feb 5, 2014 at 1:03 PM, Josh Berkus j...@agliodbs.com wrote: On 02/05/2014 07:48 AM, Merlin Moncure wrote: Another point I'm struggling with is what jsonb brings to the table that isn't covered either hstore or json; working through a couple of cases I find myself not using the jsonb functionality except as a 'hstore json formatter' which the json type covers. I'm probably being obtuse, but we have to be cautious before plonking a couple of dozen extra functions in the public schema. There's three reasons why it's worthwhile: 1) user-friendliness: telling users they need to do ::JSON and ::HSTORE2 all the time is sufficiently annoying -- and prone to causing errors -- to be a blocker to adoption by a certain, very numerous, class of user. That's a legitimate point of concern. But in and of itself I'm sure sure it warrants exposing a separate API. 2) performance: to the extent that we can operate entirely in JSONB and not transform back and forth to JSON and HSTORE, function calls (and index lookups) will be much faster. And given the competition, speed is important. Not following this. I do not see how the presence of jsonb helps at all. Client to server communication will be text-binary (and vice versa) and handling within the server itself will be in binary. This is the crux of my point. 3) growth: 9.4's JSONB functions are a prerequisite to developing richer JSON querying capabilities in 9.5 and later, which will go beyond JSON formatting for HSTORE. I kind of get this point. But in lieu of a practical use case today, what's the rush to implement? I fully anticipate I'm out on left field on this one (I have a cot and mini fridge there). The question on the table is: what use cases (performance included) does jsonb solve that is not solve can't be solved without it? With the possible limited exception of andrew's yet to be delivered enhanced deserialization routines, I can't think of any. If presented with reasonable evidence I'll shut my yap, pronto. Frankly, if it were entirely up to me HSTORE2 would be part of core and its only interface would be JSONB. But it's not. So this is a compromise. I don't. To be pedantic: hstore is in core, but packaged as an extension. That's a very important distinction. In fact, I'll go further and say it seem wise for all SQL standard type work to happen in extensions. As long as it's an in core contrib extension, I see no stigma to that whatsoever. It's not clear at all to me why the json type was put to the public schema and now we're about to double down with jsonb. Having things extension packaged greatly eases concerns about future API changes because if problems emerge it's not impossible to imagine compatibility extensions to appear to bridge the gap if certain critical functions change. That's exactly the sort of thing that we may want to happen here, I think. The time for this discussion was months ago. I would not have spent many many hours of my time if I thought it was going to be thrown away. I find this attitude puzzling, to say the least. You were a major part of the discussion when we said OK, we'll leave json as it is (text based) and add jsonb. That's exactly what we're doing. And no, hstore is NOT in core. In core for a type means to me it's builtin, with a fixed Oid. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On Wed, Feb 5, 2014 at 2:37 PM, Andrew Dunstan and...@dunslane.net wrote: The time for this discussion was months ago. I would not have spent many many hours of my time if I thought it was going to be thrown away. I find this attitude puzzling, to say the least. You were a major part of the discussion when we said OK, we'll leave json as it is (text based) and add jsonb. That's exactly what we're doing. certainly. I'll shut my yap; I understand your puzzlement. At the time though, I had assumed the API was going to incorporate more of the hstore feature set than it did. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On 02/05/2014 03:45 PM, Merlin Moncure wrote: On Wed, Feb 5, 2014 at 2:37 PM, Andrew Dunstan and...@dunslane.net wrote: The time for this discussion was months ago. I would not have spent many many hours of my time if I thought it was going to be thrown away. I find this attitude puzzling, to say the least. You were a major part of the discussion when we said OK, we'll leave json as it is (text based) and add jsonb. That's exactly what we're doing. certainly. I'll shut my yap; I understand your puzzlement. At the time though, I had assumed the API was going to incorporate more of the hstore feature set than it did. And we will. Specifically the indexing ops I mentioned upthread. We've got done as much as could be done this cycle. That's how Postgres development works. One of the major complaints about json in 9.3 is that almost all the functions and operators involve reparsing the json. The equivalent operations for jsonb do not, and should accordingly be significantly faster. That's what I have been spending my time on. I don't think that's an inconsiderable advance. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
Merlin, Not following this. I do not see how the presence of jsonb helps at all. Client to server communication will be text-binary (and vice versa) and handling within the server itself will be in binary. This is the crux of my point. Except that handling it on the server, in binary, would require using the HSTORE syntax. Otherwise you're converting from text JSON and back whenever you want to nest functions. I kind of get this point. But in lieu of a practical use case today, what's the rush to implement? I fully anticipate I'm out on left field on this one (I have a cot and mini fridge there). The question on the table is: what use cases (performance included) does jsonb solve that is not solve can't be solved without it? Indexed element extraction. JSON path queries. JSON manipulation. If JSONB is in 9.4, then these are things we can build as extensions and have available long before September 2015 -- in fact, we've already started on a couple. If JSONB isn't in core as a data type, then we have to wait for the 9.5 dev cycle to do anything. In fact, I'll go further and say it seem wise for all SQL standard type work to happen in extensions. As long as it's an in core contrib extension, I see no stigma to that whatsoever. It's not clear at all to me why the json type was put to the public schema and now we're about to double down with jsonb. I'll agree that having hstore in contrib and json in core has been a significant source of issues. On 02/05/2014 12:45 PM, Merlin Moncure wrote: certainly. I'll shut my yap; I understand your puzzlement. At the time though, I had assumed the API was going to incorporate more of the hstore feature set than it did. That was the original goal. However, Oleg and Teodor's late delivery of Hstore2 limited what Andrew could do for JSONB before CF4 started. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On Wed, Feb 5, 2014 at 3:03 PM, Josh Berkus j...@agliodbs.com wrote: That was the original goal. However, Oleg and Teodor's late delivery of Hstore2 limited what Andrew could do for JSONB before CF4 started. yeah. anyways, I'm good on this point. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] encouraging index-only scans
First, thanks for this thoughtful email. On Tue, Feb 4, 2014 at 7:14 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Mon, Feb 3, 2014 at 8:55 AM, Robert Haas robertmh...@gmail.com wrote: I've also had some further thoughts about the right way to drive vacuum scheduling. I think what we need to do is tightly couple the rate at which we're willing to do vacuuming to the rate at which we're incurring vacuum debt. That is, if we're creating 100kB/s of pages needing vacuum, we vacuum at 2-3MB/s (with default settings). If we can tolerate 2-3MB/s without adverse impact on other work, then we can tolerate it. Do we gain anything substantial by sand-bagging it? No. The problem is the other direction. If we're creating 10MB/s of pages needing vacuum, we *still* vacuum at 2-3MB/s. Not shockingly, vacuum gets behind, the database bloats, and everything goes to heck. (Your reference to bloat made be me think your comments here are about vacuuming in general, not specific to IOS. If that isn't the case, then please ignore.) If we can only vacuum at 2-3MB/s without adversely impacting other activity, but we are creating 10MB/s of future vacuum need, then there are basically two possibilities I can think of. Either the 10MB/s represents a spike, and vacuum should tolerate it and hope to catch up on the debt later. Or it represents a new permanent condition, in which case I bought too few hard drives for the work load, and no scheduling decision that autovacuum can make will save me from my folly. Perhaps there is some middle ground between those possibilities, but I don't see room for much middle ground. I guess there might be entirely different possibilities not between those two; for example, I don't realize I'm doing something that is generating 10MB/s of vacuum debt, and would like to have this thing I'm doing be automatically throttled to the point it doesn't interfere with other processes (either directly, or indirectly by bloat) The underlying issue here is that, in order for there not to be a problem, a user needs to configure their autovacuum processes to vacuum at a rate which is greater than or equal to the average rate at which vacuum debt is being created. If they don't, they get runaway bloat. But to do that, they need to know at what rate they are creating vacuum debt, which is almost impossible to figure out right now; and even if they did know it, they'd then need to figure out what vacuum cost delay settings would allow vacuuming at a rate sufficient to keep up, which isn't quite as hard to estimate but certainly involves nontrivial math. So a lot of people have this set wrong, and it's not easy to get it right except by frobbing the settings until you find something that works well in practice. Also, a whole *lot* of problems in this area are caused by cases where the rate at which vacuum debt is being created *changes*. Autovacuum is keeping up, but then you have either a load spike or just a gradual increase in activity and it doesn't keep up any more. You don't necessarily notice right away, and by the time you do there's no easy way to recover. If you've got a table with lots of dead tuples in it, but it's also got enough internal freespace to satisfy as many inserts and updates as are happening, then it's possibly reasonable to put off vacuuming in the hopes that system load will be lower at some time in the future. But if you've got a table with lots of dead tuples in it, and you're extending it to create internal freespace instead of vacuuming it, it is highly like that you are not doing what will make the user most happy. Even if vacuuming that table slows down foreground activity quite badly, it is probably better than accumulating an arbitrary amount of bloat. The rate of vacuuming needs to be tied somehow to the rate at which we're creating stuff that needs to be vacuumed. Right now we don't even have a way to measure that, let alone auto-regulate the aggressiveness of autovacuum on that basis. There is the formula used to decide when a table gets vacuumed. Isn't the time delta in this formula a measure of how fast we are creating stuff that needs to be vacuumed for bloat reasons? Is your objection that it doesn't include other reasons we might want to vacuum, or that it just doesn't work very well, or that is not explicitly exposed? AFAICT, the problem isn't when the table gets vacuumed so much as *how fast* it gets vacuumed. The autovacuum algorithm does a fine job selecting tables for vacuuming, for the most part. There are problems with insert-only tables and sometimes for large tables the default threshold (0.20) is too high, but it's not terrible. However, the limit on the overall rate of vacuuming activity to 2-3MB/s regardless of how fast we're creating vacuum debt is a big problem. Similarly, for marking of pages as all-visible, we currently make the same decision whether the relation is getting index-scanned
Re: [HACKERS] jsonb and nested hstore
On 02/05/2014 04:06 PM, Merlin Moncure wrote: On Wed, Feb 5, 2014 at 3:03 PM, Josh Berkus j...@agliodbs.com wrote: That was the original goal. However, Oleg and Teodor's late delivery of Hstore2 limited what Andrew could do for JSONB before CF4 started. I also had issues. But this is the sort of thing that happens. We get done as much as we can. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minor performance improvement in transition to external sort
On Tue, Feb 4, 2014 at 5:22 PM, Jeremy Harris j...@wizmail.org wrote: The attached patch replaces the existing siftup method for heapify with a siftdown method. Tested with random integers it does 18% fewer compares and takes 10% less time for the heapify, over the work_mem range 1024 to 1048576. Both algorithms appear to be O(n) (contradicting Wikipedia's claim that a siftup heapify is O(n log n)). I think Wikipedia is right. Inserting an a tuple into a heap is O(lg n); doing that n times is therefore O(n lg n). Your patch likewise implements an outer loop which runs O(n) times and an inner loop which runs at most O(lg n) times, so a naive analysis of that algorithm also comes out to O(n lg n). Wikipedia's article on http://en.wikipedia.org/wiki/Heapsort explains why a tighter bound is possible for the siftdown case. I think I tested something like this at some point and concluded that it didn't really help much, because building the initial heap was a pretty small part of the runtime of a large sort. It may still be worth doing, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext
On Wed, Feb 05, 2014 at 02:07:29PM -0500, Tom Lane wrote: Of course, a direct system catalog scan is certainly no safer in an aborted transaction than the one that catcache.c is refusing to do. Therefore, in my opinion, relcache.c ought also to be doing an Assert(IsTransactionState()), at least in ScanPgRelation and perhaps everywhere that it does catalog scans. I stuck such an Assert in ScanPgRelation, and verified that it doesn't break any existing regression tests --- although of course the above test case still fails, and now even without declaring an index. Barring objections I'll go commit that. It's a bit pointless to be Asserting that catcache.c does nothing unsafe when relcache.c does the same things without any such test. (Alternatively, maybe we should centralize the asserting in systable_beginscan or some such place?) I'd put it in LockAcquireExtended() and perhaps also heap_beginscan_internal() and/or index_beginscan_internal(). ScanPgRelation() isn't special. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext
On 2014-02-05 14:07:29 -0500, Tom Lane wrote: I stuck such an Assert in ScanPgRelation, and verified that it doesn't break any existing regression tests --- although of course the above test case still fails, and now even without declaring an index. Barring objections I'll go commit that. It's a bit pointless to be Asserting that catcache.c does nothing unsafe when relcache.c does the same things without any such test. (Alternatively, maybe we should centralize the asserting in systable_beginscan or some such place?) I don't have a problem with sticking an additional assert elsewhere, but I think ScanPgRelation, systable_beginscan are a bit late, because they frequently won't be hit during testing because the lookups will be cached... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext
Andres Freund and...@2ndquadrant.com writes: On 2014-02-05 14:07:29 -0500, Tom Lane wrote: I stuck such an Assert in ScanPgRelation, and verified that it doesn't break any existing regression tests --- although of course the above test case still fails, and now even without declaring an index. Barring objections I'll go commit that. It's a bit pointless to be Asserting that catcache.c does nothing unsafe when relcache.c does the same things without any such test. (Alternatively, maybe we should centralize the asserting in systable_beginscan or some such place?) I don't have a problem with sticking an additional assert elsewhere, but I think ScanPgRelation, systable_beginscan are a bit late, because they frequently won't be hit during testing because the lookups will be cached... Oh, good point. By analogy to the placement of the existing Assert in SearchCatCache, the one for relcache should be in RelationIdGetRelation. [ experiments... ] OK, that passes regression tests too. Good... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
On Thu, Jan 23, 2014 at 1:36 PM, Peter Geoghegan p...@heroku.com wrote: So while post-recovery callbacks no longer exist for any rmgr-managed-resource, 100% of remaining startup and cleanup callbacks concern the simple management of memory of AM-specific recovery contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't a better abstraction than that, such as a generic recovery memory context, allowing you to retire all 3 callbacks. I mean, StartupXLOG() just calls those callbacks for each resource at exactly the same time anyway, just as it initializes resource managers in precisely the same manner earlier on. Plus if you look at what those AM-local memory management routines do, it all seems very simple. What are your thoughts on this, as someone that has a broader perspective here? Are you inclined to keep the startup and cleanup callbacks in anticipation of a day when that degree of generality is useful? That would be pretty well-precedented of course, but I would like to hear your opinion. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On 02/05/2014 01:10 PM, Andrew Dunstan wrote: On 02/05/2014 12:48 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 02/05/2014 11:40 AM, Tom Lane wrote: switching to binary is the same as text may well be the most prudent path here. If we do that we're going to have to live with that forever, aren't we? Yeah, but the other side of that coin is that we'll have to live forever with whatever binary format we pick, too. If it turns out to be badly designed, that could be much worse than eating some parsing costs during dump/restore. If we had infinite time/manpower, this wouldn't really be an issue. We don't, though, and so I suggest that this may be one of the better things to toss overboard. The main reason I'm prepared to consider this is the JSON parser seems to be fairly efficient (See Oleg's recent stats) and in fact we'd more or less be parsing the binary format on input anyway, so there's no proof that a binary format is going to be hugely faster (or possibly even that it will be faster at all). If anyone else has opinions on this sing out pretty darn soon (like the next 24 hours or so, before I begin.) I got a slightly earlier start ;-) For people wanting to play along, here's what this change looks like: https://github.com/feodor/postgres/commit/3fe899b3d7e8f806b14878da4a4e2331b0eb58e8 I have a bit more cleanup to do and then I'll try to make new patches. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Partial sort
On Tue, Jan 28, 2014 at 7:51 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Tue, Jan 28, 2014 at 7:41 AM, Marti Raudsepp ma...@juffo.org wrote: But some benchmarks of planning performance are certainly warranted. I didn't test it, but I worry that overhead might be high. If it's true then it could be like constraint_exclusion option which id off by default because of planning overhead. Sorry I didn't get around to this before. I ran some synthetic benchmarks with single-column inner joins between 5 tables, with indexes on both joined columns, using only EXPLAIN (so measuring planning time, not execution) in 9 scenarios to excercise different code paths. According to these measurements, the overhead ranges between 1.0 and 4.5% depending on the scenario. Merge join with partial sort children seems like a fairly obscure use case (though I'm sure it can help a lot in those cases). The default should definitely allow partial sort in normal ORDER BY queries. What's under question here is whether to enable partial sort for mergejoin. So I see 3 possible resolutions: 1. The overhead is deemed acceptable to enable by default, in which case we're done here. 2. Add a three-value runtime setting like: enable_partialsort = [ off | no_mergejoin | on ], defaulting to no_mergejoin (just to get the point across, clearly we need better naming). This is how constraint_exclusion works. 3. Remove the partialsort mergejoin code entirely, keeping the rest of the cases. What do you think? All the tests are available here: https://github.com/intgr/benchjunk/tree/master/partial_sort (using script run2.sh) Overhead by test (partial-sort-7.patch.gz): join5.sql 2.9% (all joins on the same column) star5.sql 1.7% (star schema kind of join) line5.sql 1.9% (joins chained to each other) lim_join5.sql 4.5% (same as above, with LIMIT 1) lim_star5.sql 2.8% lim_line5.sql 1.8% limord_join5.sql 4.3% (same as above, with ORDER BY LIMIT 1) limord_star5.sql 3.9% limord_line5.sql 1.0% Full data: PostgreSQL @ git ac8bc3b join5.sql tps = 499.490173 (excluding connections establishing) join5.sql tps = 503.756335 (excluding connections establishing) join5.sql tps = 504.814072 (excluding connections establishing) star5.sql tps = 492.799230 (excluding connections establishing) star5.sql tps = 492.570615 (excluding connections establishing) star5.sql tps = 491.949985 (excluding connections establishing) line5.sql tps = 773.945050 (excluding connections establishing) line5.sql tps = 773.858068 (excluding connections establishing) line5.sql tps = 774.551240 (excluding connections establishing) lim_join5.sql tps = 392.539745 (excluding connections establishing) lim_join5.sql tps = 391.867549 (excluding connections establishing) lim_join5.sql tps = 393.361655 (excluding connections establishing) lim_star5.sql tps = 418.431804 (excluding connections establishing) lim_star5.sql tps = 419.258985 (excluding connections establishing) lim_star5.sql tps = 419.434697 (excluding connections establishing) lim_line5.sql tps = 713.852506 (excluding connections establishing) lim_line5.sql tps = 713.636694 (excluding connections establishing) lim_line5.sql tps = 712.971719 (excluding connections establishing) limord_join5.sql tps = 381.068465 (excluding connections establishing) limord_join5.sql tps = 380.379359 (excluding connections establishing) limord_join5.sql tps = 381.182385 (excluding connections establishing) limord_star5.sql tps = 412.997935 (excluding connections establishing) limord_star5.sql tps = 411.401352 (excluding connections establishing) limord_star5.sql tps = 413.209784 (excluding connections establishing) limord_line5.sql tps = 688.906406 (excluding connections establishing) limord_line5.sql tps = 689.445483 (excluding connections establishing) limord_line5.sql tps = 688.758042 (excluding connections establishing) partial-sort-7.patch.gz join5.sql tps = 479.508034 (excluding connections establishing) join5.sql tps = 488.263674 (excluding connections establishing) join5.sql tps = 490.127433 (excluding connections establishing) star5.sql tps = 482.106063 (excluding connections establishing) star5.sql tps = 484.179687 (excluding connections establishing) star5.sql tps = 483.027372 (excluding connections establishing) line5.sql tps = 758.092993 (excluding connections establishing) line5.sql tps = 759.697814 (excluding connections establishing) line5.sql tps = 759.792792 (excluding connections establishing) lim_join5.sql tps = 375.517211 (excluding connections establishing) lim_join5.sql tps = 375.539109 (excluding connections establishing) lim_join5.sql tps = 375.841645 (excluding connections establishing) lim_star5.sql tps = 407.683110 (excluding connections establishing) lim_star5.sql tps = 407.414409 (excluding connections establishing) lim_star5.sql tps = 407.526613 (excluding connections establishing) lim_line5.sql tps = 699.905101 (excluding connections establishing) lim_line5.sql tps = 700.349675 (excluding
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
On Tue, Feb 4, 2014 at 11:56 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I also changed _bt_moveright to never return a write-locked buffer, when the caller asked for a read-lock (an issue you pointed out earlier in this thread). I think that _bt_moveright() looks good now. There is now bitrot, caused by commit ac8bc3. I rebased myself, but that was non-trivial, surprisingly; I had to tweak by hand. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] narwhal and PGDLLIMPORT
On 02/05/2014 01:52 PM, Tom Lane wrote: Craig Ringer cr...@2ndquadrant.com writes: On 02/05/2014 06:29 AM, Tom Lane wrote: I had been okay with the manual PGDLLIMPORT-sprinkling approach (not happy with it, of course, but prepared to tolerate it) as long as I believed the buildfarm would reliably tell us of the need for it. That assumption has now been conclusively disproven, though. I'm kind of horrified that the dynamic linker doesn't throw its toys when it sees this. Indeed :-(. The truly strange part of this is that it seems that the one Windows buildfarm member that's telling the truth (or most nearly so, anyway) is narwhal, which appears to have the oldest and cruftiest toolchain of the lot. I'd really like to come out the other end of this investigation with a clear understanding of why the newer toolchains are failing to report a link problem, and yet not building working executables. I've done some digging and asked for some input from people with appropriate knowledge. Getting a hard answer's going to require some quality time with a debugger that I won't have until after the CF. Here's what I've found so far: In a .DEF file, symbol exports default to code exports if unspecified. http://msdn.microsoft.com/en-us/library/hyx1zcd3.aspx (I'm told that .DEF files take precedence over annotations in code, but haven't found a reference to confirm that yet, and I'm not convinced it's true since data annotated correctly with __declspec(dllexport) seems to work. You'd think places like this would mention it, but they don't: http://msdn.microsoft.com/en-us/library/7k30y2k5.aspx) So we're probably generating code thunk functions for our global variables in the import library. There seem to be two possibilities for what's happening when you access an extern variable defined in another module without __declspec(dllimport): 1. Our extern variable points to a thunk function's definition; or 2. Our extern variable points to the indirected pointer to the real variable, which is what you get if you link to a CONSTANT export without using __declspec(dllimport). I'd need to go poking in disassemblies with the debugger to confirm which it is, and don't have time right now. After the CF, maybe. I suspect #1, since we're probably generating thunk functions for our data exports. Either way, the key thing is: http://msdn.microsoft.com/en-us/library/hyx1zcd3.aspx Note that when you export a variable from a DLL with a .def file, you do not need to specify __declspec(dllexport) on the variable. However, in any file that uses the DLL, you must still use __declspec(dllimport) on the declaration of data. See also rules and limitations for dllimport/dllexport: http://msdn.microsoft.com/en-us/library/da6zd0a4.aspx This page notes that you should annotate .DEF entries for data with DATA to reduce the likelihood that incorrect coding will cause a problem. http://msdn.microsoft.com/en-us/library/54xsd65y.aspx That will prevent generation of the linker symbol that's a pointer to the real data, so we should get linker errors when we fail to specify __declspec(dllimport). To do this, src/tools/msvc/gendefs.pl needs to be modified to correctly annotate globals with DATA based on the dumpbin output. I'll have a crack at that after the CF is closed. However, *this modification will not get rid of the need for PGDLLIMPORT macros in headers*, because it's still necessary to explicitly tag them with __declspec(dllimport) for correct results. In fact, I'm now wondering if we're using inefficient thunk based function calls from modules because we're not specifying __declspec(dllimport) on external functions, too. Again, that's a post-CF thing to check out using the MSVC debugger. BTW, there's a general reference for __declspec(dllexport) here: http://msdn.microsoft.com/en-us/library/a90k134d.aspx and higher level concepts here: http://msdn.microsoft.com/en-us/library/z4zxe9k8.aspx http://msdn.microsoft.com/en-us/library/900axts6.aspx Other references: http://msdn.microsoft.com/en-us/library/da6zd0a4.aspx http://msdn.microsoft.com/en-us/library/zw3za17w.aspx http://msdn.microsoft.com/en-us/library/619w14ds.aspx http://msdn.microsoft.com/en-us/library/54xsd65y.aspx http://blogs.msdn.com/b/oldnewthing/archive/2006/07/20/672695.aspx http://blogs.msdn.com/b/oldnewthing/archive/2006/07/18/669668.aspx -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] narwhal and PGDLLIMPORT
On 02/06/2014 10:14 AM, Craig Ringer wrote: On 02/05/2014 01:52 PM, Tom Lane wrote: Craig Ringer cr...@2ndquadrant.com writes: On 02/05/2014 06:29 AM, Tom Lane wrote: I had been okay with the manual PGDLLIMPORT-sprinkling approach (not happy with it, of course, but prepared to tolerate it) as long as I believed the buildfarm would reliably tell us of the need for it. That assumption has now been conclusively disproven, though. I'm kind of horrified that the dynamic linker doesn't throw its toys when it sees this. Indeed :-(. The truly strange part of this is that it seems that the one Windows buildfarm member that's telling the truth (or most nearly so, anyway) is narwhal, which appears to have the oldest and cruftiest toolchain of the lot. I'd really like to come out the other end of this investigation with a clear understanding of why the newer toolchains are failing to report a link problem, and yet not building working executables. I've done some digging and asked for some input from people with appropriate knowledge. Getting a hard answer's going to require some quality time with a debugger that I won't have until after the CF. ... though this link may shed some light, it turns out: http://msdn.microsoft.com/en-us/library/aa271769(v=vs.60).aspx So, yeah. It looks like we're probably linking to thunk functions as data (ouch). To produce the desired error if __declspec(dllimport) is missing we need to change gendefs.pl to emit DATA for exported globals. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On Wed, Feb 5, 2014 at 2:26 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Then again, why is the behavior of schema-qualifying absolutely everything even desirable? Well, someone could create a collation in another schema with the same name as a system collation and the command would become ambiguous. For example, this command create table f (a text collate POSIX); creates a column whose collation is either public.POSIX or the system POSIX collation, depending on whether public appears before pg_catalog in search_path. Having someone create a POSIX collation in a schema that appears earlier than pg_catalog is pretty far-fetched, but ISTM that we really need to cover that scenario. Hmm, good point. I guess we don't worry much about this with pg_dump because we assume that we're restoring into an empty database (and if not, the user gets to keep both pieces). You're applying a higher standard here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Partial sort
On Wed, Feb 5, 2014 at 6:58 PM, Marti Raudsepp ma...@juffo.org wrote: I ran some synthetic benchmarks with single-column inner joins between 5 tables, with indexes on both joined columns, using only EXPLAIN (so measuring planning time, not execution) in 9 scenarios to excercise different code paths. According to these measurements, the overhead ranges between 1.0 and 4.5% depending on the scenario. Hmm, sounds a little steep. Why is it so expensive? I'm probably missing something here, because I would have thought that planner support for partial sorts would consist mostly of considering the same sorts we consider today, but with the costs reduced by the batching. Changing the cost estimation that way can't be that much more expensive than what we're already doing, so the overhead should be minimal. What the patch is actually doing seems to be something quite a bit more invasive than that, but I'm not sure what it is exactly, or why. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 6:59 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Attached is a quick patch for that, if you want to test it. But if we really just want to do prefix/suffix compression, this is a crappy and expensive way to do it. We needn't force everything through the pglz tag format just because we elide a common prefix or suffix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 8:50 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/05/2014 04:48 PM, Amit Kapila wrote: I have done one test where there is a large suffix match, but not large enough that it can compress more than 75% of string, the CPU overhead with wal-update-prefix-suffix-encode-1.patch is not much, but there is no I/O reduction as well. Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently I got that backwards in the patch... So If I understand the code correctly, the new check should be if (prefixlen + suffixlen (slen * need_rate) / 100) return false; rather than if (slen - prefixlen - suffixlen (slen * need_rate) / 100) return false; Please confirm, else any validation for this might not be useful? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 6:43 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: So, I came up with the attached worst case test, modified from your latest test suite. unpatched: testname | wal_generated | duration --+---+-- ten long fields, all but one changed | 343385312 | 2.20806908607483 ten long fields, all but one changed | 336263592 | 2.18997097015381 ten long fields, all but one changed | 336264504 | 2.17843413352966 (3 rows) pgrb_delta_encoding_v8.patch: testname | wal_generated | duration --+---+-- ten long fields, all but one changed | 338356944 | 3.33501315116882 ten long fields, all but one changed | 344059272 | 3.37364101409912 ten long fields, all but one changed | 336257840 | 3.36244201660156 (3 rows) So with this test, the overhead is very significant. Yuck. Well that sucks. With the skipping logic, another kind of worst case case is that you have a lot of similarity between the old and new tuple, but you miss it because you skip. For example, if you change the first few columns, but leave a large text column at the end of the tuple unchanged. I suspect there's no way to have our cake and eat it, too. Most of the work that Amit has done on this patch in the last few revs is to cut back CPU overhead in the cases where the patch can't help because the tuple has been radically modified. If we're trying to get maximum compression, we need to go the other way: for example, we could just feed both the old and new tuples through pglz (or snappy, or whatever). That would allow us to take advantage not only of similarity between the old and new tuples but also internal duplication within either the old or the new tuple, but it would also cost more CPU. The concern with minimizing overhead in cases where the compression doesn't help has thus far pushed us in the opposite direction, namely passing over compression opportunities that a more aggressive algorithm could find in order to keep the overhead low. Off-hand, I'm wondering why we shouldn't apply the same skipping algorithm that Amit is using at the beginning of the string for the rest of it as well. It might be a little too aggressive (maybe the skip distance shouldn't increase by quite as much as doubling every time, or not beyond 16/32 bytes?) but I don't see why the general principle isn't sound wherever we are in the tuple. Unfortunately, despite changing things to make a history entry only every 4th character, building the history is still pretty expensive. By the time we even begin looking at the tuple we're gonna compress, we've already spent something like half the total effort, and of course we have to go further than that before we know whether our attempt to compress is actually going anywhere. I think that's the central problem here. pglz has several safeguards to ensure that it doesn't do too much work in vain: we abort if we find nothing compressible within first_success_by bytes, or if we emit enough total output to be certain that we won't meet the need_rate threshold. Those safeguards are a lot less effective here because they can't be applied until *after* we've already paid the cost of building the history. If we could figure out some way to apply those guards, or other guards, earlier in the algorithm, we could do a better job mitigating the worst-case scenarios, but I don't have a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun
Some more thoughts: Please add comments above _bt_mark_page_halfdead(), a new routine from the dependency patch. I realize that this is substantially similar to part of how _bt_pagedel() used to work, but it's still incongruous. ! Our approach is to create any missing downlinks on-they-fly, when ! searching the tree for a new insertion. It could be done during searches, ! too, but it seems best not to put any extra updates in what would otherwise s/on-they-fly/on-the-fly/ I'm not sure about this: *** _bt_findinsertloc(Relation rel, *** 675,680 --- 701,707 static void _bt_insertonpg(Relation rel, Buffer buf, +Buffer cbuf, BTStack stack, IndexTuple itup, OffsetNumber newitemoff, Wouldn't lcbuf be a better name for the new argument? After all, in relevant contexts 'buf' is the parent of both of the pages post-split, but there are two children in play here. You say: + * When inserting to a non-leaf page, 'cbuf' is the left-sibling of the + * page we're inserting the downlink for. This function will clear the + * INCOMPLETE_SPLIT flag on it, and release the buffer. I suggest also noting in comments at this point that cbuf/the left-sibling is the old half from the split. Even having vented about cbuf's name, I can kind of see why you didn't call it lcbuf: we already have an BTPageOpaque lpageop variable for the special 'buf' metadata within the _bt_insertonpg() function; so there might be an ambiguity between the possibly-soon-to-be-left page (the target page) and the *child* left page that needs to have its flag cleared. Although I still think you should change the name of lpageop (further details follow). Consider this: /* release buffers */ + if (!P_ISLEAF(lpageop)) + _bt_relbuf(rel, cbuf); (Rebased, so this looks a bit different to your original version IIRC). This is checking that the _bt_insertonpg() target page (whose special data lpageop points to) is not a leaf page, and releasing the *child* left page if it isn't. This is pretty confusing. So I suggest naming lpageop tpageop ('t' for target, a terminology already used in the comments above the function). Also, I suggest you change this existing code comment: * On entry, we must have the right buffer in which to do the * insertion, and the buffer must be pinned and write-locked. On return, * we will have dropped both the pin and the lock on the buffer. Change right to correct here. Minor point, but right is a loaded word. But why are you doing new things while checking P_ISLEAF(lpageop) in _bt_insertonpg() to begin with? Would you not be better off doing things when there is a child buffer passed (e.g. if (BufferIsValid(cbuf))...), and only then asserting !P_ISLEAF(lpageop)? That seems to me to more naturally establish the context of _bt_insertonpg() is called to insert downlink after split. Although maybe that conflates things within XLOG stuff. Hmm. Perhaps some of these observations could equally well apply to _bt_split(), which is similarly charged with releasing a left child page on inserting a downlink. Particularly around reconsidering the left vs left child vs target terminology. Consider what happens when _bt_split() is passed a (left) child buffer (a 'cbuf' argument). We are: 1. Splitting a page (first phase, which may only be finished lazily some time later). 2. Iff this is a non-leaf page split, simultaneously unmark the flag from some *other* split which we have a child buffer to unmark. Needs to be part of same critical section. Unlock + release cbuf, again only iff target/right split page contains a leaf page. So, at the risk of belaboring the point: * Some callers to _bt_split() and _bt_insertonpg() pass a 'cbuf' that is not invalid. * If either of those 2 functions don't unlock + release those buffers, no one ever will. * Therefore, at the very least we should unlock + release those buffers **just because they're not invalid**. That's a sufficient condition to do so, and attaching that to level is unnecessarily unclear. As I said, assert non-leafness. Incidentally, I think that in general we could be clearer on the fact that a root page may also be a leaf page. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Row-security on updatable s.b. views
On 02/04/2014 02:43 PM, Craig Ringer wrote: On 01/30/2014 04:05 PM, Craig Ringer wrote: On 01/30/2014 01:25 PM, Craig Ringer wrote: On 01/29/2014 09:47 PM, Craig Ringer wrote: https://github.com/ringerc/postgres/compare/rls-9.4-upd-sb-views i.e. https://github.com/ringerc/postgres.git , branch rls-9.4-upd-sb-views (subject to rebasing) or the non-rebased tag rls-9.4-upd-sb-views-v2 Pushed an update to the branch. New update tagged rls-9.4-upd-sb-views-v3 . Fixes an issue with rowmarking that stems from the underlying updatable s.b. views patch. Other tests continue to fail, this isn't ready yet. Specifically: - row-security rule recursion detection isn't solved yet, it just overflows the stack. This is now fixed in the newly tagged rls-9.4-upd-sb-views-v4 in g...@github.com:ringerc/postgres.git . Based on Tom's objections, another approach is presented in rls-9.4-upd-sb-views-v5 on g...@github.com:ringerc/postgres.git . The Query node is used to record the recursive expansion parent list instead, and copying is avoided. However, I've separately tracked down the cause of the test failures like: ERROR: could not open file base/16384/30135: No such file or directory This occurs where a row-security qual is declared to use a view. Row-security quals get stored without rewriting (which is necessary, see below). The qual is injected into securityQuals and expanded, but *not rewritten*. So the executor sees an unexpected view in the tree. Because a view RTE has its relid field set to the view's oid, this doesn't get picked up until we try to actually scan the view relation in the executor. (I'd like to add Asserts to make the executor fail a bit more informatively when you try to scan a view, but that's separate.) So, that's clearly broken. There are really two possible solutions: 1. Try (again) to do row-security in the rewriter. This was previously impossible because of the definition of row-security behaviour around inheritance, but with the simplified inheritance model now proposed I think it's possible. 2. Invoke RewriteQuery from within expand_security_quals, rewriting the query after security qual expansion. This is only needed for row-security; for updatable s.b. views rewriting has happened with recursion into securityQuals during the original rewrite pass. I suspect that (2) will result in a resounding yuck. So I have to see if I can now turn around *again* and plug row-security into the rewriter after all. That's a pretty frustrating thing to discover in mid-late CF4. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
Robert Haas robertmh...@gmail.com writes: On Wed, Feb 5, 2014 at 2:26 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Then again, why is the behavior of schema-qualifying absolutely everything even desirable? Well, someone could create a collation in another schema with the same name as a system collation and the command would become ambiguous. Hmm, good point. I guess we don't worry much about this with pg_dump because we assume that we're restoring into an empty database (and if not, the user gets to keep both pieces). You're applying a higher standard here. Robert, that's just horsepucky. pg_dump is very careful about schemas. It's also careful to not schema-qualify names unnecessarily, which is an intentional tradeoff to improve readability of the dump --- at the cost that the dump might break if restored into a nonempty database with conflicting objects. In the case of data passed to event triggers, there's a different tradeoff to be made: people will probably value consistency over readability, so always-qualify is probably the right choice here. But in neither case are we being sloppy. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 06/02/14 16:59, Robert Haas wrote: On Wed, Feb 5, 2014 at 6:43 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: So, I came up with the attached worst case test, modified from your latest test suite. unpatched: testname | wal_generated | duration --+---+-- ten long fields, all but one changed | 343385312 | 2.20806908607483 ten long fields, all but one changed | 336263592 | 2.18997097015381 ten long fields, all but one changed | 336264504 | 2.17843413352966 (3 rows) pgrb_delta_encoding_v8.patch: testname | wal_generated | duration --+---+-- ten long fields, all but one changed | 338356944 | 3.33501315116882 ten long fields, all but one changed | 344059272 | 3.37364101409912 ten long fields, all but one changed | 336257840 | 3.36244201660156 (3 rows) So with this test, the overhead is very significant. Yuck. Well that sucks. With the skipping logic, another kind of worst case case is that you have a lot of similarity between the old and new tuple, but you miss it because you skip. For example, if you change the first few columns, but leave a large text column at the end of the tuple unchanged. I suspect there's no way to have our cake and eat it, too. Most of the work that Amit has done on this patch in the last few revs is to cut back CPU overhead in the cases where the patch can't help because the tuple has been radically modified. If we're trying to get maximum compression, we need to go the other way: for example, we could just feed both the old and new tuples through pglz (or snappy, or whatever). That would allow us to take advantage not only of similarity between the old and new tuples but also internal duplication within either the old or the new tuple, but it would also cost more CPU. The concern with minimizing overhead in cases where the compression doesn't help has thus far pushed us in the opposite direction, namely passing over compression opportunities that a more aggressive algorithm could find in order to keep the overhead low. Off-hand, I'm wondering why we shouldn't apply the same skipping algorithm that Amit is using at the beginning of the string for the rest of it as well. It might be a little too aggressive (maybe the skip distance shouldn't increase by quite as much as doubling every time, or not beyond 16/32 bytes?) but I don't see why the general principle isn't sound wherever we are in the tuple. Unfortunately, despite changing things to make a history entry only every 4th character, building the history is still pretty expensive. By the time we even begin looking at the tuple we're gonna compress, we've already spent something like half the total effort, and of course we have to go further than that before we know whether our attempt to compress is actually going anywhere. I think that's the central problem here. pglz has several safeguards to ensure that it doesn't do too much work in vain: we abort if we find nothing compressible within first_success_by bytes, or if we emit enough total output to be certain that we won't meet the need_rate threshold. Those safeguards are a lot less effective here because they can't be applied until *after* we've already paid the cost of building the history. If we could figure out some way to apply those guards, or other guards, earlier in the algorithm, we could do a better job mitigating the worst-case scenarios, but I don't have a good idea. Surely the weighting should be done according to the relative scarcity of processing power vs I/O bandwidth? I get the impression that different workloads and hardware configurations may favour conserving one of processor or I/O resources. Would it be feasible to have different logic, depending on the trade-offs identified? Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers