[HACKERS] Support for N synchronous standby servers
Hi all, Please find attached a patch to add support of synchronous replication for multiple standby servers. This is controlled by the addition of a new GUC parameter called synchronous_standby_num, that makes server wait for transaction commit on the first N standbys defined in synchronous_standby_names. The implementation is really straight-forward, and has just needed a couple of modifications in walsender.c for pg_stat_get_wal_senders and syncrep.c. When a process commit is cancelled manually by user or when ProcDiePending shows up, the message returned to user does not show the list of walsenders where the commit has not been confirmed as it partially confirmed. I have not done anything for that but let me know if that would be useful. This would need a scan of the walsenders to get their application_name. Thanks, -- Michael From 3dfff90032c38daba43e1e0c4d3221053d6386ac Mon Sep 17 00:00:00 2001 From: Michael Paquier mich...@otacoo.com Date: Sat, 9 Aug 2014 14:49:24 +0900 Subject: [PATCH] Add parameter synchronous_standby_num This makes possible support of synchronous replication on a number of standby nodes equal to the new parameter. The synchronous standbys are chosen in the order they are listed in synchronous_standby_names. --- doc/src/sgml/config.sgml| 32 --- doc/src/sgml/high-availability.sgml | 18 - src/backend/replication/syncrep.c | 81 ++--- src/backend/replication/walsender.c | 74 - src/backend/utils/misc/guc.c| 10 + src/include/replication/syncrep.h | 1 + 6 files changed, 175 insertions(+), 41 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index be5c25b..c40de16 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2586,12 +2586,13 @@ include_dir 'conf.d' Specifies a comma-separated list of standby names that can support firsttermsynchronous replication/, as described in xref linkend=synchronous-replication. -At any one time there will be at most one active synchronous standby; -transactions waiting for commit will be allowed to proceed after -this standby server confirms receipt of their data. -The synchronous standby will be the first standby named in this list -that is both currently connected and streaming data in real-time -(as shown by a state of literalstreaming/literal in the +At any one time there will be at a number of active synchronous standbys +defined by varnamesynchronous_standby_num/; transactions waiting +for commit will be allowed to proceed after those standby servers +confirms receipt of their data. The synchronous standbys will be +the first entries named in this list that are both currently connected +and streaming data in real-time (as shown by a state of +literalstreaming/literal in the link linkend=monitoring-stats-views-table literalpg_stat_replication//link view). Other standby servers appearing later in this list represent potential @@ -2627,6 +2628,25 @@ include_dir 'conf.d' /listitem /varlistentry + varlistentry id=guc-synchronous-standby-num xreflabel=synchronous_standby_num + termvarnamesynchronous_standby_num/varname (typeinteger/type) + indexterm + primaryvarnamesynchronous_standby_num/ configuration parameter/primary + /indexterm + /term + listitem + para +Specifies the number of standbys that support +firsttermsynchronous replication/, as described in +xref linkend=synchronous-replication, and listed as the first +elements of xref linkend=guc-synchronous-standby-names. + /para + para +Default value is 1. + /para + /listitem + /varlistentry + varlistentry id=guc-vacuum-defer-cleanup-age xreflabel=vacuum_defer_cleanup_age termvarnamevacuum_defer_cleanup_age/varname (typeinteger/type) indexterm diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml index d249959..085d51b 100644 --- a/doc/src/sgml/high-availability.sgml +++ b/doc/src/sgml/high-availability.sgml @@ -1081,12 +1081,12 @@ primary_slot_name = 'node_a_slot' WAL record is then sent to the standby. The standby sends reply messages each time a new batch of WAL data is written to disk, unless varnamewal_receiver_status_interval/ is set to zero on the standby. -If the standby is the first matching standby, as specified in -varnamesynchronous_standby_names/ on the primary, the reply -messages from that standby will be used to wake users waiting for -confirmation that the commit record has been received. These parameters -allow the administrator to specify which standby servers should be -synchronous standbys. Note that the configuration of synchronous +If the
Re: [HACKERS] PostgreSQL vs oracle doing 1 million sqrts am I doing it wrong?
Hi, Le 9 août 2014 05:57, Ramirez, Danilo danilo.rami...@hmhco.com a écrit : Thanks to all for the great info. We are new to postgresql and this discussion has both instructed us and increased our respect for the database and the community. I am seeing a behavior that I don’t understand and hopefully you guys can clear it up. I am using AWS postgresql db.m3.2xlarge and using pgadmin III 1.18 comparing against AWS oracle on db.m3.2xlarge using sql developer and TOAD. I am running a query with 30 tables in the from clause, getting 137 columns back (this is our most basic query, they get a lot more more complex). It returns back 4800 rows. In oracle 1st run takes 3.92 seconds, 2nd .38 seconds. Scrolling to end takes and extra 1.5 seconds for total of 5.5. Using pgadmin, I run the query. Looking at the lower right hand I can see the time going up. It stops at 8200 ms or close to it every time, then it takes an extra 6 seconds before it displays the rows on the screen. 2nd, 3rd, etc. runs all take about same amount of time 8 sec plus 6 sec I then changed it to return only 1 column back. In oracle/sqldeveloper identical behavior as before, same time. In postgresql it now goes down to 1.8 seconds for 1st, 2nd, etc. runs. I then change it so that I am asking for the sum of 1 column. In oracle time goes down to .2 seconds and postgresql now goes down to .2 seconds also. I then change it back to get the full result set and behavior goes back to original, oracle .38 since its cached, postgresql 8 seconds. Are you sure this is postgresql 8 seconds? I'd believe this is more something like postgresql something really low and PgAdmin around 8 seconds displaying it. What I mean is, PgAdmin uses really slow UI components and the time it shows is the time to execute the query and display the data. IOW, you shouldn't use it to benchmark. You should better use psql. Or, much better, you should set log_min_duration_statement to 0 and see exactly how much time postgresql needs to execute it.
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 20.7.2014 18:29, Tomas Vondra wrote: Attached v9 of the patch. Aside from a few minor fixes, the main change is that this is assumed to be combined with the dense allocation patch. It also rewrites the ExecHashIncreaseNumBuckets to follow the same pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly, instead of buckets). It's cleaner and more consistent. hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need to grab the hashjoin-alloc-v4.patch from a different thread and apply it first) hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined I just noticed that the changes made to ExecHashGetBucketAndBatch may not be perfectly fine - it does not break the hash join (i.e. the results are still correct), but I believe it may cause more batches. But I'm not entirely sure about it, or how to fix that. First, an intro into ExecHashGetBucketAndBatch internals, and how the patch changes that. If not interested, skip to the problem section below. The old ExecHashGetBucketAndBatch did this: *bucketno = hashvalue (nbuckets - 1); *batchno = (hashvalue hashtable-log2_nbuckets) (nbatch - 1); i.e. given the 32-bit hash value, the lowest log2_nbuckets bits are used to determine bucket number. The rest of the hash (after removing the bits used for buckets) is used for batch. I.e. something like this 31 23 1570 ||||| | | - batches| buckets| This worked fine, because the number of bits for buckets was fixed, and only the number of batches was growing. This was done by adding most-significant bits (as illustrated by the tiny arrow. So when there were 4 bits for batch number, after adding another bit (doubling nbatches) batch '' would be split either into '0' or '1'. With dynamic bucket count this does not work, because the batches and buckets need to be independend (i.e. derived from non-overlapping parts of the hash). The simplest approach of 'moving batches around' does not work, because that would break this assert: Assert(batchno curbatch); In other words don't move tuples to already-processed batches. So the batch number for a tuple needs to be sufficiently stable, and only ever increase (never decrease). So what I did is this: 31 23 1570 ||||| | batches - | | - buckets| and this is what happens in ExecHashGetBucketAndBatch: *bucketno = hashvalue (nbuckets - 1); *batchno = (hashvalue (32 - hashtable-log2_nbatch)); So the bucketno expression is exactly the same (but now the nbuckets may change dynamically), but batchno is now computed from the other end of the hash (highest bits), and grows by adding a least-significant bit. Problem --- The original implementation had the nice feature, that each increase of number of batches (nbatches *= 2) split each batch in half. Half the tuples stayed in the batch, half the tuples moved to one of the newly created batches, thanks to adding a most-significant bit. The tuples getting 0 bit stayed in the same batch, tuples getting 1 moved to the new one (and it was guaranteed to be one of the new ones). It's slightly more complicated because of the lazy behavior when repeatedly incrementing the number of batches, but this is the principle. Keep 1/2 the tuples, move 1/2 to another bucket. The new implementation changes this, because the batch number grows in the opposite direction - adds a lest-significant bit, so for example batch '' gets split into '1' and '0'. It's rougly equal to (batchno 1) + K where K is either 0 or 1, which is always = than the old batchno. So it does not violate the Assert (which is why I haven't noticed this earlier). But it breaks the desirable behavior that 1/2 the tuples remain in the current batch, and instead moves a lot of them to batches further down the road, and needlessly increases the number of batches. At least that's how I understand it ... Possible solutions -- Adding least-significant bit does not work, we need get back to adding the most-significant one. Not sure what's the least complex way to do that, though. I'm thinking about computing the nbuckets limit (how many buckets may fit into work_mem): tupsize = HJTUPLE_OVERHEAD + MAXALIGN(sizeof(MinimalTupleData)) + MAXALIGN(tupwidth); nbuckets_bits_max = my_log2(work_mem / tupsize) and then start with nbatches from this bit, like this: 31 23 max 1570 ||||| | | - batches| free | - buckets
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
On 2014-08-04 10:54:25 -0400, Robert Haas wrote: On Thu, Jul 31, 2014 at 9:51 PM, Tom Lane t...@sss.pgh.pa.us wrote: Even without that issue, there's no consensus that forcibly making orphan backends exit would be a good thing. (Some people would like to have such an option, but the key word in that sentence is option.) I believe that multiple people have said multiple times that we should change the behavior so that orphaned backends exit immediately; I think you are the only one defending the current behavior. There are several problems with the status quo: +1. I think the current behaviour is a seriously bad idea. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reporting the commit LSN at commit time
On 2014-08-07 21:02:54 -0400, Tom Lane wrote: Craig Ringer cr...@2ndquadrant.com writes: On 08/08/2014 03:54 AM, Tom Lane wrote: FWIW, I think it's a seriously bad idea to expose LSNs in the protocol at all. What happens five years from now when we switch to some other implementation that doesn't have LSNs? Everyone who's relying on them already via pg_stat_replication, etc, breaks. They're _already_ exposed to users. That ship has sailed. They're exposed to replication tools, yeah, but embedding them in the wire protocol would be moving the goalposts a long way past that. As an example of something that doubtless seemed like a good idea at the time, consider the business about how an INSERT command completion tag includes the OID of the inserted row. We're stuck with that obsolete idea *forever* because it's embedded in the protocol for all clients. I don't think we really need to embed it at that level. And it doesn't have to be always on - only clients that ask for it need to get the answer. Something like COMMIT WITH (report_commit_lsn ON); or similar might do the trick? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Defining a foreign key with a duplicate column is broken
David Rowley dgrowle...@gmail.com writes: On Sat, Aug 9, 2014 at 3:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: There's no need for a new error message I think, because we should just ignore such indexes. After all, there might be a valid matching index later on. hmm, but if the user attempts to define the foreign key that contains a duplicate column in the REFERENCES part, then we'll never find any indexes, so there's no point in looking at all. OK, now that I'm a bit more awake, I agree with that. I've attached a version of the patch that's a little smarter when it comes to doing the duplicate checks in the attnums array... Applied with some cosmetic adjustments. I didn't bother with the regression test either --- this doesn't seem like something that needs permanent memorialization as a regression test. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Bruce, * Bruce Momjian (br...@momjian.us) wrote: On Fri, Aug 8, 2014 at 08:25:04PM -0400, Stephen Frost wrote: * Bruce Momjian (br...@momjian.us) wrote: FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4 betas and report the problem JSONB columns. That is *not* a good solution.. If you change the JSONB binary format, and we can't read the old format, that is the only option. Apologies- I had thought you were suggesting this for a 9.4 - 9.5 conversion, not for just 9.4beta to 9.4. Adding that to pg_upgrade to address folks upgrading from betas would certainly be fine. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
On 2014-08-09 14:00:49 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-08-04 10:54:25 -0400, Robert Haas wrote: I believe that multiple people have said multiple times that we should change the behavior so that orphaned backends exit immediately; I think you are the only one defending the current behavior. There are several problems with the status quo: +1. I think the current behaviour is a seriously bad idea. I don't think it's anywhere near as black-and-white as you guys claim. What it comes down to is whether allowing existing transactions/sessions to finish is more important than allowing new sessions to start. Depending on the application, either could be more important. Nah. The current behaviour circumvents security measures we normally consider absolutely essential. If the postmaster died some bad shit went on. The likelihood of hitting corner case bugs where it's important that we react to a segfault/panic with a restart/crash replay is rather high. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
Andres Freund and...@2ndquadrant.com writes: On 2014-08-04 10:54:25 -0400, Robert Haas wrote: I believe that multiple people have said multiple times that we should change the behavior so that orphaned backends exit immediately; I think you are the only one defending the current behavior. There are several problems with the status quo: +1. I think the current behaviour is a seriously bad idea. I don't think it's anywhere near as black-and-white as you guys claim. What it comes down to is whether allowing existing transactions/sessions to finish is more important than allowing new sessions to start. Depending on the application, either could be more important. Ideally we'd have some way to configure the behavior appropriately for a given installation; but short of that, it's unclear to me that unilaterally changing the system's bias is something our users would thank us for. I've not noticed a large groundswell of complaints about it (though this may just reflect that we've made the postmaster pretty darn robust, so that the case seldom comes up). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
Andres Freund and...@2ndquadrant.com writes: On 2014-08-09 14:00:49 -0400, Tom Lane wrote: I don't think it's anywhere near as black-and-white as you guys claim. What it comes down to is whether allowing existing transactions/sessions to finish is more important than allowing new sessions to start. Depending on the application, either could be more important. Nah. The current behaviour circumvents security measures we normally consider absolutely essential. If the postmaster died some bad shit went on. The likelihood of hitting corner case bugs where it's important that we react to a segfault/panic with a restart/crash replay is rather high. What's your point? Once a new postmaster starts, it *will* do a crash restart, because certainly no shutdown checkpoint ever happened. The only issue here is what grace period existing orphaned backends are given to finish their work --- and it's not possible for the answer to that to be zero, so you don't get to assume that nothing happens in backend-land after the instant of postmaster crash. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
On 2014-08-09 14:09:36 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-08-09 14:00:49 -0400, Tom Lane wrote: I don't think it's anywhere near as black-and-white as you guys claim. What it comes down to is whether allowing existing transactions/sessions to finish is more important than allowing new sessions to start. Depending on the application, either could be more important. Nah. The current behaviour circumvents security measures we normally consider absolutely essential. If the postmaster died some bad shit went on. The likelihood of hitting corner case bugs where it's important that we react to a segfault/panic with a restart/crash replay is rather high. What's your point? Once a new postmaster starts, it *will* do a crash restart, because certainly no shutdown checkpoint ever happened. That's not saying much. For one, there can be online checkpoints in that time. So it's certainly not guaranteed (or even all that likely) that all the WAL since the incident is replayed. For another, it can be *hours* before all the backends finish. IIRC we'll continue to happily write WAL and everything after postmaster (and possibly some backends, corrupting shmem) have crashed. The bgwriter, checkpointer, backends will continue to write dirty buffers to disk. We'll IIRC continue to write checkpoints. That's simply not things we should be doing after postmaster crashed if we can avoid at all. The only issue here is what grace period existing orphaned backends are given to finish their work --- and it's not possible for the answer to that to be zero, so you don't get to assume that nothing happens in backend-land after the instant of postmaster crash. Sure. But I don't think a window in the range of seconds comes close to being the same as a window that easily can be hours. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Tom Lane t...@sss.pgh.pa.us wrote: Stephen Frost sfr...@snowman.net writes: Trying to move the header to the end just for the sake of this doesn't strike me as a good solution as it'll make things quite a bit more complicated. Why is that? How much harder would it be to add a single offset field to the front to point to the part we're shifting to the end? It is not all that unusual to put a directory at the end, like in the .zip file format. Is there a way we could interleave the likely-compressible user data in with the header instead? Yeah, I was wondering about that too, but I don't immediately see how to do it without some sort of preprocessing step when we read the object (which'd be morally equivalent to converting a series of lengths into a pointer array). That sounds far more complex and fragile than just moving the indexes to the end. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Kevin Grittner kgri...@ymail.com writes: Stephen Frost sfr...@snowman.net writes: Trying to move the header to the end just for the sake of this doesn't strike me as a good solution as it'll make things quite a bit more complicated. Why is that? How much harder would it be to add a single offset field to the front to point to the part we're shifting to the end? It is not all that unusual to put a directory at the end, like in the .zip file format. Yeah, I was wondering that too. Arguably, directory-at-the-end would be easier to work with for on-the-fly creation, not that we do any such thing at the moment. I think the main thing that's bugging Stephen is that doing that just to make pglz_compress happy seems like a kluge (and I have to agree). Here's a possibly more concrete thing to think about: we may very well someday want to support JSONB object field or array element extraction without reading all blocks of a large toasted JSONB value, if the value is stored external without compression. We already went to the trouble of creating analogous logic for substring extraction from a long uncompressed text or bytea value, so I think this is a plausible future desire. With the current format you could imagine grabbing the first TOAST chunk, and then if you see the header is longer than that you can grab the remainder of the header without any wasted I/O, and for the array-subscripting case you'd now have enough info to fetch the element value from the body of the JSONB without any wasted I/O. With directory-at-the-end you'd have to read the first chunk just to get the directory pointer, and this would most likely not give you any of the directory proper; but at least you'd know exactly how big the directory is before you go to read it in. The former case is probably slightly better. However, if you're doing an object key lookup not an array element fetch, neither of these formats are really friendly at all, because each binary-search probe probably requires bringing in one or two toast chunks from the body of the JSONB value so you can look at the key text. I'm not sure if there's a way to redesign the format to make that less painful/expensive --- but certainly, having the key texts scattered through the JSONB value doesn't seem like a great thing from this standpoint. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Aug 7, 2014 at 1:09 AM, Peter Geoghegan p...@heroku.com wrote: On Wed, Aug 6, 2014 at 10:36 PM, Peter Geoghegan p...@heroku.com wrote: This *almost* applies to patched Postgres if you pick a benchmark that is very sympathetic to my patch. To my surprise, work_mem = '10MB' (which results in an external tape sort) is sometimes snapping at the heels of a work_mem = '5GB' setting (which results in an in-memory quicksort). Note that this was with a default temp_tablespaces setting that wrote temp files on my home partition/SSD. With a /dev/shm/ temp tablespace, tape sort edges ahead, and has a couple of hundred milliseconds on quicksort for this test case. It's actually faster. I decided to do a benchmark of a very large, sympathetic sort. Again, this was with 8 byte random text strings, but this time with a total of 500 million tuples. This single-column table is about 21 GB. This ran on a dedicated server with 32 GB of ram. I took all the usual precautions (the cache was warmed, tuples were frozen, etc). Master took 02:02:20.5561 to sort the data with a 10 MB work_mem setting, without a ramdisk in temp_tablespaces. With a temp_tablespaces /dev/shm ramdisk, there was only a very small improvement that left total execution time at 02:00:58.51878 (while a later repeat attempt took 02:00:41.385937) -- an improvement hardly worth bothering with. Patched Postgres took 00:16:13.228053, again with a work_mem of 10 MB, but no ramdisk. When I changed temp_tablespaces to use the same ramdisk, this went down to 00:11:58.77301, a significant improvement. This is clearly because the data directory was on a filesystem on spinning disks, and more I/O bandwidth (real or simulated) helps external sorts. Since this machine only has 32GB of ram, and a significant proportion of that must be used for shared_buffers (8GB) and the OS cache, I think there is a fair chance that a more capable I/O subsystem could be used to get appreciably better results using otherwise identical hardware. Comparing like with like, the ramdisk patched run was over 10 times faster than master with the same ramdisk. While that disparity is obviously in itself very significant, I think the disparity in how much faster things were with a ramdisk for patched, but not for master is also significant. I'm done with sympathetic cases now. I welcome unsympathetic ones, or more broadly representative large tests. It's hard to come up with a benchmark that isn't either sympathetic or very sympathetic, or a pathological bad case. There is actually a simple enough C program for generating test input, gensort, which is used by sortbenchmark.org: http://www.ordinal.com/gensort.html (I suggest building without support for the SUMP Pump library, by modifying the Makefile before building) What's interesting about gensort is that there is a skew option. Without it, I can generate totally random ASCII keys. But with it, there is a tendency for there to be a certain amount of redundancy between keys in their first few bytes. This is intended to limit the effectiveness of abbreviation-type optimizations for sortbenchmark.org Daytona Sort entrants: http://sortbenchmark.org/FAQ-2014.html (Indy vs. Daytona) Daytona sort is basically a benchmark that focuses on somewhat commercially representative data (often ASCII data), with sneaky data-aware tricks forbidden, as opposed to totally artificial uniformly distributed random binary data, which is acceptable for their Indy sort benchmark that gets to use every trick in the book. They note here that Daytona entrants should not be overly dependent on the uniform and random distribution of key values in the sort input. They are allowed to be somewhat dependent on it, though - for one thing, keys are always exactly 10 bytes. They must merely be able to sort the alternate, skewed-keys input data set in an elapsed time of no more than twice the elapsed time of the benchmark entry. It might be useful for Robert or other reviewers to hold the abbreviated keys patch to a similar standard, possibly by using gensort, or their own modified version. I've shown a sympathetic case that is over 10 times faster, with some other less sympathetic cases that were still pretty good, since tie-breakers were generally able to get away with a cheap memcmp(). There has also been some tests showing pathological bad cases for the optimization. The middle ground has now become interesting, and gensort might offer a half-reasonable way to generate tests that are balanced. I've looked at the gensort code, and it seems easy enough to understand and modify for our purposes. You might want to look at multiple cases with this constant modified, for example: #define SKEW_BYTES 6 Since this controls the number of leading bytes that come from a table of skew bytes. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] Reporting the commit LSN at commit time
On 08/10/2014 12:54 AM, Andres Freund wrote: On 2014-08-07 21:02:54 -0400, Tom Lane wrote: Craig Ringer cr...@2ndquadrant.com writes: On 08/08/2014 03:54 AM, Tom Lane wrote: FWIW, I think it's a seriously bad idea to expose LSNs in the protocol at all. What happens five years from now when we switch to some other implementation that doesn't have LSNs? Everyone who's relying on them already via pg_stat_replication, etc, breaks. They're _already_ exposed to users. That ship has sailed. They're exposed to replication tools, yeah, but embedding them in the wire protocol would be moving the goalposts a long way past that. As an example of something that doubtless seemed like a good idea at the time, consider the business about how an INSERT command completion tag includes the OID of the inserted row. We're stuck with that obsolete idea *forever* because it's embedded in the protocol for all clients. I don't think we really need to embed it at that level. And it doesn't have to be always on - only clients that ask for it need to get the answer. Something like COMMIT WITH (report_commit_lsn ON); or similar might do the trick? Wouldn't that force client drivers - libpq, psqlODBC, PgJDBC, etc - to all watch for explicit COMMITs sent by the application and rewrite them? Applications could also then request the commit option via a driver that couldn't cope with it - which I think was one of Tom's concerns re using a GUC, too. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers