[HACKERS] Support for N synchronous standby servers

2014-08-09 Thread Michael Paquier
Hi all,

Please find attached a patch to add support of synchronous replication
for multiple standby servers. This is controlled by the addition of a
new GUC parameter called synchronous_standby_num, that makes server
wait for transaction commit on the first N standbys defined in
synchronous_standby_names. The implementation is really
straight-forward, and has just needed a couple of modifications in
walsender.c for pg_stat_get_wal_senders and syncrep.c.

When a process commit is cancelled manually by user or when
ProcDiePending shows up, the message returned to user does not show
the list of walsenders where the commit has not been confirmed as it
partially confirmed. I have not done anything for that but let me know
if that would be useful. This would need a scan of the walsenders to
get their application_name.

Thanks,
-- 
Michael
From 3dfff90032c38daba43e1e0c4d3221053d6386ac Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Sat, 9 Aug 2014 14:49:24 +0900
Subject: [PATCH] Add parameter synchronous_standby_num

This makes possible support of synchronous replication on a number of standby
nodes equal to the new parameter. The synchronous standbys are chosen in the
order they are listed in synchronous_standby_names.
---
 doc/src/sgml/config.sgml| 32 ---
 doc/src/sgml/high-availability.sgml | 18 -
 src/backend/replication/syncrep.c   | 81 ++---
 src/backend/replication/walsender.c | 74 -
 src/backend/utils/misc/guc.c| 10 +
 src/include/replication/syncrep.h   |  1 +
 6 files changed, 175 insertions(+), 41 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index be5c25b..c40de16 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2586,12 +2586,13 @@ include_dir 'conf.d'
 Specifies a comma-separated list of standby names that can support
 firsttermsynchronous replication/, as described in
 xref linkend=synchronous-replication.
-At any one time there will be at most one active synchronous standby;
-transactions waiting for commit will be allowed to proceed after
-this standby server confirms receipt of their data.
-The synchronous standby will be the first standby named in this list
-that is both currently connected and streaming data in real-time
-(as shown by a state of literalstreaming/literal in the
+At any one time there will be at a number of active synchronous standbys
+defined by varnamesynchronous_standby_num/; transactions waiting
+for commit will be allowed to proceed after those standby servers
+confirms receipt of their data. The synchronous standbys will be
+the first entries named in this list that are both currently connected
+and streaming data in real-time (as shown by a state of
+literalstreaming/literal in the
 link linkend=monitoring-stats-views-table
 literalpg_stat_replication//link view).
 Other standby servers appearing later in this list represent potential
@@ -2627,6 +2628,25 @@ include_dir 'conf.d'
   /listitem
  /varlistentry
 
+ varlistentry id=guc-synchronous-standby-num xreflabel=synchronous_standby_num
+  termvarnamesynchronous_standby_num/varname (typeinteger/type)
+  indexterm
+   primaryvarnamesynchronous_standby_num/ configuration parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+Specifies the number of standbys that support
+firsttermsynchronous replication/, as described in
+xref linkend=synchronous-replication, and listed as the first
+elements of xref linkend=guc-synchronous-standby-names.
+   /para
+   para
+Default value is 1.
+   /para
+  /listitem
+ /varlistentry
+
  varlistentry id=guc-vacuum-defer-cleanup-age xreflabel=vacuum_defer_cleanup_age
   termvarnamevacuum_defer_cleanup_age/varname (typeinteger/type)
   indexterm
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index d249959..085d51b 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1081,12 +1081,12 @@ primary_slot_name = 'node_a_slot'
 WAL record is then sent to the standby. The standby sends reply
 messages each time a new batch of WAL data is written to disk, unless
 varnamewal_receiver_status_interval/ is set to zero on the standby.
-If the standby is the first matching standby, as specified in
-varnamesynchronous_standby_names/ on the primary, the reply
-messages from that standby will be used to wake users waiting for
-confirmation that the commit record has been received. These parameters
-allow the administrator to specify which standby servers should be
-synchronous standbys. Note that the configuration of synchronous
+If the 

Re: [HACKERS] PostgreSQL vs oracle doing 1 million sqrts am I doing it wrong?

2014-08-09 Thread Guillaume Lelarge
Hi,

Le 9 août 2014 05:57, Ramirez, Danilo danilo.rami...@hmhco.com a écrit :

 Thanks to all for the great info.  We are new to postgresql and this
discussion has both instructed us and increased our respect for the
database and the community.

 I am seeing a behavior that I don’t understand and hopefully you guys can
clear it up.

 I am using AWS postgresql db.m3.2xlarge and using pgadmin III 1.18
comparing against AWS oracle on db.m3.2xlarge using sql developer and TOAD.

 I am running a query with 30 tables in the from clause, getting 137
columns back (this is our most basic query, they get a lot more more
complex).   It returns back 4800 rows.

 In oracle 1st run takes 3.92 seconds, 2nd .38 seconds.  Scrolling to end
takes and extra 1.5 seconds for total of 5.5.

 Using pgadmin, I run the query.  Looking at the lower right hand I can
see the time going up.  It stops at 8200 ms or close to it every time, then
it takes an extra 6 seconds before it displays the rows on the screen.
 2nd, 3rd, etc. runs all take about  same amount of time 8 sec plus 6 sec

 I then changed it to return only 1 column back.   In oracle/sqldeveloper
identical behavior as before, same time.  In postgresql it now goes down to
1.8 seconds for 1st, 2nd, etc. runs.

 I then change it so that I am asking for the sum of 1 column.  In oracle
time goes down to .2 seconds and postgresql now goes down to .2 seconds
also.

 I then change it back to get the full result set and behavior goes back
to original, oracle .38 since its cached, postgresql 8 seconds.


Are you sure this is postgresql 8 seconds? I'd believe this is more
something like postgresql something really low and PgAdmin around 8 seconds
displaying it. What I mean is, PgAdmin uses really slow UI components and
the time it shows is the time to execute the query and display the data.
IOW, you shouldn't use it to benchmark. You should better use psql. Or,
much better, you should set log_min_duration_statement to 0 and see exactly
how much time postgresql needs to execute it.


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-08-09 Thread Tomas Vondra
On 20.7.2014 18:29, Tomas Vondra wrote:
 Attached v9 of the patch. Aside from a few minor fixes, the main  change
 is that this is assumed to be combined with the dense allocation patch.
 
 It also rewrites the ExecHashIncreaseNumBuckets to follow the same
 pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly,
 instead of buckets). It's cleaner and more consistent.
 
 hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need
 to grab the hashjoin-alloc-v4.patch from a different thread and apply it
 first)
 
 hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined

I just noticed that the changes made to ExecHashGetBucketAndBatch may
not be perfectly fine - it does not break the hash join (i.e. the
results are still correct), but I believe it may cause more batches. But
I'm not entirely sure about it, or how to fix that.

First, an intro into ExecHashGetBucketAndBatch internals, and how the
patch changes that. If not interested, skip to the problem section below.

The old ExecHashGetBucketAndBatch did this:

  *bucketno = hashvalue  (nbuckets - 1);
  *batchno = (hashvalue  hashtable-log2_nbuckets)  (nbatch - 1);

i.e. given the 32-bit hash value, the lowest log2_nbuckets bits are used
to determine bucket number. The rest of the hash (after removing the
bits used for buckets) is used for batch. I.e. something like this


 31   23   1570
  |||||
  |  |   - batches|   buckets|


This worked fine, because the number of bits for buckets was fixed, and
only the number of batches was growing. This was done by adding
most-significant bits (as illustrated by the tiny arrow. So when there
were 4 bits for batch number, after adding another bit (doubling
nbatches) batch '' would be split either into '0' or '1'.

With dynamic bucket count this does not work, because the batches and
buckets need to be independend (i.e. derived from non-overlapping parts
of the hash). The simplest approach of 'moving batches around' does not
work, because that would break this assert:

   Assert(batchno  curbatch);

In other words don't move tuples to already-processed batches. So the
batch number for a tuple needs to be sufficiently stable, and only ever
increase (never decrease).

So what I did is this:


 31   23   1570
  |||||
  |   batches -  |   | - buckets|


and this is what happens in ExecHashGetBucketAndBatch:

  *bucketno = hashvalue  (nbuckets - 1);
  *batchno = (hashvalue  (32 - hashtable-log2_nbatch));

So the bucketno expression is exactly the same (but now the nbuckets
may change dynamically), but batchno is now computed from the other
end of the hash (highest bits), and grows by adding a least-significant bit.


Problem
---

The original implementation had the nice feature, that each increase of
number of batches (nbatches *= 2) split each batch in half. Half the
tuples stayed in the batch, half the tuples moved to one of the newly
created batches, thanks to adding a most-significant bit. The tuples
getting 0 bit stayed in the same batch, tuples getting 1 moved to the
new one (and it was guaranteed to be one of the new ones).

It's slightly more complicated because of the lazy behavior when
repeatedly incrementing the number of batches, but this is the
principle. Keep 1/2 the tuples, move 1/2 to another bucket.

The new implementation changes this, because the batch number grows in
the opposite direction - adds a lest-significant bit, so for example
batch '' gets split into '1' and '0'.

It's rougly equal to

   (batchno  1) + K

where K is either 0 or 1, which is always = than the old batchno. So it
does not violate the Assert (which is why I haven't noticed this earlier).

But it breaks the desirable behavior that 1/2 the tuples remain in the
current batch, and instead moves a lot of them to batches further down
the road, and needlessly increases the number of batches.

At least that's how I understand it ...


Possible solutions
--

Adding least-significant bit does not work, we need get back to adding
the most-significant one. Not sure what's the least complex way to do
that, though.

I'm thinking about computing the nbuckets limit (how many buckets may
fit into work_mem):

tupsize = HJTUPLE_OVERHEAD +
MAXALIGN(sizeof(MinimalTupleData)) +
MAXALIGN(tupwidth);

nbuckets_bits_max = my_log2(work_mem / tupsize)

and then start with nbatches from this bit, like this:

 31   23 max   1570
  |||||
  |  |   - batches|   free   | -  buckets   

Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-08-09 Thread Andres Freund
On 2014-08-04 10:54:25 -0400, Robert Haas wrote:
 On Thu, Jul 31, 2014 at 9:51 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Even without that issue, there's no consensus that forcibly making
  orphan backends exit would be a good thing.  (Some people would
  like to have such an option, but the key word in that sentence is
  option.)
 
 I believe that multiple people have said multiple times that we should
 change the behavior so that orphaned backends exit immediately; I
 think you are the only one defending the current behavior.  There are
 several problems with the status quo:

+1. I think the current behaviour is a seriously bad idea.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reporting the commit LSN at commit time

2014-08-09 Thread Andres Freund
On 2014-08-07 21:02:54 -0400, Tom Lane wrote:
 Craig Ringer cr...@2ndquadrant.com writes:
  On 08/08/2014 03:54 AM, Tom Lane wrote:
  FWIW, I think it's a seriously bad idea to expose LSNs in the protocol
  at all.   What happens five years from now when we switch to some other
  implementation that doesn't have LSNs?
 
  Everyone who's relying on them already via pg_stat_replication, etc, breaks.
  They're _already_ exposed to users. That ship has sailed.
 
 They're exposed to replication tools, yeah, but embedding them in the
 wire protocol would be moving the goalposts a long way past that.  As an
 example of something that doubtless seemed like a good idea at the time,
 consider the business about how an INSERT command completion tag includes
 the OID of the inserted row.  We're stuck with that obsolete idea
 *forever* because it's embedded in the protocol for all clients.

I don't think we really need to embed it at that level. And it doesn't
have to be always on - only clients that ask for it need to get the
answer. Something like COMMIT WITH (report_commit_lsn ON); or similar
might do the trick?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Defining a foreign key with a duplicate column is broken

2014-08-09 Thread Tom Lane
David Rowley dgrowle...@gmail.com writes:
 On Sat, Aug 9, 2014 at 3:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 There's no need for a new error message I think, because we should just
 ignore such indexes.  After all, there might be a valid matching index
 later on.

 hmm, but if the user attempts to define the foreign key that contains a
 duplicate column in the REFERENCES part, then we'll never find any
 indexes, so there's no point in looking at all.

OK, now that I'm a bit more awake, I agree with that.

 I've attached a version of the patch that's a little smarter when it comes
 to doing the duplicate checks in the attnums array...

Applied with some cosmetic adjustments.  I didn't bother with the
regression test either --- this doesn't seem like something that needs
permanent memorialization as a regression test.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-09 Thread Stephen Frost
Bruce,

* Bruce Momjian (br...@momjian.us) wrote:
 On Fri, Aug  8, 2014 at 08:25:04PM -0400, Stephen Frost wrote:
  * Bruce Momjian (br...@momjian.us) wrote:
   FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4
   betas and report the problem JSONB columns.
  
  That is *not* a good solution..
 
 If you change the JSONB binary format, and we can't read the old format,
 that is the only option.

Apologies- I had thought you were suggesting this for a 9.4 - 9.5
conversion, not for just 9.4beta to 9.4.  Adding that to pg_upgrade to
address folks upgrading from betas would certainly be fine.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-08-09 Thread Andres Freund
On 2014-08-09 14:00:49 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-08-04 10:54:25 -0400, Robert Haas wrote:
  I believe that multiple people have said multiple times that we should
  change the behavior so that orphaned backends exit immediately; I
  think you are the only one defending the current behavior.  There are
  several problems with the status quo:
 
  +1. I think the current behaviour is a seriously bad idea.
 
 I don't think it's anywhere near as black-and-white as you guys claim.
 What it comes down to is whether allowing existing transactions/sessions
 to finish is more important than allowing new sessions to start.
 Depending on the application, either could be more important.

Nah. The current behaviour circumvents security measures we normally
consider absolutely essential. If the postmaster died some bad shit went
on. The likelihood of hitting corner case bugs where it's important that
we react to a segfault/panic with a restart/crash replay is rather high.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-08-09 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-08-04 10:54:25 -0400, Robert Haas wrote:
 I believe that multiple people have said multiple times that we should
 change the behavior so that orphaned backends exit immediately; I
 think you are the only one defending the current behavior.  There are
 several problems with the status quo:

 +1. I think the current behaviour is a seriously bad idea.

I don't think it's anywhere near as black-and-white as you guys claim.
What it comes down to is whether allowing existing transactions/sessions
to finish is more important than allowing new sessions to start.
Depending on the application, either could be more important.

Ideally we'd have some way to configure the behavior appropriately for
a given installation; but short of that, it's unclear to me that
unilaterally changing the system's bias is something our users would
thank us for.  I've not noticed a large groundswell of complaints about
it (though this may just reflect that we've made the postmaster pretty
darn robust, so that the case seldom comes up).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-08-09 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-08-09 14:00:49 -0400, Tom Lane wrote:
 I don't think it's anywhere near as black-and-white as you guys claim.
 What it comes down to is whether allowing existing transactions/sessions
 to finish is more important than allowing new sessions to start.
 Depending on the application, either could be more important.

 Nah. The current behaviour circumvents security measures we normally
 consider absolutely essential. If the postmaster died some bad shit went
 on. The likelihood of hitting corner case bugs where it's important that
 we react to a segfault/panic with a restart/crash replay is rather high.

What's your point?  Once a new postmaster starts, it *will* do a crash
restart, because certainly no shutdown checkpoint ever happened.  The
only issue here is what grace period existing orphaned backends are given
to finish their work --- and it's not possible for the answer to that
to be zero, so you don't get to assume that nothing happens in
backend-land after the instant of postmaster crash.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-08-09 Thread Andres Freund
On 2014-08-09 14:09:36 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-08-09 14:00:49 -0400, Tom Lane wrote:
  I don't think it's anywhere near as black-and-white as you guys claim.
  What it comes down to is whether allowing existing transactions/sessions
  to finish is more important than allowing new sessions to start.
  Depending on the application, either could be more important.
 
  Nah. The current behaviour circumvents security measures we normally
  consider absolutely essential. If the postmaster died some bad shit went
  on. The likelihood of hitting corner case bugs where it's important that
  we react to a segfault/panic with a restart/crash replay is rather high.
 
 What's your point?  Once a new postmaster starts, it *will* do a crash
 restart, because certainly no shutdown checkpoint ever happened.

That's not saying much. For one, there can be online checkpoints in that
time. So it's certainly not guaranteed (or even all that likely) that
all the WAL since the incident is replayed.  For another, it can be
*hours* before all the backends finish.

IIRC we'll continue to happily write WAL and everything after postmaster
(and possibly some backends, corrupting shmem) have crashed. The
bgwriter, checkpointer, backends will continue to write dirty buffers to
disk. We'll IIRC continue to write checkpoints.  That's simply not
things we should be doing after postmaster crashed if we can avoid at
all.

 The
 only issue here is what grace period existing orphaned backends are given
 to finish their work --- and it's not possible for the answer to that
 to be zero, so you don't get to assume that nothing happens in
 backend-land after the instant of postmaster crash.

Sure. But I don't think a window in the range of seconds comes close to
being the same as a window that easily can be hours.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-09 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 Stephen Frost sfr...@snowman.net writes:

 Trying to move the header to the end just for the sake of this
 doesn't strike me as a good solution as it'll make things quite
 a bit more complicated.

Why is that?  How much harder would it be to add a single offset
field to the front to point to the part we're shifting to the end?
It is not all that unusual to put a directory at the end, like in
the .zip file format.



 Is there a way we could interleave the likely-compressible user
 data in with the header instead?

 Yeah, I was wondering about that too, but I don't immediately see
 how to do it without some sort of preprocessing step when we read
 the object (which'd be morally equivalent to converting a series
 of lengths into a pointer array).

That sounds far more complex and fragile than just moving the 
indexes to the end.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-09 Thread Tom Lane
Kevin Grittner kgri...@ymail.com writes:
 Stephen Frost sfr...@snowman.net writes:
 Trying to move the header to the end just for the sake of this
 doesn't strike me as a good solution as it'll make things quite
 a bit more complicated.

 Why is that?  How much harder would it be to add a single offset
 field to the front to point to the part we're shifting to the end?
 It is not all that unusual to put a directory at the end, like in
 the .zip file format.

Yeah, I was wondering that too.  Arguably, directory-at-the-end would
be easier to work with for on-the-fly creation, not that we do any
such thing at the moment.  I think the main thing that's bugging Stephen
is that doing that just to make pglz_compress happy seems like a kluge
(and I have to agree).

Here's a possibly more concrete thing to think about: we may very well
someday want to support JSONB object field or array element extraction
without reading all blocks of a large toasted JSONB value, if the value is
stored external without compression.  We already went to the trouble of
creating analogous logic for substring extraction from a long uncompressed
text or bytea value, so I think this is a plausible future desire.  With
the current format you could imagine grabbing the first TOAST chunk, and
then if you see the header is longer than that you can grab the remainder
of the header without any wasted I/O, and for the array-subscripting case
you'd now have enough info to fetch the element value from the body of
the JSONB without any wasted I/O.  With directory-at-the-end you'd
have to read the first chunk just to get the directory pointer, and this
would most likely not give you any of the directory proper; but at least
you'd know exactly how big the directory is before you go to read it in.
The former case is probably slightly better.  However, if you're doing an
object key lookup not an array element fetch, neither of these formats are
really friendly at all, because each binary-search probe probably requires
bringing in one or two toast chunks from the body of the JSONB value so
you can look at the key text.  I'm not sure if there's a way to redesign
the format to make that less painful/expensive --- but certainly, having
the key texts scattered through the JSONB value doesn't seem like a great
thing from this standpoint.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-08-09 Thread Peter Geoghegan
On Thu, Aug 7, 2014 at 1:09 AM, Peter Geoghegan p...@heroku.com wrote:
 On Wed, Aug 6, 2014 at 10:36 PM, Peter Geoghegan p...@heroku.com wrote:
 This *almost* applies to patched Postgres if you pick a benchmark that
 is very sympathetic to my patch. To my surprise, work_mem = '10MB'
 (which results in an external tape sort) is sometimes snapping at the
 heels of a work_mem = '5GB' setting (which results in an in-memory
 quicksort).

 Note that this was with a default temp_tablespaces setting that wrote
 temp files on my home partition/SSD. With a /dev/shm/ temp tablespace,
 tape sort edges ahead, and has a couple of hundred milliseconds on
 quicksort for this test case. It's actually faster.

I decided to do a benchmark of a very large, sympathetic sort. Again,
this was with 8 byte random text strings, but this time with a total
of 500 million tuples. This single-column table is about 21 GB. This
ran on a dedicated server with 32 GB of ram. I took all the usual
precautions (the cache was warmed, tuples were frozen, etc).

Master took 02:02:20.5561 to sort the data with a 10 MB work_mem
setting, without a ramdisk in temp_tablespaces. With a
temp_tablespaces /dev/shm ramdisk, there was only a very small
improvement that left total execution time at 02:00:58.51878 (while a
later repeat attempt took 02:00:41.385937) -- an improvement hardly
worth bothering with.

Patched Postgres took 00:16:13.228053, again with a work_mem of 10 MB,
but no ramdisk. When I changed temp_tablespaces to use the same
ramdisk, this went down to  00:11:58.77301, a significant improvement.
This is clearly because the data directory was on a filesystem on
spinning disks, and more I/O bandwidth (real or simulated) helps
external sorts. Since this machine only has 32GB of ram, and a
significant proportion of that must be used for shared_buffers (8GB)
and the OS cache, I think there is a fair chance that a more capable
I/O subsystem could be used to get appreciably better results using
otherwise identical hardware.

Comparing like with like, the ramdisk patched run was over 10 times
faster than master with the same ramdisk. While that disparity is
obviously in itself very significant, I think the disparity in how
much faster things were with a ramdisk for patched, but not for master
is also significant.

I'm done with sympathetic cases now. I welcome unsympathetic ones, or
more broadly representative large tests. It's hard to come up with a
benchmark that isn't either sympathetic or very sympathetic, or a
pathological bad case. There is actually a simple enough C program for
generating test input, gensort, which is used by sortbenchmark.org:

http://www.ordinal.com/gensort.html

(I suggest building without support for the SUMP Pump library, by
modifying the Makefile before building)

What's interesting about gensort is that there is a skew option.
Without it, I can generate totally random ASCII keys. But with it,
there is a tendency for there to be a certain amount of redundancy
between keys in their first few bytes. This is intended to limit the
effectiveness of abbreviation-type optimizations for sortbenchmark.org
Daytona Sort entrants:

http://sortbenchmark.org/FAQ-2014.html (Indy vs. Daytona)

Daytona sort is basically a benchmark that focuses on somewhat
commercially representative data (often ASCII data), with sneaky
data-aware tricks forbidden, as opposed to totally artificial
uniformly distributed random binary data, which is acceptable for
their Indy sort benchmark that gets to use every trick in the book.
They note here that Daytona entrants should not be overly dependent
on the uniform and random distribution of key values in the sort
input. They are allowed to be somewhat dependent on it, though - for
one thing, keys are always exactly 10 bytes. They must merely be able
to sort the alternate, skewed-keys input data set in an elapsed time
of no more than twice the elapsed time of the benchmark entry.

It might be useful for Robert or other reviewers to hold the
abbreviated keys patch to a similar standard, possibly by using
gensort, or their own modified version. I've shown a sympathetic case
that is over 10 times faster, with some other less sympathetic cases
that were still pretty good, since tie-breakers were generally able to
get away with a cheap memcmp(). There has also been some tests showing
pathological bad cases for the optimization. The middle ground has now
become interesting, and gensort might offer a half-reasonable way to
generate tests that are balanced. I've looked at the gensort code, and
it seems easy enough to understand and modify for our purposes. You
might want to look at multiple cases with this constant modified, for
example:

#define SKEW_BYTES 6

Since this controls the number of leading bytes that come from a table
of skew bytes.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Reporting the commit LSN at commit time

2014-08-09 Thread Craig Ringer
On 08/10/2014 12:54 AM, Andres Freund wrote:
 On 2014-08-07 21:02:54 -0400, Tom Lane wrote:
 Craig Ringer cr...@2ndquadrant.com writes:
 On 08/08/2014 03:54 AM, Tom Lane wrote:
 FWIW, I think it's a seriously bad idea to expose LSNs in the protocol
 at all.   What happens five years from now when we switch to some other
 implementation that doesn't have LSNs?

 Everyone who's relying on them already via pg_stat_replication, etc, breaks.
 They're _already_ exposed to users. That ship has sailed.

 They're exposed to replication tools, yeah, but embedding them in the
 wire protocol would be moving the goalposts a long way past that.  As an
 example of something that doubtless seemed like a good idea at the time,
 consider the business about how an INSERT command completion tag includes
 the OID of the inserted row.  We're stuck with that obsolete idea
 *forever* because it's embedded in the protocol for all clients.
 
 I don't think we really need to embed it at that level. And it doesn't
 have to be always on - only clients that ask for it need to get the
 answer. Something like COMMIT WITH (report_commit_lsn ON); or similar
 might do the trick?

Wouldn't that force client drivers - libpq, psqlODBC, PgJDBC, etc - to
all watch for explicit COMMITs sent by the application and rewrite them?

Applications could also then request the commit option via a driver that
couldn't cope with it - which I think was one of Tom's concerns re using
a GUC, too.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers