Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Andres Freund
Hi,

On 2014-01-02 05:26:26 +, Greg Stark wrote:
 2) refetching a row could conceivably end up retrieving different data than
 was present when the row was originally read. (In some cases that might
 actually be the intended behaviour)

That's possible with system columns as well. In the normal cases we'll
only have copied the HeapTuple, not the HeapTupleHeader, so it will be
re-fetched from the (pinned) buffer.

 If this came up earlier I'm sorry but I suppose it's too hard to have a
 function like foo(tab.*) which magically can tell that the record is a heap
 tuple and look in the header? And presumably throw an error if passed a non
 heap tuple.

I don't see how that could be a good API. What happens if you have two
relations in a query?
Even if that wouldn't be a query, why would this be a helpful? Seems
like a poor reinvention of system columns.

Andres

PS: Could you not always include the full quoted message below your --
signature?

Greetings,

Andres Freund

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] more psprintf() use

2014-01-02 Thread Andres Freund
On 2014-01-02 09:49:48 +0200, Heikki Linnakangas wrote:
 On 01/02/2014 05:14 AM, Peter Eisentraut wrote:
 diff --git a/contrib/hstore/hstore_io.c b/contrib/hstore/hstore_io.c
 index 772a5ca..8331a56 100644
 --- a/contrib/hstore/hstore_io.c
 +++ b/contrib/hstore/hstore_io.c
 @@ -1114,11 +1114,7 @@
  HEntry *entries = ARRPTR(in);
 
  if (count == 0)
 -{
 -out = palloc(1);
 -*out = '\0';
 -PG_RETURN_CSTRING(out);
 -}
 +PG_RETURN_CSTRING();
 
  buflen = 0;
 
 Is it legal to return a constant with PG_RETURN_CSTRING? Grepping around, I
 don't see that being done anywhere else, but there are places that do
 PG_RETURN_CSTRING(pstrdup(constant))...

I don't see why it wouldn't be legal - constant strings have static
storage duration, i.e. the program lifetime. And I can see nothing that
would allow pfree()ing the return value of cstring returning functions
in the general case.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: show relation and tuple infos of a lock to acquire

2014-01-02 Thread Andres Freund
On 2013-12-31 13:56:53 -0800, Peter Geoghegan wrote:
 ISTM that you should be printing just the value and the unique index
 there, and not any information about the tuple proper. Doing any less
 could be highly confusing.

I agree that the message needs improvement, but I don't agree that we
shouldn't lock the tuple's location. If you manually investigate the
situation that's where you'll find the conflicting tuple - I don't see
what we gain by not logging the ctid.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: show relation and tuple infos of a lock to acquire

2014-01-02 Thread Andres Freund
On 2013-12-31 11:36:36 -0500, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  On 31 December 2013 09:12, Christian Kruse christ...@2ndquadrant.com 
  wrote:
  Output with patch:
  
  LOG:  process 24774 acquired ShareLock on transaction 696 after 11688.720 
  ms
  CONTEXT:  relation name: foo (OID 16385)
  tuple (ctid (0,1)): (1)
 
  That is useful info.
 
  I think the message should be changed to say this only, without a context 
  line
 
  LOG:  process 24774 acquired ShareLock on relation foo (OID 16385)
  tuple (0,1) after 11688.720 ms
 
  My reason is that pid 24774 was waiting for a *tuple lock* and it was
  eventually granted, so thats what it should say.
 
 No, that wasn't what it was waiting for, and lying to the user like that
 isn't going to make things better.

Agreed.

 Christian's idea of a context line seems plausible to me.  I don't
 care for this implementation too much --- a global variable?  Ick.

Yea, the data should be stored in ErrorContextCallback.arg instead.

 Make a wrapper function for XactLockTableWait instead, please.

I don't think that'd work out all too nicely - we do the
XactLockTableWait() inside other functions like MultiXactIdWait(),
passing all the detail along for those would end up being ugly. So using
error context callbacks properly seems like the best way in the end.

 And I'm not real sure that showing the whole tuple contents is a good
 thing; I can see people calling that a security leak, not to mention
 that the performance consequences could be dire.

I don't think that the security argument has too much merit given
today's PG - we already log the full tuple for various constraint
violations. And we also accept the performance penalty there.  I don't
see any easy way to select a sensible subset of columns to print here,
and printing the columns is what would make investigating issues around
this so much easier. At the very least we'd need to print the pkey
columns and the columns of the unique key that might have been violated.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is drop-index-concurrently-1.spec trying to test?

2014-01-02 Thread Andres Freund
On 2013-12-31 17:14:11 -0500, Tom Lane wrote:
 Peter pointed out in
 http://www.postgresql.org/message-id/527c0fe9.7000...@gmx.net
 that Kyotaro-san's patch to treat unique indexes as satisfying any sort
 condition that they are a prefix of broke the drop-index-concurrently-1
 isolation test.  The latest iterations of the patch respond to that by
 changing the expected output.  However, that seems rather wrongheaded,
 because AFAICT the intent of that part of the test is to demonstrate that
 a seqscan has particular behavior; so if the planner starts generating an
 indexscan instead, the test no longer proves anything of the kind.
 
 What I'm wondering though is what's the point of testing that a concurrent
 DROP INDEX doesn't affect a seqscan?  That seems kinda silly, so it's
 tempting to address the patch's problem by just removing the steps
 involving the getrow_seq query, rather than hacking it up enough so we'd
 still get a seqscan plan.

The point is to show that an index scan returns the same rows a
sequential scan would, even though the index is in the process of being
dropped and has been updated *after* the DROP started. That was broken
at some point.
Now, you could argue that that would also be shown without the
sequential scan, but I think that would make understanding the faulty
output harder.

 I'd have thought the test would be designed to allow
 the DROP to complete and then re-test that the results of the prepared
 query are still sane, but it does no such thing.

We could add a permutation like this, but ISTM that it would just test
plan invalidation?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple read-write masters in a cluster with wal-streaming synchronization

2014-01-02 Thread Andres Freund
On 2013-12-31 13:51:08 -0800, Mark Dilger wrote:
 The BDR documentation 
 http://wiki.postgresql.org/images/7/75/BDR_Presentation_PGCon2012.pdf
 says,
 
     Physical replication forces us to use just one
  node: multi-master required for write scalability
 
     Physical replication provides best read scalability
 
 I am inclined to agree with the second statement, but
 I think my proposal invalidates the first statement, at
 least for a particular rigorous partitioning over which
 server owns which data.

I think you *massively* underestimate the amount of work implementing
this would require.
For one, you'd need to have a catalog that is written to on only one
server, you cannot have several nodes writing to the same table, even if
it's to disparate oid ranges. So you'd need to partition the whole
catalog by oid ranges - which would be a major efficiency loss for many,
many cases.

Not to speak of breaking pg_upgrade and noticeably increasing the size
of the catalog due to bigger oids and additional relations.

 So for me, multi-master with physical replication seems
 possible, and would presumably provide the best
 read scalability.

What you describe isn't really multi master though, as every row can
only be written to by a single node (the owner).

Also, why would this have a better read scalability? Whether a row is
written by streaming rep or not doesn't influence read speed.

 Or I can use logical replication such as BDR, but then the servers
 are spending more effort than with physical replication,
 so I get less bang for the buck when I purchase more
 servers to add to the cluster.

The efficiency difference really hasn't to be big if done right. If
you're so write-heavy that the difference is becoming a problem you
wouldn't implement a shared-everything architecture anyway.

 Am I missing something here?  Does BDR really provide
 an equivalent solution?

Not yet, but the plan is to get there.
 
 Second, it seems that BDR leaves to the client the responsibility
 for making schemas the same everywhere.  Perhaps this is just
 a limitation of the implementation so far, which will be resolved
 in the future?

Hopefully something that's going to get lifted.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Store Extension Options

2014-01-02 Thread Andres Freund
On 2013-12-31 13:37:59 +0100, Pavel Stehule wrote:
   We use the namespace ext to the internal code
  (src/backend/access/common/reloptions.c) skip some validations and store
  the custom GUC.
 
  Do you think we don't need to use the ext namespace?
 
 
 yes - there be same mechanism as we use for GUC

There is no existing mechanism to handle conflicts for GUCs. The
difference is that for GUCs nearly no namespaced GUCs exist (plperl,
plpgsql have some), but postgres defines at least autovacuum. and
toast. namespaces for relation options.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Show lossy heap block info in EXPLAIN ANALYZE for bitmap heap scan

2014-01-02 Thread Andres Freund
On 2014-01-01 21:15:46 -0500, Robert Haas wrote:
 [ sensible reasoning ] However, I'm not sure it's really worth it.
 I think what people really care about is knowing whether the bitmap
 lossified or not, and generally how much got lossified.  The counts of
 exact and lossy pages are sufficient for that, without anything
 additional

Showing the amount of memory currently required could tell you how soon
accurate bitmap scans will turn into lossy scans though. Which is not a
bad thing to know, some kinds of scans (e.g. tsearch over expression
indexes, postgis) can get ridiculously slow once lossy.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: show relation and tuple infos of a lock to acquire

2014-01-02 Thread Peter Geoghegan
On Thu, Jan 2, 2014 at 12:56 AM, Andres Freund and...@2ndquadrant.com wrote:
 I agree that the message needs improvement, but I don't agree that we
 shouldn't lock the tuple's location. If you manually investigate the
 situation that's where you'll find the conflicting tuple - I don't see
 what we gain by not logging the ctid.

I suppose so, but the tuple probably isn't going to be visible anyway,
at least when the message is initially logged.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-02 Thread Andres Freund
On 2013-12-27 14:11:44 -0800, Peter Geoghegan wrote:
 On Fri, Dec 27, 2013 at 12:57 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  I don't think the current syntax the feature implements can be used as
  the sole argument what the feature should be able to support.
 
  If you think from the angle of a async MM replication solution
  replicating a table with multiple unique keys, not having to specify a
  single index we to expect conflicts from, is surely helpful.
 
 Well, you're not totally on your own for something like that with this
 feature. You can project the conflicter's tid, and possibly do a more
 sophisticated recovery, like inspecting the locked row and iterating.

Yea, but in that case I *do* conflict with more than one index and old
values need to stay locked. Otherwise anything resembling
forward-progress guarantee is lost.

 That's probably not at all ideal, but then I can't even imagine what
 the best interface for what you describe here looks like. If there are
 multiple conflicts, do you delete or update some or all of them? How
 do you express that concept from a DML statement?

For my usecases just getting the tid back is fine - it's in C
anyway. But I'd rather be in a position to do it from SQL as well...

If there are multiple conflicts the conflicting row should be
updated. If we didn't release the value locks on the individual indexes,
we can know beforehand whether only one row is going to be affected. If
there really are more than one, error out.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: show relation and tuple infos of a lock to acquire

2014-01-02 Thread Andres Freund
On 2014-01-02 01:40:38 -0800, Peter Geoghegan wrote:
 On Thu, Jan 2, 2014 at 12:56 AM, Andres Freund and...@2ndquadrant.com wrote:
  I agree that the message needs improvement, but I don't agree that we
  shouldn't lock the tuple's location. If you manually investigate the
  situation that's where you'll find the conflicting tuple - I don't see
  what we gain by not logging the ctid.
 
 I suppose so, but the tuple probably isn't going to be visible anyway,
 at least when the message is initially logged.

But that's the case for all these, otherwise we wouldn't wait on the
row.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-02 Thread Peter Geoghegan
On Thu, Jan 2, 2014 at 1:49 AM, Andres Freund and...@2ndquadrant.com wrote:
 Well, you're not totally on your own for something like that with this
 feature. You can project the conflicter's tid, and possibly do a more
 sophisticated recovery, like inspecting the locked row and iterating.

 Yea, but in that case I *do* conflict with more than one index and old
 values need to stay locked. Otherwise anything resembling
 forward-progress guarantee is lost.

I'm not sure I understand. In a very real sense they do stay locked.
What is insufficient about locking the definitively visible row with
the value, rather than the value itself? What distinction are you
making? On the first conflict you can delete the row you locked, and
then re-try, possibly further merging some stuff from the just-deleted
row when you next upsert.

It's possible that an earlier unique index value that is unlocked
before row locking proceeds will get a new would-be duplicate after
you're returned a locked row, but it's not obvious that that's a
problem for your use-case (a problem that can't be worked around), or
that promise tuples get you anything better.

 That's probably not at all ideal, but then I can't even imagine what
 the best interface for what you describe here looks like. If there are
 multiple conflicts, do you delete or update some or all of them? How
 do you express that concept from a DML statement?

 For my usecases just getting the tid back is fine - it's in C
 anyway. But I'd rather be in a position to do it from SQL as well...

I believe you can.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Store Extension Options

2014-01-02 Thread Fabrízio de Royes Mello
On Thu, Jan 2, 2014 at 7:19 AM, Andres Freund and...@2ndquadrant.com
wrote:

 On 2013-12-31 13:37:59 +0100, Pavel Stehule wrote:
We use the namespace ext to the internal code
   (src/backend/access/common/reloptions.c) skip some validations and
store
   the custom GUC.
  
   Do you think we don't need to use the ext namespace?
  
 
  yes - there be same mechanism as we use for GUC

 There is no existing mechanism to handle conflicts for GUCs. The
 difference is that for GUCs nearly no namespaced GUCs exist (plperl,
 plpgsql have some), but postgres defines at least autovacuum. and
 toast. namespaces for relation options.


autovacuum. namespace ???

The HEAP_RELOPT_NAMESPACES (src/include/access/reloptions.h) constant
define only toast and null as a valid relation option namespace.

I missed something?

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog sobre TI: http://fabriziomello.blogspot.com
 Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-02 Thread Andres Freund
On 2014-01-02 02:20:02 -0800, Peter Geoghegan wrote:
 On Thu, Jan 2, 2014 at 1:49 AM, Andres Freund and...@2ndquadrant.com wrote:
  Well, you're not totally on your own for something like that with this
  feature. You can project the conflicter's tid, and possibly do a more
  sophisticated recovery, like inspecting the locked row and iterating.
 
  Yea, but in that case I *do* conflict with more than one index and old
  values need to stay locked. Otherwise anything resembling
  forward-progress guarantee is lost.
 
 I'm not sure I understand. In a very real sense they do stay locked.
 What is insufficient about locking the definitively visible row with
 the value, rather than the value itself?

Locking the definitely visible row only works if there's a row matching
the index's columns. If the values of the new row don't have
corresponding values in all the indexes you have the same old race
conditions again.
I think to be useful for many cases you really need to be able to ask
for a potentially conflicting row and be sure that if there's none you
are able to insert the row separately.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Store Extension Options

2014-01-02 Thread Andres Freund
On 2014-01-02 08:26:20 -0200, Fabrízio de Royes Mello wrote:
 On Thu, Jan 2, 2014 at 7:19 AM, Andres Freund and...@2ndquadrant.com
 wrote:
 
  On 2013-12-31 13:37:59 +0100, Pavel Stehule wrote:
 We use the namespace ext to the internal code
(src/backend/access/common/reloptions.c) skip some validations and
 store
the custom GUC.
   
Do you think we don't need to use the ext namespace?
   
  
   yes - there be same mechanism as we use for GUC
 
  There is no existing mechanism to handle conflicts for GUCs. The
  difference is that for GUCs nearly no namespaced GUCs exist (plperl,
  plpgsql have some), but postgres defines at least autovacuum. and
  toast. namespaces for relation options.
 
 
 autovacuum. namespace ???

Yea, right, it's autovacuum_...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: show relation and tuple infos of a lock to acquire

2014-01-02 Thread Christian Kruse
Hi,

On 02/01/14 10:02, Andres Freund wrote:
  Christian's idea of a context line seems plausible to me.  I don't
  care for this implementation too much --- a global variable?  Ick.
 
 Yea, the data should be stored in ErrorContextCallback.arg instead.

Fixed.

I also palloc() the ErrorContextCallback itself. But it doesn't feel
right since I could not find a piece of code doing so. What do you
think?

  Make a wrapper function for XactLockTableWait instead, please.
 
 I don't think that'd work out all too nicely - we do the
 XactLockTableWait() inside other functions like MultiXactIdWait(),
 passing all the detail along for those would end up being ugly. So using
 error context callbacks properly seems like the best way in the end.

Wrapping XactLockTableWait()/MultiXactIdWait() at least allows the
ErrorContextCallback and ErrorContextCallback.arg to live on the
stack. So what we could do is wrap this in a macro instead of a
function (like e.g. PG_TRY) or write two different functions. And I
don't like the two functions approach since it means duplication of
code.

While writing the response I really think writing a macro in PG_TRY
style for setting up and cleaning the error context callback I begin
to think that this would be the best way to go.

  And I'm not real sure that showing the whole tuple contents is a good
  thing; I can see people calling that a security leak, not to mention
  that the performance consequences could be dire.
 
 I don't think that the security argument has too much merit given
 today's PG - we already log the full tuple for various constraint
 violations. And we also accept the performance penalty there.  I don't
 see any easy way to select a sensible subset of columns to print here,
 and printing the columns is what would make investigating issues around
 this so much easier. At the very least we'd need to print the pkey
 columns and the columns of the unique key that might have been violated.

I agree. Even printing only the PK values would be much more
complex. As far as I can see we would even have to gain another lock
for this (see comment for relationHasPrimaryKey() in
src/backend/catalog/index.c).

Regards,

-- 
 Christian Kruse   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f8545c1..cfa49c2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2702,9 +2702,11 @@ l1:
 
 		if (infomask  HEAP_XMAX_IS_MULTI)
 		{
+			XactLockTableWaitSetupErrorContextCallback(relation, tp);
 			/* wait for multixact */
 			MultiXactIdWait((MultiXactId) xwait, MultiXactStatusUpdate,
 			NULL, infomask);
+			XactLockTableWaitCleanupErrorContextCallback();
 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 			/*
@@ -2730,7 +2732,10 @@ l1:
 		else
 		{
 			/* wait for regular transaction to end */
+			XactLockTableWaitSetupErrorContextCallback(relation, tp);
+
 			XactLockTableWait(xwait);
+			XactLockTableWaitCleanupErrorContextCallback();
 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 			/*
@@ -3254,9 +3259,11 @@ l2:
 			TransactionId update_xact;
 			int			remain;
 
+			XactLockTableWaitSetupErrorContextCallback(relation, oldtup);
 			/* wait for multixact */
 			MultiXactIdWait((MultiXactId) xwait, mxact_status, remain,
 			infomask);
+			XactLockTableWaitCleanupErrorContextCallback();
 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 			/*
@@ -3330,7 +3337,10 @@ l2:
 			else
 			{
 /* wait for regular transaction to end */
+XactLockTableWaitSetupErrorContextCallback(relation, oldtup);
+
 XactLockTableWait(xwait);
+XactLockTableWaitCleanupErrorContextCallback();
 LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 /*
@@ -4398,7 +4408,11 @@ l3:
 		RelationGetRelationName(relation;
 }
 else
+{
+	XactLockTableWaitSetupErrorContextCallback(relation, tuple);
 	MultiXactIdWait((MultiXactId) xwait, status, NULL, infomask);
+	XactLockTableWaitCleanupErrorContextCallback();
+}
 
 /* if there are updates, follow the update chain */
 if (follow_updates 
@@ -4453,7 +4467,11 @@ l3:
 		RelationGetRelationName(relation;
 }
 else
+{
+	XactLockTableWaitSetupErrorContextCallback(relation, tuple);
 	XactLockTableWait(xwait);
+	XactLockTableWaitCleanupErrorContextCallback();
+}
 
 /* if there are updates, follow the update chain */
 if (follow_updates 
@@ -5139,9 +5157,14 @@ l4:
 
 	if (needwait)
 	{
+		XactLockTableWaitSetupErrorContextCallback(rel, mytup);
+
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		XactLockTableWait(members[i].xid);
 		pfree(members);
+
+		XactLockTableWaitCleanupErrorContextCallback();
+
 		goto l4;
 	}
 	if (res != HeapTupleMayBeUpdated)
@@ -5199,8 +5222,13 @@ l4:
  needwait);
 if (needwait)
 {
+	

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-02 Thread David Rowley
On Fri, Dec 27, 2013 at 1:36 AM, David Rowley dgrowle...@gmail.com wrote:

 From what I can tell adding an inverse transition function to support AVG
 for numeric does not affect the number of trailing zeros in the results, so
 I've attached a patch which now has inverse transition functions to support
 numeric types in AVG and all of the STDDEV* aggregates.


here's a slightly updated patch in which I've fixed up an comment that had
become out-dated because of the patch. I also changed the wording in quite
a few of my comments and made some changes to the docs. The only other
change was an extra error check just in case window_gettupleslot was to
fail to get a tuple that should always be in the tuple store.

I'm now classing the patch is not WIP anymore. I think it's ready for the
commitfest.

Regards

David Rowley


inverse_transition_functions_v1.8.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Remove some duplicate if conditions

2014-01-02 Thread David Rowley
I've attached a simple patch which removes some duplicate if conditions
that seemed to have found their way into the code.

These are per PVS-Studio's warnings.

Regards

David Rowley


duplicate_if_test.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing pg_basebackup with tablespaces found in $PGDATA

2014-01-02 Thread Magnus Hagander
On Wed, Jan 1, 2014 at 11:53 PM, Dimitri Fontaine dimi...@2ndquadrant.frwrote:

 Hi,

 As much as I've seen people frown upon $subject, it still happens in the
 wild, and Magnus seems to agree that the current failure mode of our
 pg_basebackup tool when confronted to the situation is a bug.

 So here's a fix, attached.

 To reproduce, mkdir -p $PGDATA/tbs/foo then CREATE TABLESPACE there, and
 then pg_basebackup your server. If doing so from the same server, as I
 did, then pick the tar format, as here:

   pg_basebackup -Ft -z -c fast -v -X fetch -D /tmp/backup

 Then use tar to see that the base backup contains the whole content of
 your foo tablespace, and if you did create another tablespace within
 $PGDATA/pg_tblspc (which is the other common way to trigger that issue)
 then add it to what you want to see:

   tar tzvf /tmp/backup/base.tar.gz pg_tblspc tbs/foo pg_tblspc/bar

 Note that empty directories are expected, so tar should output their
 entries. Those directories are where you need to be restoring the
 tablespace tarballs.

 When using pg_basebackup in plain mode, the error is that you get a copy
 of all your tablespaces first, then the main PGDATA is copied over and
 as the destination directories already do exists (and not empty) the
 whole backup fails there.

 The bug should be fixed against all revisions of pg_basebackup, though I
 didn't try to apply this very patch on all target branches.


I had a second round of thought about this, and I don't think this one is
going to work. Unfortunately, it's part of the advice I gave you yesterday
that was bad I think :)

We can't get away with just comparing the relative part of the pathname.
Because it will fail if there is another path with exactly the same length,
containing the tablespace.

I think we might want to store a value in the tablespaceinfo struct
indicating whether it's actually inside PGDATA (since we have the full path
at that point), and then skip it based on that instead. Or store and pass
the value of getcwd() perhaps.

Or am I thinking wrong now instead? :)

I've attached a slightly updated patch - I changed around a bit of logic
order and updated some comments during my review. And added error-checking.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
*** a/src/backend/replication/basebackup.c
--- b/src/backend/replication/basebackup.c
***
*** 45,51  typedef struct
  } basebackup_options;
  
  
! static int64 sendDir(char *path, int basepathlen, bool sizeonly);
  static int64 sendTablespace(char *path, bool sizeonly);
  static bool sendFile(char *readfilename, char *tarfilename,
  		 struct stat * statbuf, bool missing_ok);
--- 45,52 
  } basebackup_options;
  
  
! static int64 sendDir(char *path, int basepathlen, int rootpathlen,
! 	 bool sizeonly, List *tablespaces);
  static int64 sendTablespace(char *path, bool sizeonly);
  static bool sendFile(char *readfilename, char *tarfilename,
  		 struct stat * statbuf, bool missing_ok);
***
*** 100,105  perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
--- 101,119 
  	XLogRecPtr	endptr;
  	TimeLineID	endtli;
  	char	   *labelfile;
+ 	char	cwd[MAXPGPATH];
+ 	int rootpathlen;
+ 
+ 	/*
+ 	 * We need to compute rootpathlen to allow for skipping tablespaces
+ 	 * installed within PGDATA.
+ 	 */
+ 	if (!getcwd(cwd, MAXPGPATH))
+ 		ereport(ERROR,
+ (errcode_for_file_access(),
+  errmsg(could not determine current directory: %m)));
+ 
+ 	rootpathlen = strlen(cwd) + 1;
  
  	backup_started_in_recovery = RecoveryInProgress();
  
***
*** 165,171  perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
  
  		/* Add a node for the base directory at the end */
  		ti = palloc0(sizeof(tablespaceinfo));
! 		ti-size = opt-progress ? sendDir(., 1, true) : -1;
  		tablespaces = lappend(tablespaces, ti);
  
  		/* Send tablespace header */
--- 179,186 
  
  		/* Add a node for the base directory at the end */
  		ti = palloc0(sizeof(tablespaceinfo));
! 		ti-size = opt-progress ?
! 			sendDir(., 1, rootpathlen, true, tablespaces) : -1;
  		tablespaces = lappend(tablespaces, ti);
  
  		/* Send tablespace header */
***
*** 191,197  perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
  sendFileWithContent(BACKUP_LABEL_FILE, labelfile);
  
  /* ... then the bulk of the files ... */
! sendDir(., 1, false);
  
  /* ... and pg_control after everything else. */
  if (lstat(XLOG_CONTROL_FILE, statbuf) != 0)
--- 206,212 
  sendFileWithContent(BACKUP_LABEL_FILE, labelfile);
  
  /* ... then the bulk of the files ... */
! sendDir(., 1, rootpathlen, false, tablespaces);
  
  /* ... and pg_control after everything else. */
  if (lstat(XLOG_CONTROL_FILE, statbuf) != 0)
***
*** 778,785  sendTablespace(char *path, bool sizeonly)
  		_tarWriteHeader(TABLESPACE_VERSION_DIRECTORY, NULL, 

[HACKERS] ERROR: missing chunk number 0 for toast value

2014-01-02 Thread Rushabh Lathia
Hi All,

Test case:

drop table if exists t;
create table t(c text);
insert into t values ('x'), (repeat(md5('abcdefghijklmnop'), 1));
select pg_column_size(c), pg_column_size(c || '') FROM t;

CREATE OR REPLACE FUNCTION copy_toast_out() RETURNS VOID AS $$
declare
v text;
BEGIN
SELECT c INTO v FROM t WHERE c  'x';
Select 1/0;
Exception
When Others Then
PERFORM pg_sleep(30); -- go run TRUNCATE t in a 2nd session


raise notice 'length :%', length(v || ''); -- force detoast


END;
$$ language plpgsql;

postgres=# select copy_toast_out();
ERROR:  missing chunk number 0 for toast value 16390 in pg_toast_16384
CONTEXT:  PL/pgSQL function copy_toast_out() line 10 at RAISE

Analysis:

The basic problem here is that if the lock is released on table before
extracting toasted value, and in meantime someone truncates the table,
this error can occur.  Here error coming with PL block contains an Exception
block (as incase there is an exception block, it calls
RollbackAndReleaseCurrentSubTransaction).

Do you think we should detoast the local variable before
 RollbackAndReleaseCurrentSubTransaction ? Or any other options ?

Regards,
Rushabh Lathia
www.EnterpriseDB.com


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Robert Haas
On Thu, Jan 2, 2014 at 12:26 AM, Greg Stark st...@mit.edu wrote:
 I fail to see why.  What is so ugly about this:

 select x.* from pgbench_accounts a, pg_tuple_header(a.tableoid, a.ctid) x;

 Two points:

 1) it's a bit weird to go to this effort to eliminate system columns by
 using a scheme that depends on having a system column -- ctid

 2) refetching a row could conceivably end up retrieving different data than
 was present when the row was originally read. (In some cases that might
 actually be the intended behaviour)

 If this came up earlier I'm sorry but I suppose it's too hard to have a
 function like foo(tab.*) which magically can tell that the record is a heap
 tuple and look in the header? And presumably throw an error if passed a non
 heap tuple.

Yeah, that was tried.  Doesn't work:

http://www.postgresql.org/message-id/ca+tgmozwzzw_wr8adhwc8iejweihxlpxranqyrfb8mw0sch...@mail.gmail.com

At any rate, my goal isn't really to get rid of system columns, but to
address Jim Nasby's gripe that the changes thus far committed make it
difficult to use system columns to determine whether a given tuple has
been frozen.  It sounds like the consensus is that a system column is
a better way of doing that than a function, so I suppose the next step
is to try to hammer out the details.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Andres Freund
On 2014-01-02 07:35:11 -0500, Robert Haas wrote:
 On Thu, Jan 2, 2014 at 12:26 AM, Greg Stark st...@mit.edu wrote:
  I fail to see why.  What is so ugly about this:
 
  select x.* from pgbench_accounts a, pg_tuple_header(a.tableoid, a.ctid) x;
 
  Two points:
 
  1) it's a bit weird to go to this effort to eliminate system columns by
  using a scheme that depends on having a system column -- ctid
 
  2) refetching a row could conceivably end up retrieving different data than
  was present when the row was originally read. (In some cases that might
  actually be the intended behaviour)
 
  If this came up earlier I'm sorry but I suppose it's too hard to have a
  function like foo(tab.*) which magically can tell that the record is a heap
  tuple and look in the header? And presumably throw an error if passed a non
  heap tuple.
 
 Yeah, that was tried.  Doesn't work:
 
 http://www.postgresql.org/message-id/ca+tgmozwzzw_wr8adhwc8iejweihxlpxranqyrfb8mw0sch...@mail.gmail.com
 
 At any rate, my goal isn't really to get rid of system columns, but to
 address Jim Nasby's gripe that the changes thus far committed make it
 difficult to use system columns to determine whether a given tuple has
 been frozen.  It sounds like the consensus is that a system column is
 a better way of doing that than a function, so I suppose the next step
 is to try to hammer out the details.

So, I think something like my POC patch would need to get in before we
can go there, right? I won't be able to work on it in the next 10days or
so, there's a bunch of stuff that's more pressing unfortunately. There's
a fair bit of work left there...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Show lossy heap block info in EXPLAIN ANALYZE for bitmap heap scan

2014-01-02 Thread Robert Haas
On Thu, Jan 2, 2014 at 4:27 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-01-01 21:15:46 -0500, Robert Haas wrote:
 [ sensible reasoning ] However, I'm not sure it's really worth it.
 I think what people really care about is knowing whether the bitmap
 lossified or not, and generally how much got lossified.  The counts of
 exact and lossy pages are sufficient for that, without anything
 additional

 Showing the amount of memory currently required could tell you how soon
 accurate bitmap scans will turn into lossy scans though. Which is not a
 bad thing to know, some kinds of scans (e.g. tsearch over expression
 indexes, postgis) can get ridiculously slow once lossy.

Hmm, interesting.  I have not encountered that myself.  If we want
that, I'm tempted to think that we should display statistics for each
bitmap index scan - but I'd be somewhat inclined to see if we could
get by with the values that are already stored in a TIDBitmap rather
than adding new ones - e.g. show npages (the number of exact entries),
nchunks (the number of lossy entries), and maxentries.  From that, you
can work out the percentage of available entries that were actually
used.  The only thing that's a bit annoying about that is that we'd
probably have to copy those values out of the tid bitmap and into an
executor state node, because the tid bitmap will subsequently get
modified destructively.  But I think that's probably OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-02 Thread Robert Haas
On Tue, Dec 31, 2013 at 4:12 AM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 1. PromiseTupleInsertionLockAcquire(my xid)
 2. Insert heap tuple
 3. Insert index tuples
 4. Check if conflict happened. Kill the already-inserted tuple on conflict.
 5. PromiseTupleInsertionLockRelease(my xid)

 IOW, the only change to the current patch is that you acquire the new kind
 of lock before starting the insertion, and you release it after you've
 killed the tuple, or you know you're not going to kill it.

 Where does row locking fit in there? - you may need to retry when that
 part is incorporated, of course. What if you have multiple promise
 tuples from a contended attempt to insert a single slot, or multiple
 broken promise tuples across multiple slots or even multiple commands
 in the same xact?

Yeah, it seems like PromiseTupleInsertionLockAcquire should be locking
the tuple, rather than the XID.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-02 Thread Nicolas Barbier
2013/12/15 David Rowley dgrowle...@gmail.com:

 I've been working on speeding up aggregate functions when used in the
 context of a window's with non fixed frame heads.

 1. Fully implement negative transition functions for SUM and AVG.

I would like to mention that this functionality is also extremely
useful to have for the incremental maintenance of materialized views
that use aggregation (which IMHO is one of the most useful kinds).

(Simply imagine a view of the form “SELECT a, agg_function(b) FROM
table GROUP BY a”, a lot of rows in the table, a lot of rows in each
group, and changes that both remove and add new rows.)

For this to work, two things are needed:

(1) A way to apply a value normally (already supported) and inversely
(i.e., this patch) to the current “internal state” of an aggregation.

(2) A way to store the “internal state” of an aggregation in the
materialized view’s “extent” (i.e., the physical rows that represent
the view’s contents, which may or may not be slightly different from
what you get when you do SELECT * FROM matview). As (AFAIK) that state
is stored as a normal value, the maintenance code could just take the
value, store it in the extent, and next time retrieve it again and
perform normal or inverse transitions. When selecting from the
matview, the state could be retrieved, and the final function applied
to it to yield the value to be returned.

To understand (2), assume that one wants to store an AVG() in a
materialized view; To be able to update the value incrementally, one
needs to actually store the SUM() and COUNT(), and perform the
division when selecting from the materialized view. Or it could
(initially) be decided to define AVG() as “not supporting fast
incremental maintenance,” and require the user (if he/she wants fast
incremental maintenance) to put SUM() and COUNT() in the materialized
view manually, and perform the division manually when wanting to
retrieve the average.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing pg_basebackup with tablespaces found in $PGDATA

2014-01-02 Thread Dimitri Fontaine
Magnus Hagander mag...@hagander.net writes:
 We can't get away with just comparing the relative part of the pathname.
 Because it will fail if there is another path with exactly the same length,
 containing the tablespace.

Actually… yeah.

 I think we might want to store a value in the tablespaceinfo struct
 indicating whether it's actually inside PGDATA (since we have the full path
 at that point), and then skip it based on that instead. Or store and pass
 the value of getcwd() perhaps.

I think it's best to stuff in the tablespaceinfo struct either NIL or
the relative path of the tablespace when found in $PGDATA, as done in
the attached.

 I've attached a slightly updated patch - I changed around a bit of logic
 order and updated some comments during my review. And added error-checking.

Thanks! I started again from your version for v3.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

*** a/src/backend/replication/basebackup.c
--- b/src/backend/replication/basebackup.c
***
*** 45,51  typedef struct
  } basebackup_options;
  
  
! static int64 sendDir(char *path, int basepathlen, bool sizeonly);
  static int64 sendTablespace(char *path, bool sizeonly);
  static bool sendFile(char *readfilename, char *tarfilename,
  		 struct stat * statbuf, bool missing_ok);
--- 45,51 
  } basebackup_options;
  
  
! static int64 sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces);
  static int64 sendTablespace(char *path, bool sizeonly);
  static bool sendFile(char *readfilename, char *tarfilename,
  		 struct stat * statbuf, bool missing_ok);
***
*** 72,77  typedef struct
--- 72,78 
  {
  	char	   *oid;
  	char	   *path;
+ 	char   *rpath;			/* relative path within PGDATA, or nil. */
  	int64		size;
  } tablespaceinfo;
  
***
*** 100,105  perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
--- 101,119 
  	XLogRecPtr	endptr;
  	TimeLineID	endtli;
  	char	   *labelfile;
+ 	char	cwd[MAXPGPATH];
+ 	int rootpathlen;
+ 
+ 	/*
+ 	 * We need to compute rootpathlen to allow for skipping tablespaces
+ 	 * installed within PGDATA.
+ 	 */
+ 	if (!getcwd(cwd, MAXPGPATH))
+ 		ereport(ERROR,
+ (errcode_for_file_access(),
+  errmsg(could not determine current directory: %m)));
+ 
+ 	rootpathlen = strlen(cwd);
  
  	backup_started_in_recovery = RecoveryInProgress();
  
***
*** 119,124  perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
--- 133,139 
  		{
  			char		fullpath[MAXPGPATH];
  			char		linkpath[MAXPGPATH];
+ 			char		*relpath = NULL;
  			int			rllen;
  
  			/* Skip special stuff */
***
*** 145,153  perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
--- 160,178 
  			}
  			linkpath[rllen] = '\0';
  
+ 			/*
+ 			 * Relpath is the relative path of the tablespace linkpath when
+ 			 * the realname is found within PGDATA, or NULL.
+ 			 */
+ 			if (rllen  rootpathlen
+  strncmp(linkpath, cwd, rootpathlen) == 0
+  linkpath[rootpathlen] == '/')
+ relpath = linkpath + rootpathlen + 1;
+ 
  			ti = palloc(sizeof(tablespaceinfo));
  			ti-oid = pstrdup(de-d_name);
  			ti-path = pstrdup(linkpath);
+ 			ti-rpath = relpath ? pstrdup(relpath) : NULL;
  			ti-size = opt-progress ? sendTablespace(fullpath, true) : -1;
  			tablespaces = lappend(tablespaces, ti);
  #else
***
*** 165,171  perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
  
  		/* Add a node for the base directory at the end */
  		ti = palloc0(sizeof(tablespaceinfo));
! 		ti-size = opt-progress ? sendDir(., 1, true) : -1;
  		tablespaces = lappend(tablespaces, ti);
  
  		/* Send tablespace header */
--- 190,196 
  
  		/* Add a node for the base directory at the end */
  		ti = palloc0(sizeof(tablespaceinfo));
! 		ti-size = opt-progress ? sendDir(., 1, true, tablespaces) : -1;
  		tablespaces = lappend(tablespaces, ti);
  
  		/* Send tablespace header */
***
*** 191,197  perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
  sendFileWithContent(BACKUP_LABEL_FILE, labelfile);
  
  /* ... then the bulk of the files ... */
! sendDir(., 1, false);
  
  /* ... and pg_control after everything else. */
  if (lstat(XLOG_CONTROL_FILE, statbuf) != 0)
--- 216,222 
  sendFileWithContent(BACKUP_LABEL_FILE, labelfile);
  
  /* ... then the bulk of the files ... */
! sendDir(., 1, false, tablespaces);
  
  /* ... and pg_control after everything else. */
  if (lstat(XLOG_CONTROL_FILE, statbuf) != 0)
***
*** 778,785  sendTablespace(char *path, bool sizeonly)
  		_tarWriteHeader(TABLESPACE_VERSION_DIRECTORY, NULL, statbuf);
  	size = 512;	/* Size of the header just added */
  
! 	/* Send all the files in the tablespace version directory */
! 	size += sendDir(pathbuf, strlen(path), sizeonly);
  
  	return size;
  }
--- 803,815 
  		

Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Robert Haas
On Thu, Jan 2, 2014 at 7:40 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-01-02 07:35:11 -0500, Robert Haas wrote:
 On Thu, Jan 2, 2014 at 12:26 AM, Greg Stark st...@mit.edu wrote:
  I fail to see why.  What is so ugly about this:
 
  select x.* from pgbench_accounts a, pg_tuple_header(a.tableoid, a.ctid) x;
 
  Two points:
 
  1) it's a bit weird to go to this effort to eliminate system columns by
  using a scheme that depends on having a system column -- ctid
 
  2) refetching a row could conceivably end up retrieving different data than
  was present when the row was originally read. (In some cases that might
  actually be the intended behaviour)
 
  If this came up earlier I'm sorry but I suppose it's too hard to have a
  function like foo(tab.*) which magically can tell that the record is a heap
  tuple and look in the header? And presumably throw an error if passed a non
  heap tuple.

 Yeah, that was tried.  Doesn't work:

 http://www.postgresql.org/message-id/ca+tgmozwzzw_wr8adhwc8iejweihxlpxranqyrfb8mw0sch...@mail.gmail.com

 At any rate, my goal isn't really to get rid of system columns, but to
 address Jim Nasby's gripe that the changes thus far committed make it
 difficult to use system columns to determine whether a given tuple has
 been frozen.  It sounds like the consensus is that a system column is
 a better way of doing that than a function, so I suppose the next step
 is to try to hammer out the details.

 So, I think something like my POC patch would need to get in before we
 can go there, right? I won't be able to work on it in the next 10days or
 so, there's a bunch of stuff that's more pressing unfortunately. There's
 a fair bit of work left there...

Well, that's the question, I guess.  I think the options are:

1. Add pg_infomask now.  Leave your patch for a future optimization.
Suck up the catalog bloat.
2. Wait for your patch to be done.  Then add pg_infomask afterwards.
Accept that it may not make 9.4, or else accept that we may have to
accept that patch after the nominal deadline.
3. Decide that exposing the frozen status of tuples is not a priority
and just don't worry about it.

One concern I have is that if we spend too much time now worrying
about these system column problems, it may reduce the amount of
optimization that gets done on top of this optimization in the 9.4
time frame.  And that, I think, would be sad.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] more psprintf() use

2014-01-02 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-01-02 09:49:48 +0200, Heikki Linnakangas wrote:
 Is it legal to return a constant with PG_RETURN_CSTRING? Grepping around, I
 don't see that being done anywhere else, but there are places that do
 PG_RETURN_CSTRING(pstrdup(constant))...

 I don't see why it wouldn't be legal - constant strings have static
 storage duration, i.e. the program lifetime. And I can see nothing that
 would allow pfree()ing the return value of cstring returning functions
 in the general case.

Heikki is right and you are wrong.  There is an ancient supposition that
datatype output functions, in particular, always return palloc'd strings.

I recently got rid of the pfree's in the main output path, cf commit
b006f4ddb988568081f8290fac77f9402b137120, which might explain why this
patch passes regression tests; but there are still places in the code (and
even more likely in third-party code) that will try to pfree the results.
So -1 for this particular change.  The pstrdup that Heikki suggests would
be safer practice.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing pg_basebackup with tablespaces found in $PGDATA

2014-01-02 Thread Bernd Helmle



--On 1. Januar 2014 23:53:46 +0100 Dimitri Fontaine 
dimi...@2ndquadrant.fr wrote:



Hi,

As much as I've seen people frown upon $subject, it still happens in the
wild, and Magnus seems to agree that the current failure mode of our
pg_basebackup tool when confronted to the situation is a bug.

So here's a fix, attached.


I've seen having tablespaces under PGDATA as a policy within several data 
centres in the past. The main reasoning behind this seems that they 
strictly separate platform and database administration and for database 
inventory reasons. They are regularly surprised if you tell them to not use 
tablespaces in such a way, since they absorbed this practice over the years 
from other database systems. So +1 for fixing this.


--
Thanks

Bernd


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Jan 2, 2014 at 12:26 AM, Greg Stark st...@mit.edu wrote:
 1) it's a bit weird to go to this effort to eliminate system columns by
 using a scheme that depends on having a system column -- ctid

 At any rate, my goal isn't really to get rid of system columns, but to
 address Jim Nasby's gripe that the changes thus far committed make it
 difficult to use system columns to determine whether a given tuple has
 been frozen.  It sounds like the consensus is that a system column is
 a better way of doing that than a function, so I suppose the next step
 is to try to hammer out the details.

Actually, I thought the function approach was a good proposal.  You are
right that func(tab.*) isn't going to work, because it's going to get a
Datum-ified tuple not a pointer to raw on-disk storage.  But an inspection
function that's handed a ctid could work.

I don't think that Greg's objection above has much force.  The fact of the
matter is that we are never going to be able to get rid of the exposed
tableoid, oid, or ctid columns, because there are too many situations
where those are useful, and are being used, in actual client-application
queries.  This is less true however of xmin/xmax/cmin/cmax; we might hope
to deprecate those at the SQL level someday, especially in view of the
fact that we keep whacking around their detailed meanings, with few user
complaints.

And I really really *don't* want to see us bloating pg_attribute, nor
infringing further on application column namespace, by adding even more
exposed columns.  So -1 for just blindly doing that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Andres Freund
On 2014-01-02 09:40:54 -0500, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Thu, Jan 2, 2014 at 12:26 AM, Greg Stark st...@mit.edu wrote:
  1) it's a bit weird to go to this effort to eliminate system columns by
  using a scheme that depends on having a system column -- ctid
 
  At any rate, my goal isn't really to get rid of system columns, but to
  address Jim Nasby's gripe that the changes thus far committed make it
  difficult to use system columns to determine whether a given tuple has
  been frozen.  It sounds like the consensus is that a system column is
  a better way of doing that than a function, so I suppose the next step
  is to try to hammer out the details.
 
 Actually, I thought the function approach was a good proposal.  You are
 right that func(tab.*) isn't going to work, because it's going to get a
 Datum-ified tuple not a pointer to raw on-disk storage.  But an inspection
 function that's handed a ctid could work.

Well, we discussed that upthread, and the overhead of going through a
function is quite noticeable because the tuple needs to be fetched from
the heap again.
Check 
http://archives.postgresql.org/message-id/CA%2BTgmoYdWOZa_UiewbqE73AQCM2etpMjukkcmYPkGatH8kPTow%40mail.gmail.com

To get rid of part of the performance problem, we could possibly invent
a way to pass a HeapTuple instead of a blessed HeapTupleHeader to
functions, but that might not be so easy.

 And I really really *don't* want to see us bloating pg_attribute, nor
 infringing further on application column namespace, by adding even more
 exposed columns.  So -1 for just blindly doing that.

Upthread there's a POC patch of mine, that started to explore what's
necessary to simply never store system columns (except maybe oid) in
pg_attribute. While it passes the regression tests it's not complete,
but the amount of work looks reasonable. Check
http://archives.postgresql.org/message-id/20131223124504.GB19795%40alap2.anarazel.de

Now, that obviously doesn't get rid of the namespace problem. I'd
suggested a system: prefix, Robert _pg_. Neither is pretty, but imo
should work.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-02 Thread Heikki Linnakangas

On 01/02/2014 02:53 PM, Robert Haas wrote:

On Tue, Dec 31, 2013 at 4:12 AM, Peter Geoghegan p...@heroku.com wrote:

On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

1. PromiseTupleInsertionLockAcquire(my xid)
2. Insert heap tuple
3. Insert index tuples
4. Check if conflict happened. Kill the already-inserted tuple on conflict.
5. PromiseTupleInsertionLockRelease(my xid)

IOW, the only change to the current patch is that you acquire the new kind
of lock before starting the insertion, and you release it after you've
killed the tuple, or you know you're not going to kill it.


Where does row locking fit in there? - you may need to retry when that
part is incorporated, of course. What if you have multiple promise
tuples from a contended attempt to insert a single slot, or multiple
broken promise tuples across multiple slots or even multiple commands
in the same xact?


You can only have one speculative insertion in progress at a time. After 
you've done all the index insertions and checked that you really didn't 
conflict with anyone, you're not going to go back and kill the tuple 
anymore. After that point, the insertion is not speculation anymore.



Yeah, it seems like PromiseTupleInsertionLockAcquire should be locking
the tuple, rather than the XID.


Well, that would be ideal, because we already have tuple locks. It would 
be nice to use the same concept for this. It's a bit tricky, however. I 
guess the most straightforward way to do it would be to grab a 
heavy-weight lock after you've inserted the tuple, but before releasing 
the buffer lock. I don't immediately see a problem with that, although 
it's a bit scary to acquire a heavy-weight lock while holding a buffer lock.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-02 Thread Erik Rijkers
On Thu, January 2, 2014 13:36, Erik Rijkers wrote:
 On Thu, January 2, 2014 13:05, David Rowley wrote:
 here's a slightly updated patch
 [inverse_transition_functions_v1.8.patch.gz ]

 patch applies, and compiles (although with new warnings).
 But make check complains loudly: see attached.

 warnings:

The TRACE_POSTGRESQL_SORT_DONE warnings were not from your patch; sorry about 
that. They occur on HEAD too (with a debug
compile).

tuplesort.c:935:44: warning: comparison between pointer and integer [enabled by 
default]
  TRACE_POSTGRESQL_SORT_DONE(state-tapeset != NULL, spaceUsed);
^
tuplesort.c:935:2: note: in expansion of macro â
  TRACE_POSTGRESQL_SORT_DONE(state-tapeset != NULL, spaceUsed);



The 'make check' failure remains a problem

The output I sent earlier today was for this configure:

./configure --prefix=/var/data1/pg_stuff/pg_installations/pgsql.inverse 
--with-pgport=6594 \
--bindir=/var/data1/pg_stuff/pg_installations/pgsql.inverse/bin \
--libdir=/var/data1/pg_stuff/pg_installations/pgsql.inverse/lib \
--quiet --enable-depend --enable-cassert --enable-debug --with-perl \
--with-openssl --with-libxml --enable-dtrace

(and that's still repeatable)


Perhaps this helps:

with another configure:

./configure --prefix=/var/data1/pg_stuff/pg_installations/pgsql.inverse 
--with-pgport=6594
--bindir=/var/data1/pg_stuff/pg_installations/pgsql.inverse/bin.fast
--libdir=/var/data1/pg_stuff/pg_installations/pgsql.inverse/lib.fast --quiet 
--enable-depend --with-perl --with-openssl
--with-libxml

I get only this single 'make check' error:


*** 
/var/data1/pg_stuff/pg_sandbox/pgsql.inverse/src/test/regress/expected/window.out
   2014-01-02 16:19:48.0 +0100
--- 
/var/data1/pg_stuff/pg_sandbox/pgsql.inverse/src/test/regress/results/window.out
2014-01-02 16:21:43.0 +0100
***
*** 1188,1195 
   sum
  --
   6.01
! 5
! 3
  (3 rows)

  SELECT i,COUNT(v) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND UNBOUNDED 
FOLLOWING)
--- 1188,1195 
   sum
  --
   6.01
!  5.00
!  3.00
  (3 rows)

  SELECT i,COUNT(v) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND UNBOUNDED 
FOLLOWING)

==


Centos 5.7
gcc 4.8.2



Thanks; and Happy New Year

Erik Rijkers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-02 Thread Robert Haas
On Thu, Jan 2, 2014 at 11:08 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 01/02/2014 02:53 PM, Robert Haas wrote:
 On Tue, Dec 31, 2013 at 4:12 AM, Peter Geoghegan p...@heroku.com wrote:

 On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:

 1. PromiseTupleInsertionLockAcquire(my xid)
 2. Insert heap tuple
 3. Insert index tuples
 4. Check if conflict happened. Kill the already-inserted tuple on
 conflict.
 5. PromiseTupleInsertionLockRelease(my xid)

 IOW, the only change to the current patch is that you acquire the new
 kind
 of lock before starting the insertion, and you release it after you've
 killed the tuple, or you know you're not going to kill it.


 Where does row locking fit in there? - you may need to retry when that
 part is incorporated, of course. What if you have multiple promise
 tuples from a contended attempt to insert a single slot, or multiple
 broken promise tuples across multiple slots or even multiple commands
 in the same xact?

 You can only have one speculative insertion in progress at a time. After
 you've done all the index insertions and checked that you really didn't
 conflict with anyone, you're not going to go back and kill the tuple
 anymore. After that point, the insertion is not speculation anymore.

Yeah... but how does someone examining the tuple know that?  We need
to avoid having them block on the promise-tuple insertion lock if
we've reacquired it meanwhile for a new speculative insertion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-01-02 09:40:54 -0500, Tom Lane wrote:
 Actually, I thought the function approach was a good proposal.  You are
 right that func(tab.*) isn't going to work, because it's going to get a
 Datum-ified tuple not a pointer to raw on-disk storage.  But an inspection
 function that's handed a ctid could work.

 Well, we discussed that upthread, and the overhead of going through a
 function is quite noticeable because the tuple needs to be fetched from
 the heap again.

Yeah, I read those results, but that seems like it could probably be
optimized.  I'm guessing the function was doing a new heap_open every
time?  There's probably a way around that.

In any case, upon further reflection I'm not convinced that doing this
with a SELECT-based query is the right thing, no matter whether the query
looks at a function or a system column; because by definition, you'll only
be able to see tuples that are visible to your current snapshot.  For real
forensics work, you need to be able to see all tuples, which makes me
think that something akin to pgstattuple is the right API; that is return
a set of the header info for all tuples on such-and-such pages of this
relation.  That should dodge any performance problem, because the
heap_open overhead could be amortized across lots of tuples, and it also
sidesteps all problems with adding new system columns.

 Upthread there's a POC patch of mine, that started to explore what's
 necessary to simply never store system columns (except maybe oid) in
 pg_attribute. While it passes the regression tests it's not complete,
 but the amount of work looks reasonable.

I think this will inevitably break a lot of code, not all of it ours,
so I'm not in favor of pursuing that direction.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Remove some duplicate if conditions

2014-01-02 Thread Tom Lane
David Rowley dgrowle...@gmail.com writes:
 I've attached a simple patch which removes some duplicate if conditions
 that seemed to have found their way into the code.

 These are per PVS-Studio's warnings.

-1.  If PVS-Studio is complaining about this type of coding, to hell with
it; it should just optimize away the extra tests and be quiet about it,
not opinionate about coding style.  What you propose would cause logically
independent pieces of code to become tied together, and I don't find that
to be an improvement.  It's the compiler's job to micro-optimize where
possible, and most compilers that I've seen will do that rather than tell
the programmer he ought to do it.

regards, tom lane



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch

2014-01-02 Thread Robert Haas
On Tue, Dec 31, 2013 at 12:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Mark Dilger markdil...@yahoo.com writes:
 In src/include/pgstat.h, the PGSTAT_NUM_TABENTRIES macro
 attempts to subtract off the size of the PgStat_MsgTabstat
 struct up to the m_entry[] field.  This macro was correct up
 until the fields m_block_read_time and m_block_write_time
 were added to that struct, as the macro was not changed to
 include their size.

 Yeah, that's a bug.

Ick.

 This patch fixes the macro.

 I'm inclined to think that we should look for a less breakable way to
 manage the message size limit; if Robert missed this issue in that patch,
 other people are going to miss it in future patches.  The existing coding
 isn't really right anyway when you consider possible alignment padding.
 (I think the present struct definitions probably don't involve any
 padding, but that's not a very future-proof assumption either.)  It'd be
 better to somehow drive the calculation from offsetof(m_entry).  I'm not
 seeing any way to do that that doesn't require some notational changes
 in the code, though.  One way would be to rejigger it like this:

 #define PGSTAT_MAX_MSG_SIZE 1000

 typedef union PgStat_MsgTabstat
 {
 struct {
 PgStat_MsgHdr hdr;
 Oid   databaseid;
 int   nentries;
 int   xact_commit;
 int   xact_rollback;
 PgStat_Counter block_read_time;/* times in microseconds */
 PgStat_Counter block_write_time;
 PgStat_TableEntry entry[FLEXIBLE_ARRAY_MEMBER];
 } m;
 char padding[PGSTAT_MAX_MSG_SIZE];/* pad sizeof this union to target 
 */
 } PgStat_MsgTabstat;

 #define PGSTAT_NUM_TABENTRIES ((PGSTAT_MAX_MSG_SIZE - 
 offsetof(PgStat_MsgTabstat, m.entry)) / sizeof(PgStat_TableEntry))

 but then we'd have to run around and replace m_hdr with m.hdr etc
 in the code referencing the message types that use this mechanism.
 Kind of annoying.

Rather than using a union, I'd be inclined to declare one type that's
just the header (PgStat_MsgTabstat_Hdr), and then work out
PGSTAT_NUM_TABENTRIES based on sizeof(PgStat_MsgTabstat_Hdr), and then
declare PgStat_MsgTabstat as a two-element struct, the header struct
followed by an array of entries.  That is indeed a bit of notational
churn but it seems worth it to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch

2014-01-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Dec 31, 2013 at 12:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm inclined to think that we should look for a less breakable way to
 manage the message size limit; if Robert missed this issue in that patch,
 other people are going to miss it in future patches.  The existing coding
 isn't really right anyway when you consider possible alignment padding.
 (I think the present struct definitions probably don't involve any
 padding, but that's not a very future-proof assumption either.)  It'd be
 better to somehow drive the calculation from offsetof(m_entry).  I'm not
 seeing any way to do that that doesn't require some notational changes
 in the code, though.  One way would be to rejigger it like this:
 ...

 Rather than using a union, I'd be inclined to declare one type that's
 just the header (PgStat_MsgTabstat_Hdr), and then work out
 PGSTAT_NUM_TABENTRIES based on sizeof(PgStat_MsgTabstat_Hdr), and then
 declare PgStat_MsgTabstat as a two-element struct, the header struct
 followed by an array of entries.  That is indeed a bit of notational
 churn but it seems worth it to me.

That approach would fail to account for any alignment padding occurring
just before the m_entry array, though.  That could happen if the array
members contain some 8-byte fields while there are none in the header
part.  I think it would net out to the same amount of notational change
in the code proper as my approach, since either way you end up with a
nested struct containing the header fields.

It occurs to me that, rather than trying to improve the struct definition
methodology, maybe we should just add static asserts to catch any
inconsistency here.  It wouldn't be that hard:

#define PGSTAT_MAX_MSG_SIZE 1000
#define PGSTAT_MSG_PAYLOAD  (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
... all else in pgstat.h as before ...

StaticAssertStmt(sizeof(PgStat_MsgTabstat) = PGSTAT_MAX_MSG_SIZE,
 'bad PgStat_MsgTabstat size');
... and similarly for other pgstat message structs ...

This would possibly lead to machine-dependent results if alignment comes
into play, but it's reasonable to suppose that we'd find out about any
mistakes as soon as they hit the buildfarm.  So this might be an
acceptable tradeoff for not having to change source code.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple read-write masters in a cluster with wal-streaming synchronization

2014-01-02 Thread Mark Dilger
My original email was mostly a question about whether WAL data
could be merged from multiple servers, or whether I was overlooking
some unsolvable difficulty.  I'm still mostly curious about that
question.


I anticipated that my proposal would require partitioning the catalogs.
For instance, autovacuum could only run on locally owned tables, and
would need to store the analyze stats data in a catalog partition belonging
to the local server, but that doesn't seem like a fundamental barrier to
it working.  The partitioned catalog tables would get replicated like
everything else.  The code that needs to open catalogs and look things
up could open the specific catalog partition needed if it already knew the
Oid of the table/index/whatever that it was interested in, as the catalog
partition desired would have the same modulus as the Oid of the object
being researched. 

Your point about increasing the runtime of pg_upgrade is taken.  I will
need to think about that some more.

Your claim that what I describe is not multi-master is at least partially
correct, depending on how you think about the word master.  Certainly
every server is the master of its own chunk.  I see that as a downside
for some people, who want to be able to insert/update/delete any data
on any server.  But the ability to modify *anything anywhere* brings
performance problems with it.  Either the servers have to wait for each
other before commits go through, in order to avoid incompatible data
changes being committed on both ends, or the servers have to reject
commits after they have already been reported to the client as successful.
I expect my proposal to have better read scalability in a write-heavy
environment, because the less work it takes to integrate data changes
from other workers, the more resources remain per server to answer
read queries.

Your claim that BDR doesn't have to be much slower than what I am
proposing is quite interesting, as if that is true I can ditch this idea and
use BDR instead.  It is hard to empirically test, though, as I don't have
the alternate implementation on hand.


I think the expectation that performance will be harmed if postgres
uses 8 byte Oids is not quite correct.

Several years ago I ported postgresql sources to use 64bit everything.
Oids, varlena headers, variables tracking offsets, etc.  It was a fair
amount of work, but all the doom and gloom predictions that I have
heard over the years about how 8-byte varlena headers would kill
performance, 8-byte Oids would kill performance, etc, turned out to
be quite inaccurate.  The performance impact was ok for me.  The
disk space impact wasn't much either, as with 8-byte varlena headers,
anything under 127 bytes had a 1-byte header, and anything under
16383 had a 2-byte header, with 8-bytes only used after that, which
pretty much meant that disk usage for representing varlena data
shrunk slightly rather than growing.  Tom Lane had mentioned in a
threadthat he didn't want to make the #define for processing
varlena headers any more complicated than it was, because it gets
executed quite a lot.  So I tried the 1,2,8 byte vs 1,8 byte varlena
design both ways and found it made little difference to me which I
chose.  Of course, my analysis was based on my own usage patterns,
my own schemas, my own data, and might not apply to everyone
else.  I tend to conflate the 8-byte Oid change with all these other
changes from 4-byte to 8-byte, because that's what I did and what
I have experience with.


Having 8-byte everything witheverything aligned allowed me to use
SSE functions on some stuffthat postgres was (at least at the time)
doing less efficiently.  Since then, I havenoticed that the hash function
for disk blocks is implemented withSSE in mind.  With 8-byte aligned
datums, SSE based hashing can be used without all the calls to 

realign the data.  I was experimenting with forcing data to be 16-byte
aligned to take advantage of newer SSE functions, but this was years
ago and I didn't own any hardware with the newer SSE capabilities,
so I never got to benchmark that.


All this is to say that increasing to 8 bytes is not a pure performance
loss.  It is a trade-off, and one that I did not find particularly problematic.
On the up side, I didn't need to worry about Oid exhaustion anymore,
which allows removing the code that checks for it (though I left that
code in place.)  It allows using varlena objects instead of the large object
interface, so I could yank that interface and make my code size
smaller.  (I never much used the LO interface to begin with, so I might
not be the right person to ask about this.)  It allows not worrying about
accidentally bumping into the 1GBlimit on varlenas, which means you
don't have to code for that errorcondition in applications.


mark





On Thursday, January 2, 2014 1:19 AM, Andres Freund and...@2ndquadrant.com 
wrote:
 
On 2013-12-31 13:51:08 -0800, Mark Dilger wrote:
 The BDR documentation 
 

Re: [HACKERS] RFC: Async query processing

2014-01-02 Thread Claudio Freire
On Wed, Dec 18, 2013 at 1:50 PM, Florian Weimer fwei...@redhat.com wrote:
 On 11/04/2013 02:51 AM, Claudio Freire wrote:

 On Sun, Nov 3, 2013 at 3:58 PM, Florian Weimer fwei...@redhat.com wrote:

 I would like to add truly asynchronous query processing to libpq,
 enabling
 command pipelining.  The idea is to to allow applications to auto-tune to
 the bandwidth-delay product and reduce the number of context switches
 when
 running against a local server.

 ...

 If the application is not interested in intermediate query results, it
 would
 use something like this:

 ...

 If there is no need to exit from the loop early (say, because errors are
 expected to be extremely rare), the PQgetResultNoWait call can be left
 out.


 It doesn't seem wise to me making such a distinction. It sounds like
 you're oversimplifying, and that's why you need modes, to overcome
 the evidently restrictive limits of the simplified interface, and that
 it would only be a matter of (a short) time when some other limitation
 requires some other mode.


 I need modes because I want to avoid unbound buffering, which means that
 result data has to be consumed in the order queries are issued.
...
 In any case, I don't want to change the wire protocol, I just want to enable
 libpq clients to use more of its capabilities.

I believe you will at least need to use TCP_CORK or some advanced
socket options if you intend to decrease the number of packets without
changing the protocol.

Due to the interactive and synchronized nature of the protocol, TCP
will immediately send the first query in a packet since it's already
ready to do so. Buffering will only happen from the second query
onwards, and this won't benefit a two-query loop as the one in your
sample.

As for expectations, they can be part of the connection object and not
the wire protocol if you wish. The point I was making, is that the
expectation should be part of the query call, since that's less error
prone than setting a discard results mode. Think of it as
PQsendQueryParams with an extra async argument that defaults to
PQASYNC_NOT (ie, sync). There you can tell libpq to expect either no
results, expect and discard them, or whatever. The benefit here is a
simplified usage: your example code will be part of libpq and thus all
this complexity will be hidden from users. Furthermore, libpq will do
the small sanity check of actually checking that the server returns no
results when expecting no result.

PGAsyncMode oldMode = PQsetsendAsyncMode(conn, PQASYNC_RESULT);
bool more_data;
do {
   more_data = ...;
   if (more_data) {
 int ret = PQsendQueryParams(conn,
   INSERT ... RETURNING ..., ...);
 if (ret == 0) {
   // handle low-level error
 }
   }
   // Consume all pending results.
   while (1) {
 PGresult *res;
 if (more_data) {
   res = PQgetResultNoWait(conn);
 } else {
   res = PQgetResult(conn);
 }


 Somehow, that code looks backwards. I mean, really backwards. Wouldn't
 that be !more_data?

 No, if more data is available to transfer to the server, the no-wait variant
 has to be used to avoid a needless synchronization with the server.

Ok, yeah. Now I get it. It's client-side more_data.

 In any case, pipelining like that, without a clear distinction, in the
 wire protocol, of which results pertain to which query, could be a
 recipe for trouble when subtle bugs, either in lib usage or
 implementation, mistakenly treat one query's result as another's.


 We already use pipelining in libpq (see pqFlush, PQsendQueryGuts and
 pqParseInput3), the server is supposed to support it, and there is a lack of
 a clear tit-for-tat response mechanism anyway because of NOTIFY/LISTEN and
 the way certain errors are reported.

pqFlush doesn't seem overly related, since the API specifically states
that you cannot queue multiple PQsendQuery. It looks more like
low-level buffering. Ie: when the command itself is larger than the os
buffer and nonblocking operation requires multiple send() calls for
one PQsendQuery. Am I wrong?

 Instead of buffering the results, we could buffer the encoded command
 messages in PQASYNC_RESULT mode.  This means that PQsendQueryParams would
 not block when it cannot send the (complete) command message, but store
 in
 the connection object so that the subsequent PQgetResultNoWait and
 PQgetResult would send it.  This might work better with single-tuple
 result
 mode.  We cannot avoid buffering either multiple queries or multiple
 responses if we want to utilize the link bandwidth, or we'd risk
 deadlocks.


 This is a non-solution. Such an implementation, at least as described,
 would not remove neither network latency nor context switches, it
 would be a purely API change with no externally visible behavior
 change.


 Ugh, why?

Oh, sorry. I had this elaborate answer prepared, but I just noticed
it's wrong: you do say if it cannot send it rightaway.

Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Robert Haas
On Thu, Jan 2, 2014 at 12:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@2ndquadrant.com writes:
 On 2014-01-02 09:40:54 -0500, Tom Lane wrote:
 Actually, I thought the function approach was a good proposal.  You are
 right that func(tab.*) isn't going to work, because it's going to get a
 Datum-ified tuple not a pointer to raw on-disk storage.  But an inspection
 function that's handed a ctid could work.

 Well, we discussed that upthread, and the overhead of going through a
 function is quite noticeable because the tuple needs to be fetched from
 the heap again.

 Yeah, I read those results, but that seems like it could probably be
 optimized.  I'm guessing the function was doing a new heap_open every
 time?  There's probably a way around that.

Yeah, it was.  I don't see any plausible way around that, but I'm all ears.

 In any case, upon further reflection I'm not convinced that doing this
 with a SELECT-based query is the right thing, no matter whether the query
 looks at a function or a system column; because by definition, you'll only
 be able to see tuples that are visible to your current snapshot.  For real
 forensics work, you need to be able to see all tuples, which makes me
 think that something akin to pgstattuple is the right API; that is return
 a set of the header info for all tuples on such-and-such pages of this
 relation.  That should dodge any performance problem, because the
 heap_open overhead could be amortized across lots of tuples, and it also
 sidesteps all problems with adding new system columns.

I both agree and disagree with this.  I think that pgstattuple is
useful, and I further agree that adding a stat to it about how much of
the heap is frozen would be worthwhile.  However, an aggregate number
isn't always what you want, and being able to scrutinize specific
tuples is, I think, a useful thing.  I'm not sure that it needs to be
fast, which is why I think a function rather than a system column
might be OK, but I think it ought to be possible.

 Upthread there's a POC patch of mine, that started to explore what's
 necessary to simply never store system columns (except maybe oid) in
 pg_attribute. While it passes the regression tests it's not complete,
 but the amount of work looks reasonable.

 I think this will inevitably break a lot of code, not all of it ours,
 so I'm not in favor of pursuing that direction.

I thought that that approach had been discussed previously and found
desirable on the grounds that it would cut down on the size of
pg_attribute, but it's not something I want to rush through when we
have a release to get out the door.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Jan 2, 2014 at 12:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 In any case, upon further reflection I'm not convinced that doing this
 with a SELECT-based query is the right thing, no matter whether the query
 looks at a function or a system column; because by definition, you'll only
 be able to see tuples that are visible to your current snapshot.  For real
 forensics work, you need to be able to see all tuples, which makes me
 think that something akin to pgstattuple is the right API; that is return
 a set of the header info for all tuples on such-and-such pages of this
 relation.  That should dodge any performance problem, because the
 heap_open overhead could be amortized across lots of tuples, and it also
 sidesteps all problems with adding new system columns.

 I both agree and disagree with this.  I think that pgstattuple is
 useful, and I further agree that adding a stat to it about how much of
 the heap is frozen would be worthwhile.  However, an aggregate number
 isn't always what you want, and being able to scrutinize specific
 tuples is, I think, a useful thing.

Oh, I guess I referenced the wrong contrib module, because what I was
trying to propose is a function that returns a setof record, one row for
each physical tuple it finds.  There are some functions in
contrib/pageinspect that work like that, but not pgstattuple.  I don't
think pageinspect's API is ideal because it's tightly tied to processing
one page at a time, but it does show information a bit like what we need
here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-02 Thread Tom Lane
Erik Rijkers e...@xs4all.nl writes:
 The TRACE_POSTGRESQL_SORT_DONE warnings were not from your patch; sorry about 
 that. They occur on HEAD too (with a debug
 compile).

 tuplesort.c:935:44: warning: comparison between pointer and integer [enabled 
 by default]
   TRACE_POSTGRESQL_SORT_DONE(state-tapeset != NULL, spaceUsed);
 ^
 tuplesort.c:935:2: note: in expansion of macro â
   TRACE_POSTGRESQL_SORT_DONE(state-tapeset != NULL, spaceUsed);


FWIW, I don't see any such warnings on either of the machines I have that
will accept --enable-dtrace.  state-tapeset is a pointer, so these
warnings look a tad bogus.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] more psprintf() use

2014-01-02 Thread Alvaro Herrera
Peter Eisentraut wrote:

 psprintf() in place of hardcoded palloc(N) + sprintf() and the like.
 

 + values[j++] = psprintf(%d, stat.blkno);
 + values[j++] = psprintf(%c, stat.type);
 + values[j++] = psprintf(%d, stat.live_items);
 + values[j++] = psprintf(%d, stat.dead_items);
 + values[j++] = psprintf(%d, stat.avg_item_size);
 + values[j++] = psprintf(%d, stat.page_size);
 + values[j++] = psprintf(%d, stat.free_size);
 + values[j++] = psprintf(%d, stat.btpo_prev);
 + values[j++] = psprintf(%d, stat.btpo_next);
 + values[j++] = psprintf(%d, (stat.type == 'd') ? stat.btpo.xact : 
 stat.btpo.level);
 + values[j++] = psprintf(%d, stat.btpo_flags);
  
   tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
  values);

In cases such as this one, I have often wondered whether it'd be better
to write this as DatumGetSometype() plus heap_form_tuple, instead of
printing to strings and then building a tuple from those.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple read-write masters in a cluster with wal-streaming synchronization

2014-01-02 Thread Andres Freund
On 2014-01-02 10:18:52 -0800, Mark Dilger wrote:
 I anticipated that my proposal would require partitioning the catalogs.
 For instance, autovacuum could only run on locally owned tables, and
 would need to store the analyze stats data in a catalog partition belonging
 to the local server, but that doesn't seem like a fundamental barrier to
 it working.

It would make every catalog lookup noticeably more expensive.

  The partitioned catalog tables would get replicated like
 everything else.  The code that needs to open catalogs and look things
 up could open the specific catalog partition needed if it already knew the
 Oid of the table/index/whatever that it was interested in, as the catalog
 partition desired would have the same modulus as the Oid of the object
 being researched. 

Far, far, far from every lookup is by oid. Most prominently the names of
database objects. Those will have to scan every catalog partition. Not
fun.

 Your point about increasing the runtime of pg_upgrade is taken.  I will
 need to think about that some more.

It's not about increasing the runtime, it's about simply breaking
it. pg_upgrade relies on binary compatibility of user relation's files
and you're breaking that if you change the width of datatypes.

 Your claim that what I describe is not multi-master is at least partially
 correct, depending on how you think about the word master.  Certainly
 every server is the master of its own chunk.

Well, you're essentially just describing a sharded system - that's not
usually coined multimaster.

 Your claim that BDR doesn't have to be much slower than what I am
 proposing is quite interesting, as if that is true I can ditch this idea and
 use BDR instead.  It is hard to empirically test, though, as I don't have
 the alternate implementation on hand.

Well, I can tell you that for the changeset extraction stuff (which is the
basis for BDR) the biggest bottleneck so far seems to be the CRC
computation when reading the WAL - and that's something plain WAL apply
has to do as well. And it is optimizable.
When actually testing decoding  apply, for workloads fitting into
memory I had to try very hard to construe situations where apply was a
big bottleneck. It is easier for seek bound workloads, where the standby
is less powerful than the primary, since there's more random reads for
those due to full page writes removing the need for reads in many cases.

 I think the expectation that performance will be harmed if postgres
 uses 8 byte Oids is not quite correct.
 
 Several years ago I ported postgresql sources to use 64bit everything.
 Oids, varlena headers, variables tracking offsets, etc.  It was a fair
 amount of work, but all the doom and gloom predictions that I have
 heard over the years about how 8-byte varlena headers would kill
 performance, 8-byte Oids would kill performance, etc, turned out to
 be quite inaccurate.

Well, it can increase the size of the database, turning a system where
the hot set fits into memory into one where it doesn't anymore. But
really, the performance concerns were more about the catalog lookups.

Fundamentally, I think there's nothing I see preventing such a scheme
from being implemented - but I think there's about zap chance of it ever
getting integrated, it's just far to invasive with very high costs in
scenarios where it's not used for not all that much gain. Not to speak
about the amount of engineering it would require to implement.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR: missing chunk number 0 for toast value

2014-01-02 Thread Heikki Linnakangas

On 01/02/2014 02:24 PM, Rushabh Lathia wrote:

Hi All,

Test case:

drop table if exists t;
create table t(c text);
insert into t values ('x'), (repeat(md5('abcdefghijklmnop'), 1));
select pg_column_size(c), pg_column_size(c || '') FROM t;

CREATE OR REPLACE FUNCTION copy_toast_out() RETURNS VOID AS $$
declare
 v text;
BEGIN
 SELECT c INTO v FROM t WHERE c  'x';
 Select 1/0;
Exception
 When Others Then
 PERFORM pg_sleep(30); -- go run TRUNCATE t in a 2nd session


 raise notice 'length :%', length(v || ''); -- force detoast


END;
$$ language plpgsql;

postgres=# select copy_toast_out();
ERROR:  missing chunk number 0 for toast value 16390 in pg_toast_16384
CONTEXT:  PL/pgSQL function copy_toast_out() line 10 at RAISE

Analysis:

The basic problem here is that if the lock is released on table before
extracting toasted value, and in meantime someone truncates the table,
this error can occur.  Here error coming with PL block contains an Exception
block (as incase there is an exception block, it calls
RollbackAndReleaseCurrentSubTransaction).


This is another variant of the bug discussed here: 
http://www.postgresql.org/message-id/0c41674c-fa02-4768-9e1b-548e56887...@quarantainenet.nl.



Do you think we should detoast the local variable before
  RollbackAndReleaseCurrentSubTransaction ? Or any other options ?


Hmm, that would fix this particular test case, but not the other case 
where you DROP or TRUNCATE the table in the same transaction.


The simplest fix would be to just detoast everything on assignment but 
that was rejected on performance grounds in that previous thread. I 
don't see any other realistic way to fix this, however, so maybe we 
should just bite the bullet and do it anyway. For simple variables like, 
in your test case, it's a good bet to detoast the value immediately; 
it'll be detoasted as soon as you try to do anything with it anyway. But 
it's not a good bet for record or row variables, because you often fetch 
the whole row into a variable but only access a field or two. Then 
again, if you run into that, at least you can work around it by changing 
your plpgsql code to only fetch the fields you need.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple read-write masters in a cluster with wal-streaming synchronization

2014-01-02 Thread Merlin Moncure
On Tue, Dec 31, 2013 at 3:51 PM, Mark Dilger markdil...@yahoo.com wrote:
 The BDR documentation
 http://wiki.postgresql.org/images/7/75/BDR_Presentation_PGCon2012.pdf
 says,

 Physical replication forces us to use just one
  node: multi-master required for write scalability

 Physical replication provides best read scalability

 I am inclined to agree with the second statement, but
 I think my proposal invalidates the first statement, at
 least for a particular rigorous partitioning over which
 server owns which data.

 In my own workflow, I load lots of data from different
 sources.  The partition the data loads into depends on
 which source it came from, and it is never mixed or
 cross referenced in any operation that writes the data.
 It is only mixed in the sense that applications query
 data from multiple sources.

 So for me, multi-master with physical replication seems
 possible, and would presumably provide the best
 read scalability.  I doubt that I am in the only database
 user who has this kind of workflow.

 The alternatives are ugly.  I can load data from separate
 sources into separate database servers *without* replication
 between them, but then the application layer has to
 emulate queries across the data.  (Yuck.)  Or I can use
 logical replication such as BDR, but then the servers
 are spending more effort than with physical replication,
 so I get less bang for the buck when I purchase more
 servers to add to the cluster.  Or I can use FDW to access
 data from other servers, but that means the same data
 may be pulled across the wire arbitrarily many times, with
 corresponding impact on the bandwidth.

 Am I missing something here?  Does BDR really provide
 an equivalent solution?

I think BDR is better: while it does only support schema-equivalent
replication that is the typical case for distributed write systems
like this.  Also, there are a lot less assumptions about the network
architecture in the actual data itself (for example, what happens when
you want to change onwer/mege/split data?).  IMNSHO, It's better that
each node is managing WAL for itself, not the other way around except
in the very special case you want an exact replica of the database on
each node at all times as with the current HS/SR.

A **huge** amount of work has/is being put in to wal based logical
replication support (LLSR in the BDR docs) that should mostly combine
the flexibility of trigger based logical replication with the
robustness of wal based replication that we have in core now.  LLSR a
low level data transmission framework that can be wrapped by higher
level user facing stuff like BDR.  LLSR, by the way, does not come
attached with the assumption that all databases have the same schema.
If I were you, I'd be studying up on LLSR and seeing how it could be
molded into the use cases you're talking about.  From a development
point of view, the replication train hasn't just left the station,
it's a space shuttle that just broke out of earth's orbit.  By my
reckoning a new 'from the ground up' implementation of replication
requiring in-core changes has an exactly zero percent change of being
adopted.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Andres Freund
On 2014-01-02 12:46:34 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-01-02 09:40:54 -0500, Tom Lane wrote:
  Actually, I thought the function approach was a good proposal.  You are
  right that func(tab.*) isn't going to work, because it's going to get a
  Datum-ified tuple not a pointer to raw on-disk storage.  But an inspection
  function that's handed a ctid could work.
 
  Well, we discussed that upthread, and the overhead of going through a
  function is quite noticeable because the tuple needs to be fetched from
  the heap again.
 
 Yeah, I read those results, but that seems like it could probably be
 optimized.  I'm guessing the function was doing a new heap_open every
 time?  There's probably a way around that.

How? Everything I can think of is either fragile or too ugly.

 In any case, upon further reflection I'm not convinced that doing this
 with a SELECT-based query is the right thing, no matter whether the query
 looks at a function or a system column; because by definition, you'll only
 be able to see tuples that are visible to your current snapshot.

I think it really depends - quite often you really want to just
investigate tuples you actually can see, because you can easily write
normal queries and such. It sure wouldn't replace pageinspect.

  For real
 forensics work, you need to be able to see all tuples, which makes me
 think that something akin to pgstattuple is the right API; that is return
 a set of the header info for all tuples on such-and-such pages of this
 relation.  That should dodge any performance problem, because the
 heap_open overhead could be amortized across lots of tuples, and it also
 sidesteps all problems with adding new system columns.

The biggest problem with such an API is that it's painful to use - I've
used pageinspect a fair bit, and not being able to easily get the
content of the rows you're looking at makes it really far less useful in
many scenarios. That could partially be improved by a neater API
(returning t_self would help greatly, accepting a page without a wrapper
function as well), but it still pretty fundamentally sucks.

And I really don't see any page-at-a-time access that's going to be
convenient. A big step would be returning the tuple's contents in a
normal fashion, but that's not easy without big notational overhead.

  Upthread there's a POC patch of mine, that started to explore what's
  necessary to simply never store system columns (except maybe oid) in
  pg_attribute. While it passes the regression tests it's not complete,
  but the amount of work looks reasonable.
 
 I think this will inevitably break a lot of code, not all of it ours,
 so I'm not in favor of pursuing that direction.

Are you thinking of client or extension code? From what I've looked at I
don't think it's all that likely too break much of either. Extension
code isn't likely to do its own catalog lookups and thus doesn't really
care. Most client code isn't interested in system catalog attribute
numbers - the few examples of client code looking at pg_attribute I
remember all explicitly excluded attnum  0.

It would make pg_attribute a fair bit smaller, especially on systems
with lots of narrow relations. There's also the possibility of only
going that route for new system catalog attributes, and leave the old
ones where they are and removed them in a release or two.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR: missing chunk number 0 for toast value

2014-01-02 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 The simplest fix would be to just detoast everything on assignment but 
 that was rejected on performance grounds in that previous thread. I 
 don't see any other realistic way to fix this, however, so maybe we 
 should just bite the bullet and do it anyway.

Or just say don't do that.  TRUNCATE on a table that's in use by open
transactions has all sorts of issues besides this one.  The given example
is a pretty narrow corner case anyway --- with a less contorted coding
pattern, we'd still have AccessShareLock on the table, blocking the
TRUNCATE from removing data.  I'd still not want to blow up performance
in order to make this example work.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple read-write masters in a cluster with wal-streaming synchronization

2014-01-02 Thread Mark Dilger
Thanks to both of you for all the feedback.  Your reasoning
about why it is not worth implementing, what the problems
with it would be, etc., are helpful.


Sorry about using the word multimaster where it might
have been better to say sharded.

BTW, since the space shuttle has already left orbit, as you
metaphorically put it, maybe there should be more
visibility to the wider world about this?  You can go to
postgresql.org and find diddly squat about it.  I grant you
that it is not a completed project yet, and so maybe you
want to wait before making major announcements, but
the sort of people who would use this feature are probably
the sort of people who would not mind hearing about it
early.


mark




On Thursday, January 2, 2014 11:18 AM, Andres Freund and...@2ndquadrant.com 
wrote:
 
On 2014-01-02 10:18:52 -0800, Mark Dilger wrote:
 I anticipated that my proposal would require partitioning the catalogs.
 For instance, autovacuum could only run on locally owned tables, and
 would need to store the analyze stats data in a catalog partition belonging
 to the local server, but that doesn't seem like a fundamental barrier to
 it working.

It would make every catalog lookup noticeably more expensive.

  The partitioned catalog tables would get replicated like
 everything else.  The code that needs to open catalogs and look things
 up could open the specific catalog partition needed if it already knew the
 Oid of the table/index/whatever that it was interested in, as the catalog
 partition desired would have the same modulus as the Oid of the object
 being researched. 

Far, far, far from every lookup is by oid. Most prominently the names of
database objects. Those will have to scan every catalog partition. Not
fun.

 Your point about increasing the runtime of pg_upgrade is taken.  I will
 need to think about that some more.

It's not about increasing the runtime, it's about simply breaking
it. pg_upgrade relies on binary compatibility of user relation's files
and you're breaking that if you change the width of datatypes.

 Your claim that what I describe is not multi-master is at least partially
 correct, depending on how you think about the word master.  Certainly
 every server is the master of its own chunk.

Well, you're essentially just describing a sharded system - that's not
usually coined multimaster.

 Your claim that BDR doesn't have to be much slower than what I am
 proposing is quite interesting, as if that is true I can ditch this idea and
 use BDR instead.  It is hard to empirically test, though, as I don't have
 the alternate implementation on hand.

Well, I can tell you that for the changeset extraction stuff (which is the
basis for BDR) the biggest bottleneck so far seems to be the CRC
computation when reading the WAL - and that's something plain WAL apply
has to do as well. And it is optimizable.
When actually testing decoding  apply, for workloads fitting into
memory I had to try very hard to construe situations where apply was a
big bottleneck. It is easier for seek bound workloads, where the standby
is less powerful than the primary, since there's more random reads for
those due to full page writes removing the need for reads in many cases.

 I think the expectation that performance will be harmed if postgres
 uses 8 byte Oids is not quite correct.
 
 Several years ago I ported postgresql sources to use 64bit everything.
 Oids, varlena headers, variables tracking offsets, etc.  It was a fair
 amount of work, but all the doom and gloom predictions that I have
 heard over the years about how 8-byte varlena headers would kill
 performance, 8-byte Oids would kill performance, etc, turned out to
 be quite inaccurate.

Well, it can increase the size of the database, turning a system where
the hot set fits into memory into one where it doesn't anymore. But
really, the performance concerns were more about the catalog lookups.

Fundamentally, I think there's nothing I see preventing such a scheme
from being implemented - but I think there's about zap chance of it ever
getting integrated, it's just far to invasive with very high costs in
scenarios where it's not used for not all that much gain. Not to speak
about the amount of engineering it would require to implement.


Greetings,

Andres Freund

-- 
Andres Freund                      http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Robert Haas
On Thu, Jan 2, 2014 at 2:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I both agree and disagree with this.  I think that pgstattuple is
 useful, and I further agree that adding a stat to it about how much of
 the heap is frozen would be worthwhile.  However, an aggregate number
 isn't always what you want, and being able to scrutinize specific
 tuples is, I think, a useful thing.

 Oh, I guess I referenced the wrong contrib module, because what I was
 trying to propose is a function that returns a setof record, one row for
 each physical tuple it finds.  There are some functions in
 contrib/pageinspect that work like that, but not pgstattuple.  I don't
 think pageinspect's API is ideal because it's tightly tied to processing
 one page at a time, but it does show information a bit like what we need
 here.

Sure, Álvaro mentioned the same thing upthread.  Also, if you notice,
the function I was proposing to add does basically the same thing as
pageinsect, except one tuple at a time rather than one page at a time.
 I think both are useful, though.  pageinspect is superuser-only, and
needing work by pages isn't always convenient.  I thought about making
the pg_tuple_header() function I proposed scan using SnapshotAny
rather than the active snapshot, but then I think it'd need to be
superuser-only.  I also thought about making it use SnapshotAny for
superusers and the active snapshot for everybody else, but that seemed
like it could be confusing.

We could certainly add a function that returns SETOF record, taking
e.g. regclass as an argument, but it doesn't seem a stretch to me to
think that you might want to get tuple header information for some but
not all tuples in the relation, and I don't see any real good way to
tell the function exactly what tuples you want except by invoking it
once per TID.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR: missing chunk number 0 for toast value

2014-01-02 Thread Andres Freund
On 2014-01-02 21:21:15 +0200, Heikki Linnakangas wrote:
 I don't see any other realistic way to fix this, however, so maybe we
 should just bite the bullet and do it anyway.

We could remember the subtransaction a variable was created in and error
out if it the creating subtransaction aborted and it's not a
pass-by-value datum or similar.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple read-write masters in a cluster with wal-streaming synchronization

2014-01-02 Thread Andres Freund
On 2014-01-02 11:35:57 -0800, Mark Dilger wrote:
 BTW, since the space shuttle has already left orbit, as you
 metaphorically put it, maybe there should be more
 visibility to the wider world about this?  You can go to
 postgresql.org and find diddly squat about it.  I grant you
 that it is not a completed project yet, and so maybe you
 want to wait before making major announcements, but
 the sort of people who would use this feature are probably
 the sort of people who would not mind hearing about it
 early.

Well, changeset extraction isn't committed yet, so it's not surprising
that you don't find anything there ;). Parts of the patches (notably
wal_level=logical, enriching the WAL to allow decoding) have landed, the
others are being worked over in response to review comments of Robert.

I am pretty sure there will be more publicity once it's committed ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-01-02 12:46:34 -0500, Tom Lane wrote:
 For real
 forensics work, you need to be able to see all tuples, which makes me
 think that something akin to pgstattuple is the right API; that is return
 a set of the header info for all tuples on such-and-such pages of this
 relation.  That should dodge any performance problem, because the
 heap_open overhead could be amortized across lots of tuples, and it also
 sidesteps all problems with adding new system columns.

 The biggest problem with such an API is that it's painful to use - I've
 used pageinspect a fair bit, and not being able to easily get the
 content of the rows you're looking at makes it really far less useful in
 many scenarios. That could partially be improved by a neater API

Surely.  Why couldn't you join against the table on ctid?

 And I really don't see any page-at-a-time access that's going to be
 convenient.

As I commented to Robert, the page-at-a-time behavior of pageinspect
is not an API detail we'd want to copy for this.  I envision something
like

   select hdr.*, foo.*
 from tuple_header_details('foo'::regclass) as hdr
  left join foo on hdr.ctid = foo.ctid;

On a large table you might want a version that restricts its scan
to pages M through N, but that's just optimization.  More useful
would be to improve the planner's intelligence about joins on ctid ...

 [ removing system columns from pg_attribute ]]
 I think this will inevitably break a lot of code, not all of it ours,
 so I'm not in favor of pursuing that direction.

 Are you thinking of client or extension code? From what I've looked at I
 don't think it's all that likely too break much of either.

It will break anything that assumes that every column is represented in
pg_attribute.  I think if you think this assumption is easily removed,
you've not looked hard enough.

 It would make pg_attribute a fair bit smaller, especially on systems
 with lots of narrow relations.

I'd like to do that too, but I think getting rid of xmin/xmax/cmin/cmax
would be enough to get most of the benefit, and we could do that without
any inconsistency if we stopped exposing those as system columns.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Robert Haas
On Thu, Jan 2, 2014 at 2:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 As I commented to Robert, the page-at-a-time behavior of pageinspect
 is not an API detail we'd want to copy for this.  I envision something
 like

select hdr.*, foo.*
  from tuple_header_details('foo'::regclass) as hdr
   left join foo on hdr.ctid = foo.ctid;

 On a large table you might want a version that restricts its scan
 to pages M through N, but that's just optimization.  More useful
 would be to improve the planner's intelligence about joins on ctid ...

/me blinks.

Surely you're not serious.  That's going to scan the whole darn table
even if we only want the details for one row.  And making the planner
smarter about joins on CTID doesn't help a bit, unless you can push
the join qual down *inside the tuple_header_details() function call*.
That'd be pretty a pretty sweet thing to allow for SRFs in general,
but it doesn't sound like something we're going to conjure up at the
drop of a hat.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 We could certainly add a function that returns SETOF record, taking
 e.g. regclass as an argument, but it doesn't seem a stretch to me to
 think that you might want to get tuple header information for some but
 not all tuples in the relation, and I don't see any real good way to
 tell the function exactly what tuples you want except by invoking it
 once per TID.

I have no objection to having a function that retrieves the details for
a given TID alongside one that does it for a whole relation.  The point
here is just that we should be headed in the direction of removing as
many system columns as we can, not adding more; especially not ones that
(a) have no purpose except forensics and (b) are virtually certain to
change across system versions.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-02 Thread David Rowley
On Fri, Jan 3, 2014 at 5:33 AM, Erik Rijkers e...@xs4all.nl wrote:

 *** /var/data1/pg_stuff/pg_sandbox/pgsql.inverse/src/
 test/regress/expected/window.out   2014-01-02 16:19:48.0 +0100
 ---
 /var/data1/pg_stuff/pg_sandbox/pgsql.inverse/src/test/regress/results/window.out
2014-01-02 16:21:43.0 +0100
 ***
 *** 1188,1195 
sum
   --
6.01
 ! 5
 ! 3
   (3 rows)

   SELECT i,COUNT(v) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND
 UNBOUNDED FOLLOWING)
 --- 1188,1195 
sum
   --
6.01
 !  5.00
 !  3.00
   (3 rows)


I've left those failures in for now in the hope to generate some discussion
on if we can inverse transition for sum(numeric). Please see my email
before the previous one for details. To fix it pg_aggregate.h just needs to
be changed to remove the inverse transition for sum(numeric).


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Jan 2, 2014 at 2:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 As I commented to Robert, the page-at-a-time behavior of pageinspect
 is not an API detail we'd want to copy for this.  I envision something
 like
 
 select hdr.*, foo.*
 from tuple_header_details('foo'::regclass) as hdr
 left join foo on hdr.ctid = foo.ctid;
 
 On a large table you might want a version that restricts its scan
 to pages M through N, but that's just optimization.  More useful
 would be to improve the planner's intelligence about joins on ctid ...

 /me blinks.

 Surely you're not serious.  That's going to scan the whole darn table
 even if we only want the details for one row.

And?  The complaint about the other function was that it was inefficient
when getting the details for a whole table, so I don't think you can
complain about an approach that handles that case well.  Use the other
function if you just want info for one row (that you can see).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-02 Thread Peter Geoghegan
I decided to make at least a cursory attempt to measure or
characterize the performance of each of our approaches to value
locking. Being fair here is a non-trivial matter, because of the fact
that upserts can behave quite differently based on the need to insert
or update, lock contention and so on. Also, I knew that anything I
came up with would not be comparing like with like: as things stand,
the btree locking code is more or less correct, and the alternative
exclusion constraint supporting implementation is more or less
incorrect (of course, you may yet describe a way to fix the
unprincipled deadlocking previously revealed by my testcase [1], but
it is far from clear what impact this fix will have on performance).
Still, something is better than nothing.

This was run on a Linux server (Linux 3.8.0-31-generic
#46~precise1-Ubuntu) with these specifications:
https://www.hetzner.de/en/hosting/produkte_rootserver/ex40 .
Everything fits in shared_buffers, but the I/O system is probably the
weakest link here.

To be 100% clear, I am comparing
btreelock_insert_on_dup.v5.2013_12_28.patch.gz [2] with
exclusion_insert_on_dup.2013_12_19.patch.gz [3]. I'm also testing a
third approach, involving avoidance of exclusive buffer locks and
heavyweight locks for upserts in the first phase of speculative
insertion. That patch is unposted, but shows a modest improvement over
[2].

I ran this against the table foo:

pgbench=# \d+ foo
  Table public.foo
 Column |  Type   | Modifiers | Storage  | Stats target | Description
+-+---+--+--+-
 a  | integer | not null  | plain|  |
 b  | integer |   | plain|  |
 c  | text|   | extended |  |
Indexes:
foo_pkey PRIMARY KEY, btree (a)
Has OIDs: no

My custom script was:

\setrandom rec 1 :scale
with rej as(insert into foo(a, b, c) values(:rec, :rec, 'insert') on
duplicate key lock for update returning rejects *) update foo set c =
'update' from rej where foo.a = rej.a;

I specified that each pgbench client in each run should last for 200k
upserts (with 100k possible distinct key values), not that it should
last some number of seconds. The idea is that there is a reasonable
mix of inserts and updates initially, for lower client counts, but
exactly the same number of queries are run for each patch, so as to
normalize the effects of contention across both runs (this sure is
hand-wavy, but likely better than nothing). I'm just looking for
approximate numbers here, and I'm sure that you could find more than
one way to benchmark this feature, with varying degrees of sympathy
towards each of our two approaches to value locking. This benchmark
isn't sympathetic to btree locking at all, because there is a huge
amount of contention for the higher client counts, with 100% of
possible rows updated by the time we're done at 16 clients, for
example.

To compensate somewhat for the relatively low duration of each run, I
take an average-of-5, rather than an average-of-3 as representative
for each client count + run/patch combination.

Full report of results are here:
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/upsert-cmp/

My executive summary is that the exclusion patch performs about the
same on lower client counts, presumably due to not having the
additional window of btree lock contention. By 8 clients, the
exclusion patch does noticeably better, but it's a fairly modest
improvement.

Forgive me if I'm belaboring the point, but even though I'm
benchmarking the simplest possible upsert statements, had I chosen
small pgbench scale factors (e.g. scales that would see 100 - 1000
possible distinct key values in total) the btree locking
implementation would surely win very convincingly, just because the
alternative implementation would spend almost all of its time
deadlocked, waiting for the deadlock detector to free clients in one
second deadlock_timeout cycles. My goal here was just to put a rough
number on how these two approaches compare, while trying to be as fair
as possible.

I have to wonder about the extent to which the exclusion approach
benefits from holding old value locks. So even if the unprincipled
deadlocking issue can be fixed without much additional cost, it might
be that the simple fact that that approach holds those pseudo value
locks (i.e. old dead rows from previous iterations on the same tuple
slot) indefinitely helps performance, and losing that property alone
will hurt performance, even though it's necessary.

For those that wonder what the effect on multiple unique index would
be, that isn't really all that relevant, since contention on multiple
unique indexes isn't expected with idiomatic usage (though I suppose
an upsert's non-HOT update would have to compete).

[1] 
http://www.postgresql.org/message-id/cam3swzshbe29kpod44cvc3vpzjgmder6k_6fghiszeozgmt...@mail.gmail.com

[2] 

[HACKERS] Streaming replication bug in 9.3.2, WAL contains references to invalid pages

2014-01-02 Thread Christophe Pettus
Greetings,

We've had two clients experience a crash on the secondary of a streaming 
replication pair, running PostgreSQL 9.3.2.  In both cases, the messages were 
close to this example:

2013-12-30 18:08:00.464 PST,,,23869,,52ab4839.5d3d,16,,2013-12-13 09:47:37 
PST,1/0,0,WARNING,01000,page 45785 of relation base/236971/365951 is 
uninitialized,xlog redo vacuum: rel 1663/236971/365951; blk 45794, 
lastBlockVacuumed 45784
2013-12-30 18:08:00.465 PST,,,23869,,52ab4839.5d3d,17,,2013-12-13 09:47:37 
PST,1/0,0,PANIC,XX000,WAL contains references to invalid pages,xlog redo 
vacuum: rel 1663/236971/365951; blk 45794, lastBlockVacuumed 45784
2013-12-30 18:08:00.950 PST,,,23866,,52ab4838.5d3a,8,,2013-12-13 09:47:36 
PST,,0,LOG,0,startup process (PID 23869) was terminated by signal 6: 
Aborted,

In both cases, the indicated relation was a primary key index.  In one case, 
rebuilding the primary key index caused the problem to go away permanently (to 
date).  In the second case, the problem returned even after a full dump / 
restore of the master database (that is, after a dump / restore of the master, 
and reimaging the secondary, the problem returned at the same primary key 
index, although of course with a different OID value).

It looks like this has been experienced on 9.2.6, as well:


http://www.postgresql.org/message-id/flat/CAL_0b1s4QCkFy_55kk_8XWcJPs7wsgVWf8vn4=jxe6v4r7h...@mail.gmail.com

Let me know if there's any further information I can provide.

Best,
--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR: missing chunk number 0 for toast value

2014-01-02 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-01-02 21:21:15 +0200, Heikki Linnakangas wrote:
 I don't see any other realistic way to fix this, however, so maybe we
 should just bite the bullet and do it anyway.

 We could remember the subtransaction a variable was created in and error
 out if it the creating subtransaction aborted and it's not a
 pass-by-value datum or similar.

That would still result in throwing an error, though, so it isn't likely
to make the OP happy.  I was wondering if we could somehow arrange to not
release the subtransaction's AccessShareLock on the table, as long as it
was protecting toasted references someplace.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Robert Haas
On Thu, Jan 2, 2014 at 2:56 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Jan 2, 2014 at 2:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 As I commented to Robert, the page-at-a-time behavior of pageinspect
 is not an API detail we'd want to copy for this.  I envision something
 like

 select hdr.*, foo.*
 from tuple_header_details('foo'::regclass) as hdr
 left join foo on hdr.ctid = foo.ctid;

 On a large table you might want a version that restricts its scan
 to pages M through N, but that's just optimization.  More useful
 would be to improve the planner's intelligence about joins on ctid ...

 /me blinks.

 Surely you're not serious.  That's going to scan the whole darn table
 even if we only want the details for one row.

 And?  The complaint about the other function was that it was inefficient
 when getting the details for a whole table, so I don't think you can
 complain about an approach that handles that case well.  Use the other
 function if you just want info for one row (that you can see).

Well, that's fair enough.  I don't mind having two functions.  Should
the whole-table function also include invisible tuples?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Well, that's fair enough.  I don't mind having two functions.  Should
 the whole-table function also include invisible tuples?

Certainly, that's exactly why I was proposing it.  You can do a join
if you want to suppress them.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preserving forensic information when we freeze

2014-01-02 Thread Andres Freund
On 2014-01-02 14:44:34 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-01-02 12:46:34 -0500, Tom Lane wrote:
  For real
  forensics work, you need to be able to see all tuples, which makes me
  think that something akin to pgstattuple is the right API; that is return
  a set of the header info for all tuples on such-and-such pages of this
  relation.  That should dodge any performance problem, because the
  heap_open overhead could be amortized across lots of tuples, and it also
  sidesteps all problems with adding new system columns.
 
  The biggest problem with such an API is that it's painful to use - I've
  used pageinspect a fair bit, and not being able to easily get the
  content of the rows you're looking at makes it really far less useful in
  many scenarios. That could partially be improved by a neater API
 
 Surely.  Why couldn't you join against the table on ctid?

For the case of pageinspect it's because pageinspect doesn't return the
ctid of a tuple in a useful way - its t_ctid is HeapTupleHeader-t_ctid,
not HeapTuple-t_self...
In many cases bulk access really isn't all that useful - you do a SELECT
searching for data that's looking strange and then need the forensic
data for those. That's just painful with any of the proposed fast APIs
afaics.

  And I really don't see any page-at-a-time access that's going to be
  convenient.
 
 As I commented to Robert, the page-at-a-time behavior of pageinspect
 is not an API detail we'd want to copy for this.  I envision something
 like
 
select hdr.*, foo.*
  from tuple_header_details('foo'::regclass) as hdr
   left join foo on hdr.ctid = foo.ctid;
 
 On a large table you might want a version that restricts its scan
 to pages M through N, but that's just optimization.  More useful
 would be to improve the planner's intelligence about joins on ctid ...

That really makes for but ugly queries. E.g. the database I found the
multixact bugs on was ~300GB and I had to look about 80GB of it. So I
would have had to write chunking code for individual tables. Not what
you want to do when shit has hit the fan.

  [ removing system columns from pg_attribute ]]
  I think this will inevitably break a lot of code, not all of it ours,
  so I'm not in favor of pursuing that direction.
 
  Are you thinking of client or extension code? From what I've looked at I
  don't think it's all that likely too break much of either.
 
 It will break anything that assumes that every column is represented in
 pg_attribute.  I think if you think this assumption is easily removed,
 you've not looked hard enough.

Uh. And how much code actually is that? Note that system columns already
aren't in a Relation's TupleDesc. So it's not they would be missing from
that - and if that were the issue we could easily add them there when
the cache entry is built.

There really isn't much code in postgres itself that iterates over all
columns including system columns. Some bits around heap.c, tablecmd.c,
lsyscache.c do lookups by name, but they are easily converted to
SystemAttributeByName()/SystemAttributeDefinition().

Most non DDL code doesn't care at all.

  It would make pg_attribute a fair bit smaller, especially on systems
  with lots of narrow relations.
 
 I'd like to do that too, but I think getting rid of xmin/xmax/cmin/cmax
 would be enough to get most of the benefit, and we could do that without
 any inconsistency if we stopped exposing those as system columns.

Well, I have yet to see any realistic proposal to get rid of
them. Having to write significantly more complex and/or significantly more
expensive queries doesn't qualify.

And there really is code out there using xmin/xmax as part of their
code, so I think the rumor of nobody crying about their near death is
just that, a rumor.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR: missing chunk number 0 for toast value

2014-01-02 Thread Andres Freund
On 2014-01-02 15:00:58 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-01-02 21:21:15 +0200, Heikki Linnakangas wrote:
  I don't see any other realistic way to fix this, however, so maybe we
  should just bite the bullet and do it anyway.
 
  We could remember the subtransaction a variable was created in and error
  out if it the creating subtransaction aborted and it's not a
  pass-by-value datum or similar.
 
 That would still result in throwing an error, though, so it isn't likely
 to make the OP happy.

Yea, it would give a better error message which might help diagnose the
issue, but not more. We could disallow accessing such variables
generally unless they explicitly had been detoasted, that would make
people notice the problem more easily.

I shortly wondered if we couldn't just iterate over plpgsql variables
and detoast them on subabort if created in the aborted xact, but that
doesn't really work because we're in an aborted transaction where it
might not be safe to access relations... Theoretically the subabort
could be split into two phases allowing it by only releasing the lock
after safely switching to the upper transaction but that sounds like a
hammer too big for the problem.

 I was wondering if we could somehow arrange to not
 release the subtransaction's AccessShareLock on the table, as long as it
 was protecting toasted references someplace.

Sounds fairly ugly...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLUSTER FREEZE

2014-01-02 Thread Robert Haas
On Mon, Dec 23, 2013 at 6:53 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-12-22 20:45:02 -0500, Robert Haas wrote:
 I suspect we ought to extend this to rewriting variants of ALTER TABLE
 as well, but a little thought is needed there.  ATRewriteTables()
 appears to just call heap_insert() for each updated row, which if I'm
 not mistaken is an MVCC violation - offhand, it seems like a
 transaction with an older MVCC snapshot could access the table for
 this first time after the rewriter commits, and fail to see rows which
 would have appeared to be there before the rewrite. (I haven't
 actually tested this, so watch me be wrong.)  If we're OK with an MVCC
 violation here, we could just pass HEAP_INSERT_FROZEN and have a
 slightly different flavor of violation; if not, this needs some kind
 of more extensive surgery quite apart from what we do about freezing.

 Yes, rewriting ALTER TABLEs certainly aren't MVCC safe. I thought that
 was documented, but apparently not.
 I am not sure it can be fixed easily using the tricks CLUSTER plays,
 there might be nasty edge-cases because of the changes in the column
 definition. Certainly not a trivial project.

 I think we should leave ALTER TABLE as a completely separate project and
 just improve CLUSTER for now. The practical impact of rewriting ALTER
 TABLEs not freezing is far smaller, because they are very seldomly
 performed in bigger databases.

OK, I have committed my patch to make CLUSTER and VACUUM FULL freeze
aggressively, and have left ALTER TABLE alone for now.

It would be nice to get to the point where a database-wide VACUUM FULL
would serve to bump datfrozenxid, so as to avoid having to give this
sort of advice:

http://www.postgresql.org/message-id/CA+Tgmobth=aqkwmwtcsqlaenv59gt4g3oqpqs45cb+fvg9m...@mail.gmail.com

However, it doesn't, quite: a bare VACUUM FULL now bumps relfrozenxid
for every table in the database *except pg_class*.  It does call
vac_update_datfrozenxid() afterwards, but that only helps if pg_class
is not among the things holding back datfrozenxid.  I haven't dug into
this much yet, but I think it might be worth fixing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR: missing chunk number 0 for toast value

2014-01-02 Thread Robert Haas
On Thu, Jan 2, 2014 at 3:19 PM, Andres Freund and...@2ndquadrant.com wrote:
 I was wondering if we could somehow arrange to not
 release the subtransaction's AccessShareLock on the table, as long as it
 was protecting toasted references someplace.

 Sounds fairly ugly...

I think the only principled fixes are to either retain the lock or
forcibly detoast before releasing it.  The main problem I see with
retaining the lock is that you'd need a way of finding out the
relation OIDs of all toast pointers you might later decide to expand.
I don't have an amazingly good idea about how to figure that out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR: missing chunk number 0 for toast value

2014-01-02 Thread Andres Freund
On 2014-01-02 16:05:09 -0500, Robert Haas wrote:
 On Thu, Jan 2, 2014 at 3:19 PM, Andres Freund and...@2ndquadrant.com wrote:
  I was wondering if we could somehow arrange to not
  release the subtransaction's AccessShareLock on the table, as long as it
  was protecting toasted references someplace.
 
  Sounds fairly ugly...
 
 I think the only principled fixes are to either retain the lock or
 forcibly detoast before releasing it.

I don't think that's sufficient. Unless I miss something the problem
isn't restricted to TRUNCATE and such at all. I think a plain VACUUM
should be sufficient? I haven't tested it, but INSERT RETURNING
toasted_col a row, storing the result in a record, and then aborting the
subtransaction will allow the inserted row to be VACUUMed by a
concurrent transaction.
So I don't think anything along those lines will be sufficient.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-02 Thread Peter Geoghegan
On Thu, Jan 2, 2014 at 8:08 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Yeah, it seems like PromiseTupleInsertionLockAcquire should be locking
 the tuple, rather than the XID.

 Well, that would be ideal, because we already have tuple locks. It would be
 nice to use the same concept for this. It's a bit tricky, however. I guess
 the most straightforward way to do it would be to grab a heavy-weight lock
 after you've inserted the tuple, but before releasing the buffer lock. I
 don't immediately see a problem with that, although it's a bit scary to
 acquire a heavy-weight lock while holding a buffer lock.

That's a really big modularity violation. Everything after
RelationPutHeapTuple() but before the buffer unlock in heap_insert()
is currently critical section. I'm not saying that it can't be done,
but it certainly is scary.

We also have heavyweight page locks, currently used by hash indexes.
That approach does not require us to contort the row locking code, and
certainly does not require us to acquire heavyweight locks with buffer
locks already held. I could understand your initial disinclination to
doing things this way, particularly when the unprincipled deadlocking
problem was not well understood, but I think that this must tip the
balance in favor of the approach I advocate. What I've done with
heavyweight locks is a modest, localized, logical expansion on the
existing mechanism, that is easy to reason about, with room for
further optimization in the future, that still has reasonable
performance characteristics today, including I believe better
worst-case latency. Heavyweight locks on btree pages are very well
precedented, if you look beyond Postgres.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch

2014-01-02 Thread Tom Lane
I wrote:
 It occurs to me that, rather than trying to improve the struct definition
 methodology, maybe we should just add static asserts to catch any
 inconsistency here.  It wouldn't be that hard:

 #define PGSTAT_MAX_MSG_SIZE   1000
 #define PGSTAT_MSG_PAYLOAD(PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
 ... all else in pgstat.h as before ...

 StaticAssertStmt(sizeof(PgStat_MsgTabstat) = PGSTAT_MAX_MSG_SIZE,
'bad PgStat_MsgTabstat size');
 ... and similarly for other pgstat message structs ...

After further thought it seems to me that this is a desirable approach,
because it is a direct check of the property we want, and will complain
about *any* mistake that results in too-large struct sizes.  The other
ideas that were kicked around upthread still left a lot of ways to mess up
the array size calculations, for instance referencing the wrong array
element type in the sizeof calculation.  So unless anyone has a concrete
objection, I'll go put in something like this along with Mark's fix.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-02 Thread Peter Geoghegan
On Thu, Jan 2, 2014 at 2:37 AM, Andres Freund and...@2ndquadrant.com wrote:
 Locking the definitely visible row only works if there's a row matching
 the index's columns. If the values of the new row don't have
 corresponding values in all the indexes you have the same old race
 conditions again.

I still don't get it - perhaps you should break down exactly what you
mean with an example. I'm talking about potentially doing multiple
upserts per row proposed for insertion to handle multiple conflicts,
perhaps with some deletes between upserts, not just one upsert with a
single update part.

 I think to be useful for many cases you really need to be able to ask
 for a potentially conflicting row and be sure that if there's none you
 are able to insert the row separately.

Why? What work do you need to perform after reserving the right to
insert but before inserting? Can't you just upsert resulting in
insert, and then perform that work, potentially deleting the row
inserted if and when you change your mind? Is there any real
difference between what that does for you, and what any particular
variety of promise tuple might do for you?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Planning time in explain/explain analyze

2014-01-02 Thread Andreas Karlsson

On 01/02/2014 04:08 AM, Robert Haas wrote:

I'm wondering whether the time should be stored inside the PlannedStmt
node instead of passing it around separately. One possible problem
with the way you've done things here is that, in the case of a
prepared statement, EXPLAIN ANALYZE will emit the time needed to call
GetCachedPlan(), even if that function didn't do any replanning.  Now
you could argue that this is the correct behavior, but I think there's
a decent argument that what we ought to show there is the amount of
time that was required to create the plan that we're displaying at the
time it was created, rather than the amount of time that was required
to figure out that we didn't need to replan.


Since we support re-planning of prepared statements I would argue the 
most useful thing is to print the time it took to fetch the old plan or 
the create a new one as the planning time, but I am not a heavy user of 
prepared statements.



Also, I am inclined to think we ought to update the examples, rather
than say this:

+rewriting and parsing. We will not include this line in later examples in
+this section.
+   /para


Ok, I will fix this in the next version of the patch.

--
Andreas Karlsson


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-02 Thread Erik Rijkers
On Thu, January 2, 2014 17:33, Erik Rijkers wrote:
 On Thu, January 2, 2014 13:36, Erik Rijkers wrote:
 On Thu, January 2, 2014 13:05, David Rowley wrote:
 here's a slightly updated patch
 [inverse_transition_functions_v1.8.patch.gz ]

 patch applies, and compiles (although with new warnings).
 But make check complains loudly

To figure out where this 'make check' failed, I lifted a few statements from 
the offending sql:
src/test/regress/sql/window.sql (see snafu.sh below).

That reliably crashes the server.  It is caused by a SUM, but only when 
configured like this (i.e. *not* configured for
speed) :

$ pg_config --configure
'--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.inverse'
'--bindir=/home/aardvark/pg_stuff/pg_installations/pgsql.inverse/bin'
'--libdir=/home/aardvark/pg_stuff/pg_installations/pgsql.inverse/lib' 
'--with-pgport=6554' '--enable-depend'
'--enable-cassert' '--enable-debug' '--with-openssl' '--with-perl' 
'--with-libxml' '--with-libxslt' '--with-ossp-uuid'
'--with-zlib'


$ cat snafu.sh
#!/bin/sh

echo 
--
-- WINDOW FUNCTIONS
--

   drop TABLE if exists empsalary ;

-- CREATE TEMPORARY TABLE empsalary (
   CREATE  TABLE empsalary (
depname varchar,
empno bigint,
salary int,
enroll_date date
);

INSERT INTO empsalary VALUES
('develop', 10, 5200, '2007-08-01'),
('sales', 1, 5000, '2006-10-01'),
('personnel', 5, 3500, '2007-12-10'),
('sales', 4, 4800, '2007-08-08'),
('personnel', 2, 3900, '2006-12-23'),
('develop', 7, 4200, '2008-01-01'),
('develop', 9, 4500, '2008-01-01'),
('sales', 3, 4800, '2007-08-01'),
('develop', 8, 6000, '2006-10-01'),
('develop', 11, 5200, '2007-08-15');

-- SELECT depname, empno, salary, sum(salary) OVER (PARTITION BY depname) FROM 
empsalary ORDER BY depname, salary;

 | psql

echo select * from empsalary; | psql

echo 
   SELECT depname, empno, salary, sum(salary) OVER (PARTITION BY depname) FROM 
empsalary ORDER BY depname, salary;
 | psql

$ ./snafu.sh
Timing is on.
DROP TABLE
Time: 1.093 ms
CREATE TABLE
Time: 80.161 ms
INSERT 0 10
Time: 11.964 ms
Timing is on.
  depname  | empno | salary | enroll_date
---+---++-
 develop   |10 |   5200 | 2007-08-01
 sales | 1 |   5000 | 2006-10-01
 personnel | 5 |   3500 | 2007-12-10
 sales | 4 |   4800 | 2007-08-08
 personnel | 2 |   3900 | 2006-12-23
 develop   | 7 |   4200 | 2008-01-01
 develop   | 9 |   4500 | 2008-01-01
 sales | 3 |   4800 | 2007-08-01
 develop   | 8 |   6000 | 2006-10-01
 develop   |11 |   5200 | 2007-08-15
(10 rows)

Time: 1.854 ms
Timing is on.
connection to server was lost

So, to repeat, this runs fine on a server compiled for speed.

I haven't looked any further (whether perhaps more statements are faulty)







-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch

2014-01-02 Thread Mark Dilger
I still don't understand why this case in src/include/pgstat.h
is different from cases elsewhere in the code.  Taken from
src/include/access/heapam_xlog.h:


typedef struct xl_heap_header
{
    uint16  t_infomask2;
    uint16  t_infomask;
    uint8   t_hoff;
} xl_heap_header;

#define SizeOfHeapHeader    (offsetof(xl_heap_header, t_hoff) + sizeof(uint8))



Now, if somebody changed t_hoff to be a uint16, that SizeOfHeapHeader
macro would be wrong.  Should we put a static assert in the code for that?
I have no objection, but it seems you like the static assert in one place
and not the other, and (perhaps due to some incredible ignorance on my
part) I cannot see why.  I tried looking for an assert of this kind already in
the code.  The use of this macro is in src/backend/access/heap/heapam.c,
but I didn't see any asserts for it, though there are lots of asserts for other
stuff.  Maybe I just didn't recognize it?


mark




On Thursday, January 2, 2014 2:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 
I wrote:
 It occurs to me that, rather than trying to improve the struct definition
 methodology, maybe we should just add static asserts to catch any
 inconsistency here.  It wouldn't be that hard:

 #define PGSTAT_MAX_MSG_SIZE    1000
 #define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
 ... all else in pgstat.h as before ...

 StaticAssertStmt(sizeof(PgStat_MsgTabstat) = PGSTAT_MAX_MSG_SIZE,
          'bad PgStat_MsgTabstat size');
 ... and similarly for other pgstat message structs ...

After further thought it seems to me that this is a desirable approach,
because it is a direct check of the property we want, and will complain
about *any* mistake that results in too-large struct sizes.  The other
ideas that were kicked around upthread still left a lot of ways to mess up
the array size calculations, for instance referencing the wrong array
element type in the sizeof calculation.  So unless anyone has a concrete
objection, I'll go put in something like this along with Mark's fix.


            regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-02 Thread Erik Rijkers
On Fri, January 3, 2014 00:09, Erik Rijkers wrote:


 connection to server was lost

 So, to repeat, this runs fine on a server compiled for speed.


I forgot to append the log messages:

2014-01-03 00:19:17.073 CET 14054 LOG:  database system is ready to accept 
connections
TRAP: FailedAssertion(!(!((bool) ((invtransfn_oid) != ((Oid) 0, File: 
parse_agg.c, Line: 1255)
2014-01-03 00:19:29.605 CET 14054 LOG:  server process (PID 14143) was 
terminated by signal 6: Aborted
2014-01-03 00:19:29.605 CET 14054 DETAIL:  Failed process was running: SELECT 
depname, empno, salary, sum(salary) OVER
(PARTITION BY depname) FROM empsalary ORDER BY depname, salary;
2014-01-03 00:19:29.605 CET 14054 LOG:  terminating any other active server 
processes
2014-01-03 00:19:29.607 CET 14054 LOG:  all server processes terminated; 
reinitializing
etc. etc.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch

2014-01-02 Thread Andres Freund
On 2014-01-02 15:15:58 -0800, Mark Dilger wrote:
 I still don't understand why this case in src/include/pgstat.h
 is different from cases elsewhere in the code.  Taken from
 src/include/access/heapam_xlog.h:
 
 
 typedef struct xl_heap_header
 {
     uint16  t_infomask2;
     uint16  t_infomask;
     uint8   t_hoff;
 } xl_heap_header;
 
 #define SizeOfHeapHeader    (offsetof(xl_heap_header, t_hoff) + sizeof(uint8))
 
 
 
 Now, if somebody changed t_hoff to be a uint16, that SizeOfHeapHeader
 macro would be wrong.  Should we put a static assert in the code for that?

The reason the various SizeOfHeapHeader are written that way is that we
do not want to uselessly store trailing padding in the
WAL. E.g. sizeof(xl_heap_header) will be 6bytes, but SizeOfHeapHeader
will be 5.
I don't see an easy way to guarantee this with asserts and I think you'd
notice pretty fast if you got things wrong there because WAL replay will
just have incomplete data.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch

2014-01-02 Thread Tom Lane
Mark Dilger markdil...@yahoo.com writes:
 I still don't understand why this case in src/include/pgstat.h
 is different from cases elsewhere in the code.

The reason why I'm exercised about it is that (a) somebody actually made a
mistake of this type, and (b) it wasn't caught by any automated testing.

The catalog and WAL-related examples you cite would probably crash
and burn in fairly obvious ways if somebody broke them --- for instance,
the most likely way to break SizeOfHeapHeader would be by adding another
field after t_hoff, but we'd notice that before long because said field
would be corrupted on arrival at a replication slave.

In contrast, messing up the pgstats message sizes would have no
consequences worse than a hard-to-detect, and probably platform-specific,
performance penalty for stats transmission.  So unless we think that's
of absolutely zero concern, adding a mechanism to make such bugs more
apparent seems useful.

I'm not against adding more assertions elsewhere, but it's a bit hard to
see what those asserts should test.  I don't see any practical way to
assert that field X is the last one in its struct, for instance.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch

2014-01-02 Thread Mark Dilger
I completely understand the padding issues that
you are dealing with.  I was mostly curious why
Tom wanted to use asserts to double-check the
code in one place, while happily not doing so in
what seemed the same kind of situation elsewhere.
He has since made the reason for that clear.





On Thursday, January 2, 2014 3:27 PM, Andres Freund and...@2ndquadrant.com 
wrote:
 
On 2014-01-02 15:15:58 -0800, Mark Dilger wrote:

 I still don't understand why this case in src/include/pgstat.h
 is different from cases elsewhere in the code.  Taken from
 src/include/access/heapam_xlog.h:
 
 
 typedef struct xl_heap_header
 {
     uint16  t_infomask2;
     uint16  t_infomask;
     uint8   t_hoff;
 } xl_heap_header;
 
 #define SizeOfHeapHeader    (offsetof(xl_heap_header, t_hoff) + sizeof(uint8))
 
 
 
 Now, if somebody changed t_hoff to be a uint16, that SizeOfHeapHeader
 macro would be wrong.  Should we put a static assert in the code for that?

The reason the various SizeOfHeapHeader are written that way is that we
do not want to uselessly store trailing padding in the
WAL. E.g. sizeof(xl_heap_header) will be 6bytes, but SizeOfHeapHeader
will be 5.
I don't see an easy way to guarantee this with asserts and I think you'd
notice pretty fast if you got things wrong there because WAL replay will
just have incomplete data.

Greetings,

Andres Freund

-- 
Andres Freund                      http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services

Re: [HACKERS] Streaming replication bug in 9.3.2, WAL contains references to invalid pages

2014-01-02 Thread MauMau

From: Christophe Pettus x...@thebuild.com
We've had two clients experience a crash on the secondary of a streaming 
replication pair, running PostgreSQL 9.3.2.  In both cases, the messages 
were close to this example:


2013-12-30 18:08:00.464 PST,,,23869,,52ab4839.5d3d,16,,2013-12-13 09:47:37 
PST,1/0,0,WARNING,01000,page 45785 of relation base/236971/365951 is 
uninitialized,xlog redo vacuum: rel 1663/236971/365951; blk 45794, 
lastBlockVacuumed 45784
2013-12-30 18:08:00.465 PST,,,23869,,52ab4839.5d3d,17,,2013-12-13 09:47:37 
PST,1/0,0,PANIC,XX000,WAL contains references to invalid pages,xlog 
redo vacuum: rel 1663/236971/365951; blk 45794, lastBlockVacuumed 
45784
2013-12-30 18:08:00.950 PST,,,23866,,52ab4838.5d3a,8,,2013-12-13 09:47:36 
PST,,0,LOG,0,startup process (PID 23869) was terminated by signal 6: 
Aborted,


In both cases, the indicated relation was a primary key index.  In one case, 
rebuilding the primary key index caused the problem to go away permanently 
(to date).  In the second case, the problem returned even after a full dump 
/ restore of the master database (that is, after a dump / restore of the 
master, and reimaging the secondary, the problem returned at the same 
primary key index, although of course with a different OID value).


It looks like this has been experienced on 9.2.6, as well:


I've experienced this problem with 9.2.4 once at the end of last year, too. 
The messages were the same except the relation and page numbers.  In 
addition, I encountered a similar (possibly the same) problem with 9.1.6 
about a year ago.  At that time, I found in the pgsql-* MLs several people 
report similar problems in the past several years, but those were not 
solved.  There seems to be a big dangerous bug hiding somewhere.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-02 Thread Peter Geoghegan
On Thu, Jan 2, 2014 at 11:58 AM, Peter Geoghegan p...@heroku.com wrote:
 My executive summary is that the exclusion patch performs about the
 same on lower client counts, presumably due to not having the
 additional window of btree lock contention. By 8 clients, the
 exclusion patch does noticeably better, but it's a fairly modest
 improvement.

I forgot to mention that synchronous_commit was turned off, so as to
eliminate noise that might have been added by commit latency, while
still obligating btree to WAL log everything with an exclusive buffer
lock held.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch

2014-01-02 Thread Mark Dilger
The mechanism that occurs to me (and I'm not wedded to
this idea) is:

typedef uint8 T_HOFF_TYPE;
typedef struct xl_heap_header
{
    uint16  t_infomask2;
    uint16  t_infomask;
    T_HOFF_TYPE t_hoff;
} xl_heap_header;

#define SizeOfHeapHeader    (offsetof(xl_heap_header, t_hoff) + 
sizeof(T_HOFF_TYPE))






On Thursday, January 2, 2014 3:39 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 
Mark Dilger markdil...@yahoo.com writes:
 I still don't understand why this case in src/include/pgstat.h
 is different from cases elsewhere in the code.

The reason why I'm exercised about it is that (a) somebody
 actually made a
mistake of this type, and (b) it wasn't caught by any automated testing.

The catalog and WAL-related examples you cite would probably crash
and burn in fairly obvious ways if somebody broke them --- for instance,
the most likely way to break SizeOfHeapHeader would be by adding another
field after t_hoff, but we'd notice that before long because said field
would be corrupted on arrival at a replication slave.

In contrast, messing up the pgstats message sizes would have no
consequences worse than a hard-to-detect, and probably platform-specific,
performance penalty for stats transmission.  So unless we think that's
of absolutely zero concern, adding a mechanism to make such bugs more
apparent seems useful.

I'm not against adding more assertions elsewhere, but it's a bit hard to
see what those asserts should test.  I don't see any practical way to
assert that field X is the last one in its struct, for instance.


            regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Store Extension Options

2014-01-02 Thread Robert Haas
On Thu, Jan 2, 2014 at 4:19 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-12-31 13:37:59 +0100, Pavel Stehule wrote:
   We use the namespace ext to the internal code
  (src/backend/access/common/reloptions.c) skip some validations and store
  the custom GUC.
 
  Do you think we don't need to use the ext namespace?
 

 yes - there be same mechanism as we use for GUC

 There is no existing mechanism to handle conflicts for GUCs. The
 difference is that for GUCs nearly no namespaced GUCs exist (plperl,
 plpgsql have some), but postgres defines at least autovacuum. and
 toast. namespaces for relation options.

I continue to think that the case for having this feature at all has
not been well-made.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch

2014-01-02 Thread Tom Lane
Mark Dilger markdil...@yahoo.com writes:
 The mechanism that occurs to me (and I'm not wedded to
 this idea) is:

 typedef uint8 T_HOFF_TYPE;
 typedef struct xl_heap_header
 {
     uint16  t_infomask2;
     uint16  t_infomask;
     T_HOFF_TYPE t_hoff;
 } xl_heap_header;

 #define SizeOfHeapHeader    (offsetof(xl_heap_header, t_hoff) + 
 sizeof(T_HOFF_TYPE))

Meh.  That does nothing for the add a field in the wrong place risk.
Yes, it would prevent people from changing the type of t_hoff without
updating the macro --- but I'm not convinced that defending against that
alone is worth any notational pain.  If you're changing t_hoff's type
without looking fairly closely at every reference to t_hoff, you're
practicing unsafe programming to begin with.

I wonder though if we could invent a macro that produces the sizeof
a struct field, and then use that in macros like this.  Something like

#define field_sizeof(typename, fieldname) \
sizeof(((typename *) NULL)-fieldname)

Compare the default definition of offsetof ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] truncating pg_multixact/members

2014-01-02 Thread Robert Haas
On Mon, Dec 30, 2013 at 10:59 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 One problem I see is length of time before freezing multis: they live
 for far too long, causing the SLRU files to eat way too much disk space.
 I ran burnmulti in a loop, creating multis of 3 members each, with a min
 freeze age of 50 million, and this leads to ~770 files in
 pg_multixact/offsets and ~2900 files in pg_multixact/members.  Each file
 is 32 pages long. 256kB apiece.  Probably enough to be bothersome.

 I think for computing the freezing point for multis, we should slash
 min_freeze_age by 10 or something like that.  Or just set a hardcoded
 one million.

Yeah.  Since we expect mxids to be composed at a much lower rate than
xids, we can keep pg_multixact small without needing to increase the
rate of full table scans.  However, it seems to me that we ought to
have GUCs for mxid_freeze_table_age and mxid_freeze_min_age.  There's
no principled way to derive those values from the corresponding values
for XIDs, and I can't see any reason to suppose that we know how to
auto-tune brand new values better than we know how to auto-tune their
XID equivalents that we've had for years.

One million is probably a reasonable default for mxid_freeze_min_age, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Make various variables read-only (const)

2014-01-02 Thread Wim Lewis
By an odd coincidence, I also decided to try to const-ify libpq
recently, and noticed this thread as I was cleaning up my patch for
submission. For what it's worth, I've attached my patch to this message.
It doesn't move as much data into the text segment as Oskari Saarenmaa's
patch does, but it is less intrusive (simply adding const modifiers here
and there). I just went after the low-hanging fruit in libpq.

As an aside, it might make sense to make pg_encname_tbl and
pg_encname_tbl_sz static, since as far as I can tell those symbols are
never used outside of encnames.c nor are they likely to be. I didn't
make that change in this patch though.

On Mon, Dec 23, 2013, Robert Haas robertmh...@gmail.com wrote:
 And how much does this really affect data sharing?  Doesn't copy-on-write do 
 the same thing for writable data?

It can have a surprisingly large effect if read-only data gets
intermixed on pages with actual read-write data and can get COWd
unnecessarily.

My motivation, though, was more about code correctness than memory
sharing, though sharing is a nice benefit--- I was examining unexpected
symbols in .data/.bss in case they represented a thread-safety problem
for my program linked against libpq. (They didn't, FWIW, except for the
known and documented issue with PQoidStatus(). Not that I really
expected to find a problem in libpq, but marking those structures const
makes it nice and clear that they're not mutable.)

diff --git a/src/backend/utils/mb/encnames.c b/src/backend/utils/mb/encnames.c
index 772d4a5..3f8c592 100644
--- a/src/backend/utils/mb/encnames.c
+++ b/src/backend/utils/mb/encnames.c
@@ -29,7 +29,7 @@
  * Karel Zak, Aug 2001
  * --
  */
-pg_encname pg_encname_tbl[] =
+const pg_encname   pg_encname_tbl[] =
 {
{
abc, PG_WIN1258
@@ -291,7 +291,7 @@ pg_encname  pg_encname_tbl[] =
}   /* last */
 };
 
-unsigned int pg_encname_tbl_sz = \
+const unsigned int pg_encname_tbl_sz = \
 sizeof(pg_encname_tbl) / sizeof(pg_encname_tbl[0]) - 1;
 
 /* --
@@ -304,7 +304,7 @@ sizeof(pg_encname_tbl) / sizeof(pg_encname_tbl[0]) - 1;
 #else
 #define DEF_ENC2NAME(name, codepage) { #name, PG_##name, codepage }
 #endif
-pg_enc2name pg_enc2name_tbl[] =
+const pg_enc2name pg_enc2name_tbl[] =
 {
DEF_ENC2NAME(SQL_ASCII, 0),
DEF_ENC2NAME(EUC_JP, 20932),
@@ -356,7 +356,7 @@ pg_enc2name pg_enc2name_tbl[] =
  * This covers all encodings except MULE_INTERNAL, which is alien to gettext.
  * --
  */
-pg_enc2gettext pg_enc2gettext_tbl[] =
+const pg_enc2gettext pg_enc2gettext_tbl[] =
 {
{PG_SQL_ASCII, US-ASCII},
{PG_UTF8, UTF-8},
@@ -469,11 +469,11 @@ clean_encoding_name(const char *key, char *newkey)
  * Search encoding by encoding name
  * --
  */
-pg_encname *
+const pg_encname *
 pg_char_to_encname_struct(const char *name)
 {
unsigned int nel = pg_encname_tbl_sz;
-   pg_encname *base = pg_encname_tbl,
+   const pg_encname *base = pg_encname_tbl,
   *last = base + nel - 1,
   *position;
int result;
@@ -521,7 +521,7 @@ pg_char_to_encname_struct(const char *name)
 int
 pg_char_to_encoding(const char *name)
 {
-   pg_encname *p;
+   const pg_encname *p;
 
if (!name)
return -1;
@@ -545,7 +545,7 @@ pg_encoding_to_char(int encoding)
 {
if (PG_VALID_ENCODING(encoding))
{
-   pg_enc2name *p = pg_enc2name_tbl[encoding];
+   const pg_enc2name *p = pg_enc2name_tbl[encoding];
 
Assert(encoding == p-encoding);
return p-name;
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index 6d1cd8e..08440e9 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -55,9 +55,9 @@ static FmgrInfo *ToClientConvProc = NULL;
 /*
  * These variables track the currently-selected encodings.
  */
-static pg_enc2name *ClientEncoding = pg_enc2name_tbl[PG_SQL_ASCII];
-static pg_enc2name *DatabaseEncoding = pg_enc2name_tbl[PG_SQL_ASCII];
-static pg_enc2name *MessageEncoding = pg_enc2name_tbl[PG_SQL_ASCII];
+static const pg_enc2name *ClientEncoding = pg_enc2name_tbl[PG_SQL_ASCII];
+static const pg_enc2name *DatabaseEncoding = pg_enc2name_tbl[PG_SQL_ASCII];
+static const pg_enc2name *MessageEncoding = pg_enc2name_tbl[PG_SQL_ASCII];
 
 /*
  * During backend startup we can't set client encoding because we (a)
diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c
index 45bc3c1..6d03a10 100644
--- a/src/backend/utils/mb/wchar.c
+++ b/src/backend/utils/mb/wchar.c
@@ -1720,7 +1720,7 @@ pg_eucjp_increment(unsigned char *charptr, int length)
  * XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h)
  *---
  */
-pg_wchar_tbl pg_wchar_table[] = {
+const pg_wchar_tbl 

Re: [HACKERS] [bug fix] connection service file doesn't take effect with ECPG apps

2014-01-02 Thread MauMau

From: Michael Meskes mes...@postgresql.org
However, I'd prefer to solve the problem slightly differently by not 
creating
an empty host variable instead of checking for it after the fact. But I 
take it

you don't mind that.

Fixed in HEAD and all back branches. Thanks for the report.


Thank you for committing the fix.  I'm comfortable with your change.

Regards
MauMau




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR: missing chunk number 0 for toast value

2014-01-02 Thread Amit Kapila
On Fri, Jan 3, 2014 at 12:51 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 01/02/2014 02:24 PM, Rushabh Lathia wrote:

 Hi All,

 Test case:

 drop table if exists t;
 create table t(c text);
 insert into t values ('x'), (repeat(md5('abcdefghijklmnop'), 1));
 select pg_column_size(c), pg_column_size(c || '') FROM t;

 CREATE OR REPLACE FUNCTION copy_toast_out() RETURNS VOID AS $$
 declare
  v text;
 BEGIN
  SELECT c INTO v FROM t WHERE c  'x';
  Select 1/0;
 Exception
  When Others Then
  PERFORM pg_sleep(30); -- go run TRUNCATE t in a 2nd session


  raise notice 'length :%', length(v || ''); -- force detoast


 END;
 $$ language plpgsql;

 postgres=# select copy_toast_out();
 ERROR:  missing chunk number 0 for toast value 16390 in pg_toast_16384
 CONTEXT:  PL/pgSQL function copy_toast_out() line 10 at RAISE

 Analysis:

 The basic problem here is that if the lock is released on table before
 extracting toasted value, and in meantime someone truncates the table,
 this error can occur.  Here error coming with PL block contains an
 Exception
 block (as incase there is an exception block, it calls
 RollbackAndReleaseCurrentSubTransaction).


 This is another variant of the bug discussed here:
 http://www.postgresql.org/message-id/0c41674c-fa02-4768-9e1b-548e56887...@quarantainenet.nl.


 Do you think we should detoast the local variable before
   RollbackAndReleaseCurrentSubTransaction ? Or any other options ?


 Hmm, that would fix this particular test case, but not the other case where
 you DROP or TRUNCATE the table in the same transaction.

 The simplest fix would be to just detoast everything on assignment but that
 was rejected on performance grounds in that previous thread. I don't see any
 other realistic way to fix this, however, so maybe we should just bite the
 bullet and do it anyway. For simple variables like, in your test case, it's
 a good bet to detoast the value immediately; it'll be detoasted as soon as
 you try to do anything with it anyway. But it's not a good bet for record or
 row variables, because you often fetch the whole row into a variable but
 only access a field or two.

Yeah, this is exactly what came to my mind as well the first time I saw this
problem that for row and record variables it can be penalty which user might
not expect as he might not be using toasted values.

However is it possible that we do detoasting on assignment when the
variable of function is declared with some specific construct.
For example, we do detoasting at commit time for holdable portals
(referred below code)

/*
* Change the destination to output to the tuplestore. Note we tell
* the tuplestore receiver to detoast all data passed through it.
*/
queryDesc-dest = CreateDestReceiver(DestTuplestore);
SetTuplestoreDestReceiverParams(..);

When the Hold option is specified with cursor, then we perform
detoast on commit, so on similar lines if the specific variable or
function is declared with some particular construct, then we detoast
on assignment.

Another option is that we give more meaningful error with Hint
suggesting the possible reason of error.
This option can be used along with above option in case
variable/function is not declared with particular construct.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-02 Thread David Rowley
On Fri, Jan 3, 2014 at 12:23 PM, Erik Rijkers e...@xs4all.nl wrote:

 On Fri, January 3, 2014 00:09, Erik Rijkers wrote:


  connection to server was lost
 
  So, to repeat, this runs fine on a server compiled for speed.
 

 I forgot to append the log messages:

 2014-01-03 00:19:17.073 CET 14054 LOG:  database system is ready to accept
 connections
 TRAP: FailedAssertion(!(!((bool) ((invtransfn_oid) != ((Oid) 0,
 File: parse_agg.c, Line: 1255)
 2014-01-03 00:19:29.605 CET 14054 LOG:  server process (PID 14143) was
 terminated by signal 6: Aborted
 2014-01-03 00:19:29.605 CET 14054 DETAIL:  Failed process was running:
 SELECT depname, empno, salary, sum(salary) OVER
 (PARTITION BY depname) FROM empsalary ORDER BY depname, salary;
 2014-01-03 00:19:29.605 CET 14054 LOG:  terminating any other active
 server processes
 2014-01-03 00:19:29.607 CET 14054 LOG:  all server processes terminated;
 reinitializing
 etc. etc.



hmm, yeah, compiling and testing a build with assets enabled... That's a
good idea!
I probably should have tried that :)
I've attached another patch which should fix this problem.

The single failing SUM(numeric) regression test is still in there and it is
a known failure to do the extra trailing zeros that it can now produce that
would not be present in an unpatched version. It's there purely in the hope
to generate some discussion about it to find out if we can use inverse
transitions for sum(numeric) or not.

Regards

David Rowley


inverse_transition_functions_v1.9.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Remove some duplicate if conditions

2014-01-02 Thread David Rowley
On Fri, Jan 3, 2014 at 6:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 David Rowley dgrowle...@gmail.com writes:
  I've attached a simple patch which removes some duplicate if conditions
  that seemed to have found their way into the code.

  These are per PVS-Studio's warnings.

 -1.  If PVS-Studio is complaining about this type of coding, to hell with
 it; it should just optimize away the extra tests and be quiet about it,
 not opinionate about coding style.  What you propose would cause logically
 independent pieces of code to become tied together, and I don't find that
 to be an improvement.  It's the compiler's job to micro-optimize where
 possible, and most compilers that I've seen will do that rather than tell
 the programmer he ought to do it.


I had imagined that PVS-Studio would have highlighted these due to it
thinking that it may be a possible bug rather than a chance for
optimisation. Here's some real world examples of bugs it has found:
http://www.viva64.com/en/examples/v581/
I understand that the compiler will likely optimise some of these cases,
most likely I would guess cases that test simple local variables. I've not
tested any compiler, but I imagined that the duplicate check
in fe-protocol3.c would be the less likely to be optimized as it may be
hard for the compiler to determine that conn-verbosity can't be changed
from within say, appendPQExpBuffer. I sent the patch in because after
looking at the fe-protocol3.c case I just found it weird and I had to spend
a bit of time to determine that the code was not meant to check
conn-verbosity against some other constant. This case just seems plain
weird to be a duplicate if condition to me.

The other case that's in the patch I don't really care so much for either,
I only added it since I had already written the fix for the fe-protocol3.c
one.

Regards

David Rowley