Re: [HACKERS] Row security violation error is misleading

2015-04-22 Thread Stephen Frost
Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
 On 21 April 2015 at 22:21, Dean Rasheed dean.a.rash...@gmail.com wrote:
  On 21 April 2015 at 20:50, Stephen Frost sfr...@snowman.net wrote:
  Thanks a lot for this.  Please take a look at the attached.
 
  I've given this a quick read-through, and it looks good to me. The
  interaction of permissive and restrictive policies from hooks matches
  my expections, and it's a definite improvement having tests for RLS
  hooks.
 
  The only thing I spotted was that the file comment for
  test_rls_hooks.c needs updating.
 
 So re-reading this, I spotted what looks like another (pre-existing)
 bug. In process_policies() there's a loop over all the policies,
 collecting quals and with_check_quals, then a test at the end to use
 the USING quals for the WITH CHECK quals if there were no
 with_check_quals. I think we want to instead do that test inside the
 loop -- i.e., for each policy, if there is no with_check_qual *for
 that policy*, use it's USING qual instead.

Pushed with those changes, please take a look and test!

Thanks again for all of your help with this.  I'm going to be looking
over that second patch with an eye towards getting it in very soon, it's
been kicking around for far longer than it should have been and that's
my fault, apologies about that.

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 11:09 AM, Kevin Grittner kgri...@ymail.com wrote:
 Robert Haas robertmh...@gmail.com wrote:
 It's possible that we could use this infrastructure to freeze
 more aggressively in other circumstances.  For example, perhaps
 VACUUM should freeze any page it intends to mark all-visible.
 That's not a guaranteed win, because it might increase WAL
 volume: setting a page all-visible does not emit an FPI for that
 page, but freezing any tuple on it would, if the page hasn't
 otherwise been modified since the last checkpoint.  Even if that
 were no issue, the freezing itself must be WAL-logged.  But if we
 could somehow get to a place where all-visible = frozen, then
 autovacuum would never need to visit all-visible pages, a huge
 win.

 That would eliminate full-table scan vacuums, right?  It would do
 that by adding incremental effort and WAL to the normal
 autovacuum run to eliminate the full table scan and the associated
 mass freeze WAL-logging?  It's hard to see how that would not be an
 overall win.

Yes and yes.

In terms of an overall win, this design loses when the tuples that
have been recently marked all-visible are going to get updated again
in the near future. In that case, the effort we spend to freeze them
is wasted.  I just tested pgbench -i -s 40 -n followed by VACUUM
or alternatively followed by VACUUM FREEZE.  The VACUUM generated
4641kB of WAL.  The VACUUM FREEZE generated 515MB of WAL - that is,
113 times more.  So changing every VACUUM to act like VACUUM FREEZE
would be quite expensive.  We'll still come out ahead if those tuples
are going to stick around long enough that they would have eventually
gotten frozen anyway, but if they get deleted again the loss is pretty
significant.

Incidentally, the reason for the large difference is that when Heikki
created the visibility map, it wasn't necessary for the WAL records
that set the visibility map bits to bump the page LSN, because it was
just a hint anyway.  When I made the visibility-map crash-safe, I went
to some pains to preserve that property.  Therefore, a regular VACUUM
does not emit full page images for the heap pages - it does for the
visibility map pages themselves, but there aren't very many of those.
In this example, the relation itself was 512MB, so you can see that
adding freezing to the mix roughly doubles the I/O cost.  Either way
we have to write half a gig of dirty data pages, but in one case we
also have to write an additional half a gig of WAL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Bruce Momjian
On Tue, Apr 21, 2015 at 04:36:53PM -0400, Robert Haas wrote:
  Keep in mind there's a disconnect between dirtying a page and writing it
  to storage.  A page could remain dirty for a long time in the buffer
  cache.  This writing of sequential pages would occur at checkpoint time
  only, which seems the wrong thing to optimize.  If some other process
  needs to evict pages to make room to read some other page in, surely
  it's going to try one page at a time, not write many sequential dirty
  pages.
 
 Well, for a big sequential scan, we use a ring buffer, so we will
 typically be evicting the pages that we ourselves read in moments
 before.  So in this case we would do a lot of sequential writes of
 dirty pages.

Ah, yes, this again supports the prune-then-skip approach, rather than
doing the first X% pruneable pages seen.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow SQL/plpgsql functions to accept record

2015-04-22 Thread Andrew Dunstan


On 04/22/2015 11:29 AM, Jim Nasby wrote:

On 4/20/15 2:04 PM, David G. Johnston wrote:


​SELECT (src.v).* FROM ( VALUES (ROW(1,2,3)) ) src (v)​;
ERROR: record type has not been registered

While it may not be necessary to solve both problems I suspect they have
the same underlying root cause - specifically the separation of concerns
between the planner and the executor.


I don't think they're related at all. C functions have the ability to 
accept a record, so the executor must be able to support it. It's just 
that SQL and plpgsql functions don't have that support. I suspect 
that's just because no one has gotten around to it.



Well, that's assuming everyone else thinks it would be a good idea. 
Maybe they do, but I wouldn't assume it.


The answer in the past has been to use more dynamically typed languages 
such as perl for which this problem is well suited.


There are actually several problems: first, given an arbitrary record 
plpgsql has no easy and efficient way of finding out what field names it 
has. Second, even if it has such knowledge it has no way of using it - 
it's not like JavaScript where you can use a text value as a field name. 
And third, it has no way of creating variables of the right type to hold 
extracted values.


All of these could possibly be overcome, but it would not be a small 
piece of work, I suspect. Given that plperl buys you all of that 
already  (just try this, for example) people might think it not worth 
the extra trouble.


   create function rkeys(record) returns text[] language plperl as $$
   my $rec = shift; return [ keys %$rec ]; $$;
   select unnest(rkeys(r)) from (select * from pg_class limit 1) r;


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-22 Thread Robert Haas
On Tue, Apr 21, 2015 at 10:33 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 [ new patch ]

A little more nitpicking:

ExecInitForeignScan() and ExecInitCustomScan() could declare
currentRelation inside the if (scanrelid  0) block instead of in the
outer scope.

I'm not too excited about the addition of GetFdwHandlerForRelation,
which is a one-line function used in one place.  It seems like we
don't really need that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] moving from contrib to bin

2015-04-22 Thread Bruce Momjian
On Wed, Apr 22, 2015 at 09:18:40AM +0200, Andres Freund wrote:
 Peter, Michael,
 
 On 2015-04-22 16:13:15 +0900, Michael Paquier wrote:
  All the patches have been committed, finishing the work on this thread.
 
 Many thanks for that effort!

And pg_upgrade thanks you.  ;-)

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Heikki Linnakangas

On 04/22/2015 05:33 PM, Robert Haas wrote:

On Tue, Apr 21, 2015 at 7:24 PM, Jim Nasby jim.na...@bluetreble.com wrote:

On 4/21/15 3:21 PM, Robert Haas wrote:

I'm not saying those ideas don't have problems, because they do.  But
I think they are worth further exploring.  The main reason I gave up
on that is because Heikki was working on the XID-to-LSN mapping stuff.
That seemed like a better approach than either of the above, so as
long as Heikki was working on that, there wasn't much reason to pursue
more lowbrow approaches.  Clearly, though, we need to do something
about this.  Freezing is a big problem for lots of users.


Did XID-LSN die? I see at the bottom of the thread it was returned with
feedback; I guess Heikki just hasn't had time and there's no major blockers?
 From what I remember this is probably a better solution, but if it's not
going to make it into 9.6 then we should probably at least look further into
a FM.


Heikki said he'd lost enthusiasm N it, but he wasn't too specific
about his reasons, IIRC.  I guess maybe just that it got complicated,
and he wasn't sure it was correct.


I'd like to continue working on that when I get around to it. Or even 
better if someone else continues it :-).


The thing that made me nervous about that approach is that it made the 
LSN of each page critical information. If you somehow zeroed out the 
LSN, you could no longer tell which pages are frozen and which are not. 
I'm sure it could be made to work - and I got it working to some degree 
anyway - but it's a bit scary. It's similar to the multixid changes in 
9.3: multixids also used to be data that you can just zap at restart, 
and when we changed the rules so that you lose data if you lose 
multixids, we got trouble. Now, LSNs are much simpler, and there 
wouldn't be anything like the multioffset/member SLRUs that you'd have 
to keep around forever or vacuum, but still..


I would feel safer if we added a completely new epoch counter to the 
page header, instead of reusing LSNs. But as we all know, changing the 
page format is a problem for in-place upgrade, and takes some space too.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow SQL/plpgsql functions to accept record

2015-04-22 Thread Jim Nasby

On 4/20/15 2:04 PM, David G. Johnston wrote:


​SELECT (src.v).* FROM ( VALUES (ROW(1,2,3)) ) src (v)​;
ERROR: record type has not been registered

While it may not be necessary to solve both problems I suspect they have
the same underlying root cause - specifically the separation of concerns
between the planner and the executor.


I don't think they're related at all. C functions have the ability to 
accept a record, so the executor must be able to support it. It's just 
that SQL and plpgsql functions don't have that support. I suspect that's 
just because no one has gotten around to it.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Authenticating from SSL certificates

2015-04-22 Thread Stephen Frost
Keenan,

* kee...@thebrocks.net (kee...@thebrocks.net) wrote:
 I'm looking into connection to postgres using authentication from client
 certificates. [1]

Nice!  Glad to hear of more users of that capability. :)

 The documentation states that the common name (aka CN) is read from the
 certificate and used as the user's login (aka auth_user).
 The problem is the common name is typically the user's full name. A field
 like email address would contain a more computer friendly identifier.

This is why we have the pg_ident mapping capability..  I realize that
file has to be generated, but at that point it's really just a string,
no?

That said, I'm not against this capability in general, but we'd need to
make sure it doesn't lock us into OpenSSL.  Heikki's been working on
changing the SSL code to allow other libraries to be used, which is
great, and I'm slightly worried this might make that more difficult.

The other issue is that we'd need to be very cleear in the documentation
that any users of this capability have to verify with their CA that they
aren't going to end up with the same value in whichever field is used
for distinct individuals- otherwise, the CA might unknowingly issue two
certs with the same value and you would then be unable to distinguish
between those two certs and both certs would have access to the account.

That's already an issue in the SSL world when using real CAs (that is,
ones outside of your own organization) and, really, we would do better
to support including *more* fields than just the CN to address that
issue.  As such, perhaps we should support having a *list* of fields to
use and then we combine them in some way in the mapping file.  That
would allow users to, say, include the issuer and the CN, and perhaps
the serial number if they want.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Row security violation error is misleading

2015-04-22 Thread Dean Rasheed
On 21 April 2015 at 22:21, Dean Rasheed dean.a.rash...@gmail.com wrote:
 On 21 April 2015 at 20:50, Stephen Frost sfr...@snowman.net wrote:
 Thanks a lot for this.  Please take a look at the attached.

 I've given this a quick read-through, and it looks good to me. The
 interaction of permissive and restrictive policies from hooks matches
 my expections, and it's a definite improvement having tests for RLS
 hooks.

 The only thing I spotted was that the file comment for
 test_rls_hooks.c needs updating.


So re-reading this, I spotted what looks like another (pre-existing)
bug. In process_policies() there's a loop over all the policies,
collecting quals and with_check_quals, then a test at the end to use
the USING quals for the WITH CHECK quals if there were no
with_check_quals. I think we want to instead do that test inside the
loop -- i.e., for each policy, if there is no with_check_qual *for
that policy*, use it's USING qual instead.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea: closing the loop for pg_ctl reload

2015-04-22 Thread Payal Singh
Ah sorry, didn't realize I top posted. I'll test this new one.

Payal.
On Apr 21, 2015 10:23 PM, Jan de Visser j...@de-visser.net wrote:

 On April 21, 2015 09:34:51 PM Jan de Visser wrote:
  On April 21, 2015 09:01:14 PM Jan de Visser wrote:
   On April 21, 2015 07:32:05 PM Payal Singh wrote:
 ... snip ...
 
  Urgh. It appears you are right. Will fix.
 
  jan

 Attached a new attempt. This was one from the category 'I have no idea how
 that ever worked, but whatever. For reference, this is how it looks for me
 (magic man-behind-the-curtain postgresql.conf editing omitted):

 jan@wolverine:~/Projects/postgresql$ initdb -D data
 ... Bla bla bla ...
 jan@wolverine:~/Projects/postgresql$ pg_ctl -D data -l logfile start
 server starting
 jan@wolverine:~/Projects/postgresql$ tail -5 logfile
 LOG:  database system was shut down at 2015-04-21 22:03:33 EDT
 LOG:  database system is ready to accept connections
 LOG:  autovacuum launcher started
 jan@wolverine:~/Projects/postgresql$ pg_ctl -D data reload
 server signaled
 jan@wolverine:~/Projects/postgresql$ tail -5 logfile
 LOG:  database system was shut down at 2015-04-21 22:03:33 EDT
 LOG:  database system is ready to accept connections
 LOG:  autovacuum launcher started
 LOG:  received SIGHUP, reloading configuration files
 jan@wolverine:~/Projects/postgresql$ pg_ctl -D data reload
 server signaled
 pg_ctl: Reload of server with PID 14656 FAILED
 Consult the server log for details.
 jan@wolverine:~/Projects/postgresql$ tail -5 logfile
 LOG:  autovacuum launcher started
 LOG:  received SIGHUP, reloading configuration files
 LOG:  received SIGHUP, reloading configuration files
 LOG:  syntax error in file
 /home/jan/Projects/postgresql/data/postgresql.conf
 line 1, near end of line
 LOG:  configuration file
 /home/jan/Projects/postgresql/data/postgresql.conf
 contains errors; no changes were applied
 jan@wolverine:~/Projects/postgresql$


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Robert Haas
On Tue, Apr 21, 2015 at 7:24 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 On 4/21/15 3:21 PM, Robert Haas wrote:
 It's possible that we could use this infrastructure to freeze more
 aggressively in other circumstances.  For example, perhaps VACUUM
 should freeze any page it intends to mark all-visible.  That's not a
 guaranteed win, because it might increase WAL volume: setting a page
 all-visible does not emit an FPI for that page, but freezing any tuple
 on it would, if the page hasn't otherwise been modified since the last
 checkpoint.  Even if that were no issue, the freezing itself must be
 WAL-logged.  But if we could somehow get to a place where all-visible
 = frozen, then autovacuum would never need to visit all-visible
 pages, a huge win.

 I don't know how bad the extra WAL traffic would be; we'd obviously need to
 incur it eventually, so it's a question of how common it is for a page to go
 all-visible but then go not-all-visible again before freezing. It would
 presumably be far more traffic than some form of a FrozenMap though...

Yeah, maybe.  The freeze record contains details for each TID, while
the freeze map bit would only need to be set once for the whole page.
I wonder if the format of that record could be optimized somehow.

 We could also attack the problem from the other end.  Instead of
 trying to set the bits on the individual tuples, we could decide that
 whenever a page is marked all-visible, we regard it as frozen
 regardless of the bits set or not set on the individual tuples.
 Anybody who wants to modify the page must freeze any unfrozen tuples
 for real before clearing the visibility map bit.  This would have
 the same end result as the previous idea: all-visible would
 essentially imply frozen, and autovacuum could ignore those pages
 categorically.

 Pushing what's currently background work onto foreground processes doesn't
 seem like a good idea...

When you phrase it that way, no, but pushing work that otherwise would
need to be done right now off to a future time that may never arrive
sounds like a good idea.  Today, we freeze the page -- rewriting it --
and then keep scanning those all-frozen pages every X number of
transactions to make sure they are really all-frozen.  In this system,
we'd eliminate the repeated scanning and defer the freeze work until
the page actually gets modified again.  But that might never happen,
in which case we never have to do the work at all.

 I'm not saying those ideas don't have problems, because they do.  But
 I think they are worth further exploring.  The main reason I gave up
 on that is because Heikki was working on the XID-to-LSN mapping stuff.
 That seemed like a better approach than either of the above, so as
 long as Heikki was working on that, there wasn't much reason to pursue
 more lowbrow approaches.  Clearly, though, we need to do something
 about this.  Freezing is a big problem for lots of users.

 Did XID-LSN die? I see at the bottom of the thread it was returned with
 feedback; I guess Heikki just hasn't had time and there's no major blockers?
 From what I remember this is probably a better solution, but if it's not
 going to make it into 9.6 then we should probably at least look further into
 a FM.

Heikki said he'd lost enthusiasm for it, but he wasn't too specific
about his reasons, IIRC.  I guess maybe just that it got complicated,
and he wasn't sure it was correct.

 All that having been said, I don't think adding a new fork is a good
 approach.  We already have problems pretty commonly where our
 customers complain about running out of inodes.  Adding another fork
 for every table would exacerbate that problem considerably.

 Andres idea of adding this to the VM may work well to handle that. It would
 double the size of the VM, but it would still be a ratio of 32,000-1
 compared to heap size, or 2MB for a 64GB table.

Yes, that's got some potential.  It would mean pg_upgrade would have
to remove all existing visibility maps when upgrading to the new
version, or rewrite them into the new format.  But it otherwise seems
promising.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rounding to even for numeric data type

2015-04-22 Thread Michael Paquier
On Wed, Apr 22, 2015 at 9:30 PM, Pedro Gimeno
pgsql-...@personal.formauri.es wrote:
 Dean Rasheed wrote, On 2015-03-28 10:01:
 On 28 March 2015 at 05:16, Andrew Gierth and...@tao11.riddles.org.uk wrote:
 Tom == Tom Lane t...@sss.pgh.pa.us writes:

  Tom I think the concern over backwards compatibility here is probably
  Tom overblown; but if we're sufficiently worried about it, a possible
  Tom compromise is to invent a numeric_rounding_mode GUC, so that
  Tom people could get back the old behavior if they really care.

 I only see one issue with this, but it's a nasty one: do we really want
 to make all numeric operations that might do rounding stable rather than
 immutable?


 Yeah, making all numeric functions non-immutable seems like a really bad 
 idea.

 Would it be possible to make it an unchangeable per-cluster or
 per-database setting, kinda like how encoding behaves? Wouldn't that
 allow to keep the functions immutable?

Rounding is not something that can be enforced at the database or
server level but at data type level, see for example the differences
already present for double precision and numeric as mentioned
upthread. In short, you could keep rounding functions immutable by
having one data type with a different rounding method. At least that's
an idea.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-04-22 Thread Abhijit Menon-Sen
At 2015-04-18 12:28:36 +0530, amit.kapil...@gmail.com wrote:

 I think you have missed to address the below point:
 
  4) prefetch

Updated patch attached, as well as the diff against the earlier version
just to make it easier to see the exact change I made (which is to copy
the skip logic from lazy_scan_heap, as you suggested).

I'm not entirely convinced that this change is worthwhile, but it's easy
to make and difficult to argue against, so here it is. (I did test, and
it seems to work OK after the change.)

Amit, Tomas, thanks again for your review comments.

-- Abhijit
From 3edb5426292d6097cb66339b865e99bf4f766646 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Fri, 26 Dec 2014 12:37:13 +0530
Subject: Add pgstatbloat to pgstattuple

---
 contrib/pgstattuple/Makefile   |   4 +-
 contrib/pgstattuple/pgstatbloat.c  | 387 +
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql  |  18 +
 .../{pgstattuple--1.2.sql = pgstattuple--1.3.sql} |  18 +-
 contrib/pgstattuple/pgstattuple.control|   2 +-
 doc/src/sgml/pgstattuple.sgml  | 135 +++
 6 files changed, 560 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstatbloat.c
 create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql
 rename contrib/pgstattuple/{pgstattuple--1.2.sql = pgstattuple--1.3.sql} (73%)

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..d7d27a5 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = pgstattuple - tuple-level statistics
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c
new file mode 100644
index 000..b19459a
--- /dev/null
+++ b/contrib/pgstattuple/pgstatbloat.c
@@ -0,0 +1,387 @@
+/*
+ * contrib/pgstattuple/pgstatbloat.c
+ *
+ * Abhijit Menon-Sen a...@2ndquadrant.com
+ * Portions Copyright (c) 2001,2002	Tatsuo Ishii (from pgstattuple)
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose, without fee, and without a
+ * written agreement is hereby granted, provided that the above
+ * copyright notice and this paragraph and the following two
+ * paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+ * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN AS
+ * IS BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+ * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include postgres.h
+
+#include access/visibilitymap.h
+#include access/transam.h
+#include access/xact.h
+#include access/multixact.h
+#include access/htup_details.h
+#include catalog/namespace.h
+#include funcapi.h
+#include miscadmin.h
+#include storage/bufmgr.h
+#include storage/freespace.h
+#include storage/procarray.h
+#include storage/lmgr.h
+#include utils/builtins.h
+#include utils/tqual.h
+#include commands/vacuum.h
+
+PG_FUNCTION_INFO_V1(pgstatbloat);
+
+/*
+ * tuple_percent, dead_tuple_percent and free_percent are computable,
+ * so not defined here.
+ */
+typedef struct pgstatbloat_output_type
+{
+	uint64		table_len;
+	uint64		tuple_count;
+	uint64		misc_count;
+	uint64		tuple_len;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	uint64		free_space;
+	uint64		total_pages;
+	uint64		scanned_pages;
+} pgstatbloat_output_type;
+
+static Datum build_output_type(pgstatbloat_output_type *stat,
+			   FunctionCallInfo fcinfo);
+static Datum pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo);
+
+/*
+ * build a pgstatbloat_output_type tuple
+ */
+static Datum
+build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo)
+{
+#define NCOLUMNS	10
+#define NCHARS		32
+
+	HeapTuple	tuple;
+	char	   *values[NCOLUMNS];
+	char		values_buf[NCOLUMNS][NCHARS];
+	int			i;
+	double		tuple_percent;
+	double		dead_tuple_percent;
+	double		free_percent;	/* free/reusable space in % */
+	double		scanned_percent;
+	TupleDesc	tupdesc;
+	AttInMetadata *attinmeta;
+
+	/* Build a 

Re: [HACKERS] inherit support for foreign tables

2015-04-22 Thread Stephen Frost
Etsuro,

* Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote:
 postgres=# select * from ft1 for update;
 ERROR:  could not find junk tableoid1 column
 
 I think this is a bug.  Attached is a patch fixing this issue.

Pushed, thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Jim Nasby

On 4/21/15 4:07 PM, Peter Eisentraut wrote:

On 4/21/15 4:45 PM, Jim Nasby wrote:
In order for a background worker to keep up with some of the workloads
that have been presented as counterexamples, you'd need multiple
background workers operating in parallel and preferring to work on
certain parts of a table.  That would require a lot more sophisticated
job management than we currently have for, say, autovacuum.


My thought was that the foreground queries would send page IDs to the 
bgworker via a shmq. If the queries have to do much waiting at all on IO 
then I'd expect the bgworker to be able to keep pace with a bunch of 
them since it's just grabbing buffers that are already in the pool (and 
only those in the pool; it wouldn't make sense for it to pull it back 
from the kernel, let alone disk).


We'd need to code this so that if a queue fills up the query doesn't 
block; we just skip that opportunity to prune. I think that'd be fine.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-22 Thread Robert Haas
On Wed, Mar 25, 2015 at 9:51 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 The attached patch adds GetForeignJoinPaths call on make_join_rel() only when
 'joinrel' is actually built and both of child relations are managed by same
 FDW driver, prior to any other built-in join paths.
 I adjusted the hook definition a little bit, because jointype can be 
 reproduced
 using SpecialJoinInfo. Right?

 Probably, it will solve the original concern towards multiple calls of FDW
 handler in case when it tries to replace an entire join subtree with a 
 foreign-
 scan on the result of remote join query.

 How about your opinion?

A few random cosmetic problems:

- The hunk in allpaths.c is useless.
- The first hunk in fdwapi.h contains an extra space before the
closing parenthesis.

And then:

+   else if (scan-scanrelid == 0 
+(IsA(scan, ForeignScan) || IsA(scan, CustomScan)))
+   varno = INDEX_VAR;

Suppose scan-scanrelid == 0 but the scan type is something else?  Is
that legal?  Is varno == 0 the correct outcome in that case?

More later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:

 It's possible that we could use this infrastructure to freeze
 more aggressively in other circumstances.  For example, perhaps
 VACUUM should freeze any page it intends to mark all-visible.
 That's not a guaranteed win, because it might increase WAL
 volume: setting a page all-visible does not emit an FPI for that
 page, but freezing any tuple on it would, if the page hasn't
 otherwise been modified since the last checkpoint.  Even if that
 were no issue, the freezing itself must be WAL-logged.  But if we
 could somehow get to a place where all-visible = frozen, then
 autovacuum would never need to visit all-visible pages, a huge
 win.

That would eliminate full-table scan vacuums, right?  It would do
that by adding incremental effort and WAL to the normal
autovacuum run to eliminate the full table scan and the associated
mass freeze WAL-logging?  It's hard to see how that would not be an
overall win.

 We could also attack the problem from the other end.  Instead of
 trying to set the bits on the individual tuples, we could decide
 that whenever a page is marked all-visible, we regard it as
 frozen regardless of the bits set or not set on the individual
 tuples.  Anybody who wants to modify the page must freeze any
 unfrozen tuples for real before clearing the visibility map
 bit.  This would have the same end result as the previous idea:
 all-visible would essentially imply frozen, and autovacuum could
 ignore those pages categorically.

Besides putting work into the foreground that could be done in the
background, that sounds more complicated.  Also, there is no
ability to pace the freeze load or use scheduled jobs to shift
the work to off-peak hours.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Greg Stark
On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian br...@momjian.us wrote:

 But if the entire table is very hot, I think that that is just another of way
 of saying that autovacuum is horribly misconfigured.  I think the purpose of

 Well, we have to assume there are many misconfigured configurations ---
 autovacuum isn't super-easy to configure, so we can't just blame the
 user if this makes things worse.  In fact, page pruning was designed
 spefically for cases where autovacuum wasn't running our couldn't keep
 up.

Well autovacuum isn't currently considering HOT pruning part of its
job at all. It's hard to call it misconfigured when there's
literally *no* way to configure it correctly.

If you update less than autovacuum_vacuum_scale_factor fraction of the
table and then never update another row autovacuum will never run.
Ever. Every select will forevermore need to follow hot chains on that
table. Until eventually transaction wraparound forces a vacuum on that
table if that ever happens.

Possibly autovacuum could be adjusted to count how many selects are
happening on the table and decide to vacuum it when the cost of the
selects following the dead tuples is balanced by the cost of doing a
vacuum. But that's not something included in the design of autovacuum
today.

The original design of tuple storage was aimed at optimizing the
steady state where most tuples were not recently updated. It
guaranteed that except for tuples that were in the process of being
updated or were recently updated a tuple read didn't have to read the
CLOG, didn't have to follow any chains, didn't have to do any I/O or
other work other than to read the bits on the tuple itself. When a
tuple is updated it's put into a state where everyone who comes along
has to do extra work but as soon as practical the hint bits get set
and that extra work stops.

We had similar discussions about setting hint bits in the past. I'm
not sure why HOT pruning is the focus now because I actually think
hint bit setting is a larger source of I/O in innocent looking selects
even today. And it's a major headache, people are always being
surprised that their selects cause lots of I/O and slow down
dramatically after a big update or data load has finished. It's
characterized as why is the database writing everything twice (and
saying it's actually writing everything three times doesn't make
people feel better). In the new age of checksums with hint bit logging
I wonder if it's even a bigger issue.

It occurs to me that generating these dirty pages isn't really that
expensive individually. It's only that there's a sudden influx of a
large number of dirty pages that causes them to get translated
immediately into filesystem I/O. Perhaps we should dirty pages on hint
bit updates and do HOT pruning only to the extent it can be done
without causing I/O. Of course it's hard to tell that in advance  but
maybe something like if the current buffer had to be fetched and
caused a dirty buffer to be evicted then skip hot pruning and don't
dirty it for any hint bit updates would at least mean that once the
select fills up its share of buffers with dirty buffers it stops
dirtying more. It would dirty pages only as fast as bgwriter or
checkpoints manage to write them out.

That sounds a bit weird but I think the right solution should have
that combination of properties. It should guarantee that hint bits get
set and hot chains pruned within some length of time but that no one
select causes a storm of dirty buffers that then need to be flushed to
disk.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Row security violation error is misleading

2015-04-22 Thread Dean Rasheed
On 22 April 2015 at 17:02, Stephen Frost sfr...@snowman.net wrote:
 Pushed with those changes, please take a look and test!


Excellent, thanks! Will test.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Peter Eisentraut
On 4/22/15 11:37 AM, Jim Nasby wrote:
 On 4/21/15 4:07 PM, Peter Eisentraut wrote:
 On 4/21/15 4:45 PM, Jim Nasby wrote:
 In order for a background worker to keep up with some of the workloads
 that have been presented as counterexamples, you'd need multiple
 background workers operating in parallel and preferring to work on
 certain parts of a table.  That would require a lot more sophisticated
 job management than we currently have for, say, autovacuum.
 
 My thought was that the foreground queries would send page IDs to the
 bgworker via a shmq. If the queries have to do much waiting at all on IO
 then I'd expect the bgworker to be able to keep pace with a bunch of
 them since it's just grabbing buffers that are already in the pool (and
 only those in the pool; it wouldn't make sense for it to pull it back
 from the kernel, let alone disk).
 
 We'd need to code this so that if a queue fills up the query doesn't
 block; we just skip that opportunity to prune. I think that'd be fine.

I think a to-clean-up map would work better.  But basically we need a
way to remember where to clean up later if we're not going to do it in
the foreground.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Bruce Momjian
On Tue, Apr 21, 2015 at 05:07:52PM -0400, Peter Eisentraut wrote:
 On 4/21/15 4:45 PM, Jim Nasby wrote:
  This comment made me wonder... has anyone considered handing the pruning
  work off to a bgworker, at least for SELECTs? That means the selects
  themselves wouldn't be burdened by the actual prune work, only in
  notifying the bgworker. While that's not going to be free, presumably
  it's a lot cheaper...
 
 The nice thing about having foreground queries to the light cleanup is
 that they can work in parallel and naturally hit the interesting parts
 of the table first.
 
 In order for a background worker to keep up with some of the workloads
 that have been presented as counterexamples, you'd need multiple
 background workers operating in parallel and preferring to work on
 certain parts of a table.  That would require a lot more sophisticated
 job management than we currently have for, say, autovacuum.

Well, the visibility map tells us where _not_ to clean up, so using
another map to tell use _where_ to cleanup might make sense.  However,
the density of the map might be low enough that a list makes more sense,
as you suggested.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea: closing the loop for pg_ctl reload

2015-04-22 Thread Payal Singh
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   tested, failed
Spec compliant:   not tested
Documentation:tested, failed

Error in postgresql.conf gives the expected result on pg_ctl reload, although 
errors in pg_hba.conf file don't. Like before, reload completes fine without 
any information that pg_hba failed to load and only information is present in 
the log file. I'm assuming pg_ctl reload should prompt user if file fails to 
load irrespective of which file it is - postgresql.conf or pg_hba.conf. 

There is no documentation change so far, but I guess that's not yet necessary. 

gmake check passed all tests.

The new status of this patch is: Waiting on Author


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-22 Thread Jan de Visser
On April 22, 2015 11:14:08 AM Heikki Linnakangas wrote:
 On 04/16/2015 06:51 AM, Alvaro Herrera wrote:
  Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use?  I
  thought the -w would wait until promotion has taken effect, so there's
  no need to sleep additional time.
 
 -w is not supported with pg_ctl promote. Only start, stop and restart.
 It's accepted, but it doesn't do anything. Which isn't very nice...

I'm futzing with the pg_ctl cmd line parameters for my patch notifying pg_ctl 
that restart didn't work. I'll fix this as well.

 
 - Heikki

jan



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-22 Thread Robert Haas
On Fri, Mar 13, 2015 at 8:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Mar 13, 2015 at 2:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't object to the concept, but I think that is a pretty bad place
 to put the hook call: add_paths_to_joinrel is typically called multiple
 (perhaps *many*) times per joinrel and thus this placement would force
 any user of the hook to do a lot of repetitive work.

 Interesting point.  I guess the question is whether a some or all
 callers are going to actually *want* a separate call for each
 invocation of add_paths_to_joinrel(), or whether they'll be happy to
 operate on the otherwise-complete path list.

 Hmm.  You're right, it's certainly possible that some users would like to
 operate on each possible pair of input relations, rather than considering
 the joinrel as a whole.  Maybe we need two hooks, one like your patch
 and one like I suggested.

Let me attempt to summarize subsequent discussion on this thread by
saying the hook location that you proposed (just before set_cheapest)
has not elicited any enthusiasm from anyone else.  In a nutshell, the
problem is that a single callback for a large join problem is just
fine if there are no special joins involved, but in any other
scenario, nobody knows how to use a hook at that location for anything
useful.  To push down a join to the remote server, you've got to
figure out how to emit an SQL query for it.  To execute it with a
custom join strategy, you've got to know which of those joins should
have inner join semantics vs. left join semantics.  A hook/callback in
make_join_rel() or in add_paths_to_joinrel() makes that relatively
straightforward. Otherwise, it's not clear what to do, short of
copy-and-pasting join_search_one_level().  If you have a suggestion,
I'd like to hear it.

If not, I'm going to press forward with the idea of putting the
relevant logic in either add_paths_to_joinrel(), as previously
proposed, or perhaps up oe level in make_one_rel().  Either way, if
you don't need to be called multiple times per joinrel, you can stash
a flag inside whatever you hang off of the joinrel's fdw_private and
return immediately on every call after the first.  I think that's
cheap enough that we shouldn't get too stressed about it: for FDWs, we
only call the hook at all if everything in the joinrel uses the same
FDW, so it won't get called at all except for joinrels where it's
likely to win big; for custom joins, multiple calls are quite likely
to be useful and necessary, and if the hook burns too much CPU time
for the query performance you get out of it, that's the custom-join
provider's fault, not ours.  The current patch takes this approach one
step further and attempts FDW pushdown only once per joinrel.  It does
that because, while postgres_fdw DOES need the jointype and a valid
innerrel/outerrel breakdown to figure out what query to generate, it
does NOT every possible breakdown; rather, the first one is as good as
any other. But this might not be true for a non-PostgreSQL remote
database.  So I think it's better to call the hook every time and let
the hook return without doing anything if it wants.

I'm still not totally sure whether make_one_rel() is better than
add_paths_to_joinrel().  The current patch attempts to split the
difference by doing FDW pushdown from make_one_rel() and custom joins
from add_paths_to_joinrel().  I dunno why; if possible, those two
things should happen in the same place.  Doing it in make_one_rel()
makes for fewer arguments and fewer repetitive calls, but that's not
much good if you would have had a use for the extra arguments that
aren't computed until we get down to add_paths_to_joinrel().  I'm not
sure whether that's the case or not.  The latest version of the
postgres_fdw patch doesn't seem to mind not having extra_lateral_rels,
but I'm wondering if that's busted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 2:23 PM, Kevin Grittner kgri...@ymail.com wrote:
 Robert Haas robertmh...@gmail.com wrote:
 I just tested pgbench -i -s 40 -n followed by VACUUM or
 alternatively followed by VACUUM FREEZE.  The VACUUM generated
 4641kB of WAL.  The VACUUM FREEZE generated 515MB of WAL - that
 is, 113 times more.

 Essentially a bulk load.  OK, so if you bulk load data and then
 vacuum it before updating 100% of it, this approach will generate a
 lot more WAL than we currently do.  Of course, if you don't VACUUM
 FREEZE after a bulk load and then are engaged in a fairly normal
 OLTP workload with peak and off-peak cycles, you are currently
 almost certain to hit a point during peak OLTP load where you begin
 to sequentially scan all tables, rewriting them in place, with WAL
 logging.  Incidentally, this tends to flush a lot of your hot
 data out of cache, increasing disk reads.  The first time I hit
 this interesting experience in production it was so devastating,
 and generated so many user complaints, that I never again
 considered a bulk load complete until I had run VACUUM FREEZE on it
 -- although I was sometimes able to defer that to an off-peak
 window of time.

 In other words, for the production environments I managed, the only
 value of that number is in demonstrating the importance of using
 unlogged COPY followed by VACUUM FREEZE for bulk-loading and
 capturing a fresh base backup upon completion.  A better way to use
 pgbench to measure WAL size cost might be to initialize, VACUUM
 FREEZE to set a long term baseline, and do a reasonable length
 run with crontab running VACUUM FREEZE periodically (including
 after the run was complete) versus doing the same with plain VACUUM
 (followed by a VACUUM FREEZE at the end?).  Comparing the total WAL
 sizes generated following the initial load and VACUUM FREEZE would
 give a more accurate picture of the impact on an OLTP load, I
 think.

Sure, that would be a better test.  But I'm pretty sure the impact
will still be fairly substantial.

 We'll still come out ahead if those tuples are going to stick
 around long enough that they would have eventually gotten frozen
 anyway, but if they get deleted again the loss is pretty
 significant.

 Perhaps my perception is biased by having worked in an environment
 where the vast majority of tuples (both in terms of tuple count and
 byte count) were never updated and were only eligible for deletion
 after a period of years.  Our current approach is pretty bad in
 such an environment, at least if you try to leave all vacuuming to
 autovacuum.  I'll admit that we were able to work around the
 problems by running VACUUM FREEZE every night for most databases.

Yeah.  And that breaks down when you have very big databases with a
high XID consumption rate, because the mostly-no-op VACUUM FREEZE runs
for longer than you can tolerate.  I'm not saying we don't need to fix
this problem; we clearly do.  I'm just saying that we've got to be
careful not to harm other scenarios in the process.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-22 Thread Heikki Linnakangas

On 04/22/2015 09:30 PM, Robert Haas wrote:

On Wed, Apr 22, 2015 at 2:17 AM, Heikki Linnakangas hlinn...@iki.fi wrote:

Note that it's a bit complicated to set up that scenario today. Archiving is
never enabled in recovery mode, so you'll need to use a custom cron job or
something to maintain the archive that C uses. The files will not
automatically flow from B to the second archive. With the patch we're
discussing, however, it would be easy: just set archive_mode='always' in B.


Hmm, I see.  But if C never replays the last, partial segment from the
old timeline, how does it follow the timeline switch?


At timeline switch, we copy the old segment to the new timeline, and 
start writing where we left off. So the WAL from the old timeline is 
found in the segment nominally belonging to the new timeline.


For example, imagine that perform point-in-time recovery to WAL position 
0/1237E568, on timeline 1. That falls within segment 
00010012. Then we end recovery, and switch to timeline 
2. After the switch, and some more WAL-logged actions, we'll have these 
files in pg_xlog:


00010011
00010012
00020012
00020013
00020014

Note that there are two segments ending in 12. They both have the same 
point up to offset 0x37E568, corresponding to the switch point 
0/1237E568. After that, the contents diverge: the segment on the new 
timeline contains a checkpoint/end-of-recovery record at that point, 
followed by new WAL belonging to the new timeline.


Recovery knows about that, so that if you set recovery target to 
timeline 2, and it needs the WAL at the beginning of segment 12 (still 
belonging to timeline 1), it will try to restoring both 
00010012 and 00020012.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow SQL/plpgsql functions to accept record

2015-04-22 Thread Merlin Moncure
On Wed, Apr 22, 2015 at 11:20 AM, Andrew Dunstan and...@dunslane.net wrote:

 On 04/22/2015 11:29 AM, Jim Nasby wrote:

 On 4/20/15 2:04 PM, David G. Johnston wrote:


 SELECT (src.v).* FROM ( VALUES (ROW(1,2,3)) ) src (v);
 ERROR: record type has not been registered

 While it may not be necessary to solve both problems I suspect they have
 the same underlying root cause - specifically the separation of concerns
 between the planner and the executor.

 I don't think they're related at all. C functions have the ability to
 accept a record, so the executor must be able to support it. It's just that
 SQL and plpgsql functions don't have that support. I suspect that's just
 because no one has gotten around to it.

 Well, that's assuming everyone else thinks it would be a good idea. Maybe
 they do, but I wouldn't assume it.

 The answer in the past has been to use more dynamically typed languages such
 as perl for which this problem is well suited.

I've never really been satisfied with this answer.  The two languages
with really good core support are perl and python, neither of which
are my cup of tea.   Also, there is no chance of inlining any of the
dynamic languages which has serious performance ramifications.  In a
perfect world, pl/v8 would be a good choice for a general purpose
database support language as javascript has a number of properties
that make it attractive for integration.  Even if we had that though
(and it's unlikely), a large percentage of postgres devs, including
myself, dislike coding in any language except sql plus extensions.

That being said, I think json types with their associated API, given
that they are core types, will ultimately handle these types of
problems.  That way, at least, we can avoid adding syntax and
functionality that will basically do the same thing.  This reminds me
a little bit of the json_build() vs enhanced row() syntax we discussed
some time back.  I didn't say so at the time, but for posterity, I
think you were right...json_build() is working fine for building
arbitrary record types and moving a record to json and deconstructing
it should work just as well.

merlin

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:

 I just tested pgbench -i -s 40 -n followed by VACUUM or
 alternatively followed by VACUUM FREEZE.  The VACUUM generated
 4641kB of WAL.  The VACUUM FREEZE generated 515MB of WAL - that
 is, 113 times more.

Essentially a bulk load.  OK, so if you bulk load data and then
vacuum it before updating 100% of it, this approach will generate a
lot more WAL than we currently do.  Of course, if you don't VACUUM
FREEZE after a bulk load and then are engaged in a fairly normal
OLTP workload with peak and off-peak cycles, you are currently
almost certain to hit a point during peak OLTP load where you begin
to sequentially scan all tables, rewriting them in place, with WAL
logging.  Incidentally, this tends to flush a lot of your hot
data out of cache, increasing disk reads.  The first time I hit
this interesting experience in production it was so devastating,
and generated so many user complaints, that I never again
considered a bulk load complete until I had run VACUUM FREEZE on it
-- although I was sometimes able to defer that to an off-peak
window of time.

In other words, for the production environments I managed, the only
value of that number is in demonstrating the importance of using
unlogged COPY followed by VACUUM FREEZE for bulk-loading and
capturing a fresh base backup upon completion.  A better way to use
pgbench to measure WAL size cost might be to initialize, VACUUM
FREEZE to set a long term baseline, and do a reasonable length
run with crontab running VACUUM FREEZE periodically (including
after the run was complete) versus doing the same with plain VACUUM
(followed by a VACUUM FREEZE at the end?).  Comparing the total WAL
sizes generated following the initial load and VACUUM FREEZE would
give a more accurate picture of the impact on an OLTP load, I
think.

 We'll still come out ahead if those tuples are going to stick
 around long enough that they would have eventually gotten frozen
 anyway, but if they get deleted again the loss is pretty
 significant.

Perhaps my perception is biased by having worked in an environment
where the vast majority of tuples (both in terms of tuple count and
byte count) were never updated and were only eligible for deletion
after a period of years.  Our current approach is pretty bad in
such an environment, at least if you try to leave all vacuuming to
autovacuum.  I'll admit that we were able to work around the
problems by running VACUUM FREEZE every night for most databases.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 12:39 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 The thing that made me nervous about that approach is that it made the LSN
 of each page critical information. If you somehow zeroed out the LSN, you
 could no longer tell which pages are frozen and which are not. I'm sure it
 could be made to work - and I got it working to some degree anyway - but
 it's a bit scary. It's similar to the multixid changes in 9.3: multixids
 also used to be data that you can just zap at restart, and when we changed
 the rules so that you lose data if you lose multixids, we got trouble. Now,
 LSNs are much simpler, and there wouldn't be anything like the
 multioffset/member SLRUs that you'd have to keep around forever or vacuum,
 but still..

LSNs are already pretty critical.  If they're in the future, you can't
flush those pages.  Ever.  And if they're wrong in either direction,
crash recovery is broken.  But it's still worth thinking about ways
that we could make this more robust.

I keep coming back to the idea of treating any page that is marked as
all-visible as frozen, and deferring freezing until the page is again
modified.  The big downside of this is that if the page is set as
all-visible and then immediately thereafter modified, it sucks to have
to freeze when the XIDs in the page are still present in CLOG.  But if
we could determine from the LSN that the XIDs in the page are new
enough to still be considered valid, then we could skip freezing in
those cases and only do it when the page is old.  That way, if
somebody zeroed out the LSN (why, oh why?) the worst that would happen
is that we'd do some extra freezing when the page was next modified.

 I would feel safer if we added a completely new epoch counter to the page
 header, instead of reusing LSNs. But as we all know, changing the page
 format is a problem for in-place upgrade, and takes some space too.

Yeah.  We have a serious need to reduce the size of our on-disk
format.  On a TPC-C-like workload Jan Wieck recently tested, our data
set was 34% larger than another database at the beginning of the test,
and 80% larger by the end of the test.  And we did twice the disk
writes.  See The Elephants in the Room.pdf at
https://sites.google.com/site/robertmhaas/presentations

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 2:17 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
 Note that it's a bit complicated to set up that scenario today. Archiving is
 never enabled in recovery mode, so you'll need to use a custom cron job or
 something to maintain the archive that C uses. The files will not
 automatically flow from B to the second archive. With the patch we're
 discussing, however, it would be easy: just set archive_mode='always' in B.

Hmm, I see.  But if C never replays the last, partial segment from the
old timeline, how does it follow the timeline switch?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add CINE for ALTER TABLE ... ADD COLUMN

2015-04-22 Thread Payal Singh
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Seeing this when trying to apply the patch:

Patching file src/backend/commands/tablecmds.c using Plan A...
Hunk #1 FAILED at 328.
Hunk #2 succeeded at 2294 (offset 11 lines).
Hunk #3 FAILED at 3399.
Hunk #4 FAILED at 3500.
Hunk #5 succeeded at 4658 with fuzz 1 (offset 65 lines).
Hunk #6 succeeded at 4753 (offset 66 lines).
Hunk #7 succeeded at 4989 with fuzz 2 (offset 66 lines).
Hunk #8 succeeded at 5003 (offset 69 lines).
Hunk #9 succeeded at 5017 (offset 69 lines).
Hunk #10 succeeded at 5033 (offset 69 lines).

The new status of this patch is: Waiting on Author


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Kevin Grittner
Greg Stark st...@mit.edu wrote:

 And it's a major headache, people are always being surprised that
 their selects cause lots of I/O and slow down dramatically after
 a big update or data load has finished. It's characterized as
 why is the database writing everything twice (and saying it's
 actually writing everything three times doesn't make people feel
 better).

When I looked at the life-cycle of a heap tuple in a database I was
using, I found that (ignoring related index access and ignoring
WAL-file copying, etc., for our backups), each tuple that existed
long enough to freeze and be eventually deleted caused a lot of
writes.

(1) WAL log the insert.
(2) Write the tuple.
(3) Hint and rewrite the tuple.
(4) WAL log the freeze of the tuple.
(5) Rewrite the frozen tuple.
(6) WAL-log the delete.
(7) Rewrite the deleted tuple.
(8) Prune and rewrite the page.
(9) Free line pointers and rewrite the page.

If I was lucky some of the writes could be combined in cache
because they happened close enough together.  Also, one could hope
that not too much of the WAL-logging involved full page writes to
the WAL -- again, keeping steps close together in time helps with
that.  If all of (1) through (5) are done in quick succession, you
save two physical writes of the heap page and save one full page

write to WAL.  If steps (7) through (9) are done in quick
succession, you save two more physical writes to the heap.  This is
part of what makes the aggressive incremental freezing being
discussed on a nearby thread appealing -- at least for some
workloads.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-22 Thread Robert Haas
On Tue, Apr 21, 2015 at 8:30 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 This .partial segment renaming is something that we
 should let the archive_command manage with its internal logic.

This strikes me as equivalent to saying we don't know how to make
this work right, but maybe our users will know.  That never works
out.  As things stand, we have a situation where the archive_command
examples in our documentation are known to be flawed.  They don't
fsync the file, and they'll write a partial file and then, when rerun,
fail to copy the full file because there's already something there.
Efforts have been made to fix these problems (see the pg_copy thread),
but they haven't been completed yet, nor have we even documented the
issues with the commands recommended by the documentation.  Let's
please not throw anything else on the pile of things we're expecting
users to somehow get right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cache lookup error for shell type creation with incompatible output function (DDL deparsing bug)

2015-04-22 Thread Alvaro Herrera
Michael Paquier wrote:
 Hi all,
 
 I just bumped into the following problem in HEAD (1c41e2a):
 =# create type my_array_float (INPUT = array_in, OUTPUT = array_out,
 ELEMENT = float4, INTERNALLENGTH = 32);
 ERROR:  XX000: cache lookup failed for type 0
 LOCATION:  format_type_internal, format_type.c:135

Argh.

 The fix consists in being sure that typoid uses the OID of the type
 shell created, aka the OID stored in adress.ObjectID. Attached is a
 patch with a regression test checking for shell creation with
 incompatible input/output functions (failure caused by output function
 here though) able to check this code path.

Thanks, pushed.  I changed the line to be just below TypeShellMake,
which seems slightly better aligned to the comment just below, and in
fact it matches what DefineRange already uses.  I also modified a couple
of other places involving TypeCreate: one did not have the typoid =
addr.objectId assignment at all; while the other did, it actually seems
misplaced because at that spot we expect that typoid is already correct.
So I added an assert to the other place and changed that assignment to
an assert, too.

I scanned the rest of the bogus commit and couldn't find any other place
on which I made the same mistake.

Thanks for reporting,

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TABLESAMPLE patch

2015-04-22 Thread Petr Jelinek

On 19/04/15 01:24, Michael Paquier wrote:

On Sat, Apr 18, 2015 at 8:38 PM, Michael Paquier
michael.paqu...@gmail.com wrote:

On Fri, Apr 17, 2015 at 10:54 PM, Petr Jelinek wrote:

On 10/04/15 06:46, Michael Paquier wrote:

13) Some regression tests with pg_tablesample_method would be welcome.


Not sure what you mean by that.


I meant a sanity check on pg_tablesample_method to be sure that
tsminit, tsmnextblock and tsmnexttuple are always defined as they are
mandatory functions. So the idea is to add a query like and and to be
sure that it returns no rows:
SELECT tsmname FROM pg_tablesample_method WHERE tsminit IS NOT NULL OR
tsmnextblock IS NOT NULL OR tsmnexttuple IS NOT NULL;


Yesterday was a long day. I meant IS NULL and not IS NOT NULL, but I
am sure you guessed it that way already..



Yes I guessed that and it's very reasonable request, I guess it should 
look like the attached (I don't want to send new version of everything 
just for this).


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From d059c792a1e864e38f321cea51169ea0b3c5caab Mon Sep 17 00:00:00 2001
From: Petr Jelinek pjmo...@pjmodos.net
Date: Wed, 22 Apr 2015 21:28:29 +0200
Subject: [PATCH 7/7] tablesample: add catalog regression test

---
 src/test/regress/expected/tablesample.out | 15 +++
 src/test/regress/sql/tablesample.sql  | 13 +
 2 files changed, 28 insertions(+)

diff --git a/src/test/regress/expected/tablesample.out b/src/test/regress/expected/tablesample.out
index 271638d..04e5eb8 100644
--- a/src/test/regress/expected/tablesample.out
+++ b/src/test/regress/expected/tablesample.out
@@ -209,6 +209,21 @@ SELECT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPLE BERNOULLI (5);
 ERROR:  syntax error at or near TABLESAMPLE
 LINE 1: ...CT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPL...
  ^
+-- catalog sanity
+SELECT *
+FROM pg_tablesample_method
+WHERE tsminit IS NULL
+   OR tsmseqscan IS NULL
+   OR tsmpagemode IS NULL
+   OR tsmnextblock IS NULL
+   OR tsmnexttuple IS NULL
+   OR tsmend IS NULL
+   OR tsmreset IS NULL
+   OR tsmcost IS NULL;
+ tsmname | tsmseqscan | tsmpagemode | tsminit | tsmnextblock | tsmnexttuple | tsmexaminetuple | tsmend | tsmreset | tsmcost 
+-++-+-+--+--+-++--+-
+(0 rows)
+
 -- done
 DROP TABLE test_tablesample CASCADE;
 NOTICE:  drop cascades to 2 other objects
diff --git a/src/test/regress/sql/tablesample.sql b/src/test/regress/sql/tablesample.sql
index 2f4b7de..7b3eb9b 100644
--- a/src/test/regress/sql/tablesample.sql
+++ b/src/test/regress/sql/tablesample.sql
@@ -57,5 +57,18 @@ SELECT * FROM query_select TABLESAMPLE BERNOULLI (5.5) REPEATABLE (1);
 
 SELECT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPLE BERNOULLI (5);
 
+-- catalog sanity
+
+SELECT *
+FROM pg_tablesample_method
+WHERE tsminit IS NULL
+   OR tsmseqscan IS NULL
+   OR tsmpagemode IS NULL
+   OR tsmnextblock IS NULL
+   OR tsmnexttuple IS NULL
+   OR tsmend IS NULL
+   OR tsmreset IS NULL
+   OR tsmcost IS NULL;
+
 -- done
 DROP TABLE test_tablesample CASCADE;
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-22 Thread Heikki Linnakangas

On 04/22/2015 10:21 PM, Robert Haas wrote:

On Wed, Apr 22, 2015 at 3:01 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

For example, imagine that perform point-in-time recovery to WAL position
0/1237E568, on timeline 1. That falls within segment
00010012. Then we end recovery, and switch to timeline 2.
After the switch, and some more WAL-logged actions, we'll have these files
in pg_xlog:

00010011
00010012
00020012
00020013
00020014


Is the 00010012 file a partial segment of the sort
you're proposing to no longer achive?


If you did pure archive recovery, with no streaming replication 
involved, then no. If it was created by streaming replication, and the 
replication had not filled the whole segment yet, then yes, it would be 
a partial segment.



Note that there are two segments ending in 12. They both have the same
point up to offset 0x37E568, corresponding to the switch point 0/1237E568.
After that, the contents diverge: the segment on the new timeline contains a
checkpoint/end-of-recovery record at that point, followed by new WAL
belonging to the new timeline.


Check.


Recovery knows about that, so that if you set recovery target to timeline 2,
and it needs the WAL at the beginning of segment 12 (still belonging to
timeline 1), it will try to restoring both 00010012 and
00020012.


What if you set the recovery target to timeline 3?


It depends how timeline 3 was created. If timeline 3 was forked off from 
timeline 2, then recovery would find it. If it was forked off directly 
from timeline 1, then no.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sequence Access Method WIP

2015-04-22 Thread Petr Jelinek

On 20/04/15 17:50, Heikki Linnakangas wrote:

On 03/15/2015 09:07 PM, Petr Jelinek wrote:

Slightly updated version of the patch.

Mainly rebased against current master (there were several conflicts) and
fixed some typos, no real functional change.

I also attached initial version of the API sgml doc.


I finally got around to have another round of review on this. I fixed a
couple of little bugs, did some minor edition on comments etc. See
attached. It is also available in my git repository at
git://git.postgresql.org/git/users/heikki/postgres.git, branch seqam,
if you want to look at individual changes. It combines your patches 1
and 4, I think those need to be applied together. I haven't looked at
the DDL changes yet.


Thanks!



I'm fairly happy with the alloc API now. I'm not sure it's a good idea
for the AM to access the sequence tuple directly, though. I would seem
cleaner if it only manipulated the amdata Datum. But it's not too bad as
it is.


Yeah, I was thinking about this myself I just liked sending 10 
parameters to the function less than this.




The division of labour between sequence.c and the AM, in the init and
the get/set_state functions, is a bit more foggy:

* Do we really need a separate amoptions() method and an init() method,
when the only caller to amoptions() is just before the init() method?
The changes in extractRelOptions suggest that it would call
am_reloptions for sequences too, but that doesn't actually seem to be
happening.


Hmm yes it should and I am sure it did at some point, must have messed 
it during one of the rebases :(


And it's the reason why we need separate API function.



postgres=# create sequence fooseq using local with (garbage=option);
CREATE SEQUENCE

Invalid options probably should throw an error.

* Currently, the AM's init function is responsible for basic sanity
checking, like min  max. It also has to extract the RESTART value from
the list of options. I think it would make more sense to move that to
sequence.c. The AM should only be responsible for initializing the
'amdata' portion, and for handling any AM-specific options. If the AM
doesn't support some options, like MIN and MAX value, it can check them
and throw an error, but it shouldn't be responsible for just passing
them through from the arguments to the sequence tuple.


Well then we need to send restart as additional parameter to the init 
function as restart is normally not stored anywhere.


The checking is actually if new value is withing min/max but yes that 
can be done inside sequence.c I guess.




* It might be better to form the sequence tuple before calling the init
function, and pass the ready-made tuple to it. All the other API
functions deal with the tuple (after calling sequence_read_tuple), so
that would be more consistent. The init function would need to
deconstruct it to modify it, but we're not concerned about performance
here.


Right, this is actually more of a relic of the custom columns 
implementation where I didn't want to build the tuple with NULLs for 
columns that might have been specified as NOT NULL, but with the amdata 
approach it can create the tuple safely.




* The transformations of the arrays in get_state() and set_state()
functions are a bit complicated. The seqam_get_state() function returns
two C arrays, and pg_sequence_get_state() turns them into a text[]
array. Why not construct the text[] array directly in the AM? I guess
it's a bit more convenient to the AM, if the pg_sequence_get_state() do
that, but if that's an issue, you could create generic helper function
to turn two C string arrays into text[], and call that from the AM.


Yeah that was exactly the reasoning. Helper function works for me (will 
check what Álvaro's suggested, maybe it can be moved somewhere and reused).





seq = (FormData_pg_sequence *) GETSTRUCT(sequence_read_tuple(seqh));

for (i = 0; i  count; i++)
{
if (pg_strcasecmp(keys[i], last_value) == 0)
seq-last_value = DatumGetInt64(DirectFunctionCall1(int8in,

CStringGetDatum(values[i])));
else if (pg_strcasecmp(keys[i], is_called) == 0)
seq-is_called = DatumGetBool(DirectFunctionCall1(boolin,

CStringGetDatum(values[i])));
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
 errmsg(invalid state key \%s\ for local
sequence,
keys[i])));
}

sequence_save_tuple(seqh, NULL, true);


If that error happens after having already processed a last_value or
is_called entry, you have already modified the on-disk tuple. AFAICS
that's the only instance of that bug, but sequence_read_tuple - modify
tuple in place - sequence_save_tuple pattern is quite unsafe in general.
If you modify a tuple directly in a Buffer, you should have a critical
section around it. It would make sense to start a critical section in
sequence_read_tuple(), except that not all callers want to modify the
tuple in place. 

Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 3:01 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 On 04/22/2015 09:30 PM, Robert Haas wrote:
 On Wed, Apr 22, 2015 at 2:17 AM, Heikki Linnakangas hlinn...@iki.fi
 wrote:

 Note that it's a bit complicated to set up that scenario today. Archiving
 is
 never enabled in recovery mode, so you'll need to use a custom cron job
 or
 something to maintain the archive that C uses. The files will not
 automatically flow from B to the second archive. With the patch we're
 discussing, however, it would be easy: just set archive_mode='always' in
 B.


 Hmm, I see.  But if C never replays the last, partial segment from the
 old timeline, how does it follow the timeline switch?

 At timeline switch, we copy the old segment to the new timeline, and start
 writing where we left off. So the WAL from the old timeline is found in the
 segment nominally belonging to the new timeline.

Check.

 For example, imagine that perform point-in-time recovery to WAL position
 0/1237E568, on timeline 1. That falls within segment
 00010012. Then we end recovery, and switch to timeline 2.
 After the switch, and some more WAL-logged actions, we'll have these files
 in pg_xlog:

 00010011
 00010012
 00020012
 00020013
 00020014

Is the 00010012 file a partial segment of the sort
you're proposing to no longer achive?

 Note that there are two segments ending in 12. They both have the same
 point up to offset 0x37E568, corresponding to the switch point 0/1237E568.
 After that, the contents diverge: the segment on the new timeline contains a
 checkpoint/end-of-recovery record at that point, followed by new WAL
 belonging to the new timeline.

Check.

 Recovery knows about that, so that if you set recovery target to timeline 2,
 and it needs the WAL at the beginning of segment 12 (still belonging to
 timeline 1), it will try to restoring both 00010012 and
 00020012.

What if you set the recovery target to timeline 3?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cache lookup error for shell type creation with incompatible output function (DDL deparsing bug)

2015-04-22 Thread Alvaro Herrera
Alvaro Herrera wrote:
 Michael Paquier wrote:
  Hi all,
  
  I just bumped into the following problem in HEAD (1c41e2a):
  =# create type my_array_float (INPUT = array_in, OUTPUT = array_out,
  ELEMENT = float4, INTERNALLENGTH = 32);
  ERROR:  XX000: cache lookup failed for type 0
  LOCATION:  format_type_internal, format_type.c:135

I also wanted to point out, but forgot, that this command is not really
creating a shell type -- it's creating a full-blown type, because there
are args.  Shell types are created when no args are given.  This happens
to fail due to the internal creation of the shell type because of the
failure to look up the i/o functions, but as far as I can see the
original code should fail in any type creation in pretty much the same
way (didn't actually test that); not completely sure why this wasn't
more visible in regression tests.

I simply removed the word shell in the provided test case in the
committed version, anyway.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-22 Thread Kouhei Kaigai
Hi Robert,

Thanks for your comments.

 A few random cosmetic problems:
 
 - The hunk in allpaths.c is useless.
 - The first hunk in fdwapi.h contains an extra space before the
 closing parenthesis.

OK, it's my oversight.

 And then:
 
 +   else if (scan-scanrelid == 0 
 +(IsA(scan, ForeignScan) || IsA(scan, CustomScan)))
 +   varno = INDEX_VAR;
 
 Suppose scan-scanrelid == 0 but the scan type is something else?  Is
 that legal?  Is varno == 0 the correct outcome in that case?

Right now, no other scan type has capability to return a tuples
with flexible type/attributes more than static definition.
I think it is a valid restriction that only foreign/custom-scan
can have scanrelid == 0.

I checked overall code again. One point doubtful was ExecScanFetch().
If estate-es_epqTuple is not NULL, it tries to save a tuple from
a particular scanrelid (larger than zero).
IIUC, es_epqTuple is used only when fetched tuple is updated then
visibility checks are applied on writer operation again.
So, it should work for CPS with underlying actual scan node on base
relations, however, I need code investigation if FDW/CSP replaced
an entire join subtree by an alternative relation scan (like a
materialized view).


  [ new patch ]
 
 A little more nitpicking:
 
 ExecInitForeignScan() and ExecInitCustomScan() could declare
 currentRelation inside the if (scanrelid  0) block instead of in the
 outer scope.

OK,

 I'm not too excited about the addition of GetFdwHandlerForRelation,
 which is a one-line function used in one place.  It seems like we
 don't really need that.

OK,

 On Fri, Mar 13, 2015 at 8:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  On Fri, Mar 13, 2015 at 2:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  I don't object to the concept, but I think that is a pretty bad place
  to put the hook call: add_paths_to_joinrel is typically called multiple
  (perhaps *many*) times per joinrel and thus this placement would force
  any user of the hook to do a lot of repetitive work.
 
  Interesting point.  I guess the question is whether a some or all
  callers are going to actually *want* a separate call for each
  invocation of add_paths_to_joinrel(), or whether they'll be happy to
  operate on the otherwise-complete path list.
 
  Hmm.  You're right, it's certainly possible that some users would like to
  operate on each possible pair of input relations, rather than considering
  the joinrel as a whole.  Maybe we need two hooks, one like your patch
  and one like I suggested.
 
 Let me attempt to summarize subsequent discussion on this thread by
 saying the hook location that you proposed (just before set_cheapest)
 has not elicited any enthusiasm from anyone else.  In a nutshell, the
 problem is that a single callback for a large join problem is just
 fine if there are no special joins involved, but in any other
 scenario, nobody knows how to use a hook at that location for anything
 useful.  To push down a join to the remote server, you've got to
 figure out how to emit an SQL query for it.  To execute it with a
 custom join strategy, you've got to know which of those joins should
 have inner join semantics vs. left join semantics.  A hook/callback in
 make_join_rel() or in add_paths_to_joinrel() makes that relatively
 straightforward. Otherwise, it's not clear what to do, short of
 copy-and-pasting join_search_one_level().  If you have a suggestion,
 I'd like to hear it.

Nothing I have. Once I tried to put a hook just after the set_cheapest(),
the largest problem was that we cannot extract a set of left and right
relations from a set of joined relations, like an extraction of apple
and orange from mix juice.

 If not, I'm going to press forward with the idea of putting the
 relevant logic in either add_paths_to_joinrel(), as previously
 proposed, or perhaps up oe level in make_one_rel().  Either way, if
 you don't need to be called multiple times per joinrel, you can stash
 a flag inside whatever you hang off of the joinrel's fdw_private and
 return immediately on every call after the first.  I think that's
 cheap enough that we shouldn't get too stressed about it: for FDWs, we
 only call the hook at all if everything in the joinrel uses the same
 FDW, so it won't get called at all except for joinrels where it's
 likely to win big; for custom joins, multiple calls are quite likely
 to be useful and necessary, and if the hook burns too much CPU time
 for the query performance you get out of it, that's the custom-join
 provider's fault, not ours.  The current patch takes this approach one
 step further and attempts FDW pushdown only once per joinrel.  It does
 that because, while postgres_fdw DOES need the jointype and a valid
 innerrel/outerrel breakdown to figure out what query to generate, it
 does NOT every possible breakdown; rather, the first one is as good as
 any other. But this might not be true for a non-PostgreSQL remote
 

Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Bruce Momjian
On Wed, Apr 22, 2015 at 06:36:23PM -0500, Jim Nasby wrote:
 On 4/22/15 6:12 PM, Bruce Momjian wrote:
 My point is that for the life of 200M transactions, you would have the
 overhead of an additional file per table in the file system, and updates
 of that.  I just don't know if the overhead over the long time period
 would be smaller than the VACUUM FREEZE.  It might be fine --- I don't
 know.  People seem to focus on the big activities, while many small
 activities can lead to larger slowdowns.
 
 Ahh. This wouldn't be for the life of 200M transactions; it would be
 a permanent fork, just like the VM is.

Right.  My point is that either you do X 2M times to maintain that fork
and the overhead of the file existance, or you do one VACUUM FREEZE.  I
am saying that 2M is a large number and adding all those X's might
exceed the cost of a VACUUM FREEZE.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type

2015-04-22 Thread Jeff Davis
Moving thread to -hackers.

On Wed, Apr 8, 2015 at 11:18 PM, Jeff Davis pg...@j-davis.com wrote:
 That example was just for illustration. My other example didn't require
 creating a table at all:

   SELECT a=b FROM (SELECT ''::text, '  ') x(a,b);

 it's fine with me if we want that to fail, but I don't think we're
 failing in the right place, or with the right error message.

 I'm not clear on what rules we're applying such that the above query
 should fail, but:

   SELECT ''::text='  ';

 should succeed. Unknown literals are OK, but unknown column references
 are not? If that's the rule, can we catch that earlier, and throw an
 error like 'column reference b has unknown type'?

Is the behavior of unknown literals vs. unknown column references
documented anywhere? I tried looking here:
http://www.postgresql.org/docs/devel/static/typeconv.html, but it
doesn't seem to make the distinction between how unknown literals vs.
unknown column references are handled.

My understanding until now has been that unknown types are a
placeholder while still inferring the types. But that leaves a few
questions:

1. Why do we care whether the unknown is a literal or a column reference?
2. Unknown column references seem basically useless, because we will
almost always encounter the failure to find conversion error, so why
do we allow them?
3. If unknowns are just a placeholder, why do we return them to the
client? What do we expect the client to do with it?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TABLESAMPLE patch

2015-04-22 Thread Michael Paquier
On Thu, Apr 23, 2015 at 4:31 AM, Petr Jelinek p...@2ndquadrant.com wrote:
 On 19/04/15 01:24, Michael Paquier wrote:

 On Sat, Apr 18, 2015 at 8:38 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:

 On Fri, Apr 17, 2015 at 10:54 PM, Petr Jelinek wrote:

 On 10/04/15 06:46, Michael Paquier wrote:

 13) Some regression tests with pg_tablesample_method would be welcome.


 Not sure what you mean by that.


 I meant a sanity check on pg_tablesample_method to be sure that
 tsminit, tsmnextblock and tsmnexttuple are always defined as they are
 mandatory functions. So the idea is to add a query like and and to be
 sure that it returns no rows:
 SELECT tsmname FROM pg_tablesample_method WHERE tsminit IS NOT NULL OR
 tsmnextblock IS NOT NULL OR tsmnexttuple IS NOT NULL;


 Yesterday was a long day. I meant IS NULL and not IS NOT NULL, but I
 am sure you guessed it that way already..


 Yes I guessed that and it's very reasonable request, I guess it should look
 like the attached (I don't want to send new version of everything just for
 this).

Thanks. That's exactly the idea.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablespaces inside $PGDATA considered harmful

2015-04-22 Thread Bruce Momjian
On Fri, Jan 30, 2015 at 01:26:22PM -0800, Josh Berkus wrote:
 Robert, Stephen, etc.:
 
 Apparently you can create a tablespace in the tablespace directory:
 
 josh=# create tablespace tbl location '/home/josh/pg94/data/pg_tblspc/';
 CREATE TABLESPACE
 josh=# create table test_tbl ( test text ) tablespace tbl;
 CREATE TABLE
 josh=# \q
 josh@Radegast:~/pg94/data/pg_tblspc$ ls
 17656  PG_9.4_201409291
 josh@Radegast:~/pg94/data/pg_tblspc$ ls -l
 total 4
 lrwxrwxrwx 1 josh josh   30 Jan 30 13:02 17656 -
 /home/josh/pg94/data/pg_tblspc
 drwx-- 3 josh josh 4096 Jan 30 13:02 PG_9.4_201409291
 josh@Radegast:~/pg94/data/pg_tblspc$
 
 In theory if I could guess the next OID, I could cause a failure there,
 but that appears to be obscure enough to be not worth bothering about.
 
 What is a real problem is that we don't block creating tablespaces
 anywhere at all, including in obviously problematic places like the
 transaction log directory:
 
 josh=# create tablespace tbl2 location '/home/josh/pg94/data/pg_xlog/';
 CREATE TABLESPACE
 
 It really seems like we ought to block *THAT*.  Of course, if we block
 tablespace creation in PGDATA generally, then that's covered.

I have developed the attached patch to warn about creating tablespaces
inside the data directory.  The case this doesn't catch is referencing a
symbolic link that points to the same directory.  We can't make it an
error so people can use pg_upgrade these setups.  This would be for 9.5
only.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
new file mode 100644
index fd22612..4ec1aff
*** a/src/backend/commands/tablespace.c
--- b/src/backend/commands/tablespace.c
*** CreateTableSpace(CreateTableSpaceStmt *s
*** 288,293 
--- 288,299 
   errmsg(tablespace location \%s\ is too long,
  		location)));
  
+ 	/* Warn if the tablespace is in the data directory. */
+ 	if (path_is_prefix_of_path(DataDir, location))
+ 		ereport(WARNING,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+  errmsg(tablespace location should not be inside the data directory)));
+ 
  	/*
  	 * Disallow creation of tablespaces named pg_xxx; we reserve this
  	 * namespace for system purposes.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inherit support for foreign tables

2015-04-22 Thread Etsuro Fujita

On 2015/04/23 0:34, Stephen Frost wrote:

* Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote:

postgres=# select * from ft1 for update;
ERROR:  could not find junk tableoid1 column

I think this is a bug.  Attached is a patch fixing this issue.


Pushed, thanks!


Thank you.

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inherit support for foreign tables

2015-04-22 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 16 Apr 2015 14:43:33 -0700, David Fetter da...@fetter.org wrote in 
20150416214333.ga...@fetter.org
 On Wed, Apr 15, 2015 at 09:35:05AM +0900, Kyotaro HORIGUCHI wrote:
  Hi,
  
  Before suppressing the symptom, I doubt the necessity and/or
  validity of giving foreign tables an ability to be a parent. Is
  there any reasonable usage for the ability?
...
 I have a use case for having foreign tables as non-leaf nodes in a
 partitioning hierarchy, namely geographic.

Ah, I see. I understood the case of intermediate nodes. I agree
that it is quite natural.

  One might have a table at
 HQ called foo_world, then partitions under it called foo_jp, foo_us,
 etc., in one level, foo_us_ca, foo_us_pa, etc. in the next level, and
 on down, each in general in a separate data center.
 
 Is there something essential about having non-leaf nodes as foreign
 tables that's a problem here?

No. I'm convinced of the necessity. Sorry for the noise.


At Wed, 22 Apr 2015 17:00:10 -0400, Robert Haas robertmh...@gmail.com wrote 
in CA+TgmobZVHp3D9wWCV8QJc+qGDu7=tekncbxowijzkhjucm...@mail.gmail.com
 Gee, I don't see why that would be unreasonable or invalid

Hmm. Yes, as mentioned above, there's no reason to refuse
non-leaf foregin tables. I didn't understood the real cause of
the problem and thought that not allowing foreign *root* tables
seem better than tweaking elsewhere. But that thought found to be
totally a garbage :(

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?

2015-04-22 Thread Bruce Momjian
On Tue, Apr 21, 2015 at 05:36:41PM -0400, Robert Haas wrote:
 On Mon, Apr 20, 2015 at 5:41 PM, Bruce Momjian br...@momjian.us wrote:
  On Mon, Apr 20, 2015 at 05:04:14PM -0400, Robert Haas wrote:
  On Mon, Apr 20, 2015 at 4:11 PM, Bruce Momjian br...@momjian.us wrote:
   Slightly improved patch applied.
 
  Is it?
 
  The patch has a slightly modified 'if' statement to check a constant
  before calling a function, and use elseif:
 
   + if (!interpretOidsOption(stmt-options, true)  
  cxt.hasoids)
  ---
   + if (cxt.hasoids  !interpretOidsOption(stmt-options, 
  true))
  47c57
   + if (interpretOidsOption(stmt-options, true)  
  !cxt.hasoids)
  ---
   + else if (!cxt.hasoids  interpretOidsOption(stmt-options, 
  true))
 
  I realize the change is subtle.
 
 What I meant was - I didn't see an attachment on that message.

I didn't attach it as people have told me they can just as easily see
the patch via git, and since it was so similar, I didn't repost it. 
Should I have?  I can easily do that.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type

2015-04-22 Thread David G. Johnston
My apologies if much of this is already assumed knowledge by most
-hackers...I'm trying to learn from observation instead of, largely,
reading code in a foreign language.

On Wed, Apr 22, 2015 at 6:40 PM, Jeff Davis pg...@j-davis.com wrote:

 Moving thread to -hackers.

 On Wed, Apr 8, 2015 at 11:18 PM, Jeff Davis pg...@j-davis.com wrote:
  That example was just for illustration. My other example didn't require
  creating a table at all:
 
 
 ​​
 ​​
 SELECT a=b FROM (SELECT ''::text, '  ') x(a,b);
 
  it's fine with me if we want that to fail, but I don't think we're
  failing in the right place, or with the right error message.
 
  I'm not clear on what rules we're applying such that the above query
  should fail, but:
 
 
 ​​
 SELECT ''::text='  ';


  should succeed. Unknown literals are OK, but unknown column references
  are not? If that's the rule, can we catch that earlier, and throw an
  error like 'column reference b has unknown type'?


But the fact that column b has the data type unknown is only a warning
- not an error.

This seems to be a case of the common problem (or, at least recently
mentioned) where type conversion only deals with data and not context.

http://www.postgresql.org/message-id/CADx9qBmVPQvSH3+2cH4cwwPmphW1mE18e=wumlfuc-qz-t7...@mail.gmail.com

Additional hinting regarding the column containing the offending data would
be welcomed by the community - but I suspect it is a non-trivial endeavor.


 Is the behavior of unknown literals vs. unknown column references
 documented anywhere? I tried looking here:
 http://www.postgresql.org/docs/devel/static/typeconv.html, but it
 doesn't seem to make the distinction between how unknown literals vs.
 unknown column references are handled.

 My understanding until now has been that unknown types are a
 placeholder while still inferring the types. But that leaves a few
 questions:

 1. Why do we care whether the unknown is a literal or a column reference?


Apparently the difference is in when non-implicit casts can be used for
coercion - or, rather, when input functions can be used instead of casting
functions.

in ​SELECT '  '::text = 'a' the explicit cast between the implicit unknown
and text is used while going through the subquery forces the planner to
locate an implicit cast between the explicit unknown and text.

​The following fails no matter what you try because no casts exist from
unknown to integer:

​​SELECT a::int=b FROM (SELECT '1', 1) x(a,b);

but this too works - which is why the implicit cast concept above fails
(I'm leaving it since the thought process may help in understanding):

SELECT 1 = '1';

From which I infer that an unknown literal is allowed to be fed directly
into a type's input function to facilitate a direct coercion.
​
Writing this makes me wish for more precise terminology...is there
something already established here?  untyped versus unknown makes sense
to me.  untyped literals only exist within the confines of a single node
and can be passed through a type's input function to make them take on a
type.  If the untyped reference passes through the node without having been
assigned an explicit type it is assigned the unknown type.

2. Unknown column references seem basically useless, because we will
 almost always encounter the failure to find conversion error, so why
 do we allow them?


At this point...backward compatibility?

I do get a warning in psql (9.3.6) from your original -bugs example

create table a(u) as select '1';

WARNING: column u has type unknown​
DETAIL:  Proceeding with relation creation anyway.

Related question: was there ever a time when the above failed instead of
just supplying a warning?

My git-fu is not super strong but the above warning was last edited by Tom
Lane back in 2003 (d8528630) though it was just a refactor - the warning
was already present.  I suppose after 12 years the why doesn't matter so
much...

create table b(i int);
insert into b select u from a;
ERROR:  failed to find conversion function from unknown to integer

Text appears to have a cast defined:

SELECT u::text FROM a;


 3. If unknowns are just a placeholder, why do we return them to the
 client? What do we expect the client to do with it?


​We do?​  I suspect that it is effectively treated as if it were text by
client libraries.

​My gut reaction is if you feel strongly enough to add some additional
documentation or warnings/hints/details related to this topic they probably
would get put in; but disallowing unknown as first-class type is likely
to fail to pass a cost-benefit evaluation.

Distinguishing between untyped literals and unknown type literals seems
promising concept to aid in understanding the difference in the face of not
being able (or wanting) to actually change the behavior.

David J.


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Alvaro Herrera
Bruce Momjian wrote:
 On Wed, Apr 22, 2015 at 04:36:17PM +0100, Greg Stark wrote:
  On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian br...@momjian.us wrote:
   Well, we have to assume there are many misconfigured configurations ---
   autovacuum isn't super-easy to configure, so we can't just blame the
   user if this makes things worse.  In fact, page pruning was designed
   spefically for cases where autovacuum wasn't running our couldn't keep
   up.
  
  Well autovacuum isn't currently considering HOT pruning part of its
  job at all. It's hard to call it misconfigured when there's
  literally *no* way to configure it correctly.
 
 Good point, but doesn't vacuum remove the need for pruning as it removes
 all the old rows?

Sure.  The point, I think, is to make autovacuum runs of some sort that
don't actually vacuum but only do HOT-pruning.  Maybe this is a
reasonable solution to the problem that queries don't prune anymore
after Simon's patch.  If we made autovac HOT-prune periodically, we
could have read-only queries prune only already-dirty pages.  Of course,
that would need further adjustments to default number of autovac
workers, I/O allocation, etc.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 8:48 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 I have implemented this idea (note that I have to expose a new API
 shm_mq_from_handle as TupleQueueFunnel stores shm_mq_handle* and
 we sum_mq* to call shm_mq_detach) and apart this I have fixed other
 problems reported on this thread:

 1. Execution of initPlan by master backend and then pass the
 required PARAM_EXEC parameter values to workers.
 2. Avoid consuming dsm's by freeing the parallel context after
 the last tuple is fetched.
 3. Allow execution of Result node in worker backend as that can
 be added as a gating filter on top of PartialSeqScan.
 4. Merged parallel heap scan descriptor patch

 To apply the patch, please follow below sequence:

 HEAD Commit-Id: 4d930eee
 parallel-mode-v9.patch [1]
 assess-parallel-safety-v4.patch [2]  (don't forget to run fixpgproc.pl in
 the patch)
 parallel_seqscan_v14.patch (Attached with this mail)

Thanks, this version looks like an improvement.  However, I still see
some problems:

- I believe the separation of concerns between ExecFunnel() and
ExecEndFunnel() is not quite right.  If the scan is shut down before
it runs to completion (e.g. because of LIMIT), then I think we'll call
ExecEndFunnel() before ExecFunnel() hits the TupIsNull(slot) path.  I
think you probably need to create a static subroutine that is called
both as soon as TupIsNull(slot) and also from ExecEndFunnel(), in each
case cleaning up whatever resources remain.

- InitializeParallelWorkers() still mixes together general parallel
executor concerns with concerns specific to parallel sequential scan
(e.g. EstimatePartialSeqScanSpace).   We have to eliminate everything
that assumes that what's under a funnel will be, specifically, a
partial sequential scan. To make this work properly, I think we should
introduce a new function that recurses over the plan tree and invokes
some callback for each plan node.  I think this could be modeled on
this code from ExplainNode(), beginning around line 1593:

/* initPlan-s */
if (planstate-initPlan)
ExplainSubPlans(planstate-initPlan, ancestors, InitPlan, es);

/* lefttree */
if (outerPlanState(planstate))
ExplainNode(outerPlanState(planstate), ancestors,
Outer, NULL, es);

/* righttree */
if (innerPlanState(planstate))
ExplainNode(innerPlanState(planstate), ancestors,
Inner, NULL, es);

/* special child plans */
switch (nodeTag(plan))
{
/* a bunch of special cases */
}

/* subPlan-s */
if (planstate-subPlan)
ExplainSubPlans(planstate-subPlan, ancestors, SubPlan, es);

The new function would do the same sort of thing, but instead of
explaining each node, it would invoke a callback for each node.
Possibly explain.c could use it instead of having hard-coded logic.
Possibly it should use the same sort of return-true convention as
expression_tree_walker, query_tree_walker, and friends.  So let's call
it planstate_tree_walker.

Now, instead of directly invoking logic specific to parallel
sequential scan, it should call planstate_tree_walker() on its
lefttree and pass a new function ExecParallelEstimate() as the
callback.  That function ignores any node that's not parallel aware,
but when it sees a partial sequential scan (or, in the future, some a
parallel bitmap scan, parallel sort, or what have you) it does the
appropriate estimation work.  When ExecParallelEstimate() finishes, we
InitializeParallelDSM().  Then, we call planstate_tree_walker() on the
lefttree again, and this time we pass another new function
ExecParallelInitializeDSM().  Like the previous one, that ignores the
callbacks from non-parallel nodes, but if it hits a parallel node,
then it fills in the parallel bits (i.e. ParallelHeapScanDesc for a
partial sequential scan).

- shm_mq_from_handle() is probably reasonable, but can we rename it
shm_mq_get_queue()?

- It's hard to believe this is right:

+   if (parallelstmt-inst_options)
+   receiver = None_Receiver;

Really?  Flush the tuples if there are *any instrumentation options
whatsoever*?  At the very least, that doesn't look too future-proof,
but I'm suspicious that it's outright incorrect.

- I think ParallelStmt probably shouldn't be defined in parsenodes.h.
That file is included in a lot of places, and adding all of those
extra #includes there doesn't seem like a good idea for modularity
reasons even if you don't care about partial rebuilds.  Something that
includes a shm_mq obviously isn't a parse node in any meaningful
sense anyway.

- I don't think you need both setup cost and startup cost.  Starting
up more workers isn't particularly more expensive than starting up
fewer of them, because most of the overhead is in waiting for them to
actually start, and the number of workers is reasonable, then they're
all be doing that in parallel with each other.  I suggest removing
parallel_startup_cost and keeping 

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Bruce Momjian
On Wed, Apr 22, 2015 at 06:07:00PM -0300, Alvaro Herrera wrote:
  Good point, but doesn't vacuum remove the need for pruning as it removes
  all the old rows?
 
 Sure.  The point, I think, is to make autovacuum runs of some sort that
 don't actually vacuum but only do HOT-pruning.  Maybe this is a
 reasonable solution to the problem that queries don't prune anymore
 after Simon's patch.  If we made autovac HOT-prune periodically, we
 could have read-only queries prune only already-dirty pages.  Of course,
 that would need further adjustments to default number of autovac
 workers, I/O allocation, etc.

Do we really want to make vacuum more complex for this?  vacuum does
have the delay settings we would need though.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-22 Thread Peter Geoghegan
On Wed, Apr 22, 2015 at 3:23 PM, Peter Geoghegan p...@heroku.com wrote:
 * We need to sort out those issues with the grammar, since that only
 really applies to the inference specification. Maybe the WHERE clause
 that the inference specification accepts can be broken out. No ON
 CONFLICT UPDATE specific issues left there, AFAICT though.

I pushed some code that deals with the predicate being within parenthesis:

https://github.com/petergeoghegan/postgres/commit/358854645279523310f998dfc9cb3fe3e165ce1e

(it now follows the attributes/expressions indexes, in the style of
CREATE INDEX).

We still need to reconcile these changes to the grammar with your own,
Andres. I'm going to wait to hear back on what you think about that.
Note that this removal:

-%nonassoc DISTINCT
-%nonassoc ON

was incidental to the commit (this is the code you could have removed
when you modified the grammar, adding a new production, but didn't).
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow SQL/plpgsql functions to accept record

2015-04-22 Thread Jim Nasby

On 4/22/15 2:12 PM, Merlin Moncure wrote:

That being said, I think json types with their associated API, given
that they are core types, will ultimately handle these types of
problems.  That way, at least, we can avoid adding syntax and
functionality that will basically do the same thing.  This reminds me
a little bit of the json_build() vs enhanced row() syntax we discussed
some time back.  I didn't say so at the time, but for posterity, I
think you were right...json_build() is working fine for building
arbitrary record types and moving a record to json and deconstructing
it should work just as well.


The one part I don't care for in that is it seems rather inefficient to 
cast something to JSON just so we can do things we really should be able 
to do with a record. But perhaps it's not all that costly.


As for allowing SQL and plpgsql functions to accept a record, I think 
our JSON functionality just provided plenty of reason we should allow 
accepting them, even if you can't do much with it: you *can* hand it to 
row_to_json(), which does allow you to do something useful with it. So 
it seems reasonable to me that we should at least accept it as a 
function argument.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Bruce Momjian
On Tue, Apr 21, 2015 at 08:39:37AM +0200, Andres Freund wrote:
 On 2015-04-20 17:13:29 -0400, Bruce Momjian wrote:
  Didn't you think any of the TODO threads had workable solutions?  And
  don't expect adding an additional file per relation will be zero cost
  --- added over the lifetime of 200M transactions, I question if this
  approach would be a win.
 
 Note that normally you'd not run with a 200M transaction freeze max age
 on a busy server. Rather around a magnitude more.
 
 Think about this being used on a time partionioned table. Right now all
 the partitions have to be fully rescanned on a regular basis - quite
 painful. With something like this normally only the newest partitions
 will have to be.

My point is that for the life of 200M transactions, you would have the
overhead of an additional file per table in the file system, and updates
of that.  I just don't know if the overhead over the long time period
would be smaller than the VACUUM FREEZE.  It might be fine --- I don't
know.  People seem to focus on the big activities, while many small
activities can lead to larger slowdowns.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Jim Nasby

On 4/22/15 6:12 PM, Bruce Momjian wrote:

My point is that for the life of 200M transactions, you would have the
overhead of an additional file per table in the file system, and updates
of that.  I just don't know if the overhead over the long time period
would be smaller than the VACUUM FREEZE.  It might be fine --- I don't
know.  People seem to focus on the big activities, while many small
activities can lead to larger slowdowns.


Ahh. This wouldn't be for the life of 200M transactions; it would be a 
permanent fork, just like the VM is.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 3:34 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 On 04/22/2015 10:21 PM, Robert Haas wrote:
 On Wed, Apr 22, 2015 at 3:01 PM, Heikki Linnakangas hlinn...@iki.fi
 wrote:
 For example, imagine that perform point-in-time recovery to WAL position
 0/1237E568, on timeline 1. That falls within segment
 00010012. Then we end recovery, and switch to timeline 2.
 After the switch, and some more WAL-logged actions, we'll have these
 files
 in pg_xlog:

 00010011
 00010012
 00020012
 00020013
 00020014


 Is the 00010012 file a partial segment of the sort
 you're proposing to no longer achive?

 If you did pure archive recovery, with no streaming replication involved,
 then no. If it was created by streaming replication, and the replication had
 not filled the whole segment yet, then yes, it would be a partial segment.

Why the difference?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-22 Thread Peter Geoghegan
On Tue, Apr 21, 2015 at 7:57 AM, Andres Freund and...@anarazel.de wrote:
 I'm not 100% sure Heikki and I am on exactly the same page here :P

 I'm looking at git diff $(git merge-base upstream/master HEAD).. where
 HEAD is e1a5822d164db0.

Merged your stuff into my Github branch. Heikki is pushing changes
there directly now.

 * The logical stuff looks much saner.

Cool.

 * Please add tests for the logical decoding stuff. Probably both a plain
   regression and and isolationtester test in
   contrib/test_decoding. Including one that does spooling to disk.

Working on it...hope to push that to Github soon.

 * I don't like REORDER_BUFFER_CHANGE_INTERNAL_INSERT/DELETE as names. Why not
   _SPECINSERT and _SPECDELETE or such?

Changed that on Github.

 * Iff we're going to have the XLOG_HEAP_AFFIRM record, I'd rather have
   that guide the logical decoding code. Seems slightly cleaner.

I thought that you didn't think that would always work out? That in
some possible scenario it could break?

 * Still not a fan of the name 'arbiter' for the OC indexes.

What do you prefer? Seems to describe what we're talking about
reasonably well to me.

 * Gram.y needs a bit more discussion:
   * Can anybody see a problem with changing the precedence of DISTINCT 
 ON to nonassoc? Right now I don't see a problem given both are
 reserved keywords already.

Why did you push code that solved this in a roundabout way, but
without actually reverting my original nonassoc changes (which would
now not result in shift/reduce conflicts?). What should we do about
that? Seems your unsure (otherwise you'd have reverted my thing). I
don't like that you seem to have regressed diagnostic output [1].
Surely it's simpler to use the nonassoc approach? I think this works
by giving the relevant keywords an explicit priority lower than '(',
so that a rule with ON CONFLICT '(' will shift rather than reducing a
conflicting rule (CONFLICT is an unreserved keyword). I wrote the code
so long ago that I can't really remember why I thought it was the
right thing, though.

   * UpdateInsertStmt is a horrible name. OnConflictUpdateStmt maybe?

Done on Github.

   * '(' index_params where_clause ')' is imo rather strange. The where
 clause is inside the parens? That's quite different from the
 original index clause.

I don't know. Maybe I was lazy about fixing shift/reduce conflicts.   :-)

I'll look at this some more.

 * SPEC_IGNORE,  /* INSERT of ON CONFLICT IGNORE */ looks like
   a wrongly copied comment.

Not sure what you mean here. Please clarify.

 * The indentation in RoerderBufferCommit is clearly getting out of hand,
   I've queued up a commit cleaning this up in my repo, feel free to merge.

Done on Github.

 * I don't think we use the term 'ordinary table' in error messages so
   far.

Fixed on Github.

 * I still think it's unacceptable to redefine
   XLOG_HEAP_LAST_MULTI_INSERT as XLOG_HEAP_SPECULATIVE_TUPLE like you
   did. I'll try to find something better.

I did what you suggested in your follow-up e-mail (changes are on Github [2]).

 * I wonder if we now couldn't avoid changing heap_delete's API - we can
   always super delete if we find a speculative insertion now. It'd be
   nice not to break out of core callers if not necessary.

Maybe, but if there is a useful way to break out only a small subset
of heap_delete(), I'm not seeing it. Most of the callers that need a
new NULL argument are heap_insert() callers, actually. There are now 3
heap_delete() callers, up from 2.

 * breinbaas on IRC just mentioned that it'd be nice to have upsert as a
   link in the insert. Given that that's the pervasive term that doesn't
   seem absurd.

Not sure what you mean. Where would the link appear? It is kind of
hard to categorize that text so that we're strictly either talking
about INSERT or UPSERT. Might be possible, though.

 I think this is getting closer to a commit. Let's get this done.

Great!

The blockers to committing the IGNORE patch I see are:

* We need to tweak some of the logical decoding stuff a bit more, I
feel. Firm up on the details of whether we treat a confirmation record
as a logical decoding change that is involved in the new dance during
transaction reassembly.

* We need to sort out those issues with the grammar, since that only
really applies to the inference specification. Maybe the WHERE clause
that the inference specification accepts can be broken out. No ON
CONFLICT UPDATE specific issues left there, AFAICT though.

That really seems totally doable in just a couple of days.

The blockers for committing the ON CONFLICT UPDATE patch are larger, but doable:

* We need to figure out the tuple lock strength details. I think this
is doable, but it is the greatest challenge to committing ON CONFLICT
UPDATE at this point. Andres feels that we should require no greater
lock strength than an equivalent UPDATE. I suggest we get to this
after committing the IGNORE variant. We probably need to discuss this
some 

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Jim Nasby

On 4/22/15 1:51 PM, Kevin Grittner wrote:

(1) WAL log the insert.
(2) Write the tuple.
(3) Hint and rewrite the tuple.
(4) WAL log the freeze of the tuple.
(5) Rewrite the frozen tuple.
(6) WAL-log the delete.
(7) Rewrite the deleted tuple.
(8) Prune and rewrite the page.
(9) Free line pointers and rewrite the page.

If I was lucky some of the writes could be combined in cache
because they happened close enough together. Also, one could hope
that not too much of the WAL-logging involved full page writes to
the WAL -- again, keeping steps close together in time helps with
that.


This is why I like the idea of methods that tell us where we need to do 
cleanup... they provide us with a rough ability to track what tuples are 
in what part of their lifecycle. The VM helps with this a small amount, 
but really it only applies after 1 and 6; it doesn't help us with any 
other portions.


Having a way to track recently created tuples would allow us to be much 
more efficient with 1-3, and with aggressive freezing, 1-5. A way to 
track recently deleted tuples would help with 6-7, possibly 6-9 if no 
indexes.


If we doubled the size of the VM, that would let us track 4 states for 
each page:


- Page has newly inserted tuples
- Page has newly deleted tuples
- Page is all visible
- Page is frozen

though as discussed elsewhere, we could probably combine all visible and 
frozen.


The win from doing this would be easily knowing what pages need hinting 
(newly inserted) and pruning (newly deleted). Unfortunately we still 
wouldn't know whether we could do real work without visiting the page 
itself, but I suspect that for many workloads just having newly 
inserted/deleted would be a serious win.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inherit support for foreign tables

2015-04-22 Thread Robert Haas
On Tue, Apr 14, 2015 at 8:35 PM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Before suppressing the symptom, I doubt the necessity and/or
 validity of giving foreign tables an ability to be a parent. Is
 there any reasonable usage for the ability?

Gee, I don't see why that would be unreasonable or invalid

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bogus WAL segments archived after promotion

2015-04-22 Thread Michael Paquier
On Mon, Apr 13, 2015 at 11:57 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 On 04/01/2015 07:12 PM, Bruce Momjian wrote:

 On Fri, Dec 19, 2014 at 10:26:34PM +0200, Heikki Linnakangas wrote:

 On 12/19/2014 02:55 PM, Heikki Linnakangas wrote:

 I'm thinking that we should add a step to promotion, where we scan
 pg_xlog for any segments higher than the timeline switch point, and
 remove them, or mark them with .done so that they are not archived.
 There might be some real WAL that was streamed from the primary, but not
 yet applied, but such WAL is of no interest to that server anyway, after
 it's been promoted. It's a bit disconcerting to zap WAL that's valid,
 even if doesn't belong to the current server's timeline history, because
 as a general rule it's good to avoid destroying evidence that might be
 useful in debugging. There isn't much difference between removing them
 immediately and marking them as .done, though, because they will
 eventually be removed/recycled anyway if they're marked as .done.


 This is what I came up with. This patch removes the suspect segments
 at timeline switch. The alternative of creating .done files for them
 would preserve more evidence for debugging, but OTOH it would also
 be very confusing to have valid-looking WAL segments in pg_xlog,
 with .done files, that in fact contain garbage.

 The patch is a bit longer than it otherwise would be, because I
 moved the code to remove a single file from RemoveOldXlogFiles() to
 a new function. I think that makes it more readable in any case,
 simply because it was so deeply indented in RemoveOldXlogFiles.


 Where are we on this?


 I didn't hear any better ideas, so committed this now.

Finally looking at that... The commit log of b2a5545 is a bit
misleading. Segment files that were recycled during archive recovery
are not necessarily removed, they could be recycled as well during
promotion on the new timeline in line with what RemoveOldXlogFiles()
does. Hence I think that the comment on top of
RemoveNonParentXlogFiles() should be updated to reflect that like in
the patch attached.

Something minor: perhaps we could refactor xlogarchive.c to have
XLogArchiveCheckDone() and XLogArchiveIsBusy() use the new
XLogArchiveIsReady().
Regards,
-- 
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2580996..e0a8b81 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3596,7 +3596,8 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 }
 
 /*
- * Remove WAL files that are not part of the given timeline's history.
+ * Recycle or remove WAL files that are not part of the given timeline's
+ * history.
  *
  * This is called during recovery, whenever we switch to follow a new
  * timeline, and at the end of recovery when we create a new timeline. We

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Code paths where LWLock should be released on failure

2015-04-22 Thread Michael Paquier
 Hi all,

After looking at bug #13128, I have been looking at the code around
LWLockAcquire/Release to see if there are similar issues elsewhere.
Here are my findings:

1) SimpleLruReadPage() holds a control lock at entry that will be held
at exit as well. However SlruReportIOError() can report an error,
letting the lock hold. Shouldn't we release the control lock when a
failure happens?

2) The patch attached to #13128 fixes MarkAsPreparing(), but actually
twophase.c also does not release some locks in LockGXact().

3) PreCommit_Notify@async.c should release AsyncQueueLock on failure I
guess because it is called at transaction commit. At least it looks
safer this way.

4) TablespaceCreateDbspace does not release TablespaceCreateLock but
LWLockReleaseAll would do it when aborting its transaction, so no
changes done there (?).

5) In ReplicationSlotCreate@slot.c, I would think that
ReplicationSlotAllocationLock should be released when all the locks
are in use. Similarly, in  ReplicationSlotDropAcquired,
ReplicationSlotAllocationLock should be released when !fail_softly.
SaveSlotToPath could also be made safer when a file could not be
created.

6) In dsm.c, dsm_create does not release
DynamicSharedMemoryControlLock when Error'ing when there are too many
shared memory segments.

7) In shmem.c, ShmemInitStruct does not release ShmemIndexLock on OOM.
I guess that's fine in bootstrap mode, still...

8) In lock.c, LockRelease() does not release partitionLock when a
shared lock cannot be found. Similar thing with
FastPathGetRelationLockEntry().

9) In predicate.c, CreatePredicateLock() forgets to release
SerializablePredicateLockListLock and partitionLock in case of an OOM.
There is a similar issue with ReleaseOneSerializableXact(),
CheckForSerializableConflictOut() and
predicatelock_twophase_recover().

10) In relcache.c, RelCacheInitLock is not released in
RelationCacheInitFilePreInvalidate() after unlink() failure.

11) In AlterSystemSetConfigFile(), AutoFileLock should be released on failure.

All those things give the patch attached. Comments are welcome.
Regards,
-- 
Michael
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b85a666..e8d6754 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -350,6 +350,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 		gxact = TwoPhaseState-prepXacts[i];
 		if (strcmp(gxact-gid, gid) == 0)
 		{
+			LWLockRelease(TwoPhaseStateLock);
 			ereport(ERROR,
 	(errcode(ERRCODE_DUPLICATE_OBJECT),
 	 errmsg(transaction identifier \%s\ is already in use,
@@ -359,11 +360,14 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 
 	/* Get a free gxact from the freelist */
 	if (TwoPhaseState-freeGXacts == NULL)
+	{
+		LWLockRelease(TwoPhaseStateLock);
 		ereport(ERROR,
 (errcode(ERRCODE_OUT_OF_MEMORY),
  errmsg(maximum number of prepared transactions reached),
  errhint(Increase max_prepared_transactions (currently %d).,
 		 max_prepared_xacts)));
+	}
 	gxact = TwoPhaseState-freeGXacts;
 	TwoPhaseState-freeGXacts = gxact-next;
 
@@ -497,16 +501,22 @@ LockGXact(const char *gid, Oid user)
 
 		/* Found it, but has someone else got it locked? */
 		if (gxact-locking_backend != InvalidBackendId)
+		{
+			LWLockRelease(TwoPhaseStateLock);
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 errmsg(prepared transaction with identifier \%s\ is busy,
 			gid)));
+		}
 
 		if (user != gxact-owner  !superuser_arg(user))
+		{
+			LWLockRelease(TwoPhaseStateLock);
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
   errmsg(permission denied to finish prepared transaction),
 	 errhint(Must be superuser or the user that prepared the transaction.)));
+		}
 
 		/*
 		 * Note: it probably would be possible to allow committing from
@@ -515,10 +525,13 @@ LockGXact(const char *gid, Oid user)
 		 * someone gets motivated to make it work.
 		 */
 		if (MyDatabaseId != proc-databaseId)
+		{
+			LWLockRelease(TwoPhaseStateLock);
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
   errmsg(prepared transaction belongs to another database),
 	 errhint(Connect to the database where the transaction was prepared to finish it.)));
+		}
 
 		/* OK for me to lock it */
 		gxact-locking_backend = MyBackendId;
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 2826b7e..1224e4d 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -839,9 +839,12 @@ PreCommit_Notify(void)
 			LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
 			asyncQueueFillWarning();
 			if (asyncQueueIsFull())
+			{
+LWLockRelease(AsyncQueueLock);
 ereport(ERROR,
 		(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 	  errmsg(too many notifications in the NOTIFY queue)));
+			}
 			nextNotify = asyncQueueAddEntries(nextNotify);
 			LWLockRelease(AsyncQueueLock);
 		}
diff --git 

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Bruce Momjian
On Wed, Apr 22, 2015 at 04:36:17PM +0100, Greg Stark wrote:
 On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian br...@momjian.us wrote:
  Well, we have to assume there are many misconfigured configurations ---
  autovacuum isn't super-easy to configure, so we can't just blame the
  user if this makes things worse.  In fact, page pruning was designed
  spefically for cases where autovacuum wasn't running our couldn't keep
  up.
 
 Well autovacuum isn't currently considering HOT pruning part of its
 job at all. It's hard to call it misconfigured when there's
 literally *no* way to configure it correctly.

Good point, but doesn't vacuum remove the need for pruning as it removes
all the old rows?

 If you update less than autovacuum_vacuum_scale_factor fraction of the
 table and then never update another row autovacuum will never run.
 Ever. Every select will forevermore need to follow hot chains on that
 table. Until eventually transaction wraparound forces a vacuum on that
 table if that ever happens.

Yes, that is a very good point, and it matches my concerns.  Of course,
Simon's concern is to avoid overly-aggressive pruning where the row is
being pruned but will soon be modified, making the prune, and its WAL
volume, undesirable.  We have to consider both cases in any final
solution.

 Possibly autovacuum could be adjusted to count how many selects are
 happening on the table and decide to vacuum it when the cost of the
 selects following the dead tuples is balanced by the cost of doing a
 vacuum. But that's not something included in the design of autovacuum
 today.

Well, autovacuum is also going to clean indexes, which seem like
overkill for pruning HOT updates.

 The original design of tuple storage was aimed at optimizing the
 steady state where most tuples were not recently updated. It
 guaranteed that except for tuples that were in the process of being
 updated or were recently updated a tuple read didn't have to read the
 CLOG, didn't have to follow any chains, didn't have to do any I/O or
 other work other than to read the bits on the tuple itself. When a
 tuple is updated it's put into a state where everyone who comes along
 has to do extra work but as soon as practical the hint bits get set
 and that extra work stops.

Yes, Simon is right that doing everything as-soon-as-possible is not
optimal.  I think the trick is knowing when we should give up waiting
for something else to dirty the page and prune it.

 We had similar discussions about setting hint bits in the past. I'm
 not sure why HOT pruning is the focus now because I actually think
 hint bit setting is a larger source of I/O in innocent looking selects
 even today. And it's a major headache, people are always being
 surprised that their selects cause lots of I/O and slow down
 dramatically after a big update or data load has finished. It's
 characterized as why is the database writing everything twice (and
 saying it's actually writing everything three times doesn't make
 people feel better). In the new age of checksums with hint bit logging
 I wonder if it's even a bigger issue.

What would be the downside of only doing pruning during SELECT hint bit
setting?  Hinting is delayed by long-running transactions, but so is
pruning.  I assume you can do more pruning than setting all_visible
hints because the old prunable rows are older by definition, but I am
unclear how much older they are.

FYI, while hint bit setting causes page writes, it does not cause WAL
writes unless you have wal_log_hints set or page-level checksums are
enabled.  By doing pruning at the same time as hint bit setting, you are
sharing the same page write, but are generating more WAL.  Of course, if
you are setting all-visible, then you are by definition waiting longer
to prune than before, and this might be enough to make it a win for all
use cases.  You wouldn't never-prune in a read-only workload because
your hint bits would eventually cause the pruning.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Code paths where LWLock should be released on failure

2015-04-22 Thread Tom Lane
Michael Paquier michael.paqu...@gmail.com writes:
 After looking at bug #13128, I have been looking at the code around
 LWLockAcquire/Release to see if there are similar issues elsewhere.
 Here are my findings:

IIRC, we automatically release all LWLocks at the start of error recovery,
so I rather doubt that any of this is necessary.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Row security violation error is misleading

2015-04-22 Thread Stephen Frost
Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
 On 21 April 2015 at 22:21, Dean Rasheed dean.a.rash...@gmail.com wrote:
  On 21 April 2015 at 20:50, Stephen Frost sfr...@snowman.net wrote:
  Thanks a lot for this.  Please take a look at the attached.
 
  I've given this a quick read-through, and it looks good to me. The
  interaction of permissive and restrictive policies from hooks matches
  my expections, and it's a definite improvement having tests for RLS
  hooks.
 
  The only thing I spotted was that the file comment for
  test_rls_hooks.c needs updating.
 
 So re-reading this, I spotted what looks like another (pre-existing)
 bug. In process_policies() there's a loop over all the policies,
 collecting quals and with_check_quals, then a test at the end to use
 the USING quals for the WITH CHECK quals if there were no
 with_check_quals. I think we want to instead do that test inside the
 loop -- i.e., for each policy, if there is no with_check_qual *for
 that policy*, use it's USING qual instead.

Agreed, the USING - WITH CHECK copy should be happening for all
policies independently, not wholesale at the end.

I've updated my tree and am testing.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Rounding to even for numeric data type

2015-04-22 Thread Pedro Gimeno
Dean Rasheed wrote, On 2015-03-28 10:01:
 On 28 March 2015 at 05:16, Andrew Gierth and...@tao11.riddles.org.uk wrote:
 Tom == Tom Lane t...@sss.pgh.pa.us writes:

  Tom I think the concern over backwards compatibility here is probably
  Tom overblown; but if we're sufficiently worried about it, a possible
  Tom compromise is to invent a numeric_rounding_mode GUC, so that
  Tom people could get back the old behavior if they really care.

 I only see one issue with this, but it's a nasty one: do we really want
 to make all numeric operations that might do rounding stable rather than
 immutable?

 
 Yeah, making all numeric functions non-immutable seems like a really bad idea.

Would it be possible to make it an unchangeable per-cluster or
per-database setting, kinda like how encoding behaves? Wouldn't that
allow to keep the functions immutable?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Jim Nasby

On 4/22/15 1:24 PM, Robert Haas wrote:

I keep coming back to the idea of treating any page that is marked as
all-visible as frozen, and deferring freezing until the page is again
modified.  The big downside of this is that if the page is set as
all-visible and then immediately thereafter modified, it sucks to have
to freeze when the XIDs in the page are still present in CLOG.  But if
we could determine from the LSN that the XIDs in the page are new
enough to still be considered valid, then we could skip freezing in
those cases and only do it when the page is old.  That way, if
somebody zeroed out the LSN (why, oh why?) the worst that would happen
is that we'd do some extra freezing when the page was next modified.


Maybe freezing a page as part of making it not all-visible wouldn't be 
that horrible, even without LSN.


For one, we already know that every tuple is visible, so no MVCC checks 
needed. That's probably a significant savings over current freezing.


If we're marking a page as no longer all-visible, that means we're 
already dirtying it and generating WAL for it (likely including a FPI). 
We may be able to consolidate all of this into a new WAL record that's a 
lot more efficient than what we currently do for freezing. I suspect we 
wouldn't need to log each TID we're freezing, for starters. Even if we 
did though, we could at least combine all that into one WAL message that 
just contains an array of TIDs or LPs.


ponders... I think we could actually proactively freeze tuples during 
vacuum too, at least if we're about to mark the page as all-visible. 
Though, with Robert's HEAP_XMIN_FROZEN change we could be a lot more 
aggressive about freezing during VACUUM, certainly for pages we're 
already dirtying, especially if we can keep the WAL cost of that down.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] moving from contrib to bin

2015-04-22 Thread Andres Freund
Peter, Michael,

On 2015-04-22 16:13:15 +0900, Michael Paquier wrote:
 All the patches have been committed, finishing the work on this thread.

Many thanks for that effort!

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-22 Thread Heikki Linnakangas

On 04/16/2015 06:51 AM, Alvaro Herrera wrote:

Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use?  I
thought the -w would wait until promotion has taken effect, so there's
no need to sleep additional time.


-w is not supported with pg_ctl promote. Only start, stop and restart. 
It's accepted, but it doesn't do anything. Which isn't very nice...


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] moving from contrib to bin

2015-04-22 Thread Michael Paquier
On Sun, Apr 12, 2015 at 12:36 PM, Peter Eisentraut pete...@gmx.net wrote:
 On 3/11/15 8:21 PM, Michael Paquier wrote:
 Attached is a series of patch rebased on current HEAD, there were some
 conflicts after perl-tidying the refactoring patch for MSVC. Note that
 this series still uses PGXS in the Makefiles, I am fine to update them
 if necessary once this matter is set (already did this stuff upthread
 with a previous version).

 OK, here we go.  I have committed the pg_archivecleanup move, with a
 complete makefile and your Windows help.  Let's see how it builds.

All the patches have been committed, finishing the work on this thread.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-22 Thread Petr Jelinek

On 21/04/15 22:36, Andres Freund wrote:

On 2015-04-21 16:26:08 -0400, Robert Haas wrote:

On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund and...@anarazel.de wrote:

I've now named the functions:

* pg_replication_origin_create
* pg_replication_origin_drop
* pg_replication_origin_get (map from name to id)
* pg_replication_progress_setup_origin : configure session to replicate
   from a specific origin
* pg_replication_progress_reset_origin
* pg_replication_progress_setup_tx_details : configure per transaction
   details (LSN and timestamp currently)
* pg_replication_progress_is_replaying : Is a origin configured for the session
* pg_replication_progress_advance : manually set the replication
   progress to a value. Primarily useful for copying values from other
   systems and such.
* pg_replication_progress_get : How far did replay progress for a
   certain origin
* pg_get_replication_progress : SRF returning the replay progress for
   all origin.

Any comments?


Why are we using functions for this rather than DDL?


Unless I miss something the only two we really could use ddl for is
pg_replication_origin_create/pg_replication_origin_drop. We could use
DDL for them if we really want, but I'm not really seeing the advantage.



I think the only value of having DDL for this would be consistency 
(catalog objects are created via DDL) as it looks like something that 
will be called only by extensions and not users during normal operation.



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-22 Thread Heikki Linnakangas

On 04/20/2015 05:21 AM, Michael Paquier wrote:

I have just made a run of the TAP tests of pg_rewind on my raspberry
PI 1 (hamster), where the tests are very slow, and I noticed that it
takes up to 10s to get confirmation from standby that it has caught up
with the changes from master, and 5s to get confirmation that standby
has been promoted. So the tests in HEAD would be broken without this
patch, and 30s gives visibly enough room.


Ok, applied, with some refactoring: I put the sleep-retry loop into a 
separate function.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-22 Thread Michael Paquier
On Wed, Apr 22, 2015 at 8:40 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 On 04/20/2015 05:21 AM, Michael Paquier wrote:

 I have just made a run of the TAP tests of pg_rewind on my raspberry
 PI 1 (hamster), where the tests are very slow, and I noticed that it
 takes up to 10s to get confirmation from standby that it has caught up
 with the changes from master, and 5s to get confirmation that standby
 has been promoted. So the tests in HEAD would be broken without this
 patch, and 30s gives visibly enough room.


 Ok, applied, with some refactoring: I put the sleep-retry loop into a
 separate function.

Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers