Re: [HACKERS] preprocess_targetlist and inheiritance

2015-04-22 Thread Etsuro Fujita

On 2015/04/22 0:43, Stephen Frost wrote:

On Tuesday, April 21, 2015, Alvaro Herrera mailto:alvhe...@2ndquadrant.com>> wrote:



Stephen Frost wrote:
 > Tom, all,
 >
 >   Looks like preprocess_targetlist() should have been adjusted
with the
 >   changes to ExecBuildAuxRowMark() to support foreign tables
being part
 >   of inheritance trees (cb1ca4d800621dcae67ca6c799006de99fa4f0a5) to
 >   also include the tableoid regardless of the rowMark type, if the
 >   relation is the parent of an inheritance tree.

Uh, this patch was already submitted separately:
https://www.postgresql.org/message-id/552cf0b6.8010...@lab.ntt.co.jp



Oh, excellent. Looks like the discussion agrees that it makes sense to
do and the RLS case didn't involve any foreign tables yet still ran into
the issue.


IIUC, I think the issue would be raised when performing SELECT FOR 
UPDATE/SHARE on an inheritance set the parents and childs of which are 
all foreign tables, or when performing UPDATE/DELETE involving such an 
inheritance set as a source table.  Probably, I think your new 
RLS-with-inheritance regression tests include such a query.


Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-22 Thread Michael Paquier
On Wed, Apr 22, 2015 at 3:38 PM, Heikki Linnakangas  wrote:
> On 04/22/2015 03:30 AM, Michael Paquier wrote:
>>
>> This is going to change a behavior that people are used to for a
>> couple of releases. I would not mind having this patch do
>> "archive_mode = on during recovery" => archive only segments generated
>> by this node + the last partial segment on the old timeline at
>> promotion, without renaming to preserve backward compatible behavior.
>> If master and standby point to separate archive locations, this way
>> the operator can sort out later the one he would want to use. If they
>> point to the same location, archive_command scripts already do
>> internally such renaming, at least that's what I suspect an
>> experienced user would do, now it is true that not many people are
>> experienced in this area I see mistakes regarding such things on a
>> weekly basis... This .partial segment renaming is something that we
>> should let the archive_command manage with its internal logic.
>
>
> Currently, the archive command doesn't know if the segment it's archiving is
> partial or not, so you can't put any logic there to manage it. But if we
> archive it with the .partial suffix, then you can put logic in the
> restore_command to check for .partial files, if you really want to.

Well, now you can check as well if there is a file with the same name
already archived and append a suffix to the new file copied, keep the
two files, and then let restore_command sort things up as it wants
with the two segment files it finds.

> I feel that the best approach is to archive the last, partial segment, but
> with the .partial suffix. I don't see any plausible real-world setup where
> the current behavior would be better. I don't really see much need to
> archive the partial segment at all, but there's also no harm in doing it, as
> long as it's clearly marked with the .partial suffix.

Well, as long as it is clearly archived at promotion, even with a
suffix, I guess that I am fine... This will need some tweaking on
restore_command for existing applications, but as long as it is
clearly documented I am fine. Shouldn't this be a different patch
though?

> BTW, pg_receivexlog also uses a ".partial" file, while it's streaming WAL
> from the server. The .partial suffix is removed when the segment is
> complete. So there's some precedence to this. pg_receivexlog adds just
> ".partial" to the filename, it doesn't add any information of what portion
> of the file is valid like I suggested here, though. Perhaps we should follow
> pg_receivexlog's example at promotion too, for consistency.

Consistency here sounds good to me.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Update docs in fdwhandler.sgml

2015-04-22 Thread Etsuro Fujita

On 2015/04/22 6:50, Robert Haas wrote:

On Tue, Apr 21, 2015 at 5:45 AM, Etsuro Fujita
 wrote:

Since we now allow CHECK constraints to be placed on foreign tables, not
only NOT NULL, I think it'd be better to update docs on considerations
about constraints on foreign tables in fdwhandler.sgml, so as to provide
more general considerations.  Please find attached a patch.


Looks good to me, so committed.


Thanks!

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] moving from contrib to bin

2015-04-22 Thread Michael Paquier
On Sun, Apr 12, 2015 at 12:36 PM, Peter Eisentraut  wrote:
> On 3/11/15 8:21 PM, Michael Paquier wrote:
>> Attached is a series of patch rebased on current HEAD, there were some
>> conflicts after perl-tidying the refactoring patch for MSVC. Note that
>> this series still uses PGXS in the Makefiles, I am fine to update them
>> if necessary once this matter is set (already did this stuff upthread
>> with a previous version).
>
> OK, here we go.  I have committed the pg_archivecleanup move, with a
> complete makefile and your Windows help.  Let's see how it builds.

All the patches have been committed, finishing the work on this thread.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] moving from contrib to bin

2015-04-22 Thread Andres Freund
Peter, Michael,

On 2015-04-22 16:13:15 +0900, Michael Paquier wrote:
> All the patches have been committed, finishing the work on this thread.

Many thanks for that effort!

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-22 Thread Heikki Linnakangas

On 04/16/2015 06:51 AM, Alvaro Herrera wrote:

Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use?  I
thought the -w would wait until promotion has taken effect, so there's
no need to sleep additional time.


-w is not supported with pg_ctl promote. Only start, stop and restart. 
It's accepted, but it doesn't do anything. Which isn't very nice...


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-22 Thread Petr Jelinek

On 21/04/15 22:36, Andres Freund wrote:

On 2015-04-21 16:26:08 -0400, Robert Haas wrote:

On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund  wrote:

I've now named the functions:

* pg_replication_origin_create
* pg_replication_origin_drop
* pg_replication_origin_get (map from name to id)
* pg_replication_progress_setup_origin : configure session to replicate
   from a specific origin
* pg_replication_progress_reset_origin
* pg_replication_progress_setup_tx_details : configure per transaction
   details (LSN and timestamp currently)
* pg_replication_progress_is_replaying : Is a origin configured for the session
* pg_replication_progress_advance : "manually" set the replication
   progress to a value. Primarily useful for copying values from other
   systems and such.
* pg_replication_progress_get : How far did replay progress for a
   certain origin
* pg_get_replication_progress : SRF returning the replay progress for
   all origin.

Any comments?


Why are we using functions for this rather than DDL?


Unless I miss something the only two we really could use ddl for is
pg_replication_origin_create/pg_replication_origin_drop. We could use
DDL for them if we really want, but I'm not really seeing the advantage.



I think the only value of having DDL for this would be consistency 
(catalog objects are created via DDL) as it looks like something that 
will be called only by extensions and not users during normal operation.



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-22 Thread Heikki Linnakangas

On 04/20/2015 05:21 AM, Michael Paquier wrote:

I have just made a run of the TAP tests of pg_rewind on my raspberry
PI 1 (hamster), where the tests are very slow, and I noticed that it
takes up to 10s to get confirmation from standby that it has caught up
with the changes from master, and 5s to get confirmation that standby
has been promoted. So the tests in HEAD would be broken without this
patch, and 30s gives visibly enough room.


Ok, applied, with some refactoring: I put the sleep-retry loop into a 
separate function.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-22 Thread Michael Paquier
On Wed, Apr 22, 2015 at 8:40 PM, Heikki Linnakangas  wrote:
> On 04/20/2015 05:21 AM, Michael Paquier wrote:
>>
>> I have just made a run of the TAP tests of pg_rewind on my raspberry
>> PI 1 (hamster), where the tests are very slow, and I noticed that it
>> takes up to 10s to get confirmation from standby that it has caught up
>> with the changes from master, and 5s to get confirmation that standby
>> has been promoted. So the tests in HEAD would be broken without this
>> patch, and 30s gives visibly enough room.
>
>
> Ok, applied, with some refactoring: I put the sleep-retry loop into a
> separate function.

Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rounding to even for numeric data type

2015-04-22 Thread Michael Paquier
On Wed, Apr 22, 2015 at 9:30 PM, Pedro Gimeno
 wrote:
> Dean Rasheed wrote, On 2015-03-28 10:01:
>> On 28 March 2015 at 05:16, Andrew Gierth  wrote:
 "Tom" == Tom Lane  writes:
>>>
>>>  Tom> I think the concern over backwards compatibility here is probably
>>>  Tom> overblown; but if we're sufficiently worried about it, a possible
>>>  Tom> compromise is to invent a numeric_rounding_mode GUC, so that
>>>  Tom> people could get back the old behavior if they really care.
>>>
>>> I only see one issue with this, but it's a nasty one: do we really want
>>> to make all numeric operations that might do rounding stable rather than
>>> immutable?
>>>
>>
>> Yeah, making all numeric functions non-immutable seems like a really bad 
>> idea.
>
> Would it be possible to make it an unchangeable per-cluster or
> per-database setting, kinda like how encoding behaves? Wouldn't that
> allow to keep the functions immutable?

Rounding is not something that can be enforced at the database or
server level but at data type level, see for example the differences
already present for double precision and numeric as mentioned
upthread. In short, you could keep rounding functions immutable by
having one data type with a different rounding method. At least that's
an idea.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-04-22 Thread Abhijit Menon-Sen
At 2015-04-18 12:28:36 +0530, amit.kapil...@gmail.com wrote:
>
> I think you have missed to address the below point:
> 
> >> 4) prefetch

Updated patch attached, as well as the diff against the earlier version
just to make it easier to see the exact change I made (which is to copy
the skip logic from lazy_scan_heap, as you suggested).

I'm not entirely convinced that this change is worthwhile, but it's easy
to make and difficult to argue against, so here it is. (I did test, and
it seems to work OK after the change.)

Amit, Tomas, thanks again for your review comments.

-- Abhijit
>From 3edb5426292d6097cb66339b865e99bf4f766646 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen 
Date: Fri, 26 Dec 2014 12:37:13 +0530
Subject: Add pgstatbloat to pgstattuple

---
 contrib/pgstattuple/Makefile   |   4 +-
 contrib/pgstattuple/pgstatbloat.c  | 387 +
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql  |  18 +
 .../{pgstattuple--1.2.sql => pgstattuple--1.3.sql} |  18 +-
 contrib/pgstattuple/pgstattuple.control|   2 +-
 doc/src/sgml/pgstattuple.sgml  | 135 +++
 6 files changed, 560 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstatbloat.c
 create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql
 rename contrib/pgstattuple/{pgstattuple--1.2.sql => pgstattuple--1.3.sql} (73%)

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..d7d27a5 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = "pgstattuple - tuple-level statistics"
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c
new file mode 100644
index 000..b19459a
--- /dev/null
+++ b/contrib/pgstattuple/pgstatbloat.c
@@ -0,0 +1,387 @@
+/*
+ * contrib/pgstattuple/pgstatbloat.c
+ *
+ * Abhijit Menon-Sen 
+ * Portions Copyright (c) 2001,2002	Tatsuo Ishii (from pgstattuple)
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose, without fee, and without a
+ * written agreement is hereby granted, provided that the above
+ * copyright notice and this paragraph and the following two
+ * paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+ * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS
+ * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+ * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include "postgres.h"
+
+#include "access/visibilitymap.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/multixact.h"
+#include "access/htup_details.h"
+#include "catalog/namespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/freespace.h"
+#include "storage/procarray.h"
+#include "storage/lmgr.h"
+#include "utils/builtins.h"
+#include "utils/tqual.h"
+#include "commands/vacuum.h"
+
+PG_FUNCTION_INFO_V1(pgstatbloat);
+
+/*
+ * tuple_percent, dead_tuple_percent and free_percent are computable,
+ * so not defined here.
+ */
+typedef struct pgstatbloat_output_type
+{
+	uint64		table_len;
+	uint64		tuple_count;
+	uint64		misc_count;
+	uint64		tuple_len;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	uint64		free_space;
+	uint64		total_pages;
+	uint64		scanned_pages;
+} pgstatbloat_output_type;
+
+static Datum build_output_type(pgstatbloat_output_type *stat,
+			   FunctionCallInfo fcinfo);
+static Datum pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo);
+
+/*
+ * build a pgstatbloat_output_type tuple
+ */
+static Datum
+build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo)
+{
+#define NCOLUMNS	10
+#define NCHARS		32
+
+	HeapTuple	tuple;
+	char	   *values[NCOLUMNS];
+	char		values_buf[NCOLUMNS][NCHARS];
+	int			i;
+	double		tuple_percent;
+	double		dead_tuple_percent;
+	double		free_percent;	/* free/reusable space in % */
+	double		scanned_percent;
+	TupleDesc	tupdesc;
+	AttInMetadata *attinmeta;
+
+	/* Build a 

Re: [HACKERS] Idea: closing the loop for "pg_ctl reload"

2015-04-22 Thread Payal Singh
Ah sorry, didn't realize I top posted. I'll test this new one.

Payal.
On Apr 21, 2015 10:23 PM, "Jan de Visser"  wrote:

> On April 21, 2015 09:34:51 PM Jan de Visser wrote:
> > On April 21, 2015 09:01:14 PM Jan de Visser wrote:
> > > On April 21, 2015 07:32:05 PM Payal Singh wrote:
> ... snip ...
> >
> > Urgh. It appears you are right. Will fix.
> >
> > jan
>
> Attached a new attempt. This was one from the category 'I have no idea how
> that ever worked", but whatever. For reference, this is how it looks for me
> (magic man-behind-the-curtain postgresql.conf editing omitted):
>
> jan@wolverine:~/Projects/postgresql$ initdb -D data
> ... Bla bla bla ...
> jan@wolverine:~/Projects/postgresql$ pg_ctl -D data -l logfile start
> server starting
> jan@wolverine:~/Projects/postgresql$ tail -5 logfile
> LOG:  database system was shut down at 2015-04-21 22:03:33 EDT
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> jan@wolverine:~/Projects/postgresql$ pg_ctl -D data reload
> server signaled
> jan@wolverine:~/Projects/postgresql$ tail -5 logfile
> LOG:  database system was shut down at 2015-04-21 22:03:33 EDT
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> LOG:  received SIGHUP, reloading configuration files
> jan@wolverine:~/Projects/postgresql$ pg_ctl -D data reload
> server signaled
> pg_ctl: Reload of server with PID 14656 FAILED
> Consult the server log for details.
> jan@wolverine:~/Projects/postgresql$ tail -5 logfile
> LOG:  autovacuum launcher started
> LOG:  received SIGHUP, reloading configuration files
> LOG:  received SIGHUP, reloading configuration files
> LOG:  syntax error in file
> "/home/jan/Projects/postgresql/data/postgresql.conf"
> line 1, near end of line
> LOG:  configuration file
> "/home/jan/Projects/postgresql/data/postgresql.conf"
> contains errors; no changes were applied
> jan@wolverine:~/Projects/postgresql$


Re: [HACKERS] Row security violation error is misleading

2015-04-22 Thread Dean Rasheed
On 21 April 2015 at 22:21, Dean Rasheed  wrote:
> On 21 April 2015 at 20:50, Stephen Frost  wrote:
>> Thanks a lot for this.  Please take a look at the attached.
>
> I've given this a quick read-through, and it looks good to me. The
> interaction of permissive and restrictive policies from hooks matches
> my expections, and it's a definite improvement having tests for RLS
> hooks.
>
> The only thing I spotted was that the file comment for
> test_rls_hooks.c needs updating.
>

So re-reading this, I spotted what looks like another (pre-existing)
bug. In process_policies() there's a loop over all the policies,
collecting quals and with_check_quals, then a test at the end to use
the USING quals for the WITH CHECK quals if there were no
with_check_quals. I think we want to instead do that test inside the
loop -- i.e., for each policy, if there is no with_check_qual *for
that policy*, use it's USING qual instead.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Row security violation error is misleading

2015-04-22 Thread Stephen Frost
Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
> On 21 April 2015 at 22:21, Dean Rasheed  wrote:
> > On 21 April 2015 at 20:50, Stephen Frost  wrote:
> >> Thanks a lot for this.  Please take a look at the attached.
> >
> > I've given this a quick read-through, and it looks good to me. The
> > interaction of permissive and restrictive policies from hooks matches
> > my expections, and it's a definite improvement having tests for RLS
> > hooks.
> >
> > The only thing I spotted was that the file comment for
> > test_rls_hooks.c needs updating.
> 
> So re-reading this, I spotted what looks like another (pre-existing)
> bug. In process_policies() there's a loop over all the policies,
> collecting quals and with_check_quals, then a test at the end to use
> the USING quals for the WITH CHECK quals if there were no
> with_check_quals. I think we want to instead do that test inside the
> loop -- i.e., for each policy, if there is no with_check_qual *for
> that policy*, use it's USING qual instead.

Agreed, the USING -> WITH CHECK copy should be happening for all
policies independently, not wholesale at the end.

I've updated my tree and am testing.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Robert Haas
On Tue, Apr 21, 2015 at 7:24 PM, Jim Nasby  wrote:
> On 4/21/15 3:21 PM, Robert Haas wrote:
>> It's possible that we could use this infrastructure to freeze more
>> aggressively in other circumstances.  For example, perhaps VACUUM
>> should freeze any page it intends to mark all-visible.  That's not a
>> guaranteed win, because it might increase WAL volume: setting a page
>> all-visible does not emit an FPI for that page, but freezing any tuple
>> on it would, if the page hasn't otherwise been modified since the last
>> checkpoint.  Even if that were no issue, the freezing itself must be
>> WAL-logged.  But if we could somehow get to a place where all-visible
>> => frozen, then autovacuum would never need to visit all-visible
>> pages, a huge win.
>
> I don't know how bad the extra WAL traffic would be; we'd obviously need to
> incur it eventually, so it's a question of how common it is for a page to go
> all-visible but then go not-all-visible again before freezing. It would
> presumably be far more traffic than some form of a FrozenMap though...

Yeah, maybe.  The freeze record contains details for each TID, while
the freeze map bit would only need to be set once for the whole page.
I wonder if the format of that record could be optimized somehow.

>> We could also attack the problem from the other end.  Instead of
>> trying to set the bits on the individual tuples, we could decide that
>> whenever a page is marked all-visible, we regard it as frozen
>> regardless of the bits set or not set on the individual tuples.
>> Anybody who wants to modify the page must freeze any unfrozen tuples
>> "for real" before clearing the visibility map bit.  This would have
>> the same end result as the previous idea: all-visible would
>> essentially imply frozen, and autovacuum could ignore those pages
>> categorically.
>
> Pushing what's currently background work onto foreground processes doesn't
> seem like a good idea...

When you phrase it that way, no, but pushing work that otherwise would
need to be done right now off to a future time that may never arrive
sounds like a good idea.  Today, we freeze the page -- rewriting it --
and then keep scanning those all-frozen pages every X number of
transactions to make sure they are really all-frozen.  In this system,
we'd eliminate the repeated scanning and defer the freeze work until
the page actually gets modified again.  But that might never happen,
in which case we never have to do the work at all.

>> I'm not saying those ideas don't have problems, because they do.  But
>> I think they are worth further exploring.  The main reason I gave up
>> on that is because Heikki was working on the XID-to-LSN mapping stuff.
>> That seemed like a better approach than either of the above, so as
>> long as Heikki was working on that, there wasn't much reason to pursue
>> more lowbrow approaches.  Clearly, though, we need to do something
>> about this.  Freezing is a big problem for lots of users.
>
> Did XID-LSN die? I see at the bottom of the thread it was returned with
> feedback; I guess Heikki just hasn't had time and there's no major blockers?
> From what I remember this is probably a better solution, but if it's not
> going to make it into 9.6 then we should probably at least look further into
> a FM.

Heikki said he'd lost enthusiasm for it, but he wasn't too specific
about his reasons, IIRC.  I guess maybe just that it got complicated,
and he wasn't sure it was correct.

>> All that having been said, I don't think adding a new fork is a good
>> approach.  We already have problems pretty commonly where our
>> customers complain about running out of inodes.  Adding another fork
>> for every table would exacerbate that problem considerably.
>
> Andres idea of adding this to the VM may work well to handle that. It would
> double the size of the VM, but it would still be a ratio of 32,000-1
> compared to heap size, or 2MB for a 64GB table.

Yes, that's got some potential.  It would mean pg_upgrade would have
to remove all existing visibility maps when upgrading to the new
version, or rewrite them into the new format.  But it otherwise seems
promising.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-22 Thread Robert Haas
On Wed, Mar 25, 2015 at 9:51 PM, Kouhei Kaigai  wrote:
> The attached patch adds GetForeignJoinPaths call on make_join_rel() only when
> 'joinrel' is actually built and both of child relations are managed by same
> FDW driver, prior to any other built-in join paths.
> I adjusted the hook definition a little bit, because jointype can be 
> reproduced
> using SpecialJoinInfo. Right?
>
> Probably, it will solve the original concern towards multiple calls of FDW
> handler in case when it tries to replace an entire join subtree with a 
> foreign-
> scan on the result of remote join query.
>
> How about your opinion?

A few random cosmetic problems:

- The hunk in allpaths.c is useless.
- The first hunk in fdwapi.h contains an extra space before the
closing parenthesis.

And then:

+   else if (scan->scanrelid == 0 &&
+(IsA(scan, ForeignScan) || IsA(scan, CustomScan)))
+   varno = INDEX_VAR;

Suppose scan->scanrelid == 0 but the scan type is something else?  Is
that legal?  Is varno == 0 the correct outcome in that case?

More later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Kevin Grittner
Robert Haas  wrote:

> It's possible that we could use this infrastructure to freeze
> more aggressively in other circumstances.  For example, perhaps
> VACUUM should freeze any page it intends to mark all-visible.
> That's not a guaranteed win, because it might increase WAL
> volume: setting a page all-visible does not emit an FPI for that
> page, but freezing any tuple on it would, if the page hasn't
> otherwise been modified since the last checkpoint.  Even if that
> were no issue, the freezing itself must be WAL-logged.  But if we
> could somehow get to a place where all-visible => frozen, then
> autovacuum would never need to visit all-visible pages, a huge
> win.

That would eliminate full-table scan vacuums, right?  It would do
that by adding incremental effort and WAL to the "normal"
autovacuum run to eliminate the full table scan and the associated
mass freeze WAL-logging?  It's hard to see how that would not be an
overall win.

> We could also attack the problem from the other end.  Instead of
> trying to set the bits on the individual tuples, we could decide
> that whenever a page is marked all-visible, we regard it as
> frozen regardless of the bits set or not set on the individual
> tuples.  Anybody who wants to modify the page must freeze any
> unfrozen tuples "for real" before clearing the visibility map
> bit.  This would have the same end result as the previous idea:
> all-visible would essentially imply frozen, and autovacuum could
> ignore those pages categorically.

Besides putting work into the foreground that could be done in the
background, that sounds more complicated.  Also, there is no
ability to "pace" the freeze load or use scheduled jobs to shift
the work to off-peak hours.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow SQL/plpgsql functions to accept record

2015-04-22 Thread Jim Nasby

On 4/20/15 2:04 PM, David G. Johnston wrote:


​SELECT (src.v).* FROM ( VALUES (ROW(1,2,3)) ) src (v)​;
ERROR: record type has not been registered

While it may not be necessary to solve both problems I suspect they have
the same underlying root cause - specifically the separation of concerns
between the planner and the executor.


I don't think they're related at all. C functions have the ability to 
accept a record, so the executor must be able to support it. It's just 
that SQL and plpgsql functions don't have that support. I suspect that's 
just because no one has gotten around to it.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inherit support for foreign tables

2015-04-22 Thread Stephen Frost
Etsuro,

* Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote:
> postgres=# select * from ft1 for update;
> ERROR:  could not find junk tableoid1 column
> 
> I think this is a bug.  Attached is a patch fixing this issue.

Pushed, thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Greg Stark
On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian  wrote:
>
>> But if the entire table is very hot, I think that that is just another of way
>> of saying that autovacuum is horribly misconfigured.  I think the purpose of
>
> Well, we have to assume there are many misconfigured configurations ---
> autovacuum isn't super-easy to configure, so we can't just blame the
> user if this makes things worse.  In fact, page pruning was designed
> spefically for cases where autovacuum wasn't running our couldn't keep
> up.

Well autovacuum isn't currently considering HOT pruning part of its
job at all. It's hard to call it "misconfigured" when there's
literally *no* way to configure it "correctly".

If you update less than autovacuum_vacuum_scale_factor fraction of the
table and then never update another row autovacuum will never run.
Ever. Every select will forevermore need to follow hot chains on that
table. Until eventually transaction wraparound forces a vacuum on that
table if that ever happens.

Possibly autovacuum could be adjusted to count how many selects are
happening on the table and decide to vacuum it when the cost of the
selects following the dead tuples is balanced by the cost of doing a
vacuum. But that's not something included in the design of autovacuum
today.

The original design of tuple storage was aimed at optimizing the
steady state where most tuples were not recently updated. It
guaranteed that except for tuples that were in the process of being
updated or were recently updated a tuple read didn't have to read the
CLOG, didn't have to follow any chains, didn't have to do any I/O or
other work other than to read the bits on the tuple itself. When a
tuple is updated it's put into a state where everyone who comes along
has to do extra work but as soon as practical the hint bits get set
and that extra work stops.

We had similar discussions about setting hint bits in the past. I'm
not sure why HOT pruning is the focus now because I actually think
hint bit setting is a larger source of I/O in innocent looking selects
even today. And it's a major headache, people are always being
surprised that their selects cause lots of I/O and slow down
dramatically after a big update or data load has finished. It's
characterized as "why is the database writing everything twice" (and
saying it's actually writing everything three times doesn't make
people feel better). In the new age of checksums with hint bit logging
I wonder if it's even a bigger issue.

It occurs to me that generating these dirty pages isn't really that
expensive individually. It's only that there's a sudden influx of a
large number of dirty pages that causes them to get translated
immediately into filesystem I/O. Perhaps we should dirty pages on hint
bit updates and do HOT pruning only to the extent it can be done
without causing I/O. Of course it's hard to tell that in advance  but
maybe something like "if the current buffer had to be fetched and
caused a dirty buffer to be evicted then skip hot pruning and don't
dirty it for any hint bit updates" would at least mean that once the
select fills up its share of buffers with dirty buffers it stops
dirtying more. It would dirty pages only as fast as bgwriter or
checkpoints manage to write them out.

That sounds a bit weird but I think the right solution should have
that combination of properties. It should guarantee that hint bits get
set and hot chains pruned within some length of time but that no one
select causes a storm of dirty buffers that then need to be flushed to
disk.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Jim Nasby

On 4/21/15 4:07 PM, Peter Eisentraut wrote:

On 4/21/15 4:45 PM, Jim Nasby wrote:
In order for a background worker to keep up with some of the workloads
that have been presented as counterexamples, you'd need multiple
background workers operating in parallel and preferring to work on
certain parts of a table.  That would require a lot more sophisticated
job management than we currently have for, say, autovacuum.


My thought was that the foreground queries would send page IDs to the 
bgworker via a shmq. If the queries have to do much waiting at all on IO 
then I'd expect the bgworker to be able to keep pace with a bunch of 
them since it's just grabbing buffers that are already in the pool (and 
only those in the pool; it wouldn't make sense for it to pull it back 
from the kernel, let alone disk).


We'd need to code this so that if a queue fills up the query doesn't 
block; we just skip that opportunity to prune. I think that'd be fine.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Authenticating from SSL certificates

2015-04-22 Thread Stephen Frost
Keenan,

* kee...@thebrocks.net (kee...@thebrocks.net) wrote:
> I'm looking into connection to postgres using authentication from client
> certificates. [1]

Nice!  Glad to hear of more users of that capability. :)

> The documentation states that the common name (aka CN) is read from the
> certificate and used as the user's login (aka auth_user).
> The problem is the common name is typically the user's full name. A field
> like email address would contain a more computer friendly identifier.

This is why we have the pg_ident mapping capability..  I realize that
file has to be generated, but at that point it's really just a string,
no?

That said, I'm not against this capability in general, but we'd need to
make sure it doesn't lock us into OpenSSL.  Heikki's been working on
changing the SSL code to allow other libraries to be used, which is
great, and I'm slightly worried this might make that more difficult.

The other issue is that we'd need to be very cleear in the documentation
that any users of this capability have to verify with their CA that they
aren't going to end up with the same value in whichever field is used
for distinct individuals- otherwise, the CA might unknowingly issue two
certs with the same value and you would then be unable to distinguish
between those two certs and both certs would have access to the account.

That's already an issue in the SSL world when using "real" CAs (that is,
ones outside of your own organization) and, really, we would do better
to support including *more* fields than just the CN to address that
issue.  As such, perhaps we should support having a *list* of fields to
use and then we combine them in some way in the mapping file.  That
would allow users to, say, include the issuer and the CN, and perhaps
the serial number if they want.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] moving from contrib to bin

2015-04-22 Thread Bruce Momjian
On Wed, Apr 22, 2015 at 09:18:40AM +0200, Andres Freund wrote:
> Peter, Michael,
> 
> On 2015-04-22 16:13:15 +0900, Michael Paquier wrote:
> > All the patches have been committed, finishing the work on this thread.
> 
> Many thanks for that effort!

And pg_upgrade thanks you.  ;-)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Row security violation error is misleading

2015-04-22 Thread Stephen Frost
Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
> On 21 April 2015 at 22:21, Dean Rasheed  wrote:
> > On 21 April 2015 at 20:50, Stephen Frost  wrote:
> >> Thanks a lot for this.  Please take a look at the attached.
> >
> > I've given this a quick read-through, and it looks good to me. The
> > interaction of permissive and restrictive policies from hooks matches
> > my expections, and it's a definite improvement having tests for RLS
> > hooks.
> >
> > The only thing I spotted was that the file comment for
> > test_rls_hooks.c needs updating.
> 
> So re-reading this, I spotted what looks like another (pre-existing)
> bug. In process_policies() there's a loop over all the policies,
> collecting quals and with_check_quals, then a test at the end to use
> the USING quals for the WITH CHECK quals if there were no
> with_check_quals. I think we want to instead do that test inside the
> loop -- i.e., for each policy, if there is no with_check_qual *for
> that policy*, use it's USING qual instead.

Pushed with those changes, please take a look and test!

Thanks again for all of your help with this.  I'm going to be looking
over that second patch with an eye towards getting it in very soon, it's
been kicking around for far longer than it should have been and that's
my fault, apologies about that.

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 11:09 AM, Kevin Grittner  wrote:
> Robert Haas  wrote:
>> It's possible that we could use this infrastructure to freeze
>> more aggressively in other circumstances.  For example, perhaps
>> VACUUM should freeze any page it intends to mark all-visible.
>> That's not a guaranteed win, because it might increase WAL
>> volume: setting a page all-visible does not emit an FPI for that
>> page, but freezing any tuple on it would, if the page hasn't
>> otherwise been modified since the last checkpoint.  Even if that
>> were no issue, the freezing itself must be WAL-logged.  But if we
>> could somehow get to a place where all-visible => frozen, then
>> autovacuum would never need to visit all-visible pages, a huge
>> win.
>
> That would eliminate full-table scan vacuums, right?  It would do
> that by adding incremental effort and WAL to the "normal"
> autovacuum run to eliminate the full table scan and the associated
> mass freeze WAL-logging?  It's hard to see how that would not be an
> overall win.

Yes and yes.

In terms of an overall win, this design loses when the tuples that
have been recently marked all-visible are going to get updated again
in the near future. In that case, the effort we spend to freeze them
is wasted.  I just tested "pgbench -i -s 40 -n" followed by "VACUUM"
or alternatively followed by "VACUUM FREEZE".  The VACUUM generated
4641kB of WAL.  The VACUUM FREEZE generated 515MB of WAL - that is,
113 times more.  So changing every VACUUM to act like VACUUM FREEZE
would be quite expensive.  We'll still come out ahead if those tuples
are going to stick around long enough that they would have eventually
gotten frozen anyway, but if they get deleted again the loss is pretty
significant.

Incidentally, the reason for the large difference is that when Heikki
created the visibility map, it wasn't necessary for the WAL records
that set the visibility map bits to bump the page LSN, because it was
just a hint anyway.  When I made the visibility-map crash-safe, I went
to some pains to preserve that property.  Therefore, a regular VACUUM
does not emit full page images for the heap pages - it does for the
visibility map pages themselves, but there aren't very many of those.
In this example, the relation itself was 512MB, so you can see that
adding freezing to the mix roughly doubles the I/O cost.  Either way
we have to write half a gig of dirty data pages, but in one case we
also have to write an additional half a gig of WAL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Bruce Momjian
On Tue, Apr 21, 2015 at 04:36:53PM -0400, Robert Haas wrote:
> > Keep in mind there's a disconnect between dirtying a page and writing it
> > to storage.  A page could remain dirty for a long time in the buffer
> > cache.  This writing of sequential pages would occur at checkpoint time
> > only, which seems the wrong thing to optimize.  If some other process
> > needs to evict pages to make room to read some other page in, surely
> > it's going to try one page at a time, not write "many sequential dirty
> > pages."
> 
> Well, for a big sequential scan, we use a ring buffer, so we will
> typically be evicting the pages that we ourselves read in moments
> before.  So in this case we would do a lot of sequential writes of
> dirty pages.

Ah, yes, this again supports the prune-then-skip approach, rather than
doing the first X% pruneable pages seen.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow SQL/plpgsql functions to accept record

2015-04-22 Thread Andrew Dunstan


On 04/22/2015 11:29 AM, Jim Nasby wrote:

On 4/20/15 2:04 PM, David G. Johnston wrote:


​SELECT (src.v).* FROM ( VALUES (ROW(1,2,3)) ) src (v)​;
ERROR: record type has not been registered

While it may not be necessary to solve both problems I suspect they have
the same underlying root cause - specifically the separation of concerns
between the planner and the executor.


I don't think they're related at all. C functions have the ability to 
accept a record, so the executor must be able to support it. It's just 
that SQL and plpgsql functions don't have that support. I suspect 
that's just because no one has gotten around to it.



Well, that's assuming everyone else thinks it would be a good idea. 
Maybe they do, but I wouldn't assume it.


The answer in the past has been to use more dynamically typed languages 
such as perl for which this problem is well suited.


There are actually several problems: first, given an arbitrary record 
plpgsql has no easy and efficient way of finding out what field names it 
has. Second, even if it has such knowledge it has no way of using it - 
it's not like JavaScript where you can use a text value as a field name. 
And third, it has no way of creating variables of the right type to hold 
extracted values.


All of these could possibly be overcome, but it would not be a small 
piece of work, I suspect. Given that plperl buys you all of that 
already  (just try this, for example) people might think it not worth 
the extra trouble.


   create function rkeys(record) returns text[] language plperl as $$
   my $rec = shift; return [ keys %$rec ]; $$;
   select unnest(rkeys(r)) from (select * from pg_class limit 1) r;


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-22 Thread Robert Haas
On Tue, Apr 21, 2015 at 10:33 PM, Kouhei Kaigai  wrote:
> [ new patch ]

A little more nitpicking:

ExecInitForeignScan() and ExecInitCustomScan() could declare
currentRelation inside the if (scanrelid > 0) block instead of in the
outer scope.

I'm not too excited about the addition of GetFdwHandlerForRelation,
which is a one-line function used in one place.  It seems like we
don't really need that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Heikki Linnakangas

On 04/22/2015 05:33 PM, Robert Haas wrote:

On Tue, Apr 21, 2015 at 7:24 PM, Jim Nasby  wrote:

On 4/21/15 3:21 PM, Robert Haas wrote:

I'm not saying those ideas don't have problems, because they do.  But
I think they are worth further exploring.  The main reason I gave up
on that is because Heikki was working on the XID-to-LSN mapping stuff.
That seemed like a better approach than either of the above, so as
long as Heikki was working on that, there wasn't much reason to pursue
more lowbrow approaches.  Clearly, though, we need to do something
about this.  Freezing is a big problem for lots of users.


Did XID-LSN die? I see at the bottom of the thread it was returned with
feedback; I guess Heikki just hasn't had time and there's no major blockers?
 From what I remember this is probably a better solution, but if it's not
going to make it into 9.6 then we should probably at least look further into
a FM.


Heikki said he'd lost enthusiasm N it, but he wasn't too specific
about his reasons, IIRC.  I guess maybe just that it got complicated,
and he wasn't sure it was correct.


I'd like to continue working on that when I get around to it. Or even 
better if someone else continues it :-).


The thing that made me nervous about that approach is that it made the 
LSN of each page critical information. If you somehow zeroed out the 
LSN, you could no longer tell which pages are frozen and which are not. 
I'm sure it could be made to work - and I got it working to some degree 
anyway - but it's a bit scary. It's similar to the multixid changes in 
9.3: multixids also used to be data that you can just zap at restart, 
and when we changed the rules so that you lose data if you lose 
multixids, we got trouble. Now, LSNs are much simpler, and there 
wouldn't be anything like the multioffset/member SLRUs that you'd have 
to keep around forever or vacuum, but still..


I would feel safer if we added a completely new "epoch" counter to the 
page header, instead of reusing LSNs. But as we all know, changing the 
page format is a problem for in-place upgrade, and takes some space too.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Row security violation error is misleading

2015-04-22 Thread Dean Rasheed
On 22 April 2015 at 17:02, Stephen Frost  wrote:
> Pushed with those changes, please take a look and test!
>

Excellent, thanks! Will test.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-22 Thread Jan de Visser
On April 22, 2015 11:14:08 AM Heikki Linnakangas wrote:
> On 04/16/2015 06:51 AM, Alvaro Herrera wrote:
> > Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use?  I
> > thought the -w would wait until promotion has taken effect, so there's
> > no need to sleep additional time.
> 
> -w is not supported with pg_ctl promote. Only start, stop and restart.
> It's accepted, but it doesn't do anything. Which isn't very nice...

I'm futzing with the pg_ctl cmd line parameters for my patch notifying pg_ctl 
that restart didn't work. I'll fix this as well.

> 
> - Heikki

jan



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Peter Eisentraut
On 4/22/15 11:37 AM, Jim Nasby wrote:
> On 4/21/15 4:07 PM, Peter Eisentraut wrote:
>> On 4/21/15 4:45 PM, Jim Nasby wrote:
>> In order for a background worker to keep up with some of the workloads
>> that have been presented as counterexamples, you'd need multiple
>> background workers operating in parallel and preferring to work on
>> certain parts of a table.  That would require a lot more sophisticated
>> job management than we currently have for, say, autovacuum.
> 
> My thought was that the foreground queries would send page IDs to the
> bgworker via a shmq. If the queries have to do much waiting at all on IO
> then I'd expect the bgworker to be able to keep pace with a bunch of
> them since it's just grabbing buffers that are already in the pool (and
> only those in the pool; it wouldn't make sense for it to pull it back
> from the kernel, let alone disk).
> 
> We'd need to code this so that if a queue fills up the query doesn't
> block; we just skip that opportunity to prune. I think that'd be fine.

I think a "to-clean-up map" would work better.  But basically we need a
way to remember where to clean up later if we're not going to do it in
the foreground.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Bruce Momjian
On Tue, Apr 21, 2015 at 05:07:52PM -0400, Peter Eisentraut wrote:
> On 4/21/15 4:45 PM, Jim Nasby wrote:
> > This comment made me wonder... has anyone considered handing the pruning
> > work off to a bgworker, at least for SELECTs? That means the selects
> > themselves wouldn't be burdened by the actual prune work, only in
> > notifying the bgworker. While that's not going to be free, presumably
> > it's a lot cheaper...
> 
> The nice thing about having foreground queries to the light cleanup is
> that they can work in parallel and naturally hit the interesting parts
> of the table first.
> 
> In order for a background worker to keep up with some of the workloads
> that have been presented as counterexamples, you'd need multiple
> background workers operating in parallel and preferring to work on
> certain parts of a table.  That would require a lot more sophisticated
> job management than we currently have for, say, autovacuum.

Well, the visibility map tells us where _not_ to clean up, so using
another map to tell use _where_ to cleanup might make sense.  However,
the density of the map might be low enough that a list makes more sense,
as you suggested.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea: closing the loop for "pg_ctl reload"

2015-04-22 Thread Payal Singh
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   tested, failed
Spec compliant:   not tested
Documentation:tested, failed

Error in postgresql.conf gives the expected result on pg_ctl reload, although 
errors in pg_hba.conf file don't. Like before, reload completes fine without 
any information that pg_hba failed to load and only information is present in 
the log file. I'm assuming pg_ctl reload should prompt user if file fails to 
load irrespective of which file it is - postgresql.conf or pg_hba.conf. 

There is no documentation change so far, but I guess that's not yet necessary. 

gmake check passed all tests.

The new status of this patch is: Waiting on Author


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-22 Thread Robert Haas
On Fri, Mar 13, 2015 at 8:02 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Fri, Mar 13, 2015 at 2:31 PM, Tom Lane  wrote:
>>> I don't object to the concept, but I think that is a pretty bad place
>>> to put the hook call: add_paths_to_joinrel is typically called multiple
>>> (perhaps *many*) times per joinrel and thus this placement would force
>>> any user of the hook to do a lot of repetitive work.
>
>> Interesting point.  I guess the question is whether a some or all
>> callers are going to actually *want* a separate call for each
>> invocation of add_paths_to_joinrel(), or whether they'll be happy to
>> operate on the otherwise-complete path list.
>
> Hmm.  You're right, it's certainly possible that some users would like to
> operate on each possible pair of input relations, rather than considering
> the joinrel "as a whole".  Maybe we need two hooks, one like your patch
> and one like I suggested.

Let me attempt to summarize subsequent discussion on this thread by
saying the hook location that you proposed (just before set_cheapest)
has not elicited any enthusiasm from anyone else.  In a nutshell, the
problem is that a single callback for a large join problem is just
fine if there are no special joins involved, but in any other
scenario, nobody knows how to use a hook at that location for anything
useful.  To push down a join to the remote server, you've got to
figure out how to emit an SQL query for it.  To execute it with a
custom join strategy, you've got to know which of those joins should
have inner join semantics vs. left join semantics.  A hook/callback in
make_join_rel() or in add_paths_to_joinrel() makes that relatively
straightforward. Otherwise, it's not clear what to do, short of
copy-and-pasting join_search_one_level().  If you have a suggestion,
I'd like to hear it.

If not, I'm going to press forward with the idea of putting the
relevant logic in either add_paths_to_joinrel(), as previously
proposed, or perhaps up oe level in make_one_rel().  Either way, if
you don't need to be called multiple times per joinrel, you can stash
a flag inside whatever you hang off of the joinrel's fdw_private and
return immediately on every call after the first.  I think that's
cheap enough that we shouldn't get too stressed about it: for FDWs, we
only call the hook at all if everything in the joinrel uses the same
FDW, so it won't get called at all except for joinrels where it's
likely to win big; for custom joins, multiple calls are quite likely
to be useful and necessary, and if the hook burns too much CPU time
for the query performance you get out of it, that's the custom-join
provider's fault, not ours.  The current patch takes this approach one
step further and attempts FDW pushdown only once per joinrel.  It does
that because, while postgres_fdw DOES need the jointype and a valid
innerrel/outerrel breakdown to figure out what query to generate, it
does NOT every possible breakdown; rather, the first one is as good as
any other. But this might not be true for a non-PostgreSQL remote
database.  So I think it's better to call the hook every time and let
the hook return without doing anything if it wants.

I'm still not totally sure whether make_one_rel() is better than
add_paths_to_joinrel().  The current patch attempts to split the
difference by doing FDW pushdown from make_one_rel() and custom joins
from add_paths_to_joinrel().  I dunno why; if possible, those two
things should happen in the same place.  Doing it in make_one_rel()
makes for fewer arguments and fewer repetitive calls, but that's not
much good if you would have had a use for the extra arguments that
aren't computed until we get down to add_paths_to_joinrel().  I'm not
sure whether that's the case or not.  The latest version of the
postgres_fdw patch doesn't seem to mind not having extra_lateral_rels,
but I'm wondering if that's busted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Kevin Grittner
Robert Haas  wrote:

> I just tested "pgbench -i -s 40 -n" followed by "VACUUM" or
> alternatively followed by "VACUUM FREEZE".  The VACUUM generated
> 4641kB of WAL.  The VACUUM FREEZE generated 515MB of WAL - that
> is, 113 times more.

Essentially a bulk load.  OK, so if you bulk load data and then
vacuum it before updating 100% of it, this approach will generate a
lot more WAL than we currently do.  Of course, if you don't VACUUM
FREEZE after a bulk load and then are engaged in a fairly normal
OLTP workload with peak and off-peak cycles, you are currently
almost certain to hit a point during peak OLTP load where you begin
to sequentially scan all tables, rewriting them in place, with WAL
logging.  Incidentally, this tends to flush a lot of your "hot"
data out of cache, increasing disk reads.  The first time I hit
this "interesting" experience in production it was so devastating,
and generated so many user complaints, that I never again
considered a bulk load complete until I had run VACUUM FREEZE on it
-- although I was sometimes able to defer that to an off-peak
window of time.

In other words, for the production environments I managed, the only
value of that number is in demonstrating the importance of using
unlogged COPY followed by VACUUM FREEZE for bulk-loading and
capturing a fresh base backup upon completion.  A better way to use
pgbench to measure WAL size cost might be to initialize, VACUUM
FREEZE to set a "long term baseline", and do a reasonable length
run with crontab running VACUUM FREEZE periodically (including
after the run was complete) versus doing the same with plain VACUUM
(followed by a VACUUM FREEZE at the end?).  Comparing the total WAL
sizes generated following the initial load and VACUUM FREEZE would
give a more accurate picture of the impact on an OLTP load, I
think.

> We'll still come out ahead if those tuples are going to stick
> around long enough that they would have eventually gotten frozen
> anyway, but if they get deleted again the loss is pretty
> significant.

Perhaps my perception is biased by having worked in an environment
where the vast majority of tuples (both in terms of tuple count and
byte count) were never updated and were only eligible for deletion
after a period of years.  Our current approach is pretty bad in
such an environment, at least if you try to leave all vacuuming to
autovacuum.  I'll admit that we were able to work around the
problems by running VACUUM FREEZE every night for most databases.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 12:39 PM, Heikki Linnakangas  wrote:
> The thing that made me nervous about that approach is that it made the LSN
> of each page critical information. If you somehow zeroed out the LSN, you
> could no longer tell which pages are frozen and which are not. I'm sure it
> could be made to work - and I got it working to some degree anyway - but
> it's a bit scary. It's similar to the multixid changes in 9.3: multixids
> also used to be data that you can just zap at restart, and when we changed
> the rules so that you lose data if you lose multixids, we got trouble. Now,
> LSNs are much simpler, and there wouldn't be anything like the
> multioffset/member SLRUs that you'd have to keep around forever or vacuum,
> but still..

LSNs are already pretty critical.  If they're in the future, you can't
flush those pages.  Ever.  And if they're wrong in either direction,
crash recovery is broken.  But it's still worth thinking about ways
that we could make this more robust.

I keep coming back to the idea of treating any page that is marked as
all-visible as frozen, and deferring freezing until the page is again
modified.  The big downside of this is that if the page is set as
all-visible and then immediately thereafter modified, it sucks to have
to freeze when the XIDs in the page are still present in CLOG.  But if
we could determine from the LSN that the XIDs in the page are new
enough to still be considered valid, then we could skip freezing in
those cases and only do it when the page is "old".  That way, if
somebody zeroed out the LSN (why, oh why?) the worst that would happen
is that we'd do some extra freezing when the page was next modified.

> I would feel safer if we added a completely new "epoch" counter to the page
> header, instead of reusing LSNs. But as we all know, changing the page
> format is a problem for in-place upgrade, and takes some space too.

Yeah.  We have a serious need to reduce the size of our on-disk
format.  On a TPC-C-like workload Jan Wieck recently tested, our data
set was 34% larger than another database at the beginning of the test,
and 80% larger by the end of the test.  And we did twice the disk
writes.  See "The Elephants in the Room.pdf" at
https://sites.google.com/site/robertmhaas/presentations

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 2:23 PM, Kevin Grittner  wrote:
> Robert Haas  wrote:
>> I just tested "pgbench -i -s 40 -n" followed by "VACUUM" or
>> alternatively followed by "VACUUM FREEZE".  The VACUUM generated
>> 4641kB of WAL.  The VACUUM FREEZE generated 515MB of WAL - that
>> is, 113 times more.
>
> Essentially a bulk load.  OK, so if you bulk load data and then
> vacuum it before updating 100% of it, this approach will generate a
> lot more WAL than we currently do.  Of course, if you don't VACUUM
> FREEZE after a bulk load and then are engaged in a fairly normal
> OLTP workload with peak and off-peak cycles, you are currently
> almost certain to hit a point during peak OLTP load where you begin
> to sequentially scan all tables, rewriting them in place, with WAL
> logging.  Incidentally, this tends to flush a lot of your "hot"
> data out of cache, increasing disk reads.  The first time I hit
> this "interesting" experience in production it was so devastating,
> and generated so many user complaints, that I never again
> considered a bulk load complete until I had run VACUUM FREEZE on it
> -- although I was sometimes able to defer that to an off-peak
> window of time.
>
> In other words, for the production environments I managed, the only
> value of that number is in demonstrating the importance of using
> unlogged COPY followed by VACUUM FREEZE for bulk-loading and
> capturing a fresh base backup upon completion.  A better way to use
> pgbench to measure WAL size cost might be to initialize, VACUUM
> FREEZE to set a "long term baseline", and do a reasonable length
> run with crontab running VACUUM FREEZE periodically (including
> after the run was complete) versus doing the same with plain VACUUM
> (followed by a VACUUM FREEZE at the end?).  Comparing the total WAL
> sizes generated following the initial load and VACUUM FREEZE would
> give a more accurate picture of the impact on an OLTP load, I
> think.

Sure, that would be a better test.  But I'm pretty sure the impact
will still be fairly substantial.

>> We'll still come out ahead if those tuples are going to stick
>> around long enough that they would have eventually gotten frozen
>> anyway, but if they get deleted again the loss is pretty
>> significant.
>
> Perhaps my perception is biased by having worked in an environment
> where the vast majority of tuples (both in terms of tuple count and
> byte count) were never updated and were only eligible for deletion
> after a period of years.  Our current approach is pretty bad in
> such an environment, at least if you try to leave all vacuuming to
> autovacuum.  I'll admit that we were able to work around the
> problems by running VACUUM FREEZE every night for most databases.

Yeah.  And that breaks down when you have very big databases with a
high XID consumption rate, because the mostly-no-op VACUUM FREEZE runs
for longer than you can tolerate.  I'm not saying we don't need to fix
this problem; we clearly do.  I'm just saying that we've got to be
careful not to harm other scenarios in the process.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 2:17 AM, Heikki Linnakangas  wrote:
> Note that it's a bit complicated to set up that scenario today. Archiving is
> never enabled in recovery mode, so you'll need to use a custom cron job or
> something to maintain the archive that C uses. The files will not
> automatically flow from B to the second archive. With the patch we're
> discussing, however, it would be easy: just set archive_mode='always' in B.

Hmm, I see.  But if C never replays the last, partial segment from the
old timeline, how does it follow the timeline switch?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add CINE for ALTER TABLE ... ADD COLUMN

2015-04-22 Thread Payal Singh
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Seeing this when trying to apply the patch:

Patching file src/backend/commands/tablecmds.c using Plan A...
Hunk #1 FAILED at 328.
Hunk #2 succeeded at 2294 (offset 11 lines).
Hunk #3 FAILED at 3399.
Hunk #4 FAILED at 3500.
Hunk #5 succeeded at 4658 with fuzz 1 (offset 65 lines).
Hunk #6 succeeded at 4753 (offset 66 lines).
Hunk #7 succeeded at 4989 with fuzz 2 (offset 66 lines).
Hunk #8 succeeded at 5003 (offset 69 lines).
Hunk #9 succeeded at 5017 (offset 69 lines).
Hunk #10 succeeded at 5033 (offset 69 lines).

The new status of this patch is: Waiting on Author


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Kevin Grittner
Greg Stark  wrote:

> And it's a major headache, people are always being surprised that
> their selects cause lots of I/O and slow down dramatically after
> a big update or data load has finished. It's characterized as
> "why is the database writing everything twice" (and saying it's
> actually writing everything three times doesn't make people feel
> better).

When I looked at the life-cycle of a heap tuple in a database I was
using, I found that (ignoring related index access and ignoring
WAL-file copying, etc., for our backups), each tuple that existed
long enough to freeze and be eventually deleted caused a lot of
writes.

(1) WAL log the insert.
(2) Write the tuple.
(3) Hint and rewrite the tuple.
(4) WAL log the freeze of the tuple.
(5) Rewrite the frozen tuple.
(6) WAL-log the delete.
(7) Rewrite the deleted tuple.
(8) Prune and rewrite the page.
(9) Free line pointers and rewrite the page.

If I was lucky some of the writes could be combined in cache
because they happened close enough together.  Also, one could hope
that not too much of the WAL-logging involved full page writes to
the WAL -- again, keeping steps close together in time helps with
that.  If all of (1) through (5) are done in quick succession, you
save two physical writes of the heap page and save one full page

write to WAL.  If steps (7) through (9) are done in quick
succession, you save two more physical writes to the heap.  This is
part of what makes the aggressive incremental freezing being
discussed on a nearby thread appealing -- at least for some
workloads.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-22 Thread Robert Haas
On Tue, Apr 21, 2015 at 8:30 PM, Michael Paquier
 wrote:
> This .partial segment renaming is something that we
> should let the archive_command manage with its internal logic.

This strikes me as equivalent to saying "we don't know how to make
this work right, but maybe our users will know".  That never works
out.  As things stand, we have a situation where the archive_command
examples in our documentation are known to be flawed.  They don't
fsync the file, and they'll write a partial file and then, when rerun,
fail to copy the full file because there's already something there.
Efforts have been made to fix these problems (see the pg_copy thread),
but they haven't been completed yet, nor have we even documented the
issues with the commands recommended by the documentation.  Let's
please not throw anything else on the pile of things we're expecting
users to somehow "get right".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-22 Thread Heikki Linnakangas

On 04/22/2015 09:30 PM, Robert Haas wrote:

On Wed, Apr 22, 2015 at 2:17 AM, Heikki Linnakangas  wrote:

Note that it's a bit complicated to set up that scenario today. Archiving is
never enabled in recovery mode, so you'll need to use a custom cron job or
something to maintain the archive that C uses. The files will not
automatically flow from B to the second archive. With the patch we're
discussing, however, it would be easy: just set archive_mode='always' in B.


Hmm, I see.  But if C never replays the last, partial segment from the
old timeline, how does it follow the timeline switch?


At timeline switch, we copy the old segment to the new timeline, and 
start writing where we left off. So the WAL from the old timeline is 
found in the segment nominally belonging to the new timeline.


For example, imagine that perform point-in-time recovery to WAL position 
0/1237E568, on timeline 1. That falls within segment 
00010012. Then we end recovery, and switch to timeline 
2. After the switch, and some more WAL-logged actions, we'll have these 
files in pg_xlog:


00010011
00010012
00020012
00020013
00020014

Note that there are two segments ending in "12". They both have the same 
point up to offset 0x37E568, corresponding to the switch point 
0/1237E568. After that, the contents diverge: the segment on the new 
timeline contains a checkpoint/end-of-recovery record at that point, 
followed by new WAL belonging to the new timeline.


Recovery knows about that, so that if you set recovery target to 
timeline 2, and it needs the WAL at the beginning of segment 12 (still 
belonging to timeline 1), it will try to restoring both 
"00010012" and "00020012".


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow SQL/plpgsql functions to accept record

2015-04-22 Thread Merlin Moncure
On Wed, Apr 22, 2015 at 11:20 AM, Andrew Dunstan  wrote:
>
> On 04/22/2015 11:29 AM, Jim Nasby wrote:
>>
>> On 4/20/15 2:04 PM, David G. Johnston wrote:
>>>
>>>
>>> SELECT (src.v).* FROM ( VALUES (ROW(1,2,3)) ) src (v);
>>> ERROR: record type has not been registered
>>>
>>> While it may not be necessary to solve both problems I suspect they have
>>> the same underlying root cause - specifically the separation of concerns
>>> between the planner and the executor.
>>
>> I don't think they're related at all. C functions have the ability to
>> accept a record, so the executor must be able to support it. It's just that
>> SQL and plpgsql functions don't have that support. I suspect that's just
>> because no one has gotten around to it.
>
> Well, that's assuming everyone else thinks it would be a good idea. Maybe
> they do, but I wouldn't assume it.
>
> The answer in the past has been to use more dynamically typed languages such
> as perl for which this problem is well suited.

I've never really been satisfied with this answer.  The two languages
with really good core support are perl and python, neither of which
are my cup of tea.   Also, there is no chance of inlining any of the
dynamic languages which has serious performance ramifications.  In a
perfect world, pl/v8 would be a good choice for a general purpose
database support language as javascript has a number of properties
that make it attractive for integration.  Even if we had that though
(and it's unlikely), a large percentage of postgres devs, including
myself, dislike coding in any language except sql plus extensions.

That being said, I think json types with their associated API, given
that they are core types, will ultimately handle these types of
problems.  That way, at least, we can avoid adding syntax and
functionality that will basically do the same thing.  This reminds me
a little bit of the json_build() vs enhanced row() syntax we discussed
some time back.  I didn't say so at the time, but for posterity, I
think you were right...json_build() is working fine for building
arbitrary record types and moving a record to json and deconstructing
it should work just as well.

merlin

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 3:01 PM, Heikki Linnakangas  wrote:
> On 04/22/2015 09:30 PM, Robert Haas wrote:
>> On Wed, Apr 22, 2015 at 2:17 AM, Heikki Linnakangas 
>> wrote:
>>>
>>> Note that it's a bit complicated to set up that scenario today. Archiving
>>> is
>>> never enabled in recovery mode, so you'll need to use a custom cron job
>>> or
>>> something to maintain the archive that C uses. The files will not
>>> automatically flow from B to the second archive. With the patch we're
>>> discussing, however, it would be easy: just set archive_mode='always' in
>>> B.
>>
>>
>> Hmm, I see.  But if C never replays the last, partial segment from the
>> old timeline, how does it follow the timeline switch?
>
> At timeline switch, we copy the old segment to the new timeline, and start
> writing where we left off. So the WAL from the old timeline is found in the
> segment nominally belonging to the new timeline.

Check.

> For example, imagine that perform point-in-time recovery to WAL position
> 0/1237E568, on timeline 1. That falls within segment
> 00010012. Then we end recovery, and switch to timeline 2.
> After the switch, and some more WAL-logged actions, we'll have these files
> in pg_xlog:
>
> 00010011
> 00010012
> 00020012
> 00020013
> 00020014

Is the 00010012 file a "partial" segment of the sort
you're proposing to no longer achive?

> Note that there are two segments ending in "12". They both have the same
> point up to offset 0x37E568, corresponding to the switch point 0/1237E568.
> After that, the contents diverge: the segment on the new timeline contains a
> checkpoint/end-of-recovery record at that point, followed by new WAL
> belonging to the new timeline.

Check.

> Recovery knows about that, so that if you set recovery target to timeline 2,
> and it needs the WAL at the beginning of segment 12 (still belonging to
> timeline 1), it will try to restoring both "00010012" and
> "00020012".

What if you set the recovery target to timeline 3?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cache lookup error for shell type creation with incompatible output function (DDL deparsing bug)

2015-04-22 Thread Alvaro Herrera
Michael Paquier wrote:
> Hi all,
> 
> I just bumped into the following problem in HEAD (1c41e2a):
> =# create type my_array_float (INPUT = array_in, OUTPUT = array_out,
> ELEMENT = float4, INTERNALLENGTH = 32);
> ERROR:  XX000: cache lookup failed for type 0
> LOCATION:  format_type_internal, format_type.c:135

Argh.

> The fix consists in being sure that typoid uses the OID of the type
> shell created, aka the OID stored in adress.ObjectID. Attached is a
> patch with a regression test checking for shell creation with
> incompatible input/output functions (failure caused by output function
> here though) able to check this code path.

Thanks, pushed.  I changed the line to be just below TypeShellMake,
which seems slightly better aligned to the comment just below, and in
fact it matches what DefineRange already uses.  I also modified a couple
of other places involving TypeCreate: one did not have the "typoid =
addr.objectId" assignment at all; while the other did, it actually seems
misplaced because at that spot we expect that typoid is already correct.
So I added an assert to the other place and changed that assignment to
an assert, too.

I scanned the rest of the bogus commit and couldn't find any other place
on which I made the same mistake.

Thanks for reporting,

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TABLESAMPLE patch

2015-04-22 Thread Petr Jelinek

On 19/04/15 01:24, Michael Paquier wrote:

On Sat, Apr 18, 2015 at 8:38 PM, Michael Paquier
 wrote:

On Fri, Apr 17, 2015 at 10:54 PM, Petr Jelinek wrote:

On 10/04/15 06:46, Michael Paquier wrote:

13) Some regression tests with pg_tablesample_method would be welcome.


Not sure what you mean by that.


I meant a sanity check on pg_tablesample_method to be sure that
tsminit, tsmnextblock and tsmnexttuple are always defined as they are
mandatory functions. So the idea is to add a query like and and to be
sure that it returns no rows:
SELECT tsmname FROM pg_tablesample_method WHERE tsminit IS NOT NULL OR
tsmnextblock IS NOT NULL OR tsmnexttuple IS NOT NULL;


Yesterday was a long day. I meant IS NULL and not IS NOT NULL, but I
am sure you guessed it that way already..



Yes I guessed that and it's very reasonable request, I guess it should 
look like the attached (I don't want to send new version of everything 
just for this).


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
>From d059c792a1e864e38f321cea51169ea0b3c5caab Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Wed, 22 Apr 2015 21:28:29 +0200
Subject: [PATCH 7/7] tablesample: add catalog regression test

---
 src/test/regress/expected/tablesample.out | 15 +++
 src/test/regress/sql/tablesample.sql  | 13 +
 2 files changed, 28 insertions(+)

diff --git a/src/test/regress/expected/tablesample.out b/src/test/regress/expected/tablesample.out
index 271638d..04e5eb8 100644
--- a/src/test/regress/expected/tablesample.out
+++ b/src/test/regress/expected/tablesample.out
@@ -209,6 +209,21 @@ SELECT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPLE BERNOULLI (5);
 ERROR:  syntax error at or near "TABLESAMPLE"
 LINE 1: ...CT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPL...
  ^
+-- catalog sanity
+SELECT *
+FROM pg_tablesample_method
+WHERE tsminit IS NULL
+   OR tsmseqscan IS NULL
+   OR tsmpagemode IS NULL
+   OR tsmnextblock IS NULL
+   OR tsmnexttuple IS NULL
+   OR tsmend IS NULL
+   OR tsmreset IS NULL
+   OR tsmcost IS NULL;
+ tsmname | tsmseqscan | tsmpagemode | tsminit | tsmnextblock | tsmnexttuple | tsmexaminetuple | tsmend | tsmreset | tsmcost 
+-++-+-+--+--+-++--+-
+(0 rows)
+
 -- done
 DROP TABLE test_tablesample CASCADE;
 NOTICE:  drop cascades to 2 other objects
diff --git a/src/test/regress/sql/tablesample.sql b/src/test/regress/sql/tablesample.sql
index 2f4b7de..7b3eb9b 100644
--- a/src/test/regress/sql/tablesample.sql
+++ b/src/test/regress/sql/tablesample.sql
@@ -57,5 +57,18 @@ SELECT * FROM query_select TABLESAMPLE BERNOULLI (5.5) REPEATABLE (1);
 
 SELECT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPLE BERNOULLI (5);
 
+-- catalog sanity
+
+SELECT *
+FROM pg_tablesample_method
+WHERE tsminit IS NULL
+   OR tsmseqscan IS NULL
+   OR tsmpagemode IS NULL
+   OR tsmnextblock IS NULL
+   OR tsmnexttuple IS NULL
+   OR tsmend IS NULL
+   OR tsmreset IS NULL
+   OR tsmcost IS NULL;
+
 -- done
 DROP TABLE test_tablesample CASCADE;
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-22 Thread Heikki Linnakangas

On 04/22/2015 10:21 PM, Robert Haas wrote:

On Wed, Apr 22, 2015 at 3:01 PM, Heikki Linnakangas  wrote:

For example, imagine that perform point-in-time recovery to WAL position
0/1237E568, on timeline 1. That falls within segment
00010012. Then we end recovery, and switch to timeline 2.
After the switch, and some more WAL-logged actions, we'll have these files
in pg_xlog:

00010011
00010012
00020012
00020013
00020014


Is the 00010012 file a "partial" segment of the sort
you're proposing to no longer achive?


If you did pure archive recovery, with no streaming replication 
involved, then no. If it was created by streaming replication, and the 
replication had not filled the whole segment yet, then yes, it would be 
a partial segment.



Note that there are two segments ending in "12". They both have the same
point up to offset 0x37E568, corresponding to the switch point 0/1237E568.
After that, the contents diverge: the segment on the new timeline contains a
checkpoint/end-of-recovery record at that point, followed by new WAL
belonging to the new timeline.


Check.


Recovery knows about that, so that if you set recovery target to timeline 2,
and it needs the WAL at the beginning of segment 12 (still belonging to
timeline 1), it will try to restoring both "00010012" and
"00020012".


What if you set the recovery target to timeline 3?


It depends how timeline 3 was created. If timeline 3 was forked off from 
timeline 2, then recovery would find it. If it was forked off directly 
from timeline 1, then no.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cache lookup error for shell type creation with incompatible output function (DDL deparsing bug)

2015-04-22 Thread Alvaro Herrera
Alvaro Herrera wrote:
> Michael Paquier wrote:
> > Hi all,
> > 
> > I just bumped into the following problem in HEAD (1c41e2a):
> > =# create type my_array_float (INPUT = array_in, OUTPUT = array_out,
> > ELEMENT = float4, INTERNALLENGTH = 32);
> > ERROR:  XX000: cache lookup failed for type 0
> > LOCATION:  format_type_internal, format_type.c:135

I also wanted to point out, but forgot, that this command is not really
creating a shell type -- it's creating a full-blown type, because there
are args.  Shell types are created when no args are given.  This happens
to fail due to the internal creation of the shell type because of the
failure to look up the i/o functions, but as far as I can see the
original code should fail in any type creation in pretty much the same
way (didn't actually test that); not completely sure why this wasn't
more visible in regression tests.

I simply removed the word "shell" in the provided test case in the
committed version, anyway.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sequence Access Method WIP

2015-04-22 Thread Petr Jelinek

On 20/04/15 17:50, Heikki Linnakangas wrote:

On 03/15/2015 09:07 PM, Petr Jelinek wrote:

Slightly updated version of the patch.

Mainly rebased against current master (there were several conflicts) and
fixed some typos, no real functional change.

I also attached initial version of the API sgml doc.


I finally got around to have another round of review on this. I fixed a
couple of little bugs, did some minor edition on comments etc. See
attached. It is also available in my git repository at
git://git.postgresql.org/git/users/heikki/postgres.git, branch "seqam",
if you want to look at individual changes. It combines your patches 1
and 4, I think those need to be applied together. I haven't looked at
the DDL changes yet.


Thanks!



I'm fairly happy with the alloc API now. I'm not sure it's a good idea
for the AM to access the sequence tuple directly, though. I would seem
cleaner if it only manipulated the amdata Datum. But it's not too bad as
it is.


Yeah, I was thinking about this myself I just liked sending 10 
parameters to the function less than this.




The division of labour between sequence.c and the AM, in the init and
the get/set_state functions, is a bit more foggy:

* Do we really need a separate amoptions() method and an init() method,
when the only caller to amoptions() is just before the init() method?
The changes in extractRelOptions suggest that it would call
am_reloptions for sequences too, but that doesn't actually seem to be
happening.


Hmm yes it should and I am sure it did at some point, must have messed 
it during one of the rebases :(


And it's the reason why we need separate API function.



postgres=# create sequence fooseq using local with (garbage=option);
CREATE SEQUENCE

Invalid options probably should throw an error.

* Currently, the AM's init function is responsible for basic sanity
checking, like min < max. It also has to extract the RESTART value from
the list of options. I think it would make more sense to move that to
sequence.c. The AM should only be responsible for initializing the
'amdata' portion, and for handling any AM-specific options. If the AM
doesn't support some options, like MIN and MAX value, it can check them
and throw an error, but it shouldn't be responsible for just passing
them through from the arguments to the sequence tuple.


Well then we need to send restart as additional parameter to the init 
function as restart is normally not stored anywhere.


The checking is actually if new value is withing min/max but yes that 
can be done inside sequence.c I guess.




* It might be better to form the sequence tuple before calling the init
function, and pass the ready-made tuple to it. All the other API
functions deal with the tuple (after calling sequence_read_tuple), so
that would be more consistent. The init function would need to
deconstruct it to modify it, but we're not concerned about performance
here.


Right, this is actually more of a relic of the custom columns 
implementation where I didn't want to build the tuple with NULLs for 
columns that might have been specified as NOT NULL, but with the amdata 
approach it can create the tuple safely.




* The transformations of the arrays in get_state() and set_state()
functions are a bit complicated. The seqam_get_state() function returns
two C arrays, and pg_sequence_get_state() turns them into a text[]
array. Why not construct the text[] array directly in the AM? I guess
it's a bit more convenient to the AM, if the pg_sequence_get_state() do
that, but if that's an issue, you could create generic helper function
to turn two C string arrays into text[], and call that from the AM.


Yeah that was exactly the reasoning. Helper function works for me (will 
check what Álvaro's suggested, maybe it can be moved somewhere and reused).





seq = (FormData_pg_sequence *) GETSTRUCT(sequence_read_tuple(seqh));

for (i = 0; i < count; i++)
{
if (pg_strcasecmp(keys[i], "last_value") == 0)
seq->last_value = DatumGetInt64(DirectFunctionCall1(int8in,

CStringGetDatum(values[i])));
else if (pg_strcasecmp(keys[i], "is_called") == 0)
seq->is_called = DatumGetBool(DirectFunctionCall1(boolin,

CStringGetDatum(values[i])));
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
 errmsg("invalid state key \"%s\" for local
sequence",
keys[i])));
}

sequence_save_tuple(seqh, NULL, true);


If that error happens after having already processed a "last_value" or
"is_called" entry, you have already modified the on-disk tuple. AFAICS
that's the only instance of that bug, but sequence_read_tuple - modify
tuple in place - sequence_save_tuple pattern is quite unsafe in general.
If you modify a tuple directly in a Buffer, you should have a critical
section around it. It would make sense to start a critical section in
sequence_read_tuple(), except that not all callers want to modify the

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Bruce Momjian
On Wed, Apr 22, 2015 at 04:36:17PM +0100, Greg Stark wrote:
> On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian  wrote:
> > Well, we have to assume there are many misconfigured configurations ---
> > autovacuum isn't super-easy to configure, so we can't just blame the
> > user if this makes things worse.  In fact, page pruning was designed
> > spefically for cases where autovacuum wasn't running our couldn't keep
> > up.
> 
> Well autovacuum isn't currently considering HOT pruning part of its
> job at all. It's hard to call it "misconfigured" when there's
> literally *no* way to configure it "correctly".

Good point, but doesn't vacuum remove the need for pruning as it removes
all the old rows?

> If you update less than autovacuum_vacuum_scale_factor fraction of the
> table and then never update another row autovacuum will never run.
> Ever. Every select will forevermore need to follow hot chains on that
> table. Until eventually transaction wraparound forces a vacuum on that
> table if that ever happens.

Yes, that is a very good point, and it matches my concerns.  Of course,
Simon's concern is to avoid overly-aggressive pruning where the row is
being pruned but will soon be modified, making the prune, and its WAL
volume, undesirable.  We have to consider both cases in any final
solution.

> Possibly autovacuum could be adjusted to count how many selects are
> happening on the table and decide to vacuum it when the cost of the
> selects following the dead tuples is balanced by the cost of doing a
> vacuum. But that's not something included in the design of autovacuum
> today.

Well, autovacuum is also going to clean indexes, which seem like
overkill for pruning HOT updates.

> The original design of tuple storage was aimed at optimizing the
> steady state where most tuples were not recently updated. It
> guaranteed that except for tuples that were in the process of being
> updated or were recently updated a tuple read didn't have to read the
> CLOG, didn't have to follow any chains, didn't have to do any I/O or
> other work other than to read the bits on the tuple itself. When a
> tuple is updated it's put into a state where everyone who comes along
> has to do extra work but as soon as practical the hint bits get set
> and that extra work stops.

Yes, Simon is right that doing everything as-soon-as-possible is not
optimal.  I think the trick is knowing when we should give up waiting
for something else to dirty the page and prune it.

> We had similar discussions about setting hint bits in the past. I'm
> not sure why HOT pruning is the focus now because I actually think
> hint bit setting is a larger source of I/O in innocent looking selects
> even today. And it's a major headache, people are always being
> surprised that their selects cause lots of I/O and slow down
> dramatically after a big update or data load has finished. It's
> characterized as "why is the database writing everything twice" (and
> saying it's actually writing everything three times doesn't make
> people feel better). In the new age of checksums with hint bit logging
> I wonder if it's even a bigger issue.

What would be the downside of only doing pruning during SELECT hint bit
setting?  Hinting is delayed by long-running transactions, but so is
pruning.  I assume you can do more pruning than setting all_visible
hints because the old prunable rows are older by definition, but I am
unclear how much older they are.

FYI, while hint bit setting causes page writes, it does not cause WAL
writes unless you have wal_log_hints set or page-level checksums are
enabled.  By doing pruning at the same time as hint bit setting, you are
sharing the same page write, but are generating more WAL.  Of course, if
you are setting all-visible, then you are by definition waiting longer
to prune than before, and this might be enough to make it a win for all
use cases.  You wouldn't never-prune in a read-only workload because
your hint bits would eventually cause the pruning.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 8:48 AM, Amit Kapila  wrote:
> I have implemented this idea (note that I have to expose a new API
> shm_mq_from_handle as TupleQueueFunnel stores shm_mq_handle* and
> we sum_mq* to call shm_mq_detach) and apart this I have fixed other
> problems reported on this thread:
>
> 1. Execution of initPlan by master backend and then pass the
> required PARAM_EXEC parameter values to workers.
> 2. Avoid consuming dsm's by freeing the parallel context after
> the last tuple is fetched.
> 3. Allow execution of Result node in worker backend as that can
> be added as a gating filter on top of PartialSeqScan.
> 4. Merged parallel heap scan descriptor patch
>
> To apply the patch, please follow below sequence:
>
> HEAD Commit-Id: 4d930eee
> parallel-mode-v9.patch [1]
> assess-parallel-safety-v4.patch [2]  (don't forget to run fixpgproc.pl in
> the patch)
> parallel_seqscan_v14.patch (Attached with this mail)

Thanks, this version looks like an improvement.  However, I still see
some problems:

- I believe the separation of concerns between ExecFunnel() and
ExecEndFunnel() is not quite right.  If the scan is shut down before
it runs to completion (e.g. because of LIMIT), then I think we'll call
ExecEndFunnel() before ExecFunnel() hits the TupIsNull(slot) path.  I
think you probably need to create a static subroutine that is called
both as soon as TupIsNull(slot) and also from ExecEndFunnel(), in each
case cleaning up whatever resources remain.

- InitializeParallelWorkers() still mixes together general parallel
executor concerns with concerns specific to parallel sequential scan
(e.g. EstimatePartialSeqScanSpace).   We have to eliminate everything
that assumes that what's under a funnel will be, specifically, a
partial sequential scan. To make this work properly, I think we should
introduce a new function that recurses over the plan tree and invokes
some callback for each plan node.  I think this could be modeled on
this code from ExplainNode(), beginning around line 1593:

/* initPlan-s */
if (planstate->initPlan)
ExplainSubPlans(planstate->initPlan, ancestors, "InitPlan", es);

/* lefttree */
if (outerPlanState(planstate))
ExplainNode(outerPlanState(planstate), ancestors,
"Outer", NULL, es);

/* righttree */
if (innerPlanState(planstate))
ExplainNode(innerPlanState(planstate), ancestors,
"Inner", NULL, es);

/* special child plans */
switch (nodeTag(plan))
{
/* a bunch of special cases */
}

/* subPlan-s */
if (planstate->subPlan)
ExplainSubPlans(planstate->subPlan, ancestors, "SubPlan", es);

The new function would do the same sort of thing, but instead of
explaining each node, it would invoke a callback for each node.
Possibly explain.c could use it instead of having hard-coded logic.
Possibly it should use the same sort of return-true convention as
expression_tree_walker, query_tree_walker, and friends.  So let's call
it planstate_tree_walker.

Now, instead of directly invoking logic specific to parallel
sequential scan, it should call planstate_tree_walker() on its
lefttree and pass a new function ExecParallelEstimate() as the
callback.  That function ignores any node that's not parallel aware,
but when it sees a partial sequential scan (or, in the future, some a
parallel bitmap scan, parallel sort, or what have you) it does the
appropriate estimation work.  When ExecParallelEstimate() finishes, we
InitializeParallelDSM().  Then, we call planstate_tree_walker() on the
lefttree again, and this time we pass another new function
ExecParallelInitializeDSM().  Like the previous one, that ignores the
callbacks from non-parallel nodes, but if it hits a parallel node,
then it fills in the parallel bits (i.e. ParallelHeapScanDesc for a
partial sequential scan).

- shm_mq_from_handle() is probably reasonable, but can we rename it
shm_mq_get_queue()?

- It's hard to believe this is right:

+   if (parallelstmt->inst_options)
+   receiver = None_Receiver;

Really?  Flush the tuples if there are *any instrumentation options
whatsoever*?  At the very least, that doesn't look too future-proof,
but I'm suspicious that it's outright incorrect.

- I think ParallelStmt probably shouldn't be defined in parsenodes.h.
That file is included in a lot of places, and adding all of those
extra #includes there doesn't seem like a good idea for modularity
reasons even if you don't care about partial rebuilds.  Something that
includes a shm_mq obviously isn't a "parse" node in any meaningful
sense anyway.

- I don't think you need both setup cost and startup cost.  Starting
up more workers isn't particularly more expensive than starting up
fewer of them, because most of the overhead is in waiting for them to
actually start, and the number of workers is reasonable, then they're
all be doing that in parallel with each other.  I suggest removing
parallel_startup_cost and keeping para

Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-22 Thread Robert Haas
On Wed, Apr 22, 2015 at 3:34 PM, Heikki Linnakangas  wrote:
> On 04/22/2015 10:21 PM, Robert Haas wrote:
>> On Wed, Apr 22, 2015 at 3:01 PM, Heikki Linnakangas 
>> wrote:
>>> For example, imagine that perform point-in-time recovery to WAL position
>>> 0/1237E568, on timeline 1. That falls within segment
>>> 00010012. Then we end recovery, and switch to timeline 2.
>>> After the switch, and some more WAL-logged actions, we'll have these
>>> files
>>> in pg_xlog:
>>>
>>> 00010011
>>> 00010012
>>> 00020012
>>> 00020013
>>> 00020014
>>
>>
>> Is the 00010012 file a "partial" segment of the sort
>> you're proposing to no longer achive?
>
> If you did pure archive recovery, with no streaming replication involved,
> then no. If it was created by streaming replication, and the replication had
> not filled the whole segment yet, then yes, it would be a partial segment.

Why the difference?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inherit support for foreign tables

2015-04-22 Thread Robert Haas
On Tue, Apr 14, 2015 at 8:35 PM, Kyotaro HORIGUCHI
 wrote:
> Before suppressing the symptom, I doubt the necessity and/or
> validity of giving foreign tables an ability to be a parent. Is
> there any reasonable usage for the ability?

Gee, I don't see why that would be unreasonable or invalid

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Alvaro Herrera
Bruce Momjian wrote:
> On Wed, Apr 22, 2015 at 04:36:17PM +0100, Greg Stark wrote:
> > On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian  wrote:
> > > Well, we have to assume there are many misconfigured configurations ---
> > > autovacuum isn't super-easy to configure, so we can't just blame the
> > > user if this makes things worse.  In fact, page pruning was designed
> > > spefically for cases where autovacuum wasn't running our couldn't keep
> > > up.
> > 
> > Well autovacuum isn't currently considering HOT pruning part of its
> > job at all. It's hard to call it "misconfigured" when there's
> > literally *no* way to configure it "correctly".
> 
> Good point, but doesn't vacuum remove the need for pruning as it removes
> all the old rows?

Sure.  The point, I think, is to make autovacuum runs of some sort that
don't actually vacuum but only do HOT-pruning.  Maybe this is a
reasonable solution to the problem that queries don't prune anymore
after Simon's patch.  If we made autovac HOT-prune periodically, we
could have read-only queries prune only already-dirty pages.  Of course,
that would need further adjustments to default number of autovac
workers, I/O allocation, etc.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rounding to even for numeric data type

2015-04-22 Thread Pedro Gimeno
Dean Rasheed wrote, On 2015-03-28 10:01:
> On 28 March 2015 at 05:16, Andrew Gierth  wrote:
>>> "Tom" == Tom Lane  writes:
>>
>>  Tom> I think the concern over backwards compatibility here is probably
>>  Tom> overblown; but if we're sufficiently worried about it, a possible
>>  Tom> compromise is to invent a numeric_rounding_mode GUC, so that
>>  Tom> people could get back the old behavior if they really care.
>>
>> I only see one issue with this, but it's a nasty one: do we really want
>> to make all numeric operations that might do rounding stable rather than
>> immutable?
>>
> 
> Yeah, making all numeric functions non-immutable seems like a really bad idea.

Would it be possible to make it an unchangeable per-cluster or
per-database setting, kinda like how encoding behaves? Wouldn't that
allow to keep the functions immutable?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Bruce Momjian
On Wed, Apr 22, 2015 at 06:07:00PM -0300, Alvaro Herrera wrote:
> > Good point, but doesn't vacuum remove the need for pruning as it removes
> > all the old rows?
> 
> Sure.  The point, I think, is to make autovacuum runs of some sort that
> don't actually vacuum but only do HOT-pruning.  Maybe this is a
> reasonable solution to the problem that queries don't prune anymore
> after Simon's patch.  If we made autovac HOT-prune periodically, we
> could have read-only queries prune only already-dirty pages.  Of course,
> that would need further adjustments to default number of autovac
> workers, I/O allocation, etc.

Do we really want to make vacuum more complex for this?  vacuum does
have the delay settings we would need though.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Jim Nasby

On 4/22/15 1:24 PM, Robert Haas wrote:

I keep coming back to the idea of treating any page that is marked as
all-visible as frozen, and deferring freezing until the page is again
modified.  The big downside of this is that if the page is set as
all-visible and then immediately thereafter modified, it sucks to have
to freeze when the XIDs in the page are still present in CLOG.  But if
we could determine from the LSN that the XIDs in the page are new
enough to still be considered valid, then we could skip freezing in
those cases and only do it when the page is "old".  That way, if
somebody zeroed out the LSN (why, oh why?) the worst that would happen
is that we'd do some extra freezing when the page was next modified.


Maybe freezing a page as part of making it not all-visible wouldn't be 
that horrible, even without LSN.


For one, we already know that every tuple is visible, so no MVCC checks 
needed. That's probably a significant savings over current freezing.


If we're marking a page as no longer all-visible, that means we're 
already dirtying it and generating WAL for it (likely including a FPI). 
We may be able to consolidate all of this into a new WAL record that's a 
lot more efficient than what we currently do for freezing. I suspect we 
wouldn't need to log each TID we're freezing, for starters. Even if we 
did though, we could at least combine all that into one WAL message that 
just contains an array of TIDs or LPs.


 I think we could actually proactively freeze tuples during 
vacuum too, at least if we're about to mark the page as all-visible. 
Though, with Robert's HEAP_XMIN_FROZEN change we could be a lot more 
aggressive about freezing during VACUUM, certainly for pages we're 
already dirtying, especially if we can keep the WAL cost of that down.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow SQL/plpgsql functions to accept record

2015-04-22 Thread Jim Nasby

On 4/22/15 2:12 PM, Merlin Moncure wrote:

That being said, I think json types with their associated API, given
that they are core types, will ultimately handle these types of
problems.  That way, at least, we can avoid adding syntax and
functionality that will basically do the same thing.  This reminds me
a little bit of the json_build() vs enhanced row() syntax we discussed
some time back.  I didn't say so at the time, but for posterity, I
think you were right...json_build() is working fine for building
arbitrary record types and moving a record to json and deconstructing
it should work just as well.


The one part I don't care for in that is it seems rather inefficient to 
cast something to JSON just so we can do things we really should be able 
to do with a record. But perhaps it's not all that costly.


As for allowing SQL and plpgsql functions to accept a record, I think 
our JSON functionality just provided plenty of reason we should allow 
accepting them, even if you can't do much with it: you *can* hand it to 
row_to_json(), which does allow you to do something useful with it. So 
it seems reasonable to me that we should at least accept it as a 
function argument.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-22 Thread Peter Geoghegan
On Tue, Apr 21, 2015 at 7:57 AM, Andres Freund  wrote:
> I'm not 100% sure Heikki and I am on exactly the same page here :P
>
> I'm looking at git diff $(git merge-base upstream/master HEAD).. where
> HEAD is e1a5822d164db0.

Merged your stuff into my Github branch. Heikki is pushing changes
there directly now.

> * The logical stuff looks much saner.

Cool.

> * Please add tests for the logical decoding stuff. Probably both a plain
>   regression and and isolationtester test in
>   contrib/test_decoding. Including one that does spooling to disk.

Working on it...hope to push that to Github soon.

> * I don't like REORDER_BUFFER_CHANGE_INTERNAL_INSERT/DELETE as names. Why not
>   _SPECINSERT and _SPECDELETE or such?

Changed that on Github.

> * Iff we're going to have the XLOG_HEAP_AFFIRM record, I'd rather have
>   that guide the logical decoding code. Seems slightly cleaner.

I thought that you didn't think that would always work out? That in
some possible scenario it could break?

> * Still not a fan of the name 'arbiter' for the OC indexes.

What do you prefer? Seems to describe what we're talking about
reasonably well to me.

> * Gram.y needs a bit more discussion:
>   * Can anybody see a problem with changing the precedence of DISTINCT &
> ON to nonassoc? Right now I don't see a problem given both are
> reserved keywords already.

Why did you push code that solved this in a roundabout way, but
without actually reverting my original nonassoc changes (which would
now not result in shift/reduce conflicts?). What should we do about
that? Seems your unsure (otherwise you'd have reverted my thing). I
don't like that you seem to have regressed diagnostic output [1].
Surely it's simpler to use the nonassoc approach? I think this works
by giving the relevant keywords an explicit priority lower than '(',
so that a rule with ON CONFLICT '(' will shift rather than reducing a
conflicting rule (CONFLICT is an unreserved keyword). I wrote the code
so long ago that I can't really remember why I thought it was the
right thing, though.

>   * UpdateInsertStmt is a horrible name. OnConflictUpdateStmt maybe?

Done on Github.

>   * '(' index_params where_clause ')' is imo rather strange. The where
> clause is inside the parens? That's quite different from the
> original index clause.

I don't know. Maybe I was lazy about fixing shift/reduce conflicts.   :-)

I'll look at this some more.

> * SPEC_IGNORE,  /* INSERT of "ON CONFLICT IGNORE" */ looks like
>   a wrongly copied comment.

Not sure what you mean here. Please clarify.

> * The indentation in RoerderBufferCommit is clearly getting out of hand,
>   I've queued up a commit cleaning this up in my repo, feel free to merge.

Done on Github.

> * I don't think we use the term 'ordinary table' in error messages so
>   far.

Fixed on Github.

> * I still think it's unacceptable to redefine
>   XLOG_HEAP_LAST_MULTI_INSERT as XLOG_HEAP_SPECULATIVE_TUPLE like you
>   did. I'll try to find something better.

I did what you suggested in your follow-up e-mail (changes are on Github [2]).

> * I wonder if we now couldn't avoid changing heap_delete's API - we can
>   always super delete if we find a speculative insertion now. It'd be
>   nice not to break out of core callers if not necessary.

Maybe, but if there is a useful way to break out only a small subset
of heap_delete(), I'm not seeing it. Most of the callers that need a
new NULL argument are heap_insert() callers, actually. There are now 3
heap_delete() callers, up from 2.

> * breinbaas on IRC just mentioned that it'd be nice to have upsert as a
>   link in the insert. Given that that's the pervasive term that doesn't
>   seem absurd.

Not sure what you mean. Where would the link appear? It is kind of
hard to categorize that text so that we're strictly either talking
about INSERT or UPSERT. Might be possible, though.

> I think this is getting closer to a commit. Let's get this done.

Great!

The blockers to committing the IGNORE patch I see are:

* We need to tweak some of the logical decoding stuff a bit more, I
feel. Firm up on the details of whether we treat a confirmation record
as a logical decoding change that is involved in the new dance during
transaction reassembly.

* We need to sort out those issues with the grammar, since that only
really applies to the inference specification. Maybe the WHERE clause
that the inference specification accepts can be broken out. No ON
CONFLICT UPDATE specific issues left there, AFAICT though.

That really seems totally doable in just a couple of days.

The blockers for committing the ON CONFLICT UPDATE patch are larger, but doable:

* We need to figure out the tuple lock strength details. I think this
is doable, but it is the greatest challenge to committing ON CONFLICT
UPDATE at this point. Andres feels that we should require no greater
lock strength than an equivalent UPDATE. I suggest we get to this
after committing the IGNORE variant. We probably need to

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-22 Thread Jim Nasby

On 4/22/15 1:51 PM, Kevin Grittner wrote:

(1) WAL log the insert.
(2) Write the tuple.
(3) Hint and rewrite the tuple.
(4) WAL log the freeze of the tuple.
(5) Rewrite the frozen tuple.
(6) WAL-log the delete.
(7) Rewrite the deleted tuple.
(8) Prune and rewrite the page.
(9) Free line pointers and rewrite the page.

If I was lucky some of the writes could be combined in cache
because they happened close enough together. Also, one could hope
that not too much of the WAL-logging involved full page writes to
the WAL -- again, keeping steps close together in time helps with
that.


This is why I like the idea of methods that tell us where we need to do 
cleanup... they provide us with a rough ability to track what tuples are 
in what part of their lifecycle. The VM helps with this a small amount, 
but really it only applies after 1 and 6; it doesn't help us with any 
other portions.


Having a way to track recently created tuples would allow us to be much 
more efficient with 1-3, and with aggressive freezing, 1-5. A way to 
track recently deleted tuples would help with 6-7, possibly 6-9 if no 
indexes.


If we doubled the size of the VM, that would let us track 4 states for 
each page:


- Page has newly inserted tuples
- Page has newly deleted tuples
- Page is all visible
- Page is frozen

though as discussed elsewhere, we could probably combine all visible and 
frozen.


The win from doing this would be easily knowing what pages need hinting 
(newly inserted) and pruning (newly deleted). Unfortunately we still 
wouldn't know whether we could do real work without visiting the page 
itself, but I suspect that for many workloads just having newly 
inserted/deleted would be a serious win.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Bruce Momjian
On Tue, Apr 21, 2015 at 08:39:37AM +0200, Andres Freund wrote:
> On 2015-04-20 17:13:29 -0400, Bruce Momjian wrote:
> > Didn't you think any of the TODO threads had workable solutions?  And
> > don't expect adding an additional file per relation will be zero cost
> > --- added over the lifetime of 200M transactions, I question if this
> > approach would be a win.
> 
> Note that normally you'd not run with a 200M transaction freeze max age
> on a busy server. Rather around a magnitude more.
> 
> Think about this being used on a time partionioned table. Right now all
> the partitions have to be fully rescanned on a regular basis - quite
> painful. With something like this normally only the newest partitions
> will have to be.

My point is that for the life of 200M transactions, you would have the
overhead of an additional file per table in the file system, and updates
of that.  I just don't know if the overhead over the long time period
would be smaller than the VACUUM FREEZE.  It might be fine --- I don't
know.  People seem to focus on the big activities, while many small
activities can lead to larger slowdowns.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Jim Nasby

On 4/22/15 6:12 PM, Bruce Momjian wrote:

My point is that for the life of 200M transactions, you would have the
overhead of an additional file per table in the file system, and updates
of that.  I just don't know if the overhead over the long time period
would be smaller than the VACUUM FREEZE.  It might be fine --- I don't
know.  People seem to focus on the big activities, while many small
activities can lead to larger slowdowns.


Ahh. This wouldn't be for the life of 200M transactions; it would be a 
permanent fork, just like the VM is.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-22 Thread Peter Geoghegan
On Wed, Apr 22, 2015 at 3:23 PM, Peter Geoghegan  wrote:
> * We need to sort out those issues with the grammar, since that only
> really applies to the inference specification. Maybe the WHERE clause
> that the inference specification accepts can be broken out. No ON
> CONFLICT UPDATE specific issues left there, AFAICT though.

I pushed some code that deals with the predicate being within parenthesis:

https://github.com/petergeoghegan/postgres/commit/358854645279523310f998dfc9cb3fe3e165ce1e

(it now follows the attributes/expressions indexes, in the style of
CREATE INDEX).

We still need to reconcile these changes to the grammar with your own,
Andres. I'm going to wait to hear back on what you think about that.
Note that this removal:

-%nonassoc DISTINCT
-%nonassoc ON

was incidental to the commit (this is the code you could have removed
when you modified the grammar, adding a new production, but didn't).
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inherit support for foreign tables

2015-04-22 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 16 Apr 2015 14:43:33 -0700, David Fetter  wrote in 
<20150416214333.ga...@fetter.org>
> On Wed, Apr 15, 2015 at 09:35:05AM +0900, Kyotaro HORIGUCHI wrote:
> > Hi,
> > 
> > Before suppressing the symptom, I doubt the necessity and/or
> > validity of giving foreign tables an ability to be a parent. Is
> > there any reasonable usage for the ability?
...
> I have a use case for having foreign tables as non-leaf nodes in a
> partitioning hierarchy, namely geographic.

Ah, I see. I understood the case of intermediate nodes. I agree
that it is quite natural.

>  One might have a table at
> HQ called foo_world, then partitions under it called foo_jp, foo_us,
> etc., in one level, foo_us_ca, foo_us_pa, etc. in the next level, and
> on down, each in general in a separate data center.
> 
> Is there something essential about having non-leaf nodes as foreign
> tables that's a problem here?

No. I'm convinced of the necessity. Sorry for the noise.


At Wed, 22 Apr 2015 17:00:10 -0400, Robert Haas  wrote 
in 
> Gee, I don't see why that would be unreasonable or invalid

Hmm. Yes, as mentioned above, there's no reason to refuse
non-leaf foregin tables. I didn't understood the real cause of
the problem and thought that not allowing foreign *root* tables
seem better than tweaking elsewhere. But that thought found to be
totally a garbage :(

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TABLESAMPLE patch

2015-04-22 Thread Michael Paquier
On Thu, Apr 23, 2015 at 4:31 AM, Petr Jelinek  wrote:
> On 19/04/15 01:24, Michael Paquier wrote:
>>
>> On Sat, Apr 18, 2015 at 8:38 PM, Michael Paquier
>>  wrote:
>>>
>>> On Fri, Apr 17, 2015 at 10:54 PM, Petr Jelinek wrote:

 On 10/04/15 06:46, Michael Paquier wrote:
>
> 13) Some regression tests with pg_tablesample_method would be welcome.


 Not sure what you mean by that.
>>>
>>>
>>> I meant a sanity check on pg_tablesample_method to be sure that
>>> tsminit, tsmnextblock and tsmnexttuple are always defined as they are
>>> mandatory functions. So the idea is to add a query like and and to be
>>> sure that it returns no rows:
>>> SELECT tsmname FROM pg_tablesample_method WHERE tsminit IS NOT NULL OR
>>> tsmnextblock IS NOT NULL OR tsmnexttuple IS NOT NULL;
>>
>>
>> Yesterday was a long day. I meant IS NULL and not IS NOT NULL, but I
>> am sure you guessed it that way already..
>>
>
> Yes I guessed that and it's very reasonable request, I guess it should look
> like the attached (I don't want to send new version of everything just for
> this).

Thanks. That's exactly the idea.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-22 Thread Bruce Momjian
On Wed, Apr 22, 2015 at 06:36:23PM -0500, Jim Nasby wrote:
> On 4/22/15 6:12 PM, Bruce Momjian wrote:
> >My point is that for the life of 200M transactions, you would have the
> >overhead of an additional file per table in the file system, and updates
> >of that.  I just don't know if the overhead over the long time period
> >would be smaller than the VACUUM FREEZE.  It might be fine --- I don't
> >know.  People seem to focus on the big activities, while many small
> >activities can lead to larger slowdowns.
> 
> Ahh. This wouldn't be for the life of 200M transactions; it would be
> a permanent fork, just like the VM is.

Right.  My point is that either you do X 2M times to maintain that fork
and the overhead of the file existance, or you do one VACUUM FREEZE.  I
am saying that 2M is a large number and adding all those X's might
exceed the cost of a VACUUM FREEZE.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?

2015-04-22 Thread Bruce Momjian
On Tue, Apr 21, 2015 at 05:36:41PM -0400, Robert Haas wrote:
> On Mon, Apr 20, 2015 at 5:41 PM, Bruce Momjian  wrote:
> > On Mon, Apr 20, 2015 at 05:04:14PM -0400, Robert Haas wrote:
> >> On Mon, Apr 20, 2015 at 4:11 PM, Bruce Momjian  wrote:
> >> > Slightly improved patch applied.
> >>
> >> Is it?
> >
> > The patch has a slightly modified 'if' statement to check a constant
> > before calling a function, and use elseif:
> >
> > < + if (!interpretOidsOption(stmt->options, true) && 
> > cxt.hasoids)
> > ---
> > > + if (cxt.hasoids && !interpretOidsOption(stmt->options, 
> > true))
> > 47c57
> > < + if (interpretOidsOption(stmt->options, true) && 
> > !cxt.hasoids)
> > ---
> > > + else if (!cxt.hasoids && interpretOidsOption(stmt->options, 
> > true))
> >
> > I realize the change is subtle.
> 
> What I meant was - I didn't see an attachment on that message.

I didn't attach it as people have told me they can just as easily see
the patch via git, and since it was so similar, I didn't repost it. 
Should I have?  I can easily do that.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type

2015-04-22 Thread Jeff Davis
Moving thread to -hackers.

On Wed, Apr 8, 2015 at 11:18 PM, Jeff Davis  wrote:
> That example was just for illustration. My other example didn't require
> creating a table at all:
>
>   SELECT a=b FROM (SELECT ''::text, '  ') x(a,b);
>
> it's fine with me if we want that to fail, but I don't think we're
> failing in the right place, or with the right error message.
>
> I'm not clear on what rules we're applying such that the above query
> should fail, but:
>
>   SELECT ''::text='  ';
>
> should succeed. Unknown literals are OK, but unknown column references
> are not? If that's the rule, can we catch that earlier, and throw an
> error like 'column reference "b" has unknown type'?

Is the behavior of unknown literals vs. unknown column references
documented anywhere? I tried looking here:
http://www.postgresql.org/docs/devel/static/typeconv.html, but it
doesn't seem to make the distinction between how unknown literals vs.
unknown column references are handled.

My understanding until now has been that unknown types are a
placeholder while still inferring the types. But that leaves a few
questions:

1. Why do we care whether the unknown is a literal or a column reference?
2. Unknown column references seem basically useless, because we will
almost always encounter the "failure to find conversion" error, so why
do we allow them?
3. If unknowns are just a placeholder, why do we return them to the
client? What do we expect the client to do with it?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablespaces inside $PGDATA considered harmful

2015-04-22 Thread Bruce Momjian
On Fri, Jan 30, 2015 at 01:26:22PM -0800, Josh Berkus wrote:
> Robert, Stephen, etc.:
> 
> Apparently you can create a tablespace in the tablespace directory:
> 
> josh=# create tablespace tbl location '/home/josh/pg94/data/pg_tblspc/';
> CREATE TABLESPACE
> josh=# create table test_tbl ( test text ) tablespace tbl;
> CREATE TABLE
> josh=# \q
> josh@Radegast:~/pg94/data/pg_tblspc$ ls
> 17656  PG_9.4_201409291
> josh@Radegast:~/pg94/data/pg_tblspc$ ls -l
> total 4
> lrwxrwxrwx 1 josh josh   30 Jan 30 13:02 17656 ->
> /home/josh/pg94/data/pg_tblspc
> drwx-- 3 josh josh 4096 Jan 30 13:02 PG_9.4_201409291
> josh@Radegast:~/pg94/data/pg_tblspc$
> 
> In theory if I could guess the next OID, I could cause a failure there,
> but that appears to be obscure enough to be not worth bothering about.
> 
> What is a real problem is that we don't block creating tablespaces
> anywhere at all, including in obviously problematic places like the
> transaction log directory:
> 
> josh=# create tablespace tbl2 location '/home/josh/pg94/data/pg_xlog/';
> CREATE TABLESPACE
> 
> It really seems like we ought to block *THAT*.  Of course, if we block
> tablespace creation in PGDATA generally, then that's covered.

I have developed the attached patch to warn about creating tablespaces
inside the data directory.  The case this doesn't catch is referencing a
symbolic link that points to the same directory.  We can't make it an
error so people can use pg_upgrade these setups.  This would be for 9.5
only.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
new file mode 100644
index fd22612..4ec1aff
*** a/src/backend/commands/tablespace.c
--- b/src/backend/commands/tablespace.c
*** CreateTableSpace(CreateTableSpaceStmt *s
*** 288,293 
--- 288,299 
   errmsg("tablespace location \"%s\" is too long",
  		location)));
  
+ 	/* Warn if the tablespace is in the data directory. */
+ 	if (path_is_prefix_of_path(DataDir, location))
+ 		ereport(WARNING,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+  errmsg("tablespace location should not be inside the data directory")));
+ 
  	/*
  	 * Disallow creation of tablespaces named "pg_xxx"; we reserve this
  	 * namespace for system purposes.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inherit support for foreign tables

2015-04-22 Thread Etsuro Fujita

On 2015/04/23 0:34, Stephen Frost wrote:

* Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote:

postgres=# select * from ft1 for update;
ERROR:  could not find junk tableoid1 column

I think this is a bug.  Attached is a patch fixing this issue.


Pushed, thanks!


Thank you.

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-22 Thread Kouhei Kaigai
Hi Robert,

Thanks for your comments.

> A few random cosmetic problems:
> 
> - The hunk in allpaths.c is useless.
> - The first hunk in fdwapi.h contains an extra space before the
> closing parenthesis.
>
OK, it's my oversight.

> And then:
> 
> +   else if (scan->scanrelid == 0 &&
> +(IsA(scan, ForeignScan) || IsA(scan, CustomScan)))
> +   varno = INDEX_VAR;
> 
> Suppose scan->scanrelid == 0 but the scan type is something else?  Is
> that legal?  Is varno == 0 the correct outcome in that case?
>
Right now, no other scan type has capability to return a tuples
with flexible type/attributes more than static definition.
I think it is a valid restriction that only foreign/custom-scan
can have scanrelid == 0.

I checked overall code again. One point doubtful was ExecScanFetch().
If estate->es_epqTuple is not NULL, it tries to save a tuple from
a particular scanrelid (larger than zero).
IIUC, es_epqTuple is used only when fetched tuple is updated then
visibility checks are applied on writer operation again.
So, it should work for CPS with underlying actual scan node on base
relations, however, I need code investigation if FDW/CSP replaced
an entire join subtree by an alternative relation scan (like a
materialized view).


> > [ new patch ]
> 
> A little more nitpicking:
> 
> ExecInitForeignScan() and ExecInitCustomScan() could declare
> currentRelation inside the if (scanrelid > 0) block instead of in the
> outer scope.
>
OK,

> I'm not too excited about the addition of GetFdwHandlerForRelation,
> which is a one-line function used in one place.  It seems like we
> don't really need that.
>
OK,

> On Fri, Mar 13, 2015 at 8:02 PM, Tom Lane  wrote:
> > Robert Haas  writes:
> >> On Fri, Mar 13, 2015 at 2:31 PM, Tom Lane  wrote:
> >>> I don't object to the concept, but I think that is a pretty bad place
> >>> to put the hook call: add_paths_to_joinrel is typically called multiple
> >>> (perhaps *many*) times per joinrel and thus this placement would force
> >>> any user of the hook to do a lot of repetitive work.
> >
> >> Interesting point.  I guess the question is whether a some or all
> >> callers are going to actually *want* a separate call for each
> >> invocation of add_paths_to_joinrel(), or whether they'll be happy to
> >> operate on the otherwise-complete path list.
> >
> > Hmm.  You're right, it's certainly possible that some users would like to
> > operate on each possible pair of input relations, rather than considering
> > the joinrel "as a whole".  Maybe we need two hooks, one like your patch
> > and one like I suggested.
> 
> Let me attempt to summarize subsequent discussion on this thread by
> saying the hook location that you proposed (just before set_cheapest)
> has not elicited any enthusiasm from anyone else.  In a nutshell, the
> problem is that a single callback for a large join problem is just
> fine if there are no special joins involved, but in any other
> scenario, nobody knows how to use a hook at that location for anything
> useful.  To push down a join to the remote server, you've got to
> figure out how to emit an SQL query for it.  To execute it with a
> custom join strategy, you've got to know which of those joins should
> have inner join semantics vs. left join semantics.  A hook/callback in
> make_join_rel() or in add_paths_to_joinrel() makes that relatively
> straightforward. Otherwise, it's not clear what to do, short of
> copy-and-pasting join_search_one_level().  If you have a suggestion,
> I'd like to hear it.
>
Nothing I have. Once I tried to put a hook just after the set_cheapest(),
the largest problem was that we cannot extract a set of left and right
relations from a set of joined relations, like an extraction of apple
and orange from mix juice.

> If not, I'm going to press forward with the idea of putting the
> relevant logic in either add_paths_to_joinrel(), as previously
> proposed, or perhaps up oe level in make_one_rel().  Either way, if
> you don't need to be called multiple times per joinrel, you can stash
> a flag inside whatever you hang off of the joinrel's fdw_private and
> return immediately on every call after the first.  I think that's
> cheap enough that we shouldn't get too stressed about it: for FDWs, we
> only call the hook at all if everything in the joinrel uses the same
> FDW, so it won't get called at all except for joinrels where it's
> likely to win big; for custom joins, multiple calls are quite likely
> to be useful and necessary, and if the hook burns too much CPU time
> for the query performance you get out of it, that's the custom-join
> provider's fault, not ours.  The current patch takes this approach one
> step further and attempts FDW pushdown only once per joinrel.  It does
> that because, while postgres_fdw DOES need the jointype and a valid
> innerrel/outerrel breakdown to figure out what query to generate, it
> does NOT every possible breakdown; rather, the first one is as good as
> any other. But 

Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type

2015-04-22 Thread David G. Johnston
My apologies if much of this is already assumed knowledge by most
-hackers...I'm trying to learn from observation instead of, largely,
reading code in a foreign language.

On Wed, Apr 22, 2015 at 6:40 PM, Jeff Davis  wrote:

> Moving thread to -hackers.
>
> On Wed, Apr 8, 2015 at 11:18 PM, Jeff Davis  wrote:
> > That example was just for illustration. My other example didn't require
> > creating a table at all:
> >
> >
> ​​
> ​​
> SELECT a=b FROM (SELECT ''::text, '  ') x(a,b);
> >
> > it's fine with me if we want that to fail, but I don't think we're
> > failing in the right place, or with the right error message.
> >
> > I'm not clear on what rules we're applying such that the above query
> > should fail, but:
> >
> >
> ​​
> SELECT ''::text='  ';

>
> > should succeed. Unknown literals are OK, but unknown column references
> > are not? If that's the rule, can we catch that earlier, and throw an
> > error like 'column reference "b" has unknown type'?
>

But the fact that column "b" has the data type "unknown" is only a warning
- not an error.

This seems to be a case of the common problem (or, at least recently
mentioned) where type conversion only deals with data and not context.

http://www.postgresql.org/message-id/CADx9qBmVPQvSH3+2cH4cwwPmphW1mE18e=wumlfuc-qz-t7...@mail.gmail.com

Additional hinting regarding the column containing the offending data would
be welcomed by the community - but I suspect it is a non-trivial endeavor.


> Is the behavior of unknown literals vs. unknown column references
> documented anywhere? I tried looking here:
> http://www.postgresql.org/docs/devel/static/typeconv.html, but it
> doesn't seem to make the distinction between how unknown literals vs.
> unknown column references are handled.
>
> My understanding until now has been that unknown types are a
> placeholder while still inferring the types. But that leaves a few
> questions:
>
> 1. Why do we care whether the unknown is a literal or a column reference?
>

Apparently the difference is in when non-implicit casts can be used for
coercion - or, rather, when input functions can be used instead of casting
functions.

in ​SELECT '  '::text = 'a' the explicit cast between the implicit unknown
and text is used while going through the subquery forces the planner to
locate an implicit cast between the explicit unknown and text.

​The following fails no matter what you try because no casts exist from
unknown to integer:

​​SELECT a::int=b FROM (SELECT '1', 1) x(a,b);

but this too works - which is why the implicit cast concept above fails
(I'm leaving it since the thought process may help in understanding):

SELECT 1 = '1';

>From which I infer that an unknown literal is allowed to be fed directly
into a type's input function to facilitate a direct coercion.
​
Writing this makes me wish for more precise terminology...is there
something already established here?  "untyped" versus "unknown" makes sense
to me.  untyped literals only exist within the confines of a single node
and can be passed through a type's input function to make them take on a
type.  If the untyped reference passes through the node without having been
assigned an explicit type it is assigned the unknown type.

2. Unknown column references seem basically useless, because we will
> almost always encounter the "failure to find conversion" error, so why
> do we allow them?
>

At this point...backward compatibility?

I do get a warning in psql (9.3.6) from your original -bugs example

create table a(u) as select '1';

WARNING: "column "u" has type "unknown"​
DETAIL:  Proceeding with relation creation anyway.

Related question: was there ever a time when the above failed instead of
just supplying a warning?

My git-fu is not super strong but the above warning was last edited by Tom
Lane back in 2003 (d8528630) though it was just a refactor - the warning
was already present.  I suppose after 12 years the "why" doesn't matter so
much...

create table b(i int);
insert into b select u from a;
ERROR:  failed to find conversion function from unknown to integer

Text appears to have a cast defined:

SELECT u::text FROM a;


> 3. If unknowns are just a placeholder, why do we return them to the
> client? What do we expect the client to do with it?


​We do?​  I suspect that it is effectively treated as if it were text by
client libraries.

​My gut reaction is if you feel strongly enough to add some additional
documentation or warnings/hints/details related to this topic they probably
would get put in; but disallowing "unknown" as first-class type is likely
to fail to pass a cost-benefit evaluation.

Distinguishing between "untyped" literals and "unknown type" literals seems
promising concept to aid in understanding the difference in the face of not
being able (or wanting) to actually change the behavior.

David J.


[HACKERS] Code paths where LWLock should be released on failure

2015-04-22 Thread Michael Paquier
 Hi all,

After looking at bug #13128, I have been looking at the code around
LWLockAcquire/Release to see if there are similar issues elsewhere.
Here are my findings:

1) SimpleLruReadPage() holds a control lock at entry that will be held
at exit as well. However SlruReportIOError() can report an error,
letting the lock hold. Shouldn't we release the control lock when a
failure happens?

2) The patch attached to #13128 fixes MarkAsPreparing(), but actually
twophase.c also does not release some locks in LockGXact().

3) PreCommit_Notify@async.c should release AsyncQueueLock on failure I
guess because it is called at transaction commit. At least it looks
safer this way.

4) TablespaceCreateDbspace does not release TablespaceCreateLock but
LWLockReleaseAll would do it when aborting its transaction, so no
changes done there (?).

5) In ReplicationSlotCreate@slot.c, I would think that
ReplicationSlotAllocationLock should be released when all the locks
are in use. Similarly, in  ReplicationSlotDropAcquired,
ReplicationSlotAllocationLock should be released when !fail_softly.
SaveSlotToPath could also be made safer when a file could not be
created.

6) In dsm.c, dsm_create does not release
DynamicSharedMemoryControlLock when Error'ing when there are too many
shared memory segments.

7) In shmem.c, ShmemInitStruct does not release ShmemIndexLock on OOM.
I guess that's fine in bootstrap mode, still...

8) In lock.c, LockRelease() does not release partitionLock when a
shared lock cannot be found. Similar thing with
FastPathGetRelationLockEntry().

9) In predicate.c, CreatePredicateLock() forgets to release
SerializablePredicateLockListLock and partitionLock in case of an OOM.
There is a similar issue with ReleaseOneSerializableXact(),
CheckForSerializableConflictOut() and
predicatelock_twophase_recover().

10) In relcache.c, RelCacheInitLock is not released in
RelationCacheInitFilePreInvalidate() after unlink() failure.

11) In AlterSystemSetConfigFile(), AutoFileLock should be released on failure.

All those things give the patch attached. Comments are welcome.
Regards,
-- 
Michael
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b85a666..e8d6754 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -350,6 +350,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 		gxact = TwoPhaseState->prepXacts[i];
 		if (strcmp(gxact->gid, gid) == 0)
 		{
+			LWLockRelease(TwoPhaseStateLock);
 			ereport(ERROR,
 	(errcode(ERRCODE_DUPLICATE_OBJECT),
 	 errmsg("transaction identifier \"%s\" is already in use",
@@ -359,11 +360,14 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 
 	/* Get a free gxact from the freelist */
 	if (TwoPhaseState->freeGXacts == NULL)
+	{
+		LWLockRelease(TwoPhaseStateLock);
 		ereport(ERROR,
 (errcode(ERRCODE_OUT_OF_MEMORY),
  errmsg("maximum number of prepared transactions reached"),
  errhint("Increase max_prepared_transactions (currently %d).",
 		 max_prepared_xacts)));
+	}
 	gxact = TwoPhaseState->freeGXacts;
 	TwoPhaseState->freeGXacts = gxact->next;
 
@@ -497,16 +501,22 @@ LockGXact(const char *gid, Oid user)
 
 		/* Found it, but has someone else got it locked? */
 		if (gxact->locking_backend != InvalidBackendId)
+		{
+			LWLockRelease(TwoPhaseStateLock);
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 errmsg("prepared transaction with identifier \"%s\" is busy",
 			gid)));
+		}
 
 		if (user != gxact->owner && !superuser_arg(user))
+		{
+			LWLockRelease(TwoPhaseStateLock);
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
   errmsg("permission denied to finish prepared transaction"),
 	 errhint("Must be superuser or the user that prepared the transaction.")));
+		}
 
 		/*
 		 * Note: it probably would be possible to allow committing from
@@ -515,10 +525,13 @@ LockGXact(const char *gid, Oid user)
 		 * someone gets motivated to make it work.
 		 */
 		if (MyDatabaseId != proc->databaseId)
+		{
+			LWLockRelease(TwoPhaseStateLock);
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
   errmsg("prepared transaction belongs to another database"),
 	 errhint("Connect to the database where the transaction was prepared to finish it.")));
+		}
 
 		/* OK for me to lock it */
 		gxact->locking_backend = MyBackendId;
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 2826b7e..1224e4d 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -839,9 +839,12 @@ PreCommit_Notify(void)
 			LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
 			asyncQueueFillWarning();
 			if (asyncQueueIsFull())
+			{
+LWLockRelease(AsyncQueueLock);
 ereport(ERROR,
 		(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 	  errmsg("too many notifications in the NOTIFY queue")));
+			}
 			nextNotify = asyncQueueAddEntries(nextNotify);
 			LWLockRelease(AsyncQueue

Re: [HACKERS] Code paths where LWLock should be released on failure

2015-04-22 Thread Tom Lane
Michael Paquier  writes:
> After looking at bug #13128, I have been looking at the code around
> LWLockAcquire/Release to see if there are similar issues elsewhere.
> Here are my findings:

IIRC, we automatically release all LWLocks at the start of error recovery,
so I rather doubt that any of this is necessary.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bogus WAL segments archived after promotion

2015-04-22 Thread Michael Paquier
On Mon, Apr 13, 2015 at 11:57 PM, Heikki Linnakangas  wrote:
> On 04/01/2015 07:12 PM, Bruce Momjian wrote:
>>
>> On Fri, Dec 19, 2014 at 10:26:34PM +0200, Heikki Linnakangas wrote:
>>>
>>> On 12/19/2014 02:55 PM, Heikki Linnakangas wrote:

 I'm thinking that we should add a step to promotion, where we scan
 pg_xlog for any segments higher than the timeline switch point, and
 remove them, or mark them with .done so that they are not archived.
 There might be some real WAL that was streamed from the primary, but not
 yet applied, but such WAL is of no interest to that server anyway, after
 it's been promoted. It's a bit disconcerting to zap WAL that's valid,
 even if doesn't belong to the current server's timeline history, because
 as a general rule it's good to avoid destroying evidence that might be
 useful in debugging. There isn't much difference between removing them
 immediately and marking them as .done, though, because they will
 eventually be removed/recycled anyway if they're marked as .done.
>>>
>>>
>>> This is what I came up with. This patch removes the suspect segments
>>> at timeline switch. The alternative of creating .done files for them
>>> would preserve more evidence for debugging, but OTOH it would also
>>> be very confusing to have valid-looking WAL segments in pg_xlog,
>>> with .done files, that in fact contain garbage.
>>>
>>> The patch is a bit longer than it otherwise would be, because I
>>> moved the code to remove a single file from RemoveOldXlogFiles() to
>>> a new function. I think that makes it more readable in any case,
>>> simply because it was so deeply indented in RemoveOldXlogFiles.
>>
>>
>> Where are we on this?
>
>
> I didn't hear any better ideas, so committed this now.

Finally looking at that... The commit log of b2a5545 is a bit
misleading. Segment files that were recycled during archive recovery
are not necessarily removed, they could be recycled as well during
promotion on the new timeline in line with what RemoveOldXlogFiles()
does. Hence I think that the comment on top of
RemoveNonParentXlogFiles() should be updated to reflect that like in
the patch attached.

Something minor: perhaps we could refactor xlogarchive.c to have
XLogArchiveCheckDone() and XLogArchiveIsBusy() use the new
XLogArchiveIsReady().
Regards,
-- 
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2580996..e0a8b81 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3596,7 +3596,8 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 }
 
 /*
- * Remove WAL files that are not part of the given timeline's history.
+ * Recycle or remove WAL files that are not part of the given timeline's
+ * history.
  *
  * This is called during recovery, whenever we switch to follow a new
  * timeline, and at the end of recovery when we create a new timeline. We

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Code paths where LWLock should be released on failure

2015-04-22 Thread Michael Paquier
On Thu, Apr 23, 2015 at 2:46 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> After looking at bug #13128, I have been looking at the code around
>> LWLockAcquire/Release to see if there are similar issues elsewhere.
>> Here are my findings:
>
> IIRC, we automatically release all LWLocks at the start of error recovery,
> so I rather doubt that any of this is necessary.

Perhaps this is purely cosmetic for most those code, but not all... if
you have a look at #13128, not releasing TwoPhaseStateLock causes a
deadlock on startup when max_prepared_transactions does not have
enough slots. I have also been surprised by the inconsistencies
particularly in predicate.c or other places regarding LWLock releases.
Sometimes they are released on failure, sometimes not.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type

2015-04-22 Thread Jeff Davis
On Wed, 2015-04-22 at 20:35 -0700, David G. Johnston wrote:

> But the fact that column "b" has the data type "unknown" is only a
> warning - not an error.
> 
I get an error:

postgres=# SELECT '  '::text = 'a';
 ?column? 
--
 f
(1 row)

postgres=# SELECT a=b FROM (SELECT ''::text, '  ') x(a,b);
ERROR:  failed to find conversion function from unknown to text

So that means the column reference "b" is treated differently than the
literal. Here I don't mean a reference to an actual column of a real
table, just an identifier ("b") that parses as a columnref.

Creating the table gives you a warning (not an error), but I think that
was a poor example for me to choose, and not important to my point.
> 
> This seems to be a case of the common problem (or, at least recently
> mentioned) where type conversion only deals with data and not context.
> 
> 
> http://www.postgresql.org/message-id/CADx9qBmVPQvSH3
> +2cH4cwwPmphW1mE18e=wumlfuc-qz-t7...@mail.gmail.com
> 
> 
I think that is a different problem. That's a runtime type conversion
error (execution time), and I'm talking about something happening at
parse analysis time.

> 
> but this too works - which is why the implicit cast concept above
> fails (I'm leaving it since the thought process may help in
> understanding):
> 
> 
> SELECT 1 = '1';
> 
> 
> From which I infer that an unknown literal is allowed to be fed
> directly into a type's input function to facilitate a direct coercion.

Yes, I believe that's what's happening. When we use an unknown literal,
it's acting more like a value constructor and will pass it to the type
input function. When it's a columnref, even if unknown, it tries to cast
it and fails.

But that is very confusing. In the example at the top of this email, it
seems like the second query should be equivalent to the first, or even
that postgres should be able to rewrite the second into the first. But
the second query fails where the first succeeds.


> At this point...backward compatibility?

Backwards compatibility of what queries? I guess the ones that return
unknowns to the client or create tables with unknown columns?

> create table a(u) as select '1';
> 
> 
> WARNING: "column "u" has type "unknown"​
> DETAIL:  Proceeding with relation creation anyway.
> 
> 
> Related question: was there ever a time when the above failed instead
> of just supplying a warning?

Not that I recall.



> ​My gut reaction is if you feel strongly enough to add some additional
> documentation or warnings/hints/details related to this topic they
> probably would get put in; but disallowing "unknown" as first-class
> type is likely to fail to pass a cost-benefit evaluation.

I'm not proposing that we eliminate unknown. I just think columnrefs and
literals should behave consistently. If we really don't want unknown
columnrefs, it seems like we could at least throw a better error.

If we were starting from scratch, I'd also not return unknown to the
client, but we have to worry about the backwards compatibility.

> Distinguishing between "untyped" literals and "unknown type" literals
> seems promising concept to aid in understanding the difference in the
> face of not being able (or wanting) to actually change the behavior.

Not sure I understand that proposal, can you elaborate?

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers