date:20180607

On 7 June 2018 at 22:19, Andres Freund  wrote:

> Wonder if the right thing here wouldn't be to instead transiently
> acquire an AEL lock during replay when truncating a relation?

The way AELs are replayed in generic, all AEL requests are handled that way.

So yes, you could invent a special case for this specific situation.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: why partition pruning doesn't work?

On Fri, Jun 8, 2018 at 8:52 AM, Tom Lane  wrote:
> David Rowley  writes:
>> On 8 June 2018 at 03:43, Tom Lane  wrote:
>>> Maybe there's something I'm missing here, but I sort of hoped that this
>>> patch would nuke all the special-case code for Params in this area.
>>> Why is there any need to distinguish them from other stable expressions?
>
>> We need to know which Params exist in the Expr as if there are no
>> Params, or only external Params, then we can run-time prune during
>> startup of the executor.
>
> This does not refute my question.  Why doesn't the same logic apply
> to any stable expression?  That is, ISTM a Param is a special case
> of that.
>

+1.

I don't think we need to perform pruning at the start of execution,
but we could fold all the stable expressions to constants at that
time. The PARAM_EXECs can not be folded into constant at execution
start since those not assigned any values yet. AFAIU expressions,
within a given node, with those parameters can be folded into
constants (if possible) during ExecRewind() on that node. We have to
perform pruning just before the (Merge)Append node scan starts.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Performance regression with PostgreSQL 11 and partitioning

David Rowley  writes:
> On 8 June 2018 at 08:22, Tom Lane  wrote:
>> I'm still of the opinion that find_appinfos_by_relids() needs to be
>> nuked from orbit.

> Yeah, I agree it's not nice that it pallocs an array then pfrees it
> again. adjust_appendrel_attrs and adjust_child_relids could probably
> just accept a RelIds of childrels and find the AppendRelInfos itself,
> similar to how I coded find_appinfos_by_relids.

Why do they need to accept any additional parameters at all?

The pre-v11 incarnation of those functions took a single AppendRelInfo,
specifying an exact translation from one parent relid to one child
relid.  The fundamental problem I've got with the current code, entirely
independently of any performance issues, is that it's completely unclear
-- or at least undocumented -- which translation(s) are supposed to occur.
If we assume that the code isn't 100% broken (which I'm hardly convinced
of, but we'll take it as read for the moment) then it must be that the
translations are constrained to be well-determined by the query structure
as represented by the totality of the AppendRelInfo data.  So I'm thinking
maybe we don't need those parameters at all.  In the pre-v11 situation,
we were saving lookups by passing the only applicable AppendRelInfo to
the translation functions.  But if the translation functions have to
look up the right translation anyway, let's just make them do that,
and create whatever infrastructure we have to have to make it fast.

> Do you see that as something for PG11?

We already broke the API of these functions for v11.  I don't much
want to break them again in v12.  Let's get it right the first time.

regards, tom lane

Re: assert in nested SQL procedure call in current HEAD

2018-06-07 Thread Andrew Gierth

> "Andrew" == Andrew Gierth  writes:

> "Joe" == Joe Conway  writes:

 Joe> My colleague Yogesh Sharma discovered an assert in nested SQL
 Joe> procedure calls after ROLLBACK is used. Minimal test case and
 Joe> backtrace below. I have not yet tried to figure out exactly what
 Joe> is going on beyond seeing that it occurs in pg_plan_query() where
 Joe> the comment says "Planner must have a snapshot in case it calls
 Joe> user-defined functions"...

 Andrew> https://www.postgresql.org/message-id/29608.1518533...@sss.pgh.pa.us

I added it to the open items list since nobody else seems to have taken
notice; from Tom's linked message it seems this should be Peter E's bag?

-- 
Andrew (irc:RhodiumToad)

Re: config.{guess,sub} updates

Peter Eisentraut  writes:
> We didn't update those for beta1, as we usually do.  Should we do it for
> beta2?

Yes, if there are updates to be made.  Better late than never.

regards, tom lane

Re: Bug in either collation docs or code

"David G. Johnston"  writes:
> On Thu, Jun 7, 2018 at 4:37 PM, Melanie Plageman 
> wrote:
>> I thought this would error out because the subquery's result is considered
>> implicit and, in this case, it seems you now have conflicting implicit
>> collations. However, this does not produce an error. What am I missing?

> Data, apparently...I got the same non-error result before inserting a
> record into test1 then I got the expected error.
> Its the function/operator the fails when faced with invalid input, not the
> planner, so the error requires data to provoke.

IIRC this was an intentional decision, made on the grounds that we
can't tell whether the function/operator actually cares about having
a determinate collation or not, so we have to leave it to execution of
that function/operator to complain or not.

If collation had been part of the system design to start with, we'd
probably have insisted on being able to determine this sooner.  But
it wasn't, and when we added it, the ability to throw an error sooner
did not seem worth breaking a lot of third-party code for.

regards, tom lane

Re: Add RANGE with values and exclusions clauses to the Window Functions

Peter Eisentraut  writes:
> On 2/4/18 13:10, Tom Lane wrote:
>> + 22013EERRCODE_INVALID_PRECEDING_FOLLOWING_SIZE 
>>   invalid_preceding_following_size

> I was checking the new error codes in PostgreSQL 11 and came across
> this.  The original name in the SQL standard is
> INVALID_PRECEDING_OR_FOLLOWING_SIZE_IN_WINDOW_FUNCTION
> which is reasonable to abbreviate, but is there a reason why we lost the
> "or"?

It seemed like a reasonable abbreviation to me.  If you disagree,
feel free to change it.

regards, tom lane

Re: why partition pruning doesn't work?

David Rowley  writes:
> On 8 June 2018 at 03:43, Tom Lane  wrote:
>> Maybe there's something I'm missing here, but I sort of hoped that this
>> patch would nuke all the special-case code for Params in this area.
>> Why is there any need to distinguish them from other stable expressions?

> We need to know which Params exist in the Expr as if there are no
> Params, or only external Params, then we can run-time prune during
> startup of the executor.

This does not refute my question.  Why doesn't the same logic apply
to any stable expression?  That is, ISTM a Param is a special case
of that.

regards, tom lane

Re: Loaded footgun open_datasync on Windows

2018-06-07 Thread Amit Kapila

On Fri, Jun 8, 2018 at 7:48 AM, Laurenz Albe 
wrote:

> Amit Kapila wrote:
> > On Wed, Jun 6, 2018 at 3:06 PM, Kuntal Ghosh 
> wrote:
> > > It seems the "#ifndef FRONTEND" restriction was added around
> > > pgwin32_open() for building libpq with Visual C++ or Borland C++.  The
> > > restriction was added in commit 422d4819 to build libpq with VC++[1].
> > > Later, in commit fd7c3f67e0bc4, the support for Borland C++ was also
> > > added.
> > >
> > > So, I'm not sure whether removing that restriction will work for all
> > > front-end modules.
> > >
> >
> > Thanks for doing investigation.  I agree with you that it appears from
> the old
> > discussion that we have added this restriction to build libpq/psql
> (basically frontend)
> > modules.  However, I tried to build those modules on windows by removing
> that
> > restriction and it doesn't give me any problem (except one extra
> argument error,
> > which can be easily resolved).  It might be that it gives problem with
> certain
> > version of MSVC.  I think if we want to pursue this direction, one needs
> to verify
> > on various supportted versions (of Windows) to see if things are okay.
>
> That's what the buildfarm is for, right?
>
> > Speaking about the issue at-hand, one way to make pg_test_fsync work in
> to just
> > include c.h and remove inclusion of postgres_fe.h, but not sure if that
> is the best way.
> > I am not sure if we have any other options to proceed in this case.
> >
> > Thoughts?
>
> I thought of a better way.
> pg_test_fsync is properly listed as a server application in the
> documentation,
> so I think it should be built as such, as per the attached patch.
>

-#include "postgres_fe.h"
+#include "postgres.h"

I don't think that server application can use backend environment unless it
is adapted to do so.  I think the application should have the capability to
deal with backend stuff like ereport, elog etc, if we don't want to use
FRONTEND environment.  The other server applications like pg_ctl.c,
initdb.c, etc. also uses FRONTEND environment.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Add RANGE with values and exclusions clauses to the Window Functions

2018-06-07 Thread Peter Eisentraut

On 2/4/18 13:10, Tom Lane wrote:
> diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
> index 1475bfe..9871d1e 100644
> *** a/src/backend/utils/errcodes.txt
> --- b/src/backend/utils/errcodes.txt
> *** Section: Class 22 - Data Exception
> *** 177,182 
> --- 177,183 
>   22P06EERRCODE_NONSTANDARD_USE_OF_ESCAPE_CHARACTER   
>  nonstandard_use_of_escape_character
>   22010EERRCODE_INVALID_INDICATOR_PARAMETER_VALUE 
>  invalid_indicator_parameter_value
>   22023EERRCODE_INVALID_PARAMETER_VALUE   
>  invalid_parameter_value
> + 22013EERRCODE_INVALID_PRECEDING_FOLLOWING_SIZE  
>  invalid_preceding_following_size
>   2201BEERRCODE_INVALID_REGULAR_EXPRESSION
>  invalid_regular_expression
>   2201WEERRCODE_INVALID_ROW_COUNT_IN_LIMIT_CLAUSE 
>  invalid_row_count_in_limit_clause
>   2201XEERRCODE_INVALID_ROW_COUNT_IN_RESULT_OFFSET_CLAUSE 
>  invalid_row_count_in_result_offset_clause

I was checking the new error codes in PostgreSQL 11 and came across
this.  The original name in the SQL standard is

INVALID_PRECEDING_OR_FOLLOWING_SIZE_IN_WINDOW_FUNCTION

which is reasonable to abbreviate, but is there a reason why we lost the
"or"?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Loaded footgun open_datasync on Windows

2018-06-07 Thread Laurenz Albe

Amit Kapila wrote:
> On Wed, Jun 6, 2018 at 3:06 PM, Kuntal Ghosh  
> wrote:
> > It seems the "#ifndef FRONTEND" restriction was added around
> > pgwin32_open() for building libpq with Visual C++ or Borland C++.  The
> > restriction was added in commit 422d4819 to build libpq with VC++[1].
> > Later, in commit fd7c3f67e0bc4, the support for Borland C++ was also
> > added.
> > 
> > So, I'm not sure whether removing that restriction will work for all
> > front-end modules.
> > 
> 
> Thanks for doing investigation.  I agree with you that it appears from the old
> discussion that we have added this restriction to build libpq/psql (basically 
> frontend)
> modules.  However, I tried to build those modules on windows by removing that
> restriction and it doesn't give me any problem (except one extra argument 
> error,
> which can be easily resolved).  It might be that it gives problem with certain
> version of MSVC.  I think if we want to pursue this direction, one needs to 
> verify
> on various supportted versions (of Windows) to see if things are okay.

That's what the buildfarm is for, right?

> Speaking about the issue at-hand, one way to make pg_test_fsync work in to 
> just
> include c.h and remove inclusion of postgres_fe.h, but not sure if that is 
> the best way.
> I am not sure if we have any other options to proceed in this case.
> 
> Thoughts?

I thought of a better way.
pg_test_fsync is properly listed as a server application in the documentation,
so I think it should be built as such, as per the attached patch.

Yours,
Laurenz AlbeFrom 032a0463072097e489f912337ea08f7373a4f107 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Fri, 8 Jun 2018 04:07:55 +0200
Subject: [PATCH] Build pg_test_fsync as backend code

Including "postgres.h" rather than "postgres_fe.h" makes
pg_test_fsync use pgwin32_open() on Windows, which is the
proper thing to do because it should test the behavior on
the backend side.
---
 src/bin/pg_test_fsync/pg_test_fsync.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index e6f7ef8557..c1e1d7ef38 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -3,7 +3,7 @@
  *		tests all supported fsync() methods
  */
 
-#include "postgres_fe.h"
+#include "postgres.h"
 
 #include 
 #include 
-- 
2.14.4

config.{guess,sub} updates

2018-06-07 Thread Peter Eisentraut

We didn't update those for beta1, as we usually do.  Should we do it for
beta2?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Column store in Greenplum

2018-06-07 Thread Haribabu Kommi

On Fri, Jun 8, 2018 at 7:34 AM, Asim R P  wrote:

> Hi,
>
> In the pluggable storage unconference session in Ottawa,
> column oriented storage was a key point of discussion.  We would like
> to share an overview of Greenplum's column store.  We hope that this
> will aid a discussion to carve out a pluggable storage interface.
> This is just an overview, source code is where the truth is:
> https://github.com/greenplum-db/gpdb
>
> Greenplum introduced "appendonly" as a table type to support row as
> well as column orientation.  Design goal for appendonly tables is to
> reduce I/O cost so that workloads involving bulk I/O (OLAP) run
> faster.  As the name suggests, tuples stored in the underlying files
> are immutable (no in-place updates / deletes).  The files can only be
> extended with new data.  We will focus only on column oriented
> appendonly tables in what follows.
>

Thanks for sharing the details of how Greenplum has implemented
columnar storage and use case of it. This will helpful in creating
a better pluggable table access methods.


A create table SQL command to create column oriented table:
>
> CREATE TABLE col_table (c1 int, c2 varchar) WITH (appendonly=true,
> orientation=column);
>

The current pluggable table access method provides the following syntax
currently to create a table with specific access method [1].

CREATE TABLE ... [USING ACCESSMETHOD] ...

The access method must be created by the columnar storage extension.


> Storage format:
>
> Each column of a column oriented table gets a dedicated file in the
> data directory.  A file contains a series of variable length blocks
> ("varblocks").  Each concurrent transaction writing to an appendonly
> table is allocated a dedicated set of files, one per column.  At the
> most, 128 concurrent transactions are allowed to write to the same
> appendonly table, leading to at the most 128 groups of files (one per
> column).  Let's assume two concurrent transactions inserted into the
> "col_table" in the above example.  The table will have two groups of
> files, each group containing two files (one per column).  A group of
> files is called "segment".  Segment 1 for "col_table" contains files
> ".1" (c1) and ".129" (c2).  Segment 2
> contains files ".2" (c1) and ".130" (c2).
>
> Appendonly tuples do not contain MVCC information (xmin/xmax).  An
> auxiliary table, pg_aoseg.pg_aoseg_, is created for each
> appendonly table.  It keeps track of the length of each segment file
> for the appendonly table.  The aoseg table is a heap table, normal
> MVCC rules apply.  This setup isolates transactions concurrently
> reading or writing to an appendonly table from each other.
>
> Similar to heap TID, appendonly TID is a 6-byte struct
> (appendonlytid.h).  It comprises of segment number and row number.
> Appendonly TID is recorded in indexes without much distinction from
> heap TID.  Unlike heap TID, an appendonly TID is not sufficient to
> locate a row number within a column's file.  Another auxiliary table,
> block directory (pg_aoseg.pg_aoblckdir_), maps a row number to
> offsets in the segment.  The offset points to start of the varblock
> containing the row number.  Index lookups first obtain the appendonly
> TID from the index, then map the row number to its offset using block
> directory and then read the varblock from the file.  The aoblckdir
> table is heap, normal MVCC rules apply.
>
> UPDATE and DELETE:
>
> Updates and deletes are supported by maintaining a bitmap representing
> visibility of appendonly row numbers.  Note that visibility is
> attribute of a row.  E.g. if row number 10 is invisible in
> "col_table", the column values corresponding to row number 10 in both
> ".1" as well as ".129" files should be
> considered invisible.  The bitmap is maintained in another auxiliary
> table - pg_aoseg.aovisimap, which is also a normal heap table.
> Deleting a row results in no change to the appendonly files.  The bit
> representing the deleted row number is set in the aovisimap table.
> UPDATE is delete followed by insert.  No update chains are maintained.
>
> Catalog changes:
>
> * pg_appenonly table contains OIDs of auxiliary tables for each
> appendonly table.
> * For each appendonly table, the auxiliary tables created are: aoseg,
> aovisimap and block directory.  Similar to toast tables, these have
> their own entires in pg_class with their own relfilenodes.
> * New pg_class.relstorage values 'a' for row oriented AO and 'c' for
> column oriented.
>

The current pluggable table access method interface provides a way to
create additional tables in the pg_class of relkind as 'E' (External table)
that can be accessed only by the storage access methods and not by the user.

Fujitsu has also developed our own columnar storage [2] and currently it is
tightly coupled
with modified version of PostgreSQL and we also wants to port our version
to pluggable
table access method

In our columnar storage, we also need to create additional relations to

Re: Performance regression with PostgreSQL 11 and partitioning

2018-06-07 Thread David Rowley

On 8 June 2018 at 08:22, Tom Lane  wrote:
> So that's basically what David's patch does, and it seems fine as far
> as it goes, although I disapprove of shoving the responsibility into
> setup_simple_rel_arrays() without so much as a comment change.
> I'd make a separate function for that, I think.  (BTW, perhaps instead
> we should do what the comment quoted above contemplates, and set up a
> link in the child's RelOptInfo?)

I had originally wanted to stuff these into RelOptInfo, but I
discovered it was not really possible because of how the inheritance
planner works. It replaces the parent with the child RelOptInfo for
each child and replans the query over and over, meaning we'd never
have a complete list of AppendRelInfos to work with.

> I'm still of the opinion that find_appinfos_by_relids() needs to be
> nuked from orbit.  It has nothing to recommend it either from the
> standpoint of performance or that of intellectual coherency (or maybe
> that problem is just inadequate documentation).  The places consuming
> its results are no better.

Yeah, I agree it's not nice that it pallocs an array then pfrees it
again. adjust_appendrel_attrs and adjust_child_relids could probably
just accept a RelIds of childrels and find the AppendRelInfos itself,
similar to how I coded find_appinfos_by_relids.

Do you see that as something for PG11?  Keep in mind, I did code the
patch in a way to minimise invasiveness having considered that PG11 is
now in beta.  I'm willing to write a patch that gets rid of
find_appinfos_by_relids completely. There are a few places, e.g.
create_partitionwise_grouping_paths that make use of the appinfos
multiple times, but the direct lookup is probably fast enough that
this would not matter.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Bug in either collation docs or code

2018-06-07 Thread David G. Johnston

On Thu, Jun 7, 2018 at 4:37 PM, Melanie Plageman 
wrote:

> CREATE TABLE test1 (
> a text COLLATE "de_DE",
> b text COLLATE "es_ES",
> ...
> );
>
> My thought was to add the following example:
>
> SELECT a < (select 'foo' COLLATE "fr_FR") FROM test1;
>
> I thought this would error out because the subquery's result is considered
> implicit and, in this case, it seems you now have conflicting implicit
> collations. However, this does not produce an error. What am I missing?
>

Data, apparently...I got the same non-error result before inserting a
record into test1 then I got the expected error.

Its the function/operator the fails when faced with invalid input, not the
planner, so the error requires data to provoke.

David J.

Re: Bug in either collation docs or code

2018-06-07 Thread Melanie Plageman

You could mark the subquery's result with a collation like this:
>
> postgres=# SELECT 'c' COLLATE "de_DE" > (SELECT 'ç') COLLATE "es_ES";
> ERROR:  collation mismatch between explicit collations "de_DE" and "es_ES"
>
> I'm not sure if this behavior is considered a bug, but I also can't imagine
>> how it would be expected given the current documentation. It seems to me
>> one or the other should be updated.
>>
>
> It seems correct to me. It does say "An explicit collation derivation
> occurs when a COLLATE clause is used; all other collation derivations are
> implicit". A subquery falls under the "all other collation derivations"
> category. Perhaps we could make it more clear what the COLLATE clause binds
> to, especially with subqueries, but I'm not sure how exactly to phrase it.
> Perhaps an additional example with a subquery would help?
>

So, I tried coming up with an example with a subquery that explains what
the COLLATE clause binds to, and I realized I'm still not clear. I came up
with an example using the DDL that is in the docs
:

CREATE TABLE test1 (
a text COLLATE "de_DE",
b text COLLATE "es_ES",
...
);

My thought was to add the following example:

SELECT a < (select 'foo' COLLATE "fr_FR") FROM test1;

I thought this would error out because the subquery's result is considered
implicit and, in this case, it seems you now have conflicting implicit
collations. However, this does not produce an error. What am I missing? The
result of the subquery has collation "fr_FR" and, if it's implicit, then I
shouldn't be able to compare it with test1.a, which has an implicit
collation of "de_DE".

Re: Spilling hashed SetOps and aggregates to disk

2018-06-07 Thread David Gershuni

As Serge mentioned, we’ve implemented spill-to-disk for SetOps and Aggregates
at Salesforce. We were hitting OOMs often enough that this became a high
priority for us. However, our current spill implementation is based on dynahash
from 9.6, and we’re not happy with its performance (it was primarily an OOM
stop-gap and was not focused on optimizing performance).

Because of this, we’ve spec’d out a new spill-to-disk design for hash-based
aggregates (and later SetOps), and we plan to begin implementation very soon.
Since this is an important fix for the community as well, we would be happy
(and would prefer) to share our spill-to-disk implementation.

Our new spill approach is very similar to Jeff’s and Heikki’s and is designed
to use simplehash.h. It’s based on an algorithm called “HybridHash with
Pre-Partitioning” found in [1]. It may later make sense to implement the
“HashSort” Algorithm from [1] as well, which works better for highly skewed
grouping keys. The optimizer could eventually choose between the two based on
the stats available. We also like Heikki’s suggestions to use logtape.c to
reduce disk usage and a trie-based approach to control the size of partitions
dynamically.

We’ve also been grappling with how to handle the implementation challenges
pointed out in this thread. These include:
• tracking memory usage
• choosing a smart eviction policy (which is touched on in [2])
• serializing opaque user-defined transition values when eviction is
required

For 1), we plan to use our WithStats memory context, which Serge mentioned.
For 2), we plan to start with a simple non-eviction policy since we don’t have
the stats do anything smarter (i.e. insert until we fill the hash table, then
spill until we finish processing the batch, with evictions only happening if a
group’s transition value grows too large).
For 3), we don’t have a good solution yet. We could serialize/deserialize for
built-in types and rely on users to provide serialize/deserialize functions for
user-defined aggregates going forward.

But we’re open to suggestions :)

Regards,
David
Salesforce

[1] Revisiting Aggregation for Data Intensive Applications: A Performance Study
https://arxiv.org/pdf/1311.0059.pdf

[2] DB2 with BLU acceleration: so much more than just a column store
http://delivery.acm.org/10.1145/254/2536233/p1080-raman.pdf?ip=204.14.239.107=2536233=ACTIVE%20SERVICE=37B0A9F49C26EEFC%2E37B0A9F49C26EEFC%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1528414374_aeb9f862ae2acc26db305d591095e3f7

Re: [PATCH v16] GSSAPI encryption support

2018-06-07 Thread Nico Williams

On Fri, Jun 08, 2018 at 10:11:52AM +1200, Thomas Munro wrote:
> On Fri, Jun 8, 2018 at 9:00 AM, Nico Williams  wrote:
> > Cool!  Is there any reason that your patch for Travis and AppVeyor
> > integration is not just committed to master?
> 
> I think that's a good idea and I know that some others are in favour.
> The AppVeyor one is not good enough to propose just yet: it's
> cargo-culted and skips a test that doesn't pass and only runs the
> basic check tests (not contribs, isolation, TAP etc).  Fortunately
> Andrew Dunstan has recently figured out how to run the official build
> farm script inside AppVeyor[1], and it looks like we might be close to
> figuring out that one test that doesn't work.  That makes me wonder if
> we should get the Travis version to use the build farm scripts too.
> If we can get all of that sorted out, then yes, I would like to
> propose that we stick the .XXX.yml files in the tree.  Another thing
> not yet explored is Travis's macOS build support.

I use AppVeyor for Heimdal and for jq...  Maybe I can help out.

As for Travis' OS X support... the problem there is that their build
farm is very small, so using theirs means waiting and waiting.

> Someone might argue that we shouldn't depend on particular external
> services, or that there are other CI services out there and we should
> use those too/instead for some reason, or that we don't want all that
> junk at top level in the tree.  But it seems to me that as long as
> they're dot prefixed files, we could carry control files for any
> number of CI services without upsetting anyone.  Having them in the
> tree would allow anyone who has a publicly accessible git repo
> (bitbucket, GitHub, ...) to go to any CI service that interests them
> and enable it with a couple of clicks.

Carrying the .yml files causes no harm beyond dependence, but that's a
nice problem to have when the alternative is to not have a CI at all.

> Then cfbot would still need to create new branches automatically (that
> is fundamentally what it does: converts patches on the mailing list
> into branches on GitHub), but it wouldn't need to add those control
> files anymore, just the submitted patches.

You wouldn't need it to.  Instead the CF page could let submitters link
their CI status pages/buttons...

> > Is there a way to get my CF entry building in cfbot, or will it get
> > there when it gets there?
> 
> Urgh, due to a bug in my new rate limiting logic it stopped pushing
> new branches for a day or two.  Fixed, and I see it's just picked up
> your submission #1319.  Usually it picks things up within minutes (it
> rescans threads whenever the 'last mail' date changes on the
> Commitfest web page), and then also rechecks each submission every
> couple of days.

Thanks!

> In a nice demonstration of the complexity of these systems, I see that
> the build for your submission on Travis failed because apt couldn't
> update its package index because repo.mongodb.org's key has expired.
> Other recent builds are OK so that seems to be a weird transient
> failure; possibly out of data in some cache, or some fraction of their
> repo server farm hasn't been updated yet or ... whatever.  Bleugh.

Oof.

> > I know I can just apply your patch, push to my fork, and have Travis and
> > AppVeyor build it.  And I just might, but cfbot is a neato service!
> 
> Thanks.  The next step is to show cfbot's results in the Commitfest
> app, and Magnus and I have started working on that.  I gave a talk
> about all this at PGCon last week, and the slides are up[2] in case
> anyone is interested:

OK.  I think that will be a huge improvement.  I find CF to be fantastic
as it is, but this will make it even better.

Thanks,

Nico
--

Re: why partition pruning doesn't work?

2018-06-07 Thread David Rowley

On 8 June 2018 at 03:43, Tom Lane  wrote:
> Maybe there's something I'm missing here, but I sort of hoped that this
> patch would nuke all the special-case code for Params in this area.
> Why is there any need to distinguish them from other stable expressions?
>
> IOW, I was hoping for the code to end up simpler, not more complicated.

We need to know which Params exist in the Expr as if there are no
Params, or only external Params, then we can run-time prune during
startup of the executor. Otherwise, we must leave the pruning until
during execution.

I really don't want to say goodbye to that optimisation as it's a
significant win to save having to initialise the subnodes for all the
useless partitions for OLTP type queries.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH v16] GSSAPI encryption support

2018-06-07 Thread Thomas Munro

On Fri, Jun 8, 2018 at 9:00 AM, Nico Williams  wrote:
> On Tue, Jun 05, 2018 at 12:16:31PM +1200, Thomas Munro wrote:
>> On Sat, May 26, 2018 at 6:58 AM, Robbie Harwood  wrote:
>> > Me and the bot are having an argument.  This should green Linux but I
>> > dunno about Windows.
>>
>> BTW if you're looking for a way to try stuff out on Windows exactly
>> the way cfbot does it without posting a new patch to the mailing list,
>> I put some instructions here:
>>
>> https://wiki.postgresql.org/wiki/Continuous_Integration
>>
>> The .patch there could certainly be improved (ideally, I think, it'd
>> run the whole build farm animal script) but it's a start.
>
> Cool!  Is there any reason that your patch for Travis and AppVeyor
> integration is not just committed to master?

I think that's a good idea and I know that some others are in favour.
The AppVeyor one is not good enough to propose just yet: it's
cargo-culted and skips a test that doesn't pass and only runs the
basic check tests (not contribs, isolation, TAP etc).  Fortunately
Andrew Dunstan has recently figured out how to run the official build
farm script inside AppVeyor[1], and it looks like we might be close to
figuring out that one test that doesn't work.  That makes me wonder if
we should get the Travis version to use the build farm scripts too.
If we can get all of that sorted out, then yes, I would like to
propose that we stick the .XXX.yml files in the tree.  Another thing
not yet explored is Travis's macOS build support.

Someone might argue that we shouldn't depend on particular external
services, or that there are other CI services out there and we should
use those too/instead for some reason, or that we don't want all that
junk at top level in the tree.  But it seems to me that as long as
they're dot prefixed files, we could carry control files for any
number of CI services without upsetting anyone.  Having them in the
tree would allow anyone who has a publicly accessible git repo
(bitbucket, GitHub, ...) to go to any CI service that interests them
and enable it with a couple of clicks.

Then cfbot would still need to create new branches automatically (that
is fundamentally what it does: converts patches on the mailing list
into branches on GitHub), but it wouldn't need to add those control
files anymore, just the submitted patches.

> Is there a way to get my CF entry building in cfbot, or will it get
> there when it gets there?

Urgh, due to a bug in my new rate limiting logic it stopped pushing
new branches for a day or two.  Fixed, and I see it's just picked up
your submission #1319.  Usually it picks things up within minutes (it
rescans threads whenever the 'last mail' date changes on the
Commitfest web page), and then also rechecks each submission every
couple of days.

In a nice demonstration of the complexity of these systems, I see that
the build for your submission on Travis failed because apt couldn't
update its package index because repo.mongodb.org's key has expired.
Other recent builds are OK so that seems to be a weird transient
failure; possibly out of data in some cache, or some fraction of their
repo server farm hasn't been updated yet or ... whatever.  Bleugh.

> I know I can just apply your patch, push to my fork, and have Travis and
> AppVeyor build it.  And I just might, but cfbot is a neato service!

Thanks.  The next step is to show cfbot's results in the Commitfest
app, and Magnus and I have started working on that.  I gave a talk
about all this at PGCon last week, and the slides are up[2] in case
anyone is interested:

[1] 
https://www.postgresql.org/message-id/flat/0f3d44a1-1ac5-599c-3e15-16d058d54e9a%402ndQuadrant.com
[2] https://www.pgcon.org/2018/schedule/events/1234.en.html

-- 
Thomas Munro
http://www.enterprisedb.com

Column store in Greenplum

2018-06-07 Thread Asim R P

Hi,

In the pluggable storage unconference session in Ottawa,
column oriented storage was a key point of discussion.  We would like
to share an overview of Greenplum's column store.  We hope that this
will aid a discussion to carve out a pluggable storage interface.
This is just an overview, source code is where the truth is:
https://github.com/greenplum-db/gpdb

Greenplum introduced "appendonly" as a table type to support row as
well as column orientation.  Design goal for appendonly tables is to
reduce I/O cost so that workloads involving bulk I/O (OLAP) run
faster.  As the name suggests, tuples stored in the underlying files
are immutable (no in-place updates / deletes).  The files can only be
extended with new data.  We will focus only on column oriented
appendonly tables in what follows.

A create table SQL command to create column oriented table:

CREATE TABLE col_table (c1 int, c2 varchar) WITH (appendonly=true,
orientation=column);

Storage format:

Each column of a column oriented table gets a dedicated file in the
data directory.  A file contains a series of variable length blocks
("varblocks").  Each concurrent transaction writing to an appendonly
table is allocated a dedicated set of files, one per column.  At the
most, 128 concurrent transactions are allowed to write to the same
appendonly table, leading to at the most 128 groups of files (one per
column).  Let's assume two concurrent transactions inserted into the
"col_table" in the above example.  The table will have two groups of
files, each group containing two files (one per column).  A group of
files is called "segment".  Segment 1 for "col_table" contains files
".1" (c1) and ".129" (c2).  Segment 2
contains files ".2" (c1) and ".130" (c2).

Appendonly tuples do not contain MVCC information (xmin/xmax).  An
auxiliary table, pg_aoseg.pg_aoseg_, is created for each
appendonly table.  It keeps track of the length of each segment file
for the appendonly table.  The aoseg table is a heap table, normal
MVCC rules apply.  This setup isolates transactions concurrently
reading or writing to an appendonly table from each other.

Similar to heap TID, appendonly TID is a 6-byte struct
(appendonlytid.h).  It comprises of segment number and row number.
Appendonly TID is recorded in indexes without much distinction from
heap TID.  Unlike heap TID, an appendonly TID is not sufficient to
locate a row number within a column's file.  Another auxiliary table,
block directory (pg_aoseg.pg_aoblckdir_), maps a row number to
offsets in the segment.  The offset points to start of the varblock
containing the row number.  Index lookups first obtain the appendonly
TID from the index, then map the row number to its offset using block
directory and then read the varblock from the file.  The aoblckdir
table is heap, normal MVCC rules apply.

UPDATE and DELETE:

Updates and deletes are supported by maintaining a bitmap representing
visibility of appendonly row numbers.  Note that visibility is
attribute of a row.  E.g. if row number 10 is invisible in
"col_table", the column values corresponding to row number 10 in both
".1" as well as ".129" files should be
considered invisible.  The bitmap is maintained in another auxiliary
table - pg_aoseg.aovisimap, which is also a normal heap table.
Deleting a row results in no change to the appendonly files.  The bit
representing the deleted row number is set in the aovisimap table.
UPDATE is delete followed by insert.  No update chains are maintained.

Catalog changes:

* pg_appenonly table contains OIDs of auxiliary tables for each
appendonly table.
* For each appendonly table, the auxiliary tables created are: aoseg,
aovisimap and block directory.  Similar to toast tables, these have
their own entires in pg_class with their own relfilenodes.
* New pg_class.relstorage values 'a' for row oriented AO and 'c' for
column oriented.

Compression:

Each column can be compressed.  Supported compression algorithms are
zlib, zstd, RLE (run length encoding), delta compression on top of
RLE.

Buffer management:

Backend private memory is used to buffer contents of appendonly files.
Block size (a multiple of 8K) can be specified per column.  The amount
of memory buffered per file is equal to the block size configured for
that column.

Executor touch points:

Column store offers access methods such as
aocs_beginscan(), aocs_getnext(), aocs_fetch(), aocs_rescan(),
aocs_insert().  One may come across a bunch of conditionals sprinkled
around executor code to invoke either heap, appendonly row or column
store access methods.  A better abstraction layer would have made the
code much cleaner but we don't have it today.

Scan of column oriented tables is performed by reading the projected
columns' files in a loop.  Note that the values stored in the files
contain only data, no MVCC information is stored.  Scan node returns a
tuple table slot, just like in case of heap tables.  The slot contains
virtual tuples.  The values and nulls arrays

Re: hot_standby_feedback vs excludeVacuum and snapshots

On 2018-06-07 14:19:18 -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-03-29 12:17:24 +0100, Greg Stark wrote:
> > I'm poking around to see debug a vacuuming problem and wondering if
> > I've found something more serious.
> > 
> > As far as I can tell the snapshots on HOT standby are built using a
> > list of running xids that the primary builds and puts in the WAL and
> > that seems to include all xids from transactions running in all
> > databases. The HOT standby would then build a snapshot and eventually
> > send the xmin of that snapshot back to the primary in the hot standby
> > feedback and that would block vacuuming tuples that might be visible
> > to the standby.
> 
> > Many ages ago Alvaro sweated blood to ensure vacuums could run for
> > long periods of time without holding back the xmin horizon and
> > blocking other vacuums from cleaning up tuples. That's the purpose of
> > the excludeVacuum flag in GetCurrentVirtualXIDs(). That's possible
> > because we know vacuums won't insert any tuples that queries might try
> > to view and also vacuums won't try to perform any sql queries on other
> > tables.
> 
> > I can't find anywhere that the standby snapshot building mechanism
> > gets this same information about which xids are actually vacuums that
> > can be ignored when building a snapshot. So I'm concerned that the hot
> > standby sending back its xmin would be effectively undermining this
> > mechanism and forcing vacuum xids to be included in the xmin horizon
> > and prevent vacuuming of tuples.
> 
> > Am I missing something obvious? Is this a known problem?
> 
> Maybe I'm missing something, but the running transaction data reported
> to the standby does *NOT* include anything about lazy vacuums - they
> don't have an xid. The reason there's PROC_IN_VACUUM etc isn't the xid,
> it's *xmin*, no?
> 
> We currently do acquire an xid when truncating the relation - but I
> think it'd somewhat fair to argue that that's somewhat of a bug. The
> reason a log is acquired is that we need to log AEL locks, and that
> currently means they have to be assigned to a transaction.
> 
> Given that the truncation happens at the end of VACUUM and it *NEEDS* to
> be present on the standby - otherwise the locking stuff is useless - I
> don't think the fix commited in this thread is correct.
> 
> Wonder if the right thing here wouldn't be to instead transiently
> acquire an AEL lock during replay when truncating a relation?

Isn't the fact that vacuum truncation requires an AEL, and that the
change committed today excludes those transactions from running xacts
records, flat out broken?

Look at:

void
ProcArrayApplyRecoveryInfo(RunningTransactions running)
...
/*
 * Remove stale locks, if any.
 *
 * Locks are always assigned to the toplevel xid so we don't need to 
care
 * about subxcnt/subxids (and by extension not about ->suboverflowed).
 */
StandbyReleaseOldLocks(running->xcnt, running->xids);

by excluding running transactions you have, as far as I can tell,
effectively removed the vacuum truncation AEL from the standby. Now that
only happens when a running xact record is logged, but as that happens
in the background...

I also don't understand why this change would be backpatched in the
first place. It's a relatively minor efficiency thing, no?


Greetings,

Andres Freund

Re: SHOW ALL does not honor pg_read_all_settings membership

2018-06-07 Thread Alvaro Herrera

On 2018-May-31, Michael Paquier wrote:

> On Thu, May 31, 2018 at 07:05:58PM +0100, Simon Riggs wrote:
> > Any objections to backpatch to v10?
> 
> A backpatch is acceptable in my opinion.

Agreed on backpatching.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH v16] GSSAPI encryption support

2018-06-07 Thread Nico Williams

On Tue, Jun 05, 2018 at 12:16:31PM +1200, Thomas Munro wrote:
> On Sat, May 26, 2018 at 6:58 AM, Robbie Harwood  wrote:
> > Me and the bot are having an argument.  This should green Linux but I
> > dunno about Windows.
> 
> BTW if you're looking for a way to try stuff out on Windows exactly
> the way cfbot does it without posting a new patch to the mailing list,
> I put some instructions here:
> 
> https://wiki.postgresql.org/wiki/Continuous_Integration
> 
> The .patch there could certainly be improved (ideally, I think, it'd
> run the whole build farm animal script) but it's a start.

Cool!  Is there any reason that your patch for Travis and AppVeyor
integration is not just committed to master?

Is there a way to get my CF entry building in cfbot, or will it get
there when it gets there?

I know I can just apply your patch, push to my fork, and have Travis and
AppVeyor build it.  And I just might, but cfbot is a neato service!

Nico
--

Code of Conduct committee: call for volunteers

The proposed Postgres Code of Conduct [1] calls for an investigation
and enforcement committee, which is to be appointed by and ultimately
answerable to the core team, though no core members may sit on it.

The core team is pleased to announce that Stacey Haysler has agreed
to serve as the first Chair of the CoC committee.  Many of you know
Stacey from various conferences etc, and I don't feel that I need to
repeat her excellent qualifications for this task.

While we have a few names in mind for additional committee members,
we need more in order that the committee can represent a diverse
range of viewpoints and cultures.  Therefore we are calling for
volunteers, as indeed is required by the proposed CoC.  This call
will remain open until June 15; the committee membership will be
decided and announced before the end of June.

We encourage community members who, perhaps, have not previously
taken large roles in the community to step forward for this.
What is needed is not technical skills but empathy and the ability
to understand varying viewpoints.

If you are interested in serving, please write to c...@postgresql.org
(do not follow up to this message).  Please include answers to these
questions:

1. What interests you about being on the CoC Committee?

2. Have you been on another CoC Committee, or had a similar role at
another organization?  (Prior experience is not required, it's just
helpful to know everyone's background.)

3. What else would you like to tell us that is helpful for us to know
about your potential involvement with the CoC Committee?


regards, tom lane

[1] https://wiki.postgresql.org/wiki/Code_of_Conduct

Re: Performance regression with PostgreSQL 11 and partitioning

Ashutosh Bapat  writes:
> On Wed, Jun 6, 2018 at 11:27 AM, David Rowley
>  wrote:
>> I was trying to be realistic for something we can do to fix v11. It's
>> probably better to minimise the risky surgery on this code while in
>> beta. What I proposed was intended to fix a performance regression new
>> in v11. I'm not sure what you're proposing has the same intentions.

> Agreed that we want a less risky fix at this stage. What I am worried
> is with your implementation there are two ways to get AppendRelInfo
> corresponding to a child, append_rel_list and append_rel_array. Right
> now we will change all the code to use append_rel_array but there is
> no guarantee that some code in future will use append_rel_list. Bugs
> would rise if these two get out of sync esp. considering
> append_rel_list is a list which can be easily modified. I think we
> should avoid that. See what we do to rt_fetch() and
> planner_rt_fetch(), but we still have two ways to get an RTE. That's a
> source of future bugs.

>> I'd prefer to get a committers thoughts on this before doing any further 
>> work.

> Yes. I think so too.

As the original author of the append_rel_list stuff, I do have a few
thoughts here.

The reason why append_rel_list is just a list, and not any more
complicated data structure, is alluded to in the comments for
find_childrel_appendrelinfo (which David's patch should have
changed, and did not):

 * This search could be eliminated by storing a link in child RelOptInfos,
 * but for now it doesn't seem performance-critical.  (Also, it might be
 * difficult to maintain such a link during mutation of the append_rel_list.)

There are a lot of places in prepjointree.c and planner.c that whack
around the append_rel_list contents more or less extensively.  It's
a lot easier in those places if the data is referenced by just one
list pointer.  I do not think we want to replace append_rel_list
with something that would make those functions' lives much harder.

However, so far as I can see, the append_rel_list contents don't change
anymore once we enter query_planner().  Therefore, it's safe to build
secondary indexing structure(s) to allow fast access beyond that point.
This is not much different from what we do with, eg, the rangetable
and simple_rte_array[].

So that's basically what David's patch does, and it seems fine as far
as it goes, although I disapprove of shoving the responsibility into
setup_simple_rel_arrays() without so much as a comment change.
I'd make a separate function for that, I think.  (BTW, perhaps instead
we should do what the comment quoted above contemplates, and set up a
link in the child's RelOptInfo?)

I'm still of the opinion that find_appinfos_by_relids() needs to be
nuked from orbit.  It has nothing to recommend it either from the
standpoint of performance or that of intellectual coherency (or maybe
that problem is just inadequate documentation).  The places consuming
its results are no better.

I was also pretty unhappy to discover, as I poked around in the code, that
recent partitioning patches have introduced various assumptions about the
ordering of the append_rel_list.  It's bad enough that those exist at all;
it's worse that they're documented, if at all, only beside the code that
will fail (usually silently) if they're violated.  I do not find this
acceptable.  If we cannot avoid these assumptions, they need to be
documented more centrally, like in the relation.h block comment for struct
AppendRelInfo.

regards, tom lane

Re: computing completion tag is expensive for pgbench -S -M prepared

On 2018-06-07 20:34:39 +0100, Simon Riggs wrote:
> On 7 June 2018 at 20:27, Tom Lane  wrote:
> > Simon Riggs  writes:
> >> If we're going to compress the protocol, it seems sensible to remove
> >> extraneous information first.
> >
> > Breaking the wire protocol was nowhere in this thread.
> 
> No, it wasn't. But there is another thread on the subject of
> compressing libpq, which is what I was referring to.
> 
> Andres' patch is clearly very efficient at adding the SELECT tag. I am
> questioning if we can remove that need altogether.

That'd be a wire protocol break. We'd have to add compatibilities for
both things in the client, wait a couple years, and then change. Or we
could make it an optional thing based on a client option passed at
connect. Which'd also take a good while.  Those seem extremely
disproportionate complicated solutions for the problem.  Nor can I
believe that a "SELECT " in the resultset is a meaningful space issue,
making it even worthwhile to break compat in the first place.

Greetings,

Andres Freund

cursors with prepared statements

2018-06-07 Thread Peter Eisentraut

I have developed a patch that allows declaring cursors over prepared
statements:

DECLARE cursor_name CURSOR FOR prepared_statement_name
   [ USING param, param, ... ]

This is an SQL standard feature.  ECPG already supports it (with
different internals).

Internally, this just connects existing functionality in different ways,
so it doesn't really introduce anything new.

One point worth pondering is how to pass the parameters of the prepared
statements.  The actual SQL standard syntax would be

DECLARE cursor_name CURSOR FOR prepared_statement_name;
OPEN cursor_name USING param, param;

But since we don't have the OPEN statement in direct SQL, it made sense
to me to attach the USING clause directly to the DECLARE statement.

Curiously, the direct EXECUTE statement uses the non-standard syntax

EXECUTE prep_stmt (param, param);

instead of the standard

EXECUTE prep_stmt USING param, param;

I tried to consolidate this.  But using

DECLARE c CURSOR FOR p (foo, bar)

leads to parsing conflicts (and looks confusing?), and instead allowing
EXECUTE + USING leads to a mess in the ECPG parser that exhausted me.
So I'm leaving it as is for now and might give supporting EXECUTE +
USING another try later on.

When looking at the patch, some parts will look easier through git diff -w.

And the changes in the ECPG parser are needed because ECPG already
supported that syntax separately, but now it needs to override the rules
from the main parser instead.  That stuff has test coverage, fortunately.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From c1e8ecf95599a9085e5f16bcd4aab3f13a2d6800 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 7 Jun 2018 11:46:16 -0400
Subject: [PATCH] Cursors over prepared statements

Add command variant

DECLARE cursor_name CURSOR FOR prepared_statement_name [ USING param, 
param, ... ]

to open a cursor over a previously defined prepared statement.
---
 doc/src/sgml/ref/declare.sgml  |  37 +++
 src/backend/commands/portalcmds.c  | 109 +++--
 src/backend/commands/prepare.c |  16 ++-
 src/backend/parser/analyze.c   |   2 +-
 src/backend/parser/gram.y  |  24 +
 src/include/commands/prepare.h |   3 +
 src/interfaces/ecpg/preproc/check_rules.pl |   3 +
 src/interfaces/ecpg/preproc/ecpg.addons|  63 +++-
 src/interfaces/ecpg/preproc/ecpg.trailer   |  65 
 src/interfaces/ecpg/preproc/ecpg.type  |   1 -
 src/interfaces/ecpg/preproc/parse.pl   |   2 +
 src/test/regress/expected/portals.out  |  54 ++
 src/test/regress/sql/portals.sql   |  40 
 13 files changed, 309 insertions(+), 110 deletions(-)

diff --git a/doc/src/sgml/ref/declare.sgml b/doc/src/sgml/ref/declare.sgml
index 34ca9df243..2b127bdd6a 100644
--- a/doc/src/sgml/ref/declare.sgml
+++ b/doc/src/sgml/ref/declare.sgml
@@ -28,6 +28,9 @@
 
 DECLARE name [ BINARY ] [ 
INSENSITIVE ] [ [ NO ] SCROLL ]
 CURSOR [ { WITH | WITHOUT } HOLD ] FOR query
+
+DECLARE name [ BINARY ] [ 
INSENSITIVE ] [ [ NO ] SCROLL ]
+CURSOR [ { WITH | WITHOUT } HOLD ] FOR prepared_statement [ USING parameter [, ...] ]
 
  
 
@@ -130,6 +133,31 @@ Parameters
  
 

+
+   
+prepared_statement
+
+ 
+  The name of the prepared statement (created with ) which will provide the rows to be returned by
+  the cursor.  The prepared statement is restricted to contain the same
+  kinds of statements as mentioned under query above.
+ 
+
+   
+
+   
+parameter
+
+ 
+  The actual value of a parameter to the prepared statement.  This
+  must be an expression yielding a value that is compatible with
+  the data type of this parameter, as was determined when the
+  prepared statement was created.
+ 
+
+   
   
 
   
@@ -313,6 +341,14 @@ Examples
See  for more
examples of cursor usage.
   
+
+  
+   To declare a cursor via a prepared statement:
+
+PREPARE p1 AS SELECT name, price FROM produce WHERE color = $1;
+DECLARE c2 CURSOR FOR p1 USING 'green';
+
+  
  
 
  
@@ -343,6 +379,7 @@ See Also
 
   

+   


   
diff --git a/src/backend/commands/portalcmds.c 
b/src/backend/commands/portalcmds.c
index 568499761f..6c5b274b51 100644
--- a/src/backend/commands/portalcmds.c
+++ b/src/backend/commands/portalcmds.c
@@ -25,6 +25,7 @@
 
 #include "access/xact.h"
 #include "commands/portalcmds.h"
+#include "commands/prepare.h"
 #include "executor/executor.h"
 #include "executor/tstoreReceiver.h"
 #include "rewrite/rewriteHandler.h"
@@ -44,9 +45,13 @@ PerformCursorOpen(DeclareCursorStmt *cstmt, ParamListInfo 
params,
 {
Query  *query = castNode(Query, cstmt->query);
List   *rewritten;
-   PlannedStmt *plan;
+   PlannedStmt *plan = NULL;
+

Re: computing completion tag is expensive for pgbench -S -M prepared

On 7 June 2018 at 20:27, Tom Lane  wrote:
> Simon Riggs  writes:
>> If we're going to compress the protocol, it seems sensible to remove
>> extraneous information first.
>
> Breaking the wire protocol was nowhere in this thread.

No, it wasn't. But there is another thread on the subject of
compressing libpq, which is what I was referring to.

Andres' patch is clearly very efficient at adding the SELECT tag. I am
questioning if we can remove that need altogether.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: computing completion tag is expensive for pgbench -S -M prepared

Simon Riggs  writes:
> If we're going to compress the protocol, it seems sensible to remove
> extraneous information first.

Breaking the wire protocol was nowhere in this thread.

regards, tom lane

Re: computing completion tag is expensive for pgbench -S -M prepared

On 7 June 2018 at 19:20, Andres Freund  wrote:
> On 2018-06-07 11:40:48 +0100, Simon Riggs wrote:
>> On 7 June 2018 at 11:29, Pavel Stehule  wrote:
>>
>> >> Do we actually need the completion tag at all? In most cases??
>> >
>> >
>> > affected rows is taken from this value on protocol level
>>
>> I didn't mean we should remove the number of rows. Many things rely on that.
>
> How do you mean it then? We can't really easily change how we return
> that value on the protocol level, and the command tag is basically just
> returned as a string in the protocol.  If we were to design the protocol
> today I'm sure we just would declare the rowcount and affected rowcounts
> separate fields or something, but ...

I meant remove the pointless noise word at the start of the tag that
few clients care about.

I was thinking of just returning "SQL" instead, but that wasn't after
much thought.

But now I think about it more returning any fixed string, "SQL" or
"SELECT", in the protocol seems like a waste of bandwidth and a
potential route in to decrypt the stream.

If we're going to compress the protocol, it seems sensible to remove
extraneous information first.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: computing completion tag is expensive for pgbench -S -M prepared

On 2018-06-07 17:01:47 +1200, David Rowley wrote:
> On 7 June 2018 at 16:13, Andres Freund  wrote:
> > in PortalRun().  That's actually fairly trivial to optimize - we don't
> > need the full blown snprintf machinery here.  A quick benchmark
> > replacing it with:
> >
> >memcpy(completionTag, "SELECT ", sizeof("SELECT "));
> >pg_lltoa(nprocessed, completionTag + 7);
> 
> I'd also noticed something similar with some recent benchmarks I was
> doing for INSERTs into partitioned tables. In my case I saw as high as
> 0.7% of the time spent building the INSERT tag. So I think it's worth
> fixing this.

I'm kinda surprised that I never noticed this until recently. I wonder
if there's a glibc or compiler regression causing this. There's some
string optimization passes, it could be that it now does less.


> I think it would be better to invent a function that accepts a
> CmdType, int64 and Oid that copies the tag into the supplied buffer,
> then make a more generic change that also replaces the code in
> ProcessQuery() which builds the tag. I'm sure there must be some way
> to get the CmdType down to the place you've patched so we can get rid
> of the if (strcmp(portal->commandTag, "SELECT") == 0) line too.

Generally sounds reasonable, I'm not sure it's worth to do the surgery
to avoid the strcmp(). That's a larger change that's somewhat
independent...


Greetings,

Andres Freund

Re: computing completion tag is expensive for pgbench -S -M prepared

On 2018-06-07 10:30:14 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > ...  That's actually fairly trivial to optimize - we don't
> > need the full blown snprintf machinery here.  A quick benchmark
> > replacing it with:
> 
> >memcpy(completionTag, "SELECT ", sizeof("SELECT "));
> >pg_lltoa(nprocessed, completionTag + 7);
> 
> While I don't have any objection to this change if the speedup is
> reproducible, I do object to spelling the same constant as
> 'sizeof("SELECT ")' and '7' on adjacent lines ...

Hah, yes. Nor would I want to keep the #if 0 around it ;).  I mostly
wanted to know whether others can reproduce the effect - the actual
patch would need to be bigger and affect more places.

Greetings,

Andres Freund

Re: Transform for pl/perl

ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> Peter Eisentraut  writes:
>> The way I understand it, it's only how things are passed around
>> internally.  Nothing is noticeable externally, and so there is no
>> backward compatibility issue.
>> 
>> At least that's how I understand it.  So far this is only a claim by one
>> person.  I haven't seen anything conclusive about whether there is an
>> actual issue.

> It's not just how things are passed internally, but how they are passed
> to pl/perl functions with a jsonb transform: JSON scalar values at the
> top level (strings, numbers, booleans, null) get passed as a reference
> to a reference to the value, e.g. \\42 instead of 42.  The reason the
> current tests don't pick this up is that they don't check the value
> inside the pl/perl functions, only that they roundtrip back to jsonb,
> and the plperl to jsonb transform recursively dereferences references.

Yeah, the reason this is important is that it affects what the plperl
function body sees.

> Another side effect of the recursive dereferencing is that returning
> undef from the perl function returns an SQL NULL while returning a
> reference to undef (\undef) returns a jsonb null.

Hm, I think you're blaming the wrong moving part there.  The way the
transform logic is set up (e.g., in plperl_sv_to_datum()), it's
impossible for a transform function to return a SQL null; the decision by
plperl_sv_to_datum as to whether or not the output will be a SQL null is
final.  (Perhaps that was a mistake, but changing the transform function
API seems like a rather Big Deal.)  So if we think that \undef ought to
produce a SQL null, the thing to do is move the dereferencing loop to
the beginning of plperl_sv_to_datum, and then \undef would produce NULL
whether this transform is installed or not.  I don't have a well-informed
opinion on whether that's a good idea, but I tend to the view that it is.
Right now the case produces an error, and not even a very sane one:

regression=# create function foo() returns int language plperlu 
regression-# as '\undef';
CREATE FUNCTION
regression=# select foo();
ERROR:  PL/Perl function must return reference to hash or array
CONTEXT:  PL/Perl function "foo"

so there's not really a compatibility break if we redefine it.

regards, tom lane

Re: POC: GROUP BY optimization


Yes. But again, this description is a bit short. First it works after
first patch and might get some preordered leading pathkeys. Second, it
tries to match ORDER BY clause order if there is no preordered leading
pathkeys from first patch (it was introduced in v7). And third, if there
is a tail of unmatched pathkeys on previous stages then it will reorder
that tail.



OK, I haven't looked at v7 yet, but if I understand correctly it tries
to maintain the ordering as much as possible? Does that actually help? I
mean, the incremental sort patch allows the sorting to happen by pieces,
but here we still need to sort all the data, right?

Can you give an example demonstrating the benefit?


See tst.sql. queries are marked with opt (optimization is on) and noopt.

Query 1: select count(*) from btg group by v, r;
Query 2: select count(*) from btg group by n, v, r order by n;

For both queries it's possible to reorder v and r column, n column has the 
single distinct value.


On my laptop:
Query 1opt vs 1noopt: 3177.500 ms vs 6604.493 ms
  2opt vs 2noopt: 5800.307 ms vs 7486.967 ms

So, what we see:
1) for query 1 optimization gives 2 times better performance, for query 2 only 
30%. if column 'n' will be unique then time for query 1 and 2 should be the 
same. We could add check for preordered pathkeys in 
get_cheapest_group_keys_order() and if estimate_num_groups(reordered pathkeys) 
is close to 1 then we could do not reordering of tail of pathkeys.


2) Planing cost is the same for all queries. So, cost_sort() doesn't take into 
account even number of columns.



FWIW I think it would be useful to have "development GUC" that would
allow us to enable/disable these options during development, because
that makes experiments much easier. But then remove them before commit.
Added, v9, debug_enable_group_by_match_order_by and 
debug_enable_cheapest_group_by. I also checked compatibility with incremental 
sort patch, and all works except small merge conflict which could be resolved 
right before committing.


Next, I had a look on cost_incremental_sort() provided by incremental sort patch 
and, it's a pity, it doesn't solve our problem with the impact of the cost of 
per-column comparison function and number of its calls.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index b22b36ec0e..c073eb061a 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -686,7 +686,18 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 
 			if (opcintype == cur_em->em_datatype &&
 equal(expr, cur_em->em_expr))
-return cur_ec;	/* Match! */
+			{
+/*
+ * Match!
+ *
+ * Copy sortref if it wasn't set yet, it's possible if ec was
+ * constructed from WHERE clause, ie it doesn't have target
+ * reference at all
+ */
+if (cur_ec->ec_sortref == 0 && sortref > 0)
+	cur_ec->ec_sortref = sortref;
+return cur_ec;
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ec66cb9c3c..4f93afdebc 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -26,6 +26,7 @@
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -327,6 +328,58 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * Reorder GROUP BY pathkeys and clauses to match order of pathkeys. Function
+ * returns new lists,  original GROUP BY lists stay untouched.
+ */
+int
+group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
+			   List **group_clauses)
+{
+	List		*new_group_pathkeys= NIL,
+*new_group_clauses = NIL;
+	ListCell	*key;
+	int			n;
+
+	if (pathkeys == NIL || *group_pathkeys == NIL)
+		return 0;
+
+	/*
+	 * For each pathkey it tries to find corresponding GROUP BY pathkey and
+	 * clause.
+	 */
+	foreach(key, pathkeys)
+	{
+		PathKey			*pathkey = (PathKey *) lfirst(key);
+		SortGroupClause	*sgc;
+
+		/*
+		 * Pathkey should use the same allocated struct, so, equiality of
+		 * pointers is enough
+		 */
+		if (!list_member_ptr(*group_pathkeys, pathkey))
+			break;
+
+		new_group_pathkeys = lappend(new_group_pathkeys, pathkey);
+
+		sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref,
+	  *group_clauses);
+		new_group_clauses = lappend(new_group_clauses, sgc);
+	}
+
+	n = list_length(new_group_pathkeys);
+
+	/*
+	 * Just append the rest of pathkeys and clauses
+	 */
+	*group_pathkeys = list_concat_unique_ptr(new_group_pathkeys,
+ *group_pathkeys);
+	*group_clauses = list_concat_unique_ptr(new_group_clauses,
+			*group_clauses);
+
+	return n;
+}
+
 /*
  *

Re: why partition pruning doesn't work?

David Rowley  writes:
> On 6 June 2018 at 18:05, Amit Langote  wrote:
>> I wonder why we need to create those Bitmapsets in the planner?  Why not
>> in ExecSetupPartitionPruneState()?  For example, like how
>> context->exprstates is initialized.

> That seems like a good idea.  Certainly much better than working them
> out each time we prune.

> v3 patch attached.

Maybe there's something I'm missing here, but I sort of hoped that this
patch would nuke all the special-case code for Params in this area.
Why is there any need to distinguish them from other stable expressions?

IOW, I was hoping for the code to end up simpler, not more complicated.

regards, tom lane

Re: POC: GROUP BY optimization

Again agree. If we have fixed order of columns (ORDER BY) then we should not 
try to reorder it. Current patch follows that if I didn't a mistake.




This part seems to be more a misunderstanding between me and Claudio. I believe 
Claudio was referring to the column order in a GROUP BY, not ORDER BY. In which 
case we don't add any Sort, of course.

I hope so



I'm still opposed to adding arbitrary handicap to prioritize the order specified 
by user, for the reasons I explained before. We should aim to make the 
heuristics/costing reliable enough to make this unnecessary.

+1

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: POC: GROUP BY optimization

2018-06-07 Thread Tomas Vondra

On 06/07/2018 03:41 PM, Teodor Sigaev wrote:
>>

... snip ...

>>

Priorization of the user-provided order can be as simple as giving
that comparison_cost a small handicap.

I see no point in doing that, and I don't recall a single place in the
planner where we do that. If the user specified ORDER BY, we'll slap an
explicit Sort on top when needed (which acts as the handicap, but in a
clear way). Otherwise we don't do such things - it'd be just plain
confusing (consider "ORDER BY a,b" vs. "ORDER BY b,c" with same data
types, ndistinct etc. but unexpectedly different costs). Also, what
would be a good value for the handicap?

Again agree. If we have fixed order of columns (ORDER BY) then we should 
not try to reorder it. Current patch follows that if I didn't a mistake.

This part seems to be more a misunderstanding between me and Claudio. I 
believe Claudio was referring to the column order in a GROUP BY, not 
ORDER BY. In which case we don't add any Sort, of course.

I'm still opposed to adding arbitrary handicap to prioritize the order 
specified by user, for the reasons I explained before. We should aim to 
make the heuristics/costing reliable enough to make this unnecessary.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: processSQLNamePattern() analog

2018-06-07 Thread Sergey Cherkashin

Thanks for the answer.

On Ср, 2018-06-06 at 13:06 -0400, Tom Lane wrote:
> Sergey Cherkashin  writes:
> > 
> > The command "\dA" (as well as several commands that I write) accept
> > the access method name template. The resulting template is
> > processed by the processSQLNamePattern () function, which means
> > that a template with a schema can be fed to the input. But since
> > the access method does not have schema, it's needed to handle
> > somehow a command like "\dA foo. *". At this point, the command
> > will display a full list of access methods, not paying attention to
> > the presence of the schema name in the template.
> I don't see a particular problem with this.  The \d commands in
> general
> are meant to accept handwritten input, so they should err on the side
> of being forgiving.  I do not see how it would be an improvement to
> throw an error complaining that the pattern shouldn't have been
> schema-qualified for this particular type of name, nor would the
> alternative possibility that "*.*" silently refuses to match anything
> be a great idea.

OK, I leave it alone.

> > I also need a possibility to handle templates of type
> > "schema.table.column",
> Why?  I think you'd be best off not going there, because it will
> create confusion against the SQL-standard-mandated possibility of
> "database.schema.table".  We don't really support that notation
> today in most contexts, but it might be a problem in future.

As for the new function, I meant the possibility of getting from the
function information about on which blocks ("schema" or "table") it
divided the template, how many of these blocks, etc. That is, to make
it possible to understand outside the function, if we are dealing with
the name of the schema or column. Or maybe we should have possibility
to limit what blocks we can work with.
But at the moment I think it is
really enough to make user manage with pattenrs by himself.

Regards,
Sergey Cherkashin.

Re: computing completion tag is expensive for pgbench -S -M prepared

Andres Freund  writes:
> ...  That's actually fairly trivial to optimize - we don't
> need the full blown snprintf machinery here.  A quick benchmark
> replacing it with:

>memcpy(completionTag, "SELECT ", sizeof("SELECT "));
>pg_lltoa(nprocessed, completionTag + 7);

While I don't have any objection to this change if the speedup is
reproducible, I do object to spelling the same constant as
'sizeof("SELECT ")' and '7' on adjacent lines ...

regards, tom lane

Re: Transform for pl/perl

2018-06-07 Thread Dagfinn Ilmari Mannsåker

Peter Eisentraut  writes:

> On 6/6/18 12:14, Alvaro Herrera wrote:
>> On 2018-May-17, Peter Eisentraut wrote:
>> 
>>> The items that are still open from the original email are:
>>>
>>> 2) jsonb scalar values are passed to the plperl function wrapped in not
>>>one, but _two_ layers of references
>>>
>>> 3) jsonb numeric values are passed as perl's NV (floating point) type,
>>>losing precision if they're integers that would fit in an IV or UV.
>>>
>>> #2 appears to be a quality of implementation issue without any
>>> user-visible effects.
>>>
>>> #3 is an opportunity for future improvement, but works as intended right
>>> now.
>>>
>>> I think patches for these issues could still be considered during beta,
>>> but they are not release blockers IMO.
>> 
>> It appears to me that item #2 definitely would need to be fixed before
>> release, so that it doesn't become a backwards-incompatibility later on.
>
> The way I understand it, it's only how things are passed around
> internally.  Nothing is noticeable externally, and so there is no
> backward compatibility issue.
>
> At least that's how I understand it.  So far this is only a claim by one
> person.  I haven't seen anything conclusive about whether there is an
> actual issue.

It's not just how things are passed internally, but how they are passed
to pl/perl functions with a jsonb transform: JSON scalar values at the
top level (strings, numbers, booleans, null) get passed as a reference
to a reference to the value, e.g. \\42 instead of 42.  The reason the
current tests don't pick this up is that they don't check the value
inside the pl/perl functions, only that they roundtrip back to jsonb,
and the plperl to jsonb transform recursively dereferences references.

Another side effect of the recursive dereferencing is that returning
undef from the perl function returns an SQL NULL while returning a
reference to undef (\undef) returns a jsonb null.

The attached patch fixes the excess enreferencing, but does not touch
the dereferencing part.

- ilmari
-- 
"I use RMS as a guide in the same way that a boat captain would use
 a lighthouse.  It's good to know where it is, but you generally
 don't want to find yourself in the same spot." - Tollef Fog Heen

>From e0948ca29055241874ba455d39728da66fdf3fd5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Tue, 17 Apr 2018 18:18:31 +0100
Subject: [PATCH] Fix excess enreferencing in plperl jsonb transform

---
 .../jsonb_plperl/expected/jsonb_plperl.out| 42 +--
 .../jsonb_plperl/expected/jsonb_plperlu.out   | 36 
 contrib/jsonb_plperl/jsonb_plperl.c   |  8 ++--
 contrib/jsonb_plperl/sql/jsonb_plperl.sql | 39 -
 contrib/jsonb_plperl/sql/jsonb_plperlu.sql| 38 +
 5 files changed, 82 insertions(+), 81 deletions(-)

diff --git a/contrib/jsonb_plperl/expected/jsonb_plperl.out b/contrib/jsonb_plperl/expected/jsonb_plperl.out
index e4f3cdd41a..3364057a64 100644
--- a/contrib/jsonb_plperl/expected/jsonb_plperl.out
+++ b/contrib/jsonb_plperl/expected/jsonb_plperl.out
@@ -52,16 +52,18 @@ SELECT testRegexpResultToJsonb();
  0
 (1 row)
 
-CREATE FUNCTION roundtrip(val jsonb) RETURNS jsonb
+CREATE FUNCTION roundtrip(val jsonb, ref text = '') RETURNS jsonb
 LANGUAGE plperl
 TRANSFORM FOR TYPE jsonb
 AS $$
+die 'unexpected '.(ref($_[0]) || 'not a').' reference'
+if ref($_[0]) ne $_[1];
 return $_[0];
 $$;
 SELECT roundtrip('null');
  roundtrip 
 ---
- null
+ 
 (1 row)
 
 SELECT roundtrip('1');
@@ -97,12 +99,6 @@ SELECT roundtrip('"string"');
  "string"
 (1 row)
 
-SELECT roundtrip('"NaN"');
- roundtrip 

- "NaN"
-(1 row)
-
 SELECT roundtrip('true');
  roundtrip 
 ---
@@ -115,91 +111,91 @@ SELECT roundtrip('false');
  0
 (1 row)
 
-SELECT roundtrip('[]');
+SELECT roundtrip('[]', 'ARRAY');
  roundtrip 
 ---
  []
 (1 row)
 
-SELECT roundtrip('[null, null]');
+SELECT roundtrip('[null, null]', 'ARRAY');
   roundtrip   
 --
  [null, null]
 (1 row)
 
-SELECT roundtrip('[1, 2, 3]');
+SELECT roundtrip('[1, 2, 3]', 'ARRAY');
  roundtrip 
 ---
  [1, 2, 3]
 (1 row)
 
-SELECT roundtrip('[-1, 2, -3]');
+SELECT roundtrip('[-1, 2, -3]', 'ARRAY');
   roundtrip  
 -
  [-1, 2, -3]
 (1 row)
 
-SELECT roundtrip('[1.2, 2.3, 3.4]');
+SELECT roundtrip('[1.2, 2.3, 3.4]', 'ARRAY');
 roundtrip
 -
  [1.2, 2.3, 3.4]
 (1 row)
 
-SELECT roundtrip('[-1.2, 2.3, -3.4]');
+SELECT roundtrip('[-1.2, 2.3, -3.4]', 'ARRAY');
  roundtrip 
 ---
  [-1.2, 2.3, -3.4]
 (1 row)
 
-SELECT roundtrip('["string1", "string2"]');
+SELECT roundtrip('["string1", "string2"]', 'ARRAY');
roundtrip
 
  ["string1", "string2"]
 (1 row)
 
-SELECT roundtrip('{}');
+SELECT roundtrip('{}', 'HASH');
  roundtrip 
 ---
  {}
 (1 row)
 
-SELECT roundtrip('{"1": null}');
+SELECT roundtrip('{"1": null}', 'HASH');

Re: POC: GROUP BY optimization


I don't see why not to generate all possible orderings (keeping only the
cheapest path for each pathkey variant) and let the optimizer to sort it
out.


I'm assuming costing the full N! possible orderings would be
prohibitively expensive.


That's true, but for the first step we need to improve cost_sort and only then 
try to find the best pathkeys order by optimal way.



- If the user requested that order, we assume it "somewhat
subjectively better" (by giving it a slightly reduced cost).
I don't think so. It's not a SQL way - DB should define the optimal way to 
execute query.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

On Tue, Jun 5, 2018 at 3:40 PM, Kyotaro HORIGUCHI
 wrote:
> Hello.
>
> At Mon, 04 Jun 2018 20:58:28 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20180604.205828.208262556.horiguchi.kyot...@lab.ntt.co.jp>
>> It fails on some join-pushdown cases since it doesn't add tid
>> columns to join tlist.  I suppose that build_tlist_to_deparse
>> needs something but I'll consider further tomorrow.
>
> I made it work with a few exceptions and bumped.  PARAM_EXEC
> doesn't work at all in a case where Sort exists between
> ForeignUpdate and ForeignScan.
>
> =
> explain (verbose, costs off)
> update bar set f2 = f2 + 100
> from
>   ( select f1 from foo union all select f1+3 from foo ) ss
> where bar.f1 = ss.f1;
>   QUERY PLAN
> -
>  Update on public.bar
>Update on public.bar
>Foreign Update on public.bar2
>  Remote SQL: UPDATE public.loct2 SET f2 = $3 WHERE tableoid = $1 AND ctid 
> = $2
> ...
>->  Merge Join
>  Output: bar2.f1, (bar2.f2 + 100), bar2.f3, (ROW(foo.f1))
>  Merge Cond: (bar2.f1 = foo.f1)
>  ->  Sort
>Output: bar2.f1, bar2.f2, bar2.f3, bar2.tableoid, bar2.ctid
>Sort Key: bar2.f1
>->  Foreign Scan on public.bar2
>  Output: bar2.f1, bar2.f2, bar2.f3, bar2.tableoid, 
> bar2.ctid
>  Remote SQL: SELECT f1, f2, f3, ctid, tableoid FROM 
> public.loct2 FOR UPDATE
> =

What's the problem that you faced?

>
> Even if this worked fine, it cannot be back-patched.  We need an
> extra storage moves together with tuples or prevent sorts or
> something like from being inserted there.

I think your approach still has the same problem that it's abusing the
tableOid field in the heap tuple to store tableoid from the remote as
well as local table. That's what Robert and Tom objected to [1], [2]

>
>
> At Fri, 1 Jun 2018 10:21:39 -0400, Ashutosh Bapat 
>  wrote in 
> 
>> I am not suggesting to commit 0003 in my patch set, but just 0001 and
>> 0002 which just raise an error when multiple rows get updated when
>> only one row is expected to be updated.
>
> So I agree to commit the two at least in order to prevent doing
> wrong silently.

I haven't heard any committer's opinion on this one yet.

[1] 
https://www.postgresql.org/message-id/CA+TgmobUHHZiDR=hcu4n30yi9_pe175ittbfk6t8jxzwkra...@mail.gmail.com
[2] https://www.postgresql.org/message-id/7936.1526590932%40sss.pgh.pa.us

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: POC: GROUP BY optimization


So the costing was fairly trivial, we simply do something like

 comparison_cost = 2.0 * cpu_operator_cost;

 sort_cost = comparison_cost * tuples * LOG2(tuples);

which essentially ignores that there might be multiple columns, or that
the columns may have sort operator with different costs.

Agree. And distribution of keys.


The question is how reliable the heuristics can be. The current patch
uses just plain ndistinct, but that seems rather unreliable but I don't
have a clear idea how to improve that - we may have MCV for the columns
and perhaps some extended statistics, but I'm not sure how far we can
run with that.

v8 already uses another algorithm.



Essentially what we need to estimate the number of comparisons for each
column, to compute better comparison_cost.

Exactly


Priorization of the user-provided order can be as simple as giving
that comparison_cost a small handicap.


I see no point in doing that, and I don't recall a single place in the
planner where we do that. If the user specified ORDER BY, we'll slap an
explicit Sort on top when needed (which acts as the handicap, but in a
clear way). Otherwise we don't do such things - it'd be just plain
confusing (consider "ORDER BY a,b" vs. "ORDER BY b,c" with same data
types, ndistinct etc. but unexpectedly different costs). Also, what
would be a good value for the handicap?


Again agree. If we have fixed order of columns (ORDER BY) then we should not try 
to reorder it. Current patch follows that if I didn't a mistake.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Remove mention in docs that foreign keys on partitioned tables are not supported

On Thu, Jun 7, 2018 at 10:58 AM, Amit Langote
 wrote:
> On 2018/06/07 14:17, Ashutosh Bapat wrote:
>>> that is, users can find out about that feature by themselves by
>>> trying it out?
>>
>> I didn't understand that part.
>>
>> Probably we just say that BEFORE ROW triggers are not supported on a
>> partitioned table. It's good enough not to suggest it ourselves. If
>> users find out that they can create triggers on partitions and use it
>> that way, they may or may not have to change their implementation
>> later when we start supporting those. But then we didn't suggest that
>> they do it that way.
>

> I don't understand why you think it's too troublesome to let the users
> know that there is some way to use BR triggers with partitioning.  We
> didn't do that for indexes, for example, before PG 11 introduced the
> ability to create them on partitioned tables.

By looking at the index keys it's easy to decide whether the two
indexes are same. When we add an index on a partitioned table in v11,
we skip creating an index on the partition if there exists an index
similar to the one being created. So, a user can have indexes on
partitions created in v10, upgrade to v11 and create an index on the
partitioned table. Nothing changes. But that's not true for a trigger.
It's not easy to check whether two triggers are same or not unless the
underlying function is same. User may or may not be using same trigger
function for all the partitions, which is more true, if the column
order differs between partitions. So, if the user has created triggers
on the partitions in v10, upgrades to v11, s/he may have to drop them
all and recreate the trigger on the partitioned table.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: POC: GROUP BY optimization






OK, I haven't looked at v7 yet, but if I understand correctly it tries
to maintain the ordering as much as possible? Does that actually help? I
mean, the incremental sort patch allows the sorting to happen by pieces,
but here we still need to sort all the data, right?

Can you give an example demonstrating the benefit?





SELECT a, SUM(x) FROM
(
 SELECT a, b, COUNT(c) AS x FROM t1 GROUP BY a, b
 UNION ALL
 SELECT a, b, COUNT(c) AS x FROM t2 GROUP BY a, b
) foo GROUP BY a;

and indexes on (a,b) and (b,a) for both relations. The "deduplication"
by pathkeys I suggested would mean we might keep only index (a,b) on t1
and (b,a) on t2, which means the grouping by "a" can't leverage index
scans on both relations. But if we keep paths for both indexes on each
relation, we can.

yes, one of option


Isn't "estimation of cost of comparing function/number of unique values
in column could be not very accurate and so planner could make a wrong
choice" is more an argument against relying on it when doing these
optimizations?

FWIW it's one of the arguments Tom made in the incremental sort patch,
which relies on it too when computing cost of the incremental sort. I'm
sure it's going to be an obstacle there too. >

I saw 2 times difference in real-world application. Again, improving
sort cost estimation is a separate task.

Sure. But we also need to ask the other question, i.e. how many people
would be negatively affected by the optimization. And I admit I don't
know the answer to that, the next example is entirely made up.
Hm, seems, the best way here is a improving cost_sort estimation. Will try, but 
I think that is separated patch




FWIW I think it would be useful to have "development GUC" that would
allow us to enable/disable these options during development, because
that makes experiments much easier. But then remove them before commit.


Will do

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Typo in planner README

2018-06-07 Thread Magnus Hagander

On Thu, Jun 7, 2018 at 12:28 PM, Daniel Gustafsson  wrote:

> The README didn’t get the memo when set_base_rel_pathlist was renamed in
> 6543d81d659f417.  Use the right name, set_base_rel_pathlists, as per the
> attached.
>

Applied, thanks.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

On Wed, Jun 6, 2018 at 5:00 PM, Etsuro Fujita
 wrote:
> (2018/05/18 16:33), Etsuro Fujita wrote:
>>
>> Other than pull_var_clause things,
>> the updated version looks good to me, so I'll mark this as Ready for
>> Committer.
>
>
> Since I'm not 100% sure that that is the right way to go, I've been
> rethinking how to fix this issue.  Yet another idea I came up with to fix
> this is to redesign the handling of the tlists for children in the
> partitioning case.  Currently, we build the reltarget for a child by
> applying adjust_appendrel_attrs to the reltarget for its parent in
> set_append_rel_size, which maps a wholerow Var referring to the parent rel
> to a ConvertRowtypeExpr that translates a wholerow of the child rel into a
> wholerow of the parent rel's rowtype.  This works well for the
> non-partitioned inheritance case, but makes complicated the code for
> handling the partitioning case especially when planning partitionwise-joins.
> And I think this would be the root cause of this issue.

Although true, this is not only about targetlist. Even the whole-row
expressions in the conditions, equivalence classes and other
planner/optimizer data structures are translated to
ConvertRowtypeExpr.

>  I don't think the
> tlists for the children need to have their wholerows transformed into the
> corresponding ConvertRowtypeExprs *at this point*, so what I'd like to
> propose is to 1) map a parent wholerow Var simply to a child wholerow Var,
> instead (otherwise, the same as adjust_appendrel_attrs), when building the
> reltarget for a child in set_append_rel_size, 2) build the tlists for child
> joins using those children's wholerow Vars at path creation time, and 3)
> adjust those wholerow Vars in the tlists for subpaths in the chosen
> AppendPath so that those are transformed into the corresponding
> ConvertRowtypeExprs, at plan creation time (ie, in
> create_append_plan/create_merge_append_plan).  IIUC, this would not require
> any changes to pull_var_clause as proposed by the patch.  This would not
> require any changes to postgres_fdw as proposed by the patch, either.  In
> addition, this would make unnecessary the code added to setrefs.c by the
> partitionwise-join patch.  Maybe I'm missing something though.

Not translating whole-row expressions to ConvertRowtypeExpr before
creating paths can lead to a number of anomalies. For example,

1. if there's an index on the whole-row expression of child,
translating parent's whole-row expression to child's whole-row
expression as is will lead to using that index, when in reality the it
can not be used since the condition/ORDER BY clause (which originally
referred the parent's whole-row expression) require the child's
whole-row reassembled as parent's whole-row.
2. Constraints on child'd whole-row expression, will be used to prune
a child when they can not be used since the original condition was on
parent' whole-row expression and not that of the child.
3. Equivalence classes will think that a child whole-row expression
(without ConvertRowtypeExpr) is equivalent to an expression which is
part of the same equivalence class as the parent' whole-row
expression.

Given these serious problems, I don't think we could afford not to
cover a child whole-row reference in ConvertRowtypeExpr.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: computing completion tag is expensive for pgbench -S -M prepared

On 7 June 2018 at 11:29, Pavel Stehule  wrote:

>> Do we actually need the completion tag at all? In most cases??
>
>
> affected rows is taken from this value on protocol level

I didn't mean we should remove the number of rows. Many things rely on that.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: computing completion tag is expensive for pgbench -S -M prepared

2018-06-07 Thread Pavel Stehule

2018-06-07 12:01 GMT+02:00 Simon Riggs :

> On 7 June 2018 at 06:01, David Rowley 
> wrote:
> > On 7 June 2018 at 16:13, Andres Freund  wrote:
> >> in PortalRun().  That's actually fairly trivial to optimize - we don't
> >> need the full blown snprintf machinery here.  A quick benchmark
> >> replacing it with:
> >>
> >>memcpy(completionTag, "SELECT ", sizeof("SELECT
> "));
> >>pg_lltoa(nprocessed, completionTag + 7);
> >
> > I'd also noticed something similar with some recent benchmarks I was
> > doing for INSERTs into partitioned tables. In my case I saw as high as
> > 0.7% of the time spent building the INSERT tag. So I think it's worth
> > fixing this.
> >
> > I think it would be better to invent a function that accepts a
> > CmdType, int64 and Oid that copies the tag into the supplied buffer,
> > then make a more generic change that also replaces the code in
> > ProcessQuery() which builds the tag. I'm sure there must be some way
> > to get the CmdType down to the place you've patched so we can get rid
> > of the if (strcmp(portal->commandTag, "SELECT") == 0) line too.
>
> Sounds better
>
> Do we actually need the completion tag at all? In most cases??
>

affected rows is taken from this value on protocol level

Regards

Pavel


> Perhaps we should add a parameter to make it optional and turn it off
> by default, except for psql.
>
> --
> Simon Riggshttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>

Typo in planner README

2018-06-07 Thread Daniel Gustafsson

The README didn’t get the memo when set_base_rel_pathlist was renamed in
6543d81d659f417.  Use the right name, set_base_rel_pathlists, as per the
attached.

cheers ./daniel



typo-planner_README.patch
Description: Binary data

Re: computing completion tag is expensive for pgbench -S -M prepared

On 7 June 2018 at 06:01, David Rowley  wrote:
> On 7 June 2018 at 16:13, Andres Freund  wrote:
>> in PortalRun().  That's actually fairly trivial to optimize - we don't
>> need the full blown snprintf machinery here.  A quick benchmark
>> replacing it with:
>>
>>memcpy(completionTag, "SELECT ", sizeof("SELECT "));
>>pg_lltoa(nprocessed, completionTag + 7);
>
> I'd also noticed something similar with some recent benchmarks I was
> doing for INSERTs into partitioned tables. In my case I saw as high as
> 0.7% of the time spent building the INSERT tag. So I think it's worth
> fixing this.
>
> I think it would be better to invent a function that accepts a
> CmdType, int64 and Oid that copies the tag into the supplied buffer,
> then make a more generic change that also replaces the code in
> ProcessQuery() which builds the tag. I'm sure there must be some way
> to get the CmdType down to the place you've patched so we can get rid
> of the if (strcmp(portal->commandTag, "SELECT") == 0) line too.

Sounds better

Do we actually need the completion tag at all? In most cases??

Perhaps we should add a parameter to make it optional and turn it off
by default, except for psql.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Concurrency bug in UPDATE of partition-key

2018-06-07 Thread Amit Khandekar

On 7 June 2018 at 11:44, Amit Kapila  wrote:
> On Tue, Jun 5, 2018 at 8:03 PM, Amit Khandekar 
> wrote:
>>
>> Attached is a rebased patch version. Also included it in the upcoming
>> commitfest :
>> https://commitfest.postgresql.org/18/1660/
>>
>
> Doesn't this belong to PostgreSQL 11 Open Items [1] or are you proposing it
> as a feature enhancement for next version?

Yes, it needs to be in the Open items. I will have it added in that list.

>
>>
>> In the rebased version, the new test cases are added in the existing
>> isolation/specs/partition-key-update-1.spec test.
>>
>
>   if (!tuple_deleted)
> - return NULL;
> + {
> + /*
> + * epqslot will be typically NULL. But when ExecDelete() finds
> + * that another transaction has concurrently updated the same
> + * row, it re-fetches the row, skips the delete, and epqslot is
> + * set to the re-fetched tuple slot. In that case, we need to
> + * do all the checks again.
> + */
> + if (TupIsNull(epqslot))
> + return NULL;
> + else
> + {
> + slot = ExecFilterJunk(resultRelInfo->ri_junkFilter, epqslot);
> + tuple = ExecMaterializeSlot(slot);
> + goto lreplace;
> + }
> + }
>
> I think this will allow before row delete triggers to be fired more than
> once.  Normally, if the EvalPlanQual testing generates a new tuple, we don't
> fire the triggers again.

If there are BR delete triggers, the tuple will be locked using
GetTupleForTrigger(). So the subsequent EvalPlanQual testing won't be
run, since the tuple is already locked due to triggers having run.

But that leads me to think : The same concurrency issue can occur in
GetTupleForTrigger() also. Say, a concurrent session has already
locked the tuple, and GetTupleForTrigger() would wait and then return
the updated tuple in its last parameter newSlot. In that case, we need
to pass this slot back through ExecBRDeleteTriggers(), and further
through epqslot parameter of ExecDelete(). But yes, in this case, we
should avoid calling this trigger function the second time.

If you agree on the above, I will send an updated patch.

> Now, one possibility could be that we don't skip
> the delete after a concurrent update and still allow inserts to use a new
> tuple generated by EvalPlanQual testing of delete.  However, I think we need
> to perform rechecks for update to check if the modified tuple still requires
> to be moved to new partition, right or do you have some other reason in
> mind?

Yes, that's the reason we need to start again from lreplace; i.e., for
checking not just the partition constraints but all the other checks
that were required earlier, because the tuple has been modified. It
may even end up moving to a different partition than the one chosen
earlier.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: Spilling hashed SetOps and aggregates to disk

2018-06-07 Thread Tomas Vondra


On 06/07/2018 02:18 AM, Andres Freund wrote:

On 2018-06-06 17:17:52 -0700, Andres Freund wrote:

On 2018-06-07 12:11:37 +1200, David Rowley wrote:

On 7 June 2018 at 08:11, Tomas Vondra  wrote:

On 06/06/2018 04:11 PM, Andres Freund wrote:

Consider e.g. a scheme where we'd switch from hashed aggregation to
sorted aggregation due to memory limits, but already have a number of
transition values in the hash table. Whenever the size of the transition
values in the hashtable exceeds memory size, we write one of them to the
tuplesort (with serialized transition value). From then on further input
rows for that group would only be written to the tuplesort, as the group
isn't present in the hashtable anymore.



Ah, so you're suggesting that during the second pass we'd deserialize
the transition value and then add the tuples to it, instead of building
a new transition value. Got it.


Having to deserialize every time we add a new tuple sounds terrible
from a performance point of view.


I didn't mean that we do that, and I don't think David understood it as
that either. I was talking about the approach where the second pass is a
sort rather than hash based aggregation.  Then we would *not* need to
deserialize more than exactly once.


s/David/Tomas/, obviously. Sorry, it's been a long day.



Solution is simple: drink more coffee. ;-)

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Possible bug in logical replication.