date:20180525

Re: Avoiding Tablespace path collision for primary and standby

2018-05-25 Thread Michael Paquier

On Sat, May 26, 2018 at 02:10:52PM +1200, Thomas Munro wrote:
> I also wondered about this when trying to figure out how to write a
> TAP test for recovery testing with tablespaces, for my undo proposal.
> I was starting to wonder about either allowing relative paths or
> supporting some kind of variable in the tablespace path that could
> then be set differently in each cluster's .conf.

As for now for tablespace creation with multiple nodes on the same host,
you really come to just using the tablespace map within pg_basebackup..
I think that this is a difficult problem as one may want to not use the
same partition space for both primary and standby, hence you would need
to associate a tablespace path with one node using for example a node
name set in postgresql.conf, while extending CREATE TABLESPACE to
support this grammar and register the paths for each nodes in WAL
records.  Using a path that variates depending on the time is not a good
idea in my opinion.
--
Michael


signature.asc
Description: PGP signature

Re: Avoiding Tablespace path collision for primary and standby

2018-05-25 Thread Thomas Munro

On Sat, May 26, 2018 at 9:17 AM, Ashwin Agrawal  wrote:
> To generate uniqueness for the path between primary and standby need to use
> something which is not represented within database. So will be random to
> some degree. Like one can use PORT number of postmaster. As only need to
> generate unique path while creating link during CREATE TABLESPACE.

I also wondered about this when trying to figure out how to write a
TAP test for recovery testing with tablespaces, for my undo proposal.
I was starting to wonder about either allowing relative paths or
supporting some kind of variable in the tablespace path that could
then be set differently in each cluster's .conf.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: SPI/backend equivalent of extended-query Describe(statement)?

2018-05-25 Thread Chapman Flack

On 05/25/18 21:16, Andrew Gierth wrote:
>> "Tom" == Tom Lane  writes:
>  Tom> about that for v11, but I'd favor trying to improve the situation
>  Tom> in v12.
> 
> Yeah. Another issue I ran into is that if you use SPI_prepare_params,
> then you have to use SPI_execute_plan_with_paramlist, it's not possible
> to use SPI_execute_plan (or SPI_execute_snapshot) instead because
> plan->nargs was set to 0 by the prepare and never filled in with the
> actual parameter count.

Now *that* sounds arguably bug-like ...could *it*, at least, be a candidate
for back-patching?

If it's currently not possible to use SPI_execute_plan/SPI_execute_snapshot
after SPI_prepare_params, then there can't be anything currently doing so,
that a patch could conceivably disrupt, can there?

-Chap

Re: SPI/backend equivalent of extended-query Describe(statement)?

2018-05-25 Thread Andrew Gierth

> "Tom" == Tom Lane  writes:

 >>> /*
 >>> * GAH. To do parameter type checking properly, we have to install our
 >>> * own global post-parse hook transiently.
 >>> */

 >> Gah, indeed. Thanks for the heads up. I would never have guessed it'd
 >> be that fiddly.

 Tom> Yikes.  That seems pretty unsafe :-(

I put in a recursion check out of paranoia, but even after considerable
thought wasn't able to figure out any scenario that would actually break
it. If it's actually unsafe I'd really like to know about it :-)

 Tom> Obviously, I missed a bet by not folding check_variable_parameters
 Tom> into the pstate hook mechanism. It's a bit late to do anything
 Tom> about that for v11, but I'd favor trying to improve the situation
 Tom> in v12.

Yeah. Another issue I ran into is that if you use SPI_prepare_params,
then you have to use SPI_execute_plan_with_paramlist, it's not possible
to use SPI_execute_plan (or SPI_execute_snapshot) instead because
plan->nargs was set to 0 by the prepare and never filled in with the
actual parameter count.

-- 
Andrew (irc:RhodiumToad)

Re: SPI/backend equivalent of extended-query Describe(statement)?

2018-05-25 Thread Tom Lane

Chapman Flack  writes:
> On 05/25/18 20:07, Andrew Gierth wrote:
>> /*
>> * GAH. To do parameter type checking properly, we have to install our
>> * own global post-parse hook transiently.
>> */
>> ...
>> PG_TRY();
>> {
>> pllua_spi_prev_parse_hook = post_parse_analyze_hook;
>> post_parse_analyze_hook = pllua_spi_prepare_checkparam_hook;
>> ...
>> PG_CATCH();
>> {
>> post_parse_analyze_hook = pllua_spi_prev_parse_hook;
>> --pllua_spi_prepare_recursion;
>> PG_RE_THROW();
>> ...

> Gah, indeed. Thanks for the heads up. I would never have guessed it'd
> be that fiddly.

Yikes.  That seems pretty unsafe :-(

Obviously, I missed a bet by not folding check_variable_parameters
into the pstate hook mechanism.  It's a bit late to do anything
about that for v11, but I'd favor trying to improve the situation
in v12.

regards, tom lane

Re: SPI/backend equivalent of extended-query Describe(statement)?

2018-05-25 Thread Chapman Flack

On 05/25/18 20:07, Andrew Gierth wrote:
>  >> with writing a ParserSetupHook that's just a thin wrapper around
>  >> parse_variable_parameters ?
> 
> That's what I did in pllua-ng. The tricky bit was in arranging to also
> call check_variable_parameters; I considered skipping that part, but
> that seemed like it could be potentially problematic.
> 
> https://github.com/pllua/pllua-ng/blob/master/src/spi.c#L266
>
>/*
> * GAH. To do parameter type checking properly, we have to install our
> * own global post-parse hook transiently.
> */
>...
>PG_TRY();
>{
>   pllua_spi_prev_parse_hook = post_parse_analyze_hook;
>   post_parse_analyze_hook = pllua_spi_prepare_checkparam_hook;
>...
>PG_CATCH();
>{
>   post_parse_analyze_hook = pllua_spi_prev_parse_hook;
>   --pllua_spi_prepare_recursion;
>   PG_RE_THROW();
>...

Gah, indeed. Thanks for the heads up. I would never have guessed it'd
be that fiddly.

-Chap

Re: SPI/backend equivalent of extended-query Describe(statement)?

2018-05-25 Thread Andrew Gierth

> "Tom" == Tom Lane  writes:

 > Chapman Flack  writes:
 >> As for bringing it along to the modern API, am I on the right track
 >> with writing a ParserSetupHook that's just a thin wrapper around
 >> parse_variable_parameters ?

 Tom> Seems reasonable from here.

That's what I did in pllua-ng. The tricky bit was in arranging to also
call check_variable_parameters; I considered skipping that part, but
that seemed like it could be potentially problematic.

https://github.com/pllua/pllua-ng/blob/master/src/spi.c#L266

-- 
Andrew (irc:RhodiumToad)

Re: Timetz comparison

2018-05-25 Thread Tom Lane

"David G. Johnston"  writes:
> On Fri, May 25, 2018 at 3:33 PM, Tom Lane  wrote:
>> Even if you'd made a case why we should consider them equal,
>> those would be very good reasons not to change behavior that's
>> stood for 17 years.

> This is true, and the alternative doesn't have the supporting argument of
> being spec-compliant either...

I poked around in the standard to see what it has to say on the subject.
SQL:2011 section 8.2  general rule 6 says "the
comparison of two datetimes is determined according to the interval
resulting from their subtraction".  That's promising, except we don't
actually implement timetz subtraction:

# select '22:00+02'::timetz - '23:00+01'::timetz;
ERROR:  operator does not exist: time with time zone - time with time zone

But surely the spec defines it ... digging around, it's in 6.33  general rule 7.  That describes rotating both values to
the same time zone and then subtracting, which would seem to provide some
ammunition for Alexey's point of view.  But then they throw it all away:

The difference of two values of type TIME (with or without time zone)
is constrained to be between –24:00:00 and +24:00:00 (excluding each
end point); it is implementation-defined which of two non-zero values
in this range is the result, although the computation shall be
deterministic.

In other words, the implementation is actually free to choose the sign
of the subtraction result, which means that the spec fails to define
the result of timetz comparison: all that's required is that it be
consistent with your implementation of timetz subtraction.  Since we
don't have the latter (and I don't recall anyone asking for it...)
there's not much to argue from here.

> The notes in 8.5.3 Time Zone (v10 docs) seem to apply here overall - the
> type, while standard, is ill-conceived.

Yeah, this.  There are a *lot* of weirdnesses in the spec's treatment of
datetimes, and specifically their notion of timezones just has darn little
to do with anyone's reality.  So in general I'm not that excited about
getting closer to spec in this area.

regards, tom lane

jsonb iterator not fully initialized

2018-05-25 Thread Peter Eisentraut

I got this error message via -fsanitized=undefined:

jsonfuncs.c:5169:12: runtime error: load of value 127, which is not a
valid value for type '_Bool'

The query was

select ts_headline('{}'::jsonb, tsquery('aaa & bbb'));

This calls the C function ts_headline_jsonb_byid_opt(), which calls
transform_jsonb_string_values(), which calls

it = JsonbIteratorInit(>root);
is_scalar = it->isScalar;

but it->isScalar is not always initialized by JsonbIteratorInit().  (So
the 127 is quite likely clobbered memory.)

It can be fixed this way:

--- a/src/backend/utils/adt/jsonb_util.c
+++ b/src/backend/utils/adt/jsonb_util.c
@@ -901,7 +901,7 @@ iteratorFromContainer(JsonbContainer *container,
JsonbIterator *parent)
 {
JsonbIterator *it;

-   it = palloc(sizeof(JsonbIterator));
+   it = palloc0(sizeof(JsonbIterator));
it->container = container;
it->parent = parent;
it->nElems = JsonContainerSize(container);

It's probably not a problem in practice, since the isScalar business is
apparently only used in the array case, but it's dubious to leave things
uninitialized like this nonetheless.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: SCRAM with channel binding downgrade attack

2018-05-25 Thread Michael Paquier

On Fri, May 25, 2018 at 06:24:07PM +0300, Heikki Linnakangas wrote:
> On 25 May 2018 17:44:16 EEST, Robert Haas  wrote:
>> It seems to me that this is really another sort of thing altogether.
>> Whether or not you want to insist on channel binding is a completely
>> separate thing from which channel binding methods you're willing to
>> use.  It seems to me like the most logical thing would be to make
>> these two separate connection options. 
> 
> Works for me.

OK, I can live with that as well.  So we'll go in the direction of two
parameters then:
- scram_channel_binding, which can use "prefer" (default), "require" or
"disable".
- scram_channel_binding_name, developer option to choose the type of
channel binding, with "tls-unique" (default) and "tls-server-end-point".
We could also remove the prefix "scram_".  Ideas of names are welcome.

At the end, the core of the proposed patch relies on the fact that it
checks when receiving AUTH_REQ_OK that a full set of SCRAM messages has
happened with the server, up to the point where the client has checked
the server proof that both ends know the same password.  So do people
here think that this is a sane apprach?  Are there other ideas?
--
Michael


signature.asc
Description: PGP signature

Re: SPI/backend equivalent of extended-query Describe(statement)?

2018-05-25 Thread Tom Lane

Chapman Flack  writes:
> As for bringing it along to the modern API, am I on the right track
> with writing a ParserSetupHook that's just a thin wrapper around
> parse_variable_parameters ?

Seems reasonable from here.

regards, tom lane

Re: Timetz comparison

2018-05-25 Thread David G. Johnston

On Fri, May 25, 2018 at 3:33 PM, Tom Lane  wrote:

> Alexey Bashtanov  writes:
> > Comparison of timetz values looks a bit weird to me, as
> > '22:00+02'::timetz > '21:00+01'::timetz.
>
> Perhaps, but I don't think there's a reasonable case for considering
> them equal, either.  In the other places where obviously-different
> values compare equal, such as zero versus minus zero in IEEE floats,
> it's widely understood as a gotcha.
>

Except all we've done here is expose an implementation detail since
timestamptz compares on a logical "point in time" notion of equality while
timetz does not.  Limitations of timetz aside adding a random date and
changing the type to "timestamptz" would not obviously cause the result of
the above comparison to change and the fact that it does could be
considered a gotcha when the behavior of timestamptz is likely to be the
widely understood and accepted and the behavior of timetz inferred from it.

> > What's the problem with these values to be considered equal?
> > Backward compatibility? Hash algorithms?
>
> Even if you'd made a case why we should consider them equal,
> those would be very good reasons not to change behavior that's
> stood for 17 years.
>

This is true, and the alternative doesn't have the supporting argument of
being spec-compliant either...

The notes in 8.5.3 Time Zone (v10 docs) seem to apply here overall - the
type, while standard, is ill-conceived.  Those who cannot stop using it
would not appreciate us changing it and those dealing with the oddity now
should just use timestamptz.

https://www.postgresql.org/docs/10/static/datatype-datetime.html

David J.

Re: SPI/backend equivalent of extended-query Describe(statement)?

2018-05-25 Thread Chapman Flack

On 05/25/2018 06:01 PM, Tom Lane wrote:

> Offhand I don't believe SPI exposed a way to do that before 9bedd128d.
> Does it matter?  Pre-9.0 releases are long out of support by now,
> which means they're full of known data-loss hazards and security bugs.

Well, my exploration arose from discovering something in PL/Java's
implementation that surprised me, but it dated from pre-9.0 times,
so seems to have been doing the best that was possible then. Didn't
want to start changing it before understanding why it was that way.

As for bringing it along to the modern API, am I on the right track
with writing a ParserSetupHook that's just a thin wrapper around
parse_variable_parameters ?

-Chap

Re: Timetz comparison

2018-05-25 Thread Tom Lane

Alexey Bashtanov  writes:
> Comparison of timetz values looks a bit weird to me, as 
> '22:00+02'::timetz > '21:00+01'::timetz.

Perhaps, but I don't think there's a reasonable case for considering
them equal, either.  In the other places where obviously-different
values compare equal, such as zero versus minus zero in IEEE floats,
it's widely understood as a gotcha.

> What's the problem with these values to be considered equal?
> Backward compatibility? Hash algorithms?

Even if you'd made a case why we should consider them equal,
those would be very good reasons not to change behavior that's
stood for 17 years.

regards, tom lane

Timetz comparison

2018-05-25 Thread Alexey Bashtanov


Hello,

Comparison of timetz values looks a bit weird to me, as 
'22:00+02'::timetz > '21:00+01'::timetz.

I can see this behavior introduced by commit 2792374c in 2001
with the comment "timetz was just plain broken (some possible pairs of 
values were neither < nor = nor >)".


The in-code comment is:
/*
 * If same GMT time, sort by timezone; we only want to say that two
 * timetz's are equal if both the time and zone parts are equal.
 */

However I found no specification on how timetz values are compared 
neither in postgres docs

nor in the standard (http://www.wiscorp.com/sql20nn.zip).

Was this decision made just to be definite?

What's the problem with these values to be considered equal?
Backward compatibility? Hash algorithms?

Thanks

Best regards,
  Alexey

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-05-25 Thread Andres Freund

On 2018-05-25 17:47:37 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Moving discussion to -hackers.  Tom, I think you worked most with this
> > code, your input would be appreciated.
> 
> Yeah, the assumption in the relcache is that the only part of a nailed
> catalog's relcache entry that really needs to be updated intrasession is
> the relfilenode mapping.

Paging through the changes to relcache.c and vacuum[lazy].c it looks to
me like that hasn't been true in a long time, right?

> For nailed indexes, we allow updating of some additional fields, and I
> guess what has to happen here is that we teach the code to update some
> additional fields for nailed tables too.

Yea, it seems like we could just get a new version of the pg_class tuple
if in the right state, and memcpy() it into place. Not sure if there's
any other issues...

BTW, and I guess this mostly goes to Alvaro, I don't understand why that
code accepts relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX?
That seems like something we'll hopefully never support.

Greetings,

Andres Freund

Re: SPI/backend equivalent of extended-query Describe(statement)?

2018-05-25 Thread Tom Lane

Chapman Flack  writes:
> Am I on the right track here? Is what I'm looking to do something
> that became possible in SPI in 9.0 and wasn't before, or did I overlook
> a way it could have been done pre-9.0 ?

Offhand I don't believe SPI exposed a way to do that before 9bedd128d.
Does it matter?  Pre-9.0 releases are long out of support by now,
which means they're full of known data-loss hazards and security bugs.

regards, tom lane

Re: Performance regression with PostgreSQL 11 and partitioning

2018-05-25 Thread Justin Pryzby

On Fri, May 25, 2018 at 05:17:12PM -0400, Tom Lane wrote:
> Robert Haas  writes:
> > On Fri, May 25, 2018 at 1:53 PM, Amit Langote  
> > wrote:
> >> Seems here that we call find_appinfos_by_relids here for *all*
> >> partitions, even if all but one partition may have been pruned.  I
> >> haven't studied this code in detail, but I suspect it might be
> >> unnecessary, although I might be wrong.
> 
> > Uggh.  It might be possible to skip it for dummy children.  That would
> > leave the dummy child rels generating a different pathtarget than the
> > non-dummy children, but I guess if we never use them again maybe it
> > wouldn't matter.

> Maybe it's all right to decide that this rejiggering can be left
> for v12 ... did we promise anyone that it's now sane to use thousands
> of partitions?

https://www.postgresql.org/docs/devel/static/ddl-partitioning.html
|The following caveats apply to CONSTRAINT EXCLUSION:
|[...]
|All constraints on all partitions of the master table are examined during
|constraint exclusion, so large numbers of partitions are likely to increase
|query planning time considerably. So the legacy inheritance based partitioning
|will work well with up to perhaps a hundred partitions; DON'T TRY TO USE MANY
|THOUSANDS OF PARTITIONS.

It doesn't matter for us, as we're already using tooo many partitions, but I
would interpret that to mean that thousands of partitions are a job exclusively
for "partition pruning" of declarative partitions.

Justin

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-05-25 Thread Tom Lane

Andres Freund  writes:
> Moving discussion to -hackers.  Tom, I think you worked most with this
> code, your input would be appreciated.

Yeah, the assumption in the relcache is that the only part of a nailed
catalog's relcache entry that really needs to be updated intrasession is
the relfilenode mapping.  For nailed indexes, we allow updating of some
additional fields, and I guess what has to happen here is that we teach
the code to update some additional fields for nailed tables too.

regards, tom lane

Re: assert in nested SQL procedure call in current HEAD

2018-05-25 Thread Andrew Gierth

> "Joe" == Joe Conway  writes:

 Joe> My colleague Yogesh Sharma discovered an assert in nested SQL
 Joe> procedure calls after ROLLBACK is used. Minimal test case and
 Joe> backtrace below. I have not yet tried to figure out exactly what
 Joe> is going on beyond seeing that it occurs in pg_plan_query() where
 Joe> the comment says "Planner must have a snapshot in case it calls
 Joe> user-defined functions"...

https://www.postgresql.org/message-id/29608.1518533...@sss.pgh.pa.us

-- 
Andrew (irc:RhodiumToad)

Re: Performance regression with PostgreSQL 11 and partitioning

2018-05-25 Thread Jonathan S. Katz

> On May 25, 2018, at 5:17 PM, Tom Lane  wrote:
> 
> Maybe it's all right to decide that this rejiggering can be left
> for v12 ... did we promise anyone that it's now sane to use thousands
> of partitions?

Per beta release, we’ve only said “improved SELECT query performance due
to enhanced partition elimination during query processing and execution as
well as parallelized partition scans” with the disclaimers of “subject to change
due to beta testing” and “please test and report back.”

In other words, no promises on the above.

However, the question is how common is the above scenario? If you’re doing
partitions by day, it would take awhile to get to 20K. But if you have something
where you have so much inbound data that you need to partition by hour, it
can occur a little bit more quickly (though will still take a couple of years, 
if you
were to start partitioning today).

Jonathan

Re: Redesigning the executor (async, JIT, memory efficiency)

2018-05-25 Thread Joe Conway

On 05/24/2018 11:26 PM, Heikki Linnakangas wrote:
> Cool stuff!
> 
> On 25/05/18 06:35, Andres Freund wrote:
>> For example, this converts this converts TPCH's Q01:

+1
Wicked cool!


-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: Avoiding Tablespace path collision for primary and standby

2018-05-25 Thread Ashwin Agrawal

On Fri, May 25, 2018 at 7:33 AM, Tom Lane  wrote:

> Ashwin Agrawal  writes:
> > Proposing to create directory with timestamp at time of creating
> tablespace
> > and create symbolic link to it instead.
>
> I'm skeptical that this solves your problem.  What happens when the CREATE
> TABLESPACE command is replicated to the standby with sub-second delay?
>

I thought timestamps have micro-second precision. Are we expecting
tabelspace to be created, wal logged, streamed, and replayed on mirror in
micro-second ?

Clock skew is another reason to doubt that timestamp == unique identifier,
> which is essentially what you're assuming here.
>

On same machine is what we care about generating uniqueness. Different
machines the problem doesn't exist anyways, so doesn't matter clock is
skewed or not.

>
> Even if we fixed that, the general idea of including a quasi-random
> component in the directory name seems like it would have a lot of
> unpleasant side effects in terms of reproduceability, testability, etc.
>

Hmm.. aren't to some degree we currently as well create directories/files
with quasi-random numbers like tablespace-oids, database-oids and
relfilenodes, etc..

To generate uniqueness for the path between primary and standby need to use
something which is not represented within database. So will be random to
some degree. Like one can use PORT number of postmaster. As only need to
generate unique path while creating link during CREATE TABLESPACE.

Re: Performance regression with PostgreSQL 11 and partitioning

2018-05-25 Thread Tom Lane

Robert Haas  writes:
> On Fri, May 25, 2018 at 1:53 PM, Amit Langote  wrote:
>> Seems here that we call find_appinfos_by_relids here for *all*
>> partitions, even if all but one partition may have been pruned.  I
>> haven't studied this code in detail, but I suspect it might be
>> unnecessary, although I might be wrong.

> Uggh.  It might be possible to skip it for dummy children.  That would
> leave the dummy child rels generating a different pathtarget than the
> non-dummy children, but I guess if we never use them again maybe it
> wouldn't matter.

I think this line of thought is shooting the messenger.  The core of the
problem here is that the appinfo list, which is a data structure designed
with the expectation of having a few dozen entries at most, has got four
thousand entries (or twenty thousand if you try Thomas' bigger case).
What we need, if we're going to cope with that, is something smarter than
linear searches.

I'm not exactly impressed with the design concept of
find_appinfos_by_relids in itself, either, as what that does is to
linearly search the appinfo list to produce ... an array, which also
has to be linearly searched.  In this particular example, it seems
that all the output arrays have length 1; if they did not then we'd
be seeing the search loops in adjust_appendrel_attrs et al taking
up unreasonable amounts of time as well.  (Not to mention the palloc
cycles and unreclaimed space eaten by said arrays.)

I'm inclined to think that we should flush find_appinfos_by_relids
altogether, along with these inefficient intermediate arrays, and instead
have the relevant places in adjust_appendrel_attrs call some function
defined as "gimme the AppendRelInfo for child rel A and parent rel B,
working directly from the PlannerInfo root data".  That could use a hash
lookup when dealing with more than some-small-number of AppendRelInfos,
comparable to what we do with join_rel_list/join_rel_hash.

I also wonder whether it's really necessary to do a fresh search
for each individual Var, as I see is currently happening in
adjust_appendrel_attrs_mutator.  At the very least we could expect
that there would be runs of requests for the same parent/child pair,
and avoid fresh list searches/hash probes when the answer is the
same as last time.  But do we really have to leave that for the bottom
level of the recursion, rather than doing it once per expr at some higher
call level?

Maybe it's all right to decide that this rejiggering can be left
for v12 ... did we promise anyone that it's now sane to use thousands
of partitions?

regards, tom lane

assert in nested SQL procedure call in current HEAD

2018-05-25 Thread Joe Conway

My colleague Yogesh Sharma discovered an assert in nested SQL procedure
calls after ROLLBACK is used. Minimal test case and backtrace below. I
have not yet tried to figure out exactly what is going on beyond seeing
that it occurs in pg_plan_query() where the comment says "Planner must
have a snapshot in case it calls user-defined functions"...

Joe

--
DROP TABLE IF EXISTS tst_tbl;
CREATE TABLE tst_tbl (a int);
CREATE OR REPLACE PROCEDURE insert_data() AS $$
 INSERT INTO tst_tbl VALUES (42);
$$ LANGUAGE SQL;

CREATE OR REPLACE PROCEDURE transaction_test() AS $$
 BEGIN
  ROLLBACK;
  CALL insert_data();
 END
$$ LANGUAGE PLPGSQL;
CALL transaction_test();

#0  0x7fe343cf4428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x7fe343cf602a in __GI_abort () at abort.c:89
#2  0x00a71951 in ExceptionalCondition (conditionName=0xc9408b
"!(ActiveSnapshotSet())", errorType=0xc93d23 "FailedAssertion",
fileName=0xc93e22 "postgres.c", lineNumber=801) at assert.c:54
#3  0x008ec20f in pg_plan_query (querytree=0x2df1f00,
cursorOptions=256, boundParams=0x0) at postgres.c:801
#4  0x0070772e in init_execution_state
(queryTree_list=0x2df2c28, fcache=0x2df12d8, lazyEvalOK=true) at
functions.c:507
#5  0x00707e50 in init_sql_fcache (finfo=0x7fffcbb1fd10,
collation=0, lazyEvalOK=true) at functions.c:770
#6  0x007085d9 in fmgr_sql (fcinfo=0x7fffcbb1fd40) at
functions.c:1064
#7  0x00675c12 in ExecuteCallStmt (stmt=0x2d54b18, params=0x0,
atomic=false, dest=0xfc7380 ) at functioncmds.c:2294
#8  0x008f5067 in standard_ProcessUtility (pstmt=0x2d54a00,
queryString=0x2d4f4e8 "CALL insert_data()",
context=PROCESS_UTILITY_QUERY_NONATOMIC,
params=0x0, queryEnv=0x0, dest=0xfc7380 ,
completionTag=0x7fffcbb20390 "") at utility.c:649
#9  0x008f4852 in ProcessUtility (pstmt=0x2d54a00,
queryString=0x2d4f4e8 "CALL insert_data()",
context=PROCESS_UTILITY_QUERY_NONATOMIC, params=0x0,
queryEnv=0x0, dest=0xfc7380 ,
completionTag=0x7fffcbb20390 "") at utility.c:360
#10 0x0074523d in _SPI_execute_plan (plan=0x2d54558,
paramLI=0x0, snapshot=0x0, crosscheck_snapshot=0x0, read_only=false,
fire_triggers=true, tcount=0)
at spi.c:2225
#11 0x00741d9d in SPI_execute_plan_with_paramlist
(plan=0x2d54558, params=0x0, read_only=false, tcount=0) at spi.c:481
#12 0x7fe33491bcfd in exec_stmt_call (estate=0x7fffcbb20870,
stmt=0x2df6170) at pl_exec.c:2103
#13 0x7fe33491b7bd in exec_stmt (estate=0x7fffcbb20870,
stmt=0x2df6170) at pl_exec.c:1920
#14 0x7fe33491b66e in exec_stmts (estate=0x7fffcbb20870,
stmts=0x2df60f0) at pl_exec.c:1875
#15 0x7fe33491b508 in exec_stmt_block (estate=0x7fffcbb20870,
block=0x2df5c60) at pl_exec.c:1816
#16 0x7fe334918e50 in plpgsql_exec_function (func=0x2d5ec30,
fcinfo=0x7fffcbb20ba0, simple_eval_estate=0x0, atomic=false) at
pl_exec.c:586
#17 0x7fe334913119 in plpgsql_call_handler (fcinfo=0x7fffcbb20ba0)
at pl_handler.c:263
#18 0x00675c12 in ExecuteCallStmt (stmt=0x2d2cff8, params=0x0,
atomic=false, dest=0x2d2d420) at functioncmds.c:2294
#19 0x008f5067 in standard_ProcessUtility (pstmt=0x2d2d0c8,
queryString=0x2d2c4e8 "CALL transaction_test();",
context=PROCESS_UTILITY_TOPLEVEL,
params=0x0, queryEnv=0x0, dest=0x2d2d420,
completionTag=0x7fffcbb213a0 "") at utility.c:649
#20 0x008f4852 in ProcessUtility (pstmt=0x2d2d0c8,
queryString=0x2d2c4e8 "CALL transaction_test();",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x2d2d420, completionTag=0x7fffcbb213a0 "") at
utility.c:360


-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-05-25 Thread Andres Freund

Hi,

Moving discussion to -hackers.  Tom, I think you worked most with this
code, your input would be appreciated.

Original discussion is around:
http://archives.postgresql.org/message-id/20180524211311.tnswfnjwnii54htx%40alvherre.pgsql

On 2018-05-24 17:13:11 -0400, Alvaro Herrera wrote:
> On 2018-May-24, Andres Freund wrote:
> > Then there's also:
> > http://archives.postgresql.org/message-id/1527193504642.36340%40amazon.com
> 
> ah, so deleting the relcache file makes the problem to go away?  That's
> definitely pretty strange.  I see no reason for the value in relcache to
> become out of step with the catalogued value in the same database ... I
> don't think we transmit in any way values of one database to another.

I can reproduce the issue. As far as I can tell we just don't ever
actually update nailed relcache entries in the normal course, leaving
the "physical address" aside.  VACUUM will, via
vac_update_relstats() -> heap_inplace_update() -> CacheInvalidateHeapTuple(),
send out an invalidation. But invalidation, in my case another session,
will essentially ignore most of that due to:

static void
RelationClearRelation(Relation relation, bool rebuild)
...
/*
 * Never, never ever blow away a nailed-in system relation, because we'd
 * be unable to recover.  However, we must redo RelationInitPhysicalAddr
 * in case it is a mapped relation whose mapping changed.
 *
 * If it's a nailed-but-not-mapped index, then we need to re-read the
 * pg_class row to see if its relfilenode changed. We do that 
immediately
 * if we're inside a valid transaction and the relation is open (not
 * counting the nailed refcount).  Otherwise just mark the entry as
 * possibly invalid, and it'll be fixed when next opened.
 */
if (relation->rd_isnailed)
{
RelationInitPhysicalAddr(relation);

if (relation->rd_rel->relkind == RELKIND_INDEX ||
relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
{
relation->rd_isvalid = false;   /* needs to be 
revalidated */
if (relation->rd_refcnt > 1 && IsTransactionState())
RelationReloadIndexInfo(relation);
}
return;
}

Which basically means that once running we'll never update the relcache
data for nailed entries.  That's unproblematic for most relcache fields,
but not for things like RelationData->rd_rel->relfrozenxid / relminmxid.

This'll e.g. lead to lazy_vacuum_rel() wrongly not using aggressive
vacuums despite being required. And it'll lead, triggering this thread,
to wrong errors being raised during vacuum because relfrozenxid just is
some random value from the past.  I suspect this might also be
co-responsible for a bunch of planning issues for queries involving the
catalog, because the planner will use wrong relcache data until the next
time the init file is thrown away?

This looks like a very longstanding bug to me.  I'm not yet quite sure
what the best way to deal with this is.  I suspect we might get away
with just looking up a new version of the pg_class tuple and copying
rd_rel over?

Greetings,

Andres Freund

Re: SPI/backend equivalent of extended-query Describe(statement)?

2018-05-25 Thread Chapman Flack

On 05/24/2018 02:30 AM, Chapman Flack wrote:
> In 9.0, there's SPI_prepare_params, which seems promising; it accepts
> an arbitrary ParserSetupHook "to control the parsing of external parameter
> references." But its documentation doesn't suggest what to use as the
> ParserSetupHook to say "please just do the same stuff you would do if
> I were a client sending a Parse message with unspecified parameter types!"
> 
> Perhaps I just need something like
> 
> struct varparinfo { Oid *paramTypes, int numParams } vpi = {palloc(0), 0};
> 
> static void inferringSetupHook(struct ParseState *pstate, void *arg)
> {
> struct varparinfo *vpi = (struct varparinfo *)arg;
> parse_variable_parameters(pstate, >paramTypes, >numParams);
> }
> 
> SPI_prepare_params("SELECT $1", inferringSetupHook, , 0);

Am I on the right track here? Is what I'm looking to do something
that became possible in SPI in 9.0 and wasn't before, or did I overlook
a way it could have been done pre-9.0 ?

Thanks,
-Chap

Re: Performance regression with PostgreSQL 11 and partitioning

2018-05-25 Thread Robert Haas

On Fri, May 25, 2018 at 1:53 PM, Amit Langote  wrote:
> Seems here that we call find_appinfos_by_relids here for *all*
> partitions, even if all but one partition may have been pruned.  I
> haven't studied this code in detail, but I suspect it might be
> unnecessary, although I might be wrong.

Uggh.  It might be possible to skip it for dummy children.  That would
leave the dummy child rels generating a different pathtarget than the
non-dummy children, but I guess if we never use them again maybe it
wouldn't matter.

> Fwiw, I'm not sure why the new pruning code would call here, at least
> git grep find_appinfos_by_relids doesn't turn up anything interesting
> in that regard.

Hmm, OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: perlcritic and perltidy

2018-05-25 Thread Bruce Momjian

On Sun, May  6, 2018 at 11:53:34AM -0400, Tom Lane wrote:
> What sort of changes do we get if we remove those two flags as you prefer?
> It'd help to see some examples.
> 
> Since we just went to a new perltidy version, and made some other
> policy changes for it, in HEAD, it'd make sense to make any further
> changes in this same release cycle rather than drip drip drip over
> multiple cycles.  We just need to get some consensus about what
> style we like.

I saw you looking for feedback so I wanted to give mine.  Also, Andrew,
thanks for working on this --- it is a big help to have limited Perl
critic reports and good tidiness.

I am using the src/tools/pgindent/perltidyrc setting for my own Perl
code, but needed to add these two:

--noblanks-before-comments
--break-after-all-operators

The first one fixes odd blank lines when I put comments inside
conditional tests, e.g.:

if (!$options{args_supplied}   &&
!$is_debug &&
defined($stat_main)&&
defined($stat_cache)   &&
$stat_main->mtime < $stat_cache->mtime &&
# is local time zone?
(!defined($ENV{TZ}) || $ENV{TZ} =~ m/^E.T$/))

Without the first option, I get:

if (!$options{args_supplied}   &&
!$is_debug &&
defined($stat_main)&&
defined($stat_cache)   &&
$stat_main->mtime < $stat_cache->mtime &&
-->
# is local time zone?
(!defined($ENV{TZ}) || $ENV{TZ} =~ m/^E.T$/))

which just looks odd to me.  Am I the only person who often does this?

The second option, --break-after-all-operators, is more of a personal
taste, but it does match how our C code works, and people have said I
write C code in Perl.  ;-)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [PATCH v16] GSSAPI encryption support

2018-05-25 Thread Robbie Harwood

Robbie Harwood  writes:

> Thomas Munro  writes:
>
>> On Thu, May 24, 2018 at 8:00 AM, Robbie Harwood  wrote:
>>
>>> Zombie patch is back from the dead.
>>
>> Hi Robbie,
>>
>> Robots[1] vs zombies:
>>
>> + $postgres->RemoveFile('src/backennd/libpq/be-gssapi-common.c');
>>
>> Typo, breaks on Windows.
>>
>> runtime.sgml:2032: parser error : Opening and ending tag mismatch:
>> para line 2026 and sect1
>>  
>>  ^
>>
>> Docs malformed.
>>
>> [1] http://cfbot.cputube.org/robbie-harwood.html
>
> Hah, this is great!  Looks like I failed to build the docs.
>
> Here's an updated version that should fix those.

Me and the bot are having an argument.  This should green Linux but I
dunno about Windows.

Thanks,
--Robbie

signature.asc
Description: PGP signature
>From 1bf21be9d9cd05bf2bcb37c474888b4d8ff6fb63 Mon Sep 17 00:00:00 2001
From: Robbie Harwood 
Date: Thu, 10 May 2018 16:12:03 -0400
Subject: [PATCH] libpq GSSAPI encryption support

On both the frontend and backend, prepare for GSSAPI encryption
support by moving common code for error handling into a separate file.
Fix a TODO for handling multiple status messages in the process.
Eliminate the OIDs, which have not been needed for some time.

Add frontend and backend encryption support functions.  Keep the
context initiation for authentication-only separate on both the
frontend and backend in order to avoid concerns about changing the
requested flags to include encryption support.

In postmaster, pull GSSAPI authorization checking into a shared
function.  Also share the initiator name between the encryption and
non-encryption codepaths.

Modify pqsecure_write() to take a non-const pointer.  The pointer will
not be modified, but this change (or a const-discarding cast, or a
malloc()+memcpy()) is necessary for GSSAPI due to const/struct
interactions in C.

For HBA, add "hostgss" and "hostnogss" entries that behave similarly
to their SSL counterparts.  "hostgss" requires either "gss", "trust",
or "reject" for its authentication.

Simiarly, add a "gssmode" parameter to libpq.  Supported values are
"disable", "require", and "prefer".  Notably, negotiation will only be
attempted if credentials can be acquired.  Move credential acquisition
into its own function to support this behavior.

Finally, add documentation for everything new, and update existing
documentation on connection security.

Thanks to Michael Paquier for the Windows fixes.
---
 doc/src/sgml/client-auth.sgml   |  75 --
 doc/src/sgml/libpq.sgml |  57 +++-
 doc/src/sgml/runtime.sgml   |  77 +-
 src/backend/libpq/Makefile  |   4 +
 src/backend/libpq/auth.c| 103 +++
 src/backend/libpq/be-gssapi-common.c|  64 +
 src/backend/libpq/be-gssapi-common.h|  26 ++
 src/backend/libpq/be-secure-gssapi.c| 319 ++
 src/backend/libpq/be-secure.c   |  16 ++
 src/backend/libpq/hba.c |  51 +++-
 src/backend/postmaster/pgstat.c |   3 +
 src/backend/postmaster/postmaster.c |  64 -
 src/include/libpq/hba.h |   4 +-
 src/include/libpq/libpq-be.h|  11 +-
 src/include/libpq/libpq.h   |   3 +
 src/include/libpq/pqcomm.h  |   5 +-
 src/include/pgstat.h|   3 +-
 src/interfaces/libpq/Makefile   |   4 +
 src/interfaces/libpq/fe-auth.c  |  90 +--
 src/interfaces/libpq/fe-connect.c   | 232 +++-
 src/interfaces/libpq/fe-gssapi-common.c | 128 +
 src/interfaces/libpq/fe-gssapi-common.h |  23 ++
 src/interfaces/libpq/fe-secure-gssapi.c | 343 
 src/interfaces/libpq/fe-secure.c|  16 +-
 src/interfaces/libpq/libpq-fe.h |   3 +-
 src/interfaces/libpq/libpq-int.h|  30 ++-
 src/tools/msvc/Mkvcbuild.pm |  24 +-
 27 files changed, 1576 insertions(+), 202 deletions(-)
 create mode 100644 src/backend/libpq/be-gssapi-common.c
 create mode 100644 src/backend/libpq/be-gssapi-common.h
 create mode 100644 src/backend/libpq/be-secure-gssapi.c
 create mode 100644 src/interfaces/libpq/fe-gssapi-common.c
 create mode 100644 src/interfaces/libpq/fe-gssapi-common.h
 create mode 100644 src/interfaces/libpq/fe-secure-gssapi.c

diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index 656d5f9417..38cf32e3be 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -108,6 +108,8 @@ hostnossl  database  user
 host   database  user  IP-address  IP-mask  auth-method  auth-options
 hostssldatabase  user  IP-address  IP-mask  auth-method  auth-options
 hostnossl  database  user  IP-address  IP-mask  auth-method  auth-options
+hostgssdatabase  user  IP-address  IP-mask  auth-method  auth-options
+hostnogss  database  user  IP-address  IP-mask  auth-method  auth-options

The meaning of

Re: Performance regression with PostgreSQL 11 and partitioning

2018-05-25 Thread Amit Langote

On Fri, May 25, 2018 at 11:49 PM, Robert Haas  wrote:
> On Fri, May 25, 2018 at 10:30 AM, Thomas Reiss  
> wrote:
>> Then I used the following to compare the planning time :
>> explain (analyze) SELECT * FROM t1 WHERE dt = '2018-05-25';
>>
>> With PostgreSQL 10, planning time is 66ms, in v11, planning rise to
>> 143ms. I also did a little test with more than 20k partitions, and while
>> the planning time was reasonable with PG10 (287.453 ms), it exploded
>> with v11 with 4578.054 ms.
>>
>> Perf showed that thes functions find_appinfos_by_relids and
>> bms_is_member consumes most of the CPU time with v11. With v10, this
>> functions don't appear. It seems that find_appinfos_by_relids was
>> introduced by commit 480f1f4329f.
>
> Hmm.  Have you verified whether that commit is actually the one that
> caused the regression?  It's certainly possible, but I wouldn't expect
> calling find_appinfos_by_relids() with 1 AppendRelInfo to be too much
> more expensive than calling find_childrel_appendrelinfo() as the
> previous code did.  I wonder if some later change, perhaps related to
> pruning, just caused this code path to be hit more often.

One more possibility is that find_appinfos_by_relids being shown high
up in profiles is called from apply_scanjoin_target_to_paths that's
new in PG 11, which in turn is called (unconditionally) from
grouping_planner.  Especially, the following bit in it:

/*
 * If the relation is partitioned, recurseively apply the same changes to
 * all partitions and generate new Append paths. Since Append is not
 * projection-capable, that might save a separate Result node, and it also
 * is important for partitionwise aggregate.
 */
if (rel->part_scheme && rel->part_rels)
{
int partition_idx;
List   *live_children = NIL;

/* Adjust each partition. */
for (partition_idx = 0; partition_idx < rel->nparts; partition_idx++)
{
RelOptInfo *child_rel = rel->part_rels[partition_idx];
ListCell   *lc;
AppendRelInfo **appinfos;
int nappinfos;
List   *child_scanjoin_targets = NIL;

/* Translate scan/join targets for this child. */
appinfos = find_appinfos_by_relids(root, child_rel->relids,
   );

Seems here that we call find_appinfos_by_relids here for *all*
partitions, even if all but one partition may have been pruned.  I
haven't studied this code in detail, but I suspect it might be
unnecessary, although I might be wrong.

Fwiw, I'm not sure why the new pruning code would call here, at least
git grep find_appinfos_by_relids doesn't turn up anything interesting
in that regard.

Thanks,
Amit

Re: Performance regression with PostgreSQL 11 and partitioning

2018-05-25 Thread Thomas Reiss



Le 25/05/2018 à 16:49, Robert Haas a écrit :
> On Fri, May 25, 2018 at 10:30 AM, Thomas Reiss  
> wrote:
>> Then I used the following to compare the planning time :
>> explain (analyze) SELECT * FROM t1 WHERE dt = '2018-05-25';
>>
>> With PostgreSQL 10, planning time is 66ms, in v11, planning rise to
>> 143ms. I also did a little test with more than 20k partitions, and while
>> the planning time was reasonable with PG10 (287.453 ms), it exploded
>> with v11 with 4578.054 ms.
>>
>> Perf showed that thes functions find_appinfos_by_relids and
>> bms_is_member consumes most of the CPU time with v11. With v10, this
>> functions don't appear. It seems that find_appinfos_by_relids was
>> introduced by commit 480f1f4329f.
> 
> Hmm.  Have you verified whether that commit is actually the one that
> caused the regression?  It's certainly possible, but I wouldn't expect
> calling find_appinfos_by_relids() with 1 AppendRelInfo to be too much
> more expensive than calling find_childrel_appendrelinfo() as the
> previous code did.  I wonder if some later change, perhaps related to
> pruning, just caused this code path to be hit more often.

I didn't because I didn't enough time. I'll take another look next week.

Re: Enhancement Idea - Expose the active value of a parameter in pg_settings

2018-05-25 Thread Robert Haas

On Fri, May 25, 2018 at 12:14 PM, Greg Clough  wrote:
> Many thanks for the quick consideration, even if it's ultimately a rejection. 
>  Figuring out some SQL that will work across all platforms, versions, and 
> compile-time options will be "fun", but I'm up for a challenge.

Why would you need different SQL for this on different platforms or
with different compile-time options?

I agree that there could be some variation across major versions, but
I don't think there's a lot.

BTW, posting to a public mailing list with a giant blob of legalese
about confidential information in your signature is kind of silly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

We still claim "cannot begin/end transactions in PL/pgSQL"

2018-05-25 Thread Tom Lane

I notice there are still several places in pl_exec.c like this:

case SPI_ERROR_TRANSACTION:
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot begin/end transactions in PL/pgSQL"),
 errhint("Use a BEGIN block with an EXCEPTION clause 
instead.")));
break;

At best, the wording of these error messages is now obsolete.  I'm not
sure if we expect them to be reachable at all.  If they should be
can't-happen cases, I'd suggest just deleting them and letting control
fall to the generic default: cases in each switch.  If they are reachable,
the messages need work.

regards, tom lane

RE: Enhancement Idea - Expose the active value of a parameter in pg_settings

2018-05-25 Thread Greg Clough

>> On Fri, May 25, 2018 at 10:11 AM, Andrew Dunstan
>>  wrote:
>>> He's proposing an extra column to show the actual value used, so
>>> distinguishing them should be a problem.
>
>> For most settings, that column would just be a duplicate.  For a
>> handful, it would pull in the value of some other GUC.  If somebody
>> finds that useful, cool, they can write a view that does it and
>> install it on their own cluster.  I don't think that it makes a lot of
>> sense to put it in core, though.  My guess would be that more people
>> would be annoyed or confused by the extra column than would be pleased
>> or enlightened by it.  I could of course be wrong.
>
> Yeah, that's pretty much my evaluation --- given the tiny number of
> GUCs that have behaviors like this, an extra column seems like it
> would mostly be confusing.  Plus, pg_settings is too darn wide already.
>
> Personally, what I'd rather do is try to get rid of GUC behaviors like
> "the effective value depends on something else".  But convenience and
> backwards compatibility may be arguments against that.
>
> regards, tom lane

Many thanks for the quick consideration, even if it's ultimately a rejection.  
Figuring out some SQL that will work across all platforms, versions, and 
compile-time options will be "fun", but I'm up for a challenge.

It came about because I was scraping the entire cluster with "pgmetrics", and I 
was trying to reduce all "size" numeric settings down to bytes for them. I 
figured it would be nice if PostgreSQL could expose the data it's already got 
rather than forcing all external applications that want that data to do the 
same thing.

I'll deal with it externally, but I hope it was a reasonable proposal and not 
completely off-the-wall.

Regards,
Greg Clough.

* Confidential Disclaimer *

This e-mail message and any attachments are confidential. Dissemination, 
distribution or copying of this e-mail or any attachments by anyone other than 
the intended recipient is prohibited. If you are not the intended recipient, 
please notify Ipreo immediately by replying to this e-mail, and destroy all 
copies of this e-mail and any attachments. If you have received this e-mail as 
part of a marketing communication and you would like to unsubscribe from future 
marketing communications, please review our privacy 
policy for more information.

Re: rule-related crash in v11

2018-05-25 Thread Tom Lane

Robert Haas  writes:
> On Fri, May 25, 2018 at 11:21 AM, Tom Lane  wrote:
>> Looking at what mod_stmt is used for, we've got
>> (1) the Assert that's crashing, and its siblings, which are just meant
>> to cross-check that mod_stmt is consistent with the SPI return code.
>> (2) two places that decide to enforce STRICT behavior if mod_stmt
>> is true.
>> 
>> I think we could just drop those Asserts.  As for the other things,
>> maybe we should force STRICT on the basis of examining the raw
>> parsetree (which really is immutable) rather than what came out of
>> the planner.  It doesn't seem like a great idea that INSERT ... INTO
>> should stop being considered strict if there's currently a rewrite
>> rule that changes it into something else.

> Yes, that does sound like surprising behavior.

On closer inspection, the specific Assert you're hitting is the only one
of the bunch that's really bogus.  It's actually almost backwards: if we
have a statement that got rewritten into some other kind of statement by a
rule, it almost certainly *was* an INSERT/UPDATE/DELETE to start with.
But I think there are corner cases where spi.c can return SPI_OK_REWRITTEN
where we'd not have set mod_stmt (e.g., empty statement list), so I'm not
comfortable with asserting mod_stmt==true either.  So the attached patch
just drops it.

Not sure if this is worth a regression test case.

regards, tom lane

diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 1eb421b..ef013bc 100644
*** a/src/pl/plpgsql/src/pl_exec.c
--- b/src/pl/plpgsql/src/pl_exec.c
*** exec_stmt_execsql(PLpgSQL_execstate *est
*** 4031,4049 
  		foreach(l, SPI_plan_get_plan_sources(expr->plan))
  		{
  			CachedPlanSource *plansource = (CachedPlanSource *) lfirst(l);
- 			ListCell   *l2;
  
! 			foreach(l2, plansource->query_list)
  			{
! Query	   *q = lfirst_node(Query, l2);
! 
! if (q->canSetTag)
! {
! 	if (q->commandType == CMD_INSERT ||
! 		q->commandType == CMD_UPDATE ||
! 		q->commandType == CMD_DELETE)
! 		stmt->mod_stmt = true;
! }
  			}
  		}
  	}
--- 4031,4050 
  		foreach(l, SPI_plan_get_plan_sources(expr->plan))
  		{
  			CachedPlanSource *plansource = (CachedPlanSource *) lfirst(l);
  
! 			/*
! 			 * We could look at the raw_parse_tree, but it seems simpler to
! 			 * check the command tag.  Note we should *not* look at the Query
! 			 * tree(s), since those are the result of rewriting and could have
! 			 * been transmogrified into something else entirely.
! 			 */
! 			if (plansource->commandTag &&
! (strcmp(plansource->commandTag, "INSERT") == 0 ||
!  strcmp(plansource->commandTag, "UPDATE") == 0 ||
!  strcmp(plansource->commandTag, "DELETE") == 0))
  			{
! stmt->mod_stmt = true;
! break;
  			}
  		}
  	}
*** exec_stmt_execsql(PLpgSQL_execstate *est
*** 4108,4119 
  			break;
  
  		case SPI_OK_REWRITTEN:
- 			Assert(!stmt->mod_stmt);
  
  			/*
  			 * The command was rewritten into another kind of command. It's
  			 * not clear what FOUND would mean in that case (and SPI doesn't
! 			 * return the row count either), so just set it to false.
  			 */
  			exec_set_found(estate, false);
  			break;
--- 4109,4120 
  			break;
  
  		case SPI_OK_REWRITTEN:
  
  			/*
  			 * The command was rewritten into another kind of command. It's
  			 * not clear what FOUND would mean in that case (and SPI doesn't
! 			 * return the row count either), so just set it to false.  Note
! 			 * that we can't assert anything about mod_stmt here.
  			 */
  			exec_set_found(estate, false);
  			break;

Re: rule-related crash in v11

2018-05-25 Thread Robert Haas

On Fri, May 25, 2018 at 11:21 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> For reasons that I'm not quite sure about, the following test case
>> crashes in v11, but not earlier versions:
>
> Crashes just fine in prior versions for me, at least as far back as 9.3,
> but probably much further.  Note that I was doing an extra select fun()
> right after creating the function --- I don't think that should affect
> the behavior, but maybe it does?  Or maybe you were testing non-assert
> builds?

Oops, yeah, my back-branch builds were without assertions.

> The core problem here seems to be that exec_stmt_execsql sets mod_stmt
> once when the query is first planned, and doesn't account for the idea
> that the statement's classification might change later.  But adding
> (or removing) a DO INSTEAD rule can indeed change that.
>
> Looking at what mod_stmt is used for, we've got
>
> (1) the Assert that's crashing, and its siblings, which are just meant
> to cross-check that mod_stmt is consistent with the SPI return code.
>
> (2) two places that decide to enforce STRICT behavior if mod_stmt
> is true.
>
> I think we could just drop those Asserts.  As for the other things,
> maybe we should force STRICT on the basis of examining the raw
> parsetree (which really is immutable) rather than what came out of
> the planner.  It doesn't seem like a great idea that INSERT ... INTO
> should stop being considered strict if there's currently a rewrite
> rule that changes it into something else.

Yes, that does sound like surprising behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2018-05-25 Thread Robert Haas

On Fri, May 18, 2018 at 11:21 AM, Masahiko Sawada  wrote:
> Regarding to API design, should we use 2PC for a distributed
> transaction if both two or more 2PC-capable foreign servers and
> 2PC-non-capable foreign server are involved with it?  Or should we end
> up with an error? the 2PC-non-capable server might be either that has
> 2PC functionality but just disables it or that doesn't have it.

It seems to me that this is functionality that many people will not
want to use.  First, doing a PREPARE and then a COMMIT for each FDW
write transaction is bound to be more expensive than just doing a
COMMIT.  Second, because the default value of
max_prepared_transactions is 0, this can only work at all if special
configuration has been done on the remote side.  Because of the second
point in particular, it seems to me that the default for this new
feature must be "off".  It would make to ship a default configuration
of PostgreSQL that doesn't work with the default configuration of
postgres_fdw, and I do not think we want to change the default value
of max_prepared_transactions.  It was changed from 5 to 0 a number of
years back for good reason.

So, I think the question could be broadened a bit: how you enable this
feature if you want it, and what happens if you want it but it's not
available for your choice of FDW?  One possible enabling method is a
GUC (e.g. foreign_twophase_commit).  It could be true/false, with true
meaning use PREPARE for all FDW writes and fail if that's not
supported, or it could be three-valued, like require/prefer/disable,
with require throwing an error if PREPARE support is not available and
prefer using PREPARE where available but without failing when it isn't
available.  Another possibility could be to make it an FDW option,
possibly capable of being set at multiple levels (e.g. server or
foreign table).  If any FDW involved in the transaction demands
distributed 2PC semantics then the whole transaction must have those
semantics or it fails.  I was previous leaning toward the latter
approach, but I guess now the former approach is sounding better.  I'm
not totally certain I know what's best here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: SCRAM with channel binding downgrade attack

2018-05-25 Thread Heikki Linnakangas



On 25 May 2018 17:44:16 EEST, Robert Haas  wrote:
>On Wed, May 23, 2018 at 2:46 AM, Heikki Linnakangas 
>wrote:
>> We could provide "tls-unique" and "tls-server-end-point" in addition
>to
>> those, but I'd consider those to be developer only settings, useful
>only for
>> testing the protocol.
>
>It seems to me that this is really another sort of thing altogether.
>Whether or not you want to insist on channel binding is a completely
>separate thing from which channel binding methods you're willing to
>use.  It seems to me like the most logical thing would be to make
>these two separate connection options. 

Works for me.

- Heikki

Re: rule-related crash in v11

2018-05-25 Thread Tom Lane

Robert Haas  writes:
> For reasons that I'm not quite sure about, the following test case
> crashes in v11, but not earlier versions:

Crashes just fine in prior versions for me, at least as far back as 9.3,
but probably much further.  Note that I was doing an extra select fun()
right after creating the function --- I don't think that should affect
the behavior, but maybe it does?  Or maybe you were testing non-assert
builds?

The core problem here seems to be that exec_stmt_execsql sets mod_stmt
once when the query is first planned, and doesn't account for the idea
that the statement's classification might change later.  But adding
(or removing) a DO INSTEAD rule can indeed change that.

Looking at what mod_stmt is used for, we've got

(1) the Assert that's crashing, and its siblings, which are just meant
to cross-check that mod_stmt is consistent with the SPI return code.

(2) two places that decide to enforce STRICT behavior if mod_stmt
is true.

I think we could just drop those Asserts.  As for the other things,
maybe we should force STRICT on the basis of examining the raw
parsetree (which really is immutable) rather than what came out of
the planner.  It doesn't seem like a great idea that INSERT ... INTO
should stop being considered strict if there's currently a rewrite
rule that changes it into something else.

regards, tom lane

Re: [GSoC] github repo and initial work

2018-05-25 Thread Charles Cui

Got it, will go for second method. Will let you guys know the progress.

2018-05-25 4:05 GMT-07:00 Aleksander Alekseev :

> Hello Charles,
>
> I suggest to begin with the second approach and add to_jsonb/from_jsonb
> later. Both approaches are fine but it seems to me that most users would
> expect a separate type thus it's more important.
>
> --
> Best regards,
> Aleksander Alekseev
>

Re: [GSoC] github repo and initial work

2018-05-25 Thread Charles Cui

Thanks for correcting me, will definitely study citext and see how a new
type is registered in plugin.


2018-05-24 23:23 GMT-07:00 Aleksandr Parfenov :

> On Thu, 24 May 2018 18:25:28 -0700
> Charles Cui  wrote:
> > The second is to provide thrift type just like json or jsonb. When you
> > create a table, postgres knows ::thrift keywords.
> > I think method one should be easier to implement because it only
> > limits to this plugin. Method two needs modify postgres kernel to
> > register a new type, which may time consuming,
> > but more natural. Any ideas on this?
> >
> > Thanks, Charles
>
> Hi Charles,
>
> I prefer the second way with separate type. But I think it is good idea
> to wait for an answer from your project mentor or someone other.
>
> I'm not an expert in PostgreSQL user-defined types, but AFAIK, it
> doesn't require changes in PostgreSQL core, since types can be created
> in extensions. It doesn't require changes to grammar or something.
> You can look at citext contrib as an example.
>
> --
> Aleksandr Parfenov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
>

Re: Performance regression with PostgreSQL 11 and partitioning

2018-05-25 Thread Robert Haas

On Fri, May 25, 2018 at 10:30 AM, Thomas Reiss  wrote:
> Then I used the following to compare the planning time :
> explain (analyze) SELECT * FROM t1 WHERE dt = '2018-05-25';
>
> With PostgreSQL 10, planning time is 66ms, in v11, planning rise to
> 143ms. I also did a little test with more than 20k partitions, and while
> the planning time was reasonable with PG10 (287.453 ms), it exploded
> with v11 with 4578.054 ms.
>
> Perf showed that thes functions find_appinfos_by_relids and
> bms_is_member consumes most of the CPU time with v11. With v10, this
> functions don't appear. It seems that find_appinfos_by_relids was
> introduced by commit 480f1f4329f.

Hmm.  Have you verified whether that commit is actually the one that
caused the regression?  It's certainly possible, but I wouldn't expect
calling find_appinfos_by_relids() with 1 AppendRelInfo to be too much
more expensive than calling find_childrel_appendrelinfo() as the
previous code did.  I wonder if some later change, perhaps related to
pruning, just caused this code path to be hit more often.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: SCRAM with channel binding downgrade attack

2018-05-25 Thread Robert Haas

On Wed, May 23, 2018 at 2:46 AM, Heikki Linnakangas  wrote:
> "tls-unique" and "tls-server-end-point" are overly technical to users. They
> don't care which one is used, there's no difference in security.
> Furthermore, if we add another channel binding type in the future, perhaps
> because someone finds a vulnerability in those types and a new one is added
> to address it, users would surely like to be automatically switched over to
> the new binding type. If they have "tls-unique" hardcoded in connection
> strings, that won't happen.

+1.

> So I think the options should be "prefer", "disable", and "require". That's
> also symmetrical with sslmode, which is nice.

+1.

> We could provide "tls-unique" and "tls-server-end-point" in addition to
> those, but I'd consider those to be developer only settings, useful only for
> testing the protocol.

It seems to me that this is really another sort of thing altogether.
Whether or not you want to insist on channel binding is a completely
separate thing from which channel binding methods you're willing to
use.  It seems to me like the most logical thing would be to make
these two separate connection options.  If it's discovered that
tls-unique sucks bigtime for some reason, people are going to want to
turn it off whether they are preferring channel binding or requiring
it.

> It would be nice to have a libpq setting like "authenticate_server=require",
> which would mean "I want man-in-the-middle protection". With that, a
> connection would be allowed, if either the server's SSL certificate is
> verified as with "sslmode=verify-full", *or* SCRAM authentication with
> channel binding was used. Or perhaps cram it into sslmode,
> "sslmode=verify-full-or-scram-channel-binding", just with a nicer name. (We
> can do that after v11 though, I think.)

IMHO we could do all of this after v11.  This seems like late
development being jammed through after beta1 to me.  But I just work
here.

Semantically, I see the value of an option that means basically
mitm_protection=true, but in practice I'm not sure there is any real
advantage over having the user just specify either ssmode=verify-full
or channelbinding=require depending on the technology they wish to
use.  They probably have a specific technology in mind; it's hard to
see how they're going to get an environment configured otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Unexpected casts while using date_trunc()

2018-05-25 Thread Tom Lane

Chris Bandy  writes:
> On 5/24/18 2:31 PM, Tom Lane wrote:
>> Andrew Gierth  writes:
>>> There's also the option of adding an explicit function
>>> date_trunc(text,date) returns date, which is a workaround that I (and
>>> probably quite a few other people) have used.

> Are we in agreement that the return type should be date?

That is a good question, actually.  That would be a larger behavior change
than just avoiding the undesired conversion to TZ.  I had imagined this
as just being equivalent to date_trunc(text, date::timestamp).  Casting
the result back down to date seems safe, though.

Another thing to consider is that the effective range of date is wider
than timestamp's, meaning coerce-to-timestamp can fail.  Is it worth
providing a whole additional code path so we never coerce the date to
timestamp at all?  I'd tend to think not, but perhaps somebody wants to
argue differently.

regards, tom lane

Re: Avoiding Tablespace path collision for primary and standby

2018-05-25 Thread Tom Lane

Ashwin Agrawal  writes:
> Proposing to create directory with timestamp at time of creating tablespace
> and create symbolic link to it instead.

I'm skeptical that this solves your problem.  What happens when the CREATE
TABLESPACE command is replicated to the standby with sub-second delay?
Clock skew is another reason to doubt that timestamp == unique identifier,
which is essentially what you're assuming here.

Even if we fixed that, the general idea of including a quasi-random
component in the directory name seems like it would have a lot of
unpleasant side effects in terms of reproduceability, testability, etc.

regards, tom lane

Re: Enhancement Idea - Expose the active value of a parameter in pg_settings

2018-05-25 Thread Robert Haas

On Fri, May 25, 2018 at 10:22 AM, Tom Lane  wrote:
> Personally, what I'd rather do is try to get rid of GUC behaviors like
> "the effective value depends on something else".  But convenience and
> backwards compatibility may be arguments against that.

Yeah.  The dependency between various GUCs is something that I don't
like very much either.  However, AFAICT, the limited number of GUCs
that have behaviors like this mostly all do for good reasons,
generally that there are two GUCs which people usually want set the
same way but occasionally not.  Decoupling the GUCs could lead to
people accidentally shooting themselves in the foot, and as you
mention it would also break configurations that work today when users
try to upgrade.  Maybe it would be worth going through that pain if we
could point to some really compelling benefit (if you do this, the
whole system can run 10% faster!) but I know of no such benefit.  It
seems more like a wart than a bullet wound.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Unexpected casts while using date_trunc()

2018-05-25 Thread Chris Bandy


On 5/24/18 2:31 PM, Tom Lane wrote:

Andrew Gierth  writes:

"Tom" == Tom Lane  writes:
  Tom> Yeah.  There are two relevant variants of date_trunc():
  [...]
  Tom> So we probably ought to change the docs here.



There's also the option of adding an explicit function
date_trunc(text,date) returns date, which is a workaround that I (and
probably quite a few other people) have used. I think having such a
function added to core would be less surprising than the current
behavior.


Ah!  Yes, of course, that would be better.  Seems like a workable
solution for Chris, too.  We still can't back-patch it, though.

regards, tom lane



I could take a pass at this about two weeks from now. (I won't be sad if 
someone else beats me to it.)


Are we in agreement that the return type should be date? I wasn't able 
to find a definitive reference for the expected behavior of date_trunc. 
Shall I replicate the behavior of casting to/from timestamp? What should 
happen when the user requests some time portion (e.g. hour) be truncated?


-- Chris

Performance regression with PostgreSQL 11 and partitioning

2018-05-25 Thread Thomas Reiss

Hello,

I spent some time to test the new features on partitioning with the
beta1. I noticed a potentially huge performance regression with
plan-time partition pruning.

To show the issue, I used this DO statement to generate some partitions,
one per day :
DO $$
DECLARE
  part_date date;
  ddl text;
BEGIN
  CREATE TABLE t1 (
num INTEGER NOT NULL,
dt  DATE NOT NULL
  ) PARTITION BY LIST (dt);

  FOR part_date IN SELECT d FROM generate_series(date '2010-01-01',
'2020-12-31', interval '1 day') d LOOP
ddl := 'CREATE TABLE t1_' || to_char(part_date, '_MM_DD') || E'
PARTITION OF t1 FOR VALUES IN (\'' || part_date || E'\')';
EXECUTE ddl;
  END LOOP;
END;
$$;

Then I used the following to compare the planning time :
explain (analyze) SELECT * FROM t1 WHERE dt = '2018-05-25';

With PostgreSQL 10, planning time is 66ms, in v11, planning rise to
143ms. I also did a little test with more than 20k partitions, and while
the planning time was reasonable with PG10 (287.453 ms), it exploded
with v11 with 4578.054 ms.

Perf showed that thes functions find_appinfos_by_relids and
bms_is_member consumes most of the CPU time with v11. With v10, this
functions don't appear. It seems that find_appinfos_by_relids was
introduced by commit 480f1f4329f.

Regards,
Thomas

Re: Subplan result caching

2018-05-25 Thread Robert Haas

On Wed, May 23, 2018 at 6:03 AM, Laurenz Albe  wrote:
> I think the cache should be limited in size, perhaps by work_mem.
> Also, there probably should be a GUC for this, defaulting to "off".

Eh, why?  Generally we want performance optimizations turned on by
default; otherwise only highly-knowledgeable users end up getting any
benefit.  An exception is when there's some significant downside to
the optimization, but I don't think I understand what the downside of
this could be.  Maybe it's bad if we populate the cache and never use
it?  But until the patch is written we don't know whether there's
really a case that regresses like that, and if there is, it would be
better to fix it so it doesn't rather than make the feature
off-by-default.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Subplan result caching

2018-05-25 Thread Robert Haas

On Wed, May 23, 2018 at 12:51 PM, Tom Lane  wrote:
> Ah.  That would work, though it'd make the number of subquery executions
> even less predictable (since some logically-equal values would compare
> as physically unequal).

In most cases that seems fine.  It might not be fine with the subquery
contains volatile functions, though.  I think I'd be sad if I wrote a
query expecting random() to be executing 26000 times and it got
executed 11 times instead.  But if the optimizer finds a way to
execute int4pl fewer times, that seems like a good thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Enhancement Idea - Expose the active value of a parameter in pg_settings

2018-05-25 Thread Tom Lane

Robert Haas  writes:
> On Fri, May 25, 2018 at 10:11 AM, Andrew Dunstan
>  wrote:
>> He's proposing an extra column to show the actual value used, so
>> distinguishing them should be a problem.

> For most settings, that column would just be a duplicate.  For a
> handful, it would pull in the value of some other GUC.  If somebody
> finds that useful, cool, they can write a view that does it and
> install it on their own cluster.  I don't think that it makes a lot of
> sense to put it in core, though.  My guess would be that more people
> would be annoyed or confused by the extra column than would be pleased
> or enlightened by it.  I could of course be wrong.

Yeah, that's pretty much my evaluation --- given the tiny number of
GUCs that have behaviors like this, an extra column seems like it
would mostly be confusing.  Plus, pg_settings is too darn wide already.

Personally, what I'd rather do is try to get rid of GUC behaviors like
"the effective value depends on something else".  But convenience and
backwards compatibility may be arguments against that.

regards, tom lane

Re: Enhancement Idea - Expose the active value of a parameter in pg_settings

2018-05-25 Thread Robert Haas

On Fri, May 25, 2018 at 10:11 AM, Andrew Dunstan
 wrote:
> He's proposing an extra column to show the actual value used, so
> distinguishing them should be a problem.

For most settings, that column would just be a duplicate.  For a
handful, it would pull in the value of some other GUC.  If somebody
finds that useful, cool, they can write a view that does it and
install it on their own cluster.  I don't think that it makes a lot of
sense to put it in core, though.  My guess would be that more people
would be annoyed or confused by the extra column than would be pleased
or enlightened by it.  I could of course be wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Enhancement Idea - Expose the active value of a parameter in pg_settings

2018-05-25 Thread Andrew Dunstan

On Fri, May 25, 2018 at 9:58 AM, Tom Lane  wrote:
> Greg Clough  writes:
>> I would like to propose that we expose the "active" value of parameters in 
>> pg_settings, instead of "-1".  In this example below, when it's set to "-1" 
>> I need to know that autovacuum_work_mem is related to the setting of 
>> maintenance_work_mem, so that I can determine that the actual setting is 
>> 64MB:
>
> If we did that, how would you tell the difference between "-1" and a hard
> setting of 64MB?
>

He's proposing an extra column to show the actual value used, so
distinguishing them should be a problem.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: In what range of the code can we read debug_query_string?

2018-05-25 Thread Tom Lane

Kyotaro HORIGUCHI  writes:
> It is set for other kinds of message, (parse, bind, execute). I
> think fastpath, close, flush and sync don't need that. If it is
> reasonable to assume that we can see debug_query_string in the
> DESCRIBE path, the attached patch would work.

I think this patch is a bad idea.  In the first place, debug_query_string
is generally understood as the current *client* command string, not
something internally generated.  In the second place, it looks way too
easy for this to leave debug_query_string as a dangling pointer,
ie pointing at a dropped plancache entry.

There might be a case for providing some way to get at the current
actively-executing query's string, using a protocol that can deal
with nesting of execution.  (This might already be more or less
possible via ActivePortal->sourceText, depending on what you think
the semantics ought to be.)  But debug_query_string was never
intended to do that.

regards, tom lane

rule-related crash in v11

2018-05-25 Thread Robert Haas

For reasons that I'm not quite sure about, the following test case
crashes in v11, but not earlier versions:

create table abc(n int);
create table xyz(n int);
create function fun() returns int as $$begin insert into abc values
(1); return 1; end$$ language plpgsql;
create or replace  rule rule1  as on insert to abc dodelete from xyz;
select fun();
create or replace  rule rule1  as on insert to abc do instead   delete from xyz;
select fun();

I get:

TRAP: FailedAssertion("!(!stmt->mod_stmt)", File: "pl_exec.c", Line: 4106)

The xyz table doesn't seem to be important; I can reproduce the crash
if I change 'delete from xyz' to 'do nothing' in both places.  But
it's critical to 'SELET fun()' after the first CREATE OR REPLACE RULE
statement and before the second one.  The INSERT inside the function
is also critical -- omitting that prevents the crash.  I suspect the
problem is likely related to some of the changes made to spi.c rather
than to changes made on the plpgsql side of things, but that might be
wrong.

My colleague Tushar Ahuja deserves credit for finding this problem; I
can take credit only for modifying his test case to work against
unmodified PostgreSQL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Enhancement Idea - Expose the active value of a parameter in pg_settings

2018-05-25 Thread Tom Lane

Greg Clough  writes:
> I would like to propose that we expose the "active" value of parameters in 
> pg_settings, instead of "-1".  In this example below, when it's set to "-1" I 
> need to know that autovacuum_work_mem is related to the setting of 
> maintenance_work_mem, so that I can determine that the actual setting is 64MB:

If we did that, how would you tell the difference between "-1" and a hard
setting of 64MB?

regards, tom lane

Enhancement Idea - Expose the active value of a parameter in pg_settings

2018-05-25 Thread Greg Clough

Hi Hackers,

I would like to propose that we expose the "active" value of parameters in 
pg_settings, instead of "-1".  In this example below, when it's set to "-1" I 
need to know that autovacuum_work_mem is related to the setting of 
maintenance_work_mem, so that I can determine that the actual setting is 64MB:

postgresql.conf
===
#autovacuum_work_mem = -1   # min 1MB, or -1 to use 
maintenance_work_mem
#maintenance_work_mem = 64MB# min 1MB


pg_settings
===
postgres=# SELECT name, setting, unit FROM pg_settings WHERE name IN 
('autovacuum_work_mem','maintenance_work_mem');
name | setting | unit
--+-+--
autovacuum_work_mem  | -1  | kB
maintenance_work_mem | 65536   | kB
(2 rows)


I think it would make sense to create a new column called something like 
"active_setting", which will allow simple verification of the setting that's in 
use without having to know the specifics about the parameter's relationship 
with others, and the version of PostgreSQL.  For parameters where "-1" is a 
real setting that has meaning (e.g. log_min_duration_statement), then it should 
return "-1".

I presume that as a part of running the server that we have already decoded 
what the active values should be, so I'm asking if we can expose this data via 
the pg_settings view?

Regards
Greg Clough

https://ipreo.com


* Confidential Disclaimer *

This e-mail message and any attachments are confidential. Dissemination, 
distribution or copying of this e-mail or any attachments by anyone other than 
the intended recipient is prohibited. If you are not the intended recipient, 
please notify Ipreo immediately by replying to this e-mail, and destroy all 
copies of this e-mail and any attachments. If you have received this e-mail as 
part of a marketing communication and you would like to unsubscribe from future 
marketing communications, please review our privacy 
policy for more information.

Re: pg_replication_slot_advance to return NULL instead of 0/0 if slot not advanced

2018-05-25 Thread Simon Riggs

On 25 May 2018 at 13:12, Magnus Hagander  wrote:
>
>
> On Fri, May 25, 2018 at 7:28 AM, Michael Paquier 
> wrote:
>>
>> Hi all,
>>
>> When attempting to use multiple times pg_replication_slot_advance on a
>> slot, then the caller gets back directly InvalidXLogRecPtr as result,
>> for example:
>> =# select * from pg_replication_slot_advance('popo', 'FF/0');
>>  slot_name |  end_lsn
>> ---+---
>>  popo  | 0/60021E0
>> (1 row)
>> =# select * from pg_replication_slot_advance('popo', 'FF/0');
>>  slot_name | end_lsn
>> ---+-
>>  popo  | 0/0
>> (1 row)
>>
>> Wouldn't it be more simple to return NULL to mean that the slot could
>> not be moved forward?  That would be easier to parse for clients.
>> Please see the attached.
>
>
> I agree that returning 0/0 on this is wrong.

Agreed

> However, can this actually occour for any case other than exactly the case
> of "moving the position to where the position already is"? If I look at the
> physical slot path at least that seems to eb the only case, and in that case
> I think the correct thing to return would be the new position, and not NULL.

Docs say
"Returns name of the slot and real position to which it was advanced to."

so agreed

> If we actually *fail* to move the position, we give an error.
>
> Actually, isn't there also a race there? That is, if we try to move it, we
> check that we're not trying to move it backwards, and throw an error, but
> that's checked outside the lock. Then later we actually move it, and check
> *again* if we try to move it backwards, but if that one fails we return
> InvalidXLogRecPtr (which can happen in the case of concurrent activity on
> the slot, I think)? In this case, maybe we should just re-check that and
> raise an error appropriately?

Agreed

> (I haven't looked at the logical slot path, but I assume it would have
> something similar in it)

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_replication_slot_advance to return NULL instead of 0/0 if slot not advanced

2018-05-25 Thread Magnus Hagander

On Fri, May 25, 2018 at 7:28 AM, Michael Paquier 
wrote:

> Hi all,
>
> When attempting to use multiple times pg_replication_slot_advance on a
> slot, then the caller gets back directly InvalidXLogRecPtr as result,
> for example:
> =# select * from pg_replication_slot_advance('popo', 'FF/0');
>  slot_name |  end_lsn
> ---+---
>  popo  | 0/60021E0
> (1 row)
> =# select * from pg_replication_slot_advance('popo', 'FF/0');
>  slot_name | end_lsn
> ---+-
>  popo  | 0/0
> (1 row)
>
> Wouldn't it be more simple to return NULL to mean that the slot could
> not be moved forward?  That would be easier to parse for clients.
> Please see the attached.
>

I agree that returning 0/0 on this is wrong.

However, can this actually occour for any case other than exactly the case
of "moving the position to where the position already is"? If I look at the
physical slot path at least that seems to eb the only case, and in that
case I think the correct thing to return would be the new position, and not
NULL. If we actually *fail* to move the position, we give an error.

Actually, isn't there also a race there? That is, if we try to move it, we
check that we're not trying to move it backwards, and throw an error, but
that's checked outside the lock. Then later we actually move it, and check
*again* if we try to move it backwards, but if that one fails we return
InvalidXLogRecPtr (which can happen in the case of concurrent activity on
the slot, I think)? In this case, maybe we should just re-check that and
raise an error appropriately?

(I haven't looked at the logical slot path, but I assume it would have
something similar in it)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

[Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-05-25 Thread Moon, Insung

Hello Hackers,

This propose a way to develop "Table-level" Transparent Data Encryption (TDE) 
and Key Management Service (KMS) support in
PostgreSQL.


Issues on data encryption of PostgreSQL
==
Currently, in PostgreSQL, data encryption can be using pgcrypto Tool.
However, it is inconvenient to use pgcrypto to encrypts data in some cases.

There are two significant inconveniences.

First, if we use pgcrypto to encrypt/decrypt data, we must call pgcrypto 
functions everywhere we encrypt/decrypt.
Second, we must modify application program code much if we want to do database 
migration to PostgreSQL from other databases that is
using TDE.

To resolved these inconveniences, many users want to support TDE.
There have also been a few proposals, comments, and questions to support TDE in 
the PostgreSQL community.

However, currently PostgreSQL does not support TDE, so in development 
community, there are discussions whether it's necessary to
support TDE or not.

In these discussions, there were requirements necessary to support TDE in 
PostgreSQL.

1) The performance overhead of encryption and decryption database data must be 
minimized
2) Need to support WAL encryption.
3) Need to support Key Management Service.

Therefore, I'd like to propose the new design of TDE that deals with both above 
requirements.
Since this feature will become very large, I'd like to hear opinions from 
community before starting making the patch.

First, my proposal is table-level TDE which is that user can specify tables 
begin encrypted. 
Indexes, TOAST table and WAL associated with the table that enables TDE are 
also encrypted.

Moreover, I want to support encryption for large object as well.
But I haven't found a good way for it so far. So I'd like to remain it as 
future TODO.

My proposal has five characteristics features of "table-level TDE".

1) Buffer-level data encryption and decryption
2) Per-table encryption
3) 2-tier encryption key management
4) Working with external key management services(KMS)
5) WAL encryption

Here are more details for each items.


1. Buffer-level data encryption and decryption
==
Transparent data encryption and decryption accompany by storage operation 
With ordinally way like using pgcrypto, the biggest problem with encrypted data 
is the performance overhead of decrypting the data
each time the run to queries.

My proposal is to encrypt and decrypt data when performing DISK I/O operation 
to minimize performance overhead.
Therefore, the data in the shared memory layer is unencrypted so that 
performance overhead can minimize.

With this design, data encryption/decryption implementations can be developed 
by modifying the codes of the storage and buffer
manager modules, 
which are responsible for performing DISK I/O operation.


2. Per-table encryption
==
User can enable TDE per table as they want.
I introduce new storage parameter "encryption_enabled" which enables TDE at 
table-level.

// Generate  the encryption table
   CREATE TABLE foo WITH ( ENCRYPTION_ENABLED = ON );

// Change to the non-encryption table
   ALTER TABLE foo SET ( ENCRYPTION_ENABLED = OFF );

This approach minimizes the overhead for tables that do not require encryption 
options.
For tables that enable TDE, the corresponding table key will be generated with 
random values, and it's stored into the new system
catalog after being encrypted by the master key.

BTW, I want to support CBC mode encryption[3]. However, I'm not sure how to use 
the IV in CBC mode for this proposal. 
I'd like to hear opinions by security engineer.


3. 2-tier encryption key management
==
when it comes time to change cryptographic keys, there is a performance 
overhead to decryption and re-encryption to all data.

To solve this problem we employee 2-tier encryption. 
2-tier encryption is All table keys can be stored in the database cluster after 
being encrypted by the master key, And master keys
must be stored at external of PostgreSQL.

Therefore, without master key, it is impossible to decrypt the table key. Thus, 
It is impossible to decrypt the database data.

When changing the key, it's not necessary to re-encrypt for all data.
We use the new master key only to decrypt and re-encrypt the table key, these 
operations for minimizing the performance overhead.

For table keys, all TDE-enabled tables have different table keys. 
And for master key, all database have different master keys. Table keys are 
encrypted by the master key of its own database.
For WAL encryption, we have another cryptographic key. WAL-key is also 
encrypted by a master key, but it is shared across the
database cluster.


4. Working with external key management services(KMS)
==
A key management service is an integrated approach for generating, fetching and 
managing encryption keys for key control. 
They may cover all aspects of security from the secure generation of keys, 
secure storing

Re: [GSoC] github repo and initial work

2018-05-25 Thread Aleksander Alekseev

Hello Charles,

I suggest to begin with the second approach and add to_jsonb/from_jsonb
later. Both approaches are fine but it seems to me that most users would
expect a separate type thus it's more important.

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature

Avoiding Tablespace path collision for primary and standby

2018-05-25 Thread Ashwin Agrawal

Currently, if primary and standby are setup on same machine (which is
always the case for development), CREATE TABLESPACE xyz LOCATION '/abc',
primary and mirror both write to "/abc/TABLESPACE_VERSION_DIRECTORY"
directory. Collision is certainly not an issue in any production deployment
but seems still solving the same for development is extremely helpful.

Proposing to create directory with timestamp at time of creating tablespace
and create symbolic link to it instead. So, would be something like
"/abc/PG_/TABLESPACE_VERSION_DIRECTORY". This helps avoid
collision of primary and standby as timestamps would differ between primary
creating the tablespace and mirror replaying the record for the same.

Ideally other advantage of this scheme is creating that additional
TABLESPACE_VERSION_DIRECTORY inside can also be eliminated as even during
pg_upgrade the paths will not collide. So, it helps to avoid constructing
this additional string part at multiple places in code for tablespace
access.

Since this is on-disk change yes may have impact to existing tools.

Attaching the patch to showcase the proposed. Tested by creating tablespace
with primary and standby on same machine, also tablespace test passes.


adding_timestamp_to_tablespace_path
Description: Binary data

Re: XLogWrite uses palloc within a critical section

2018-05-25 Thread Kyotaro HORIGUCHI

Thank you for the comment.

At Fri, 25 May 2018 09:05:21 +0300, Heikki Linnakangas  wrote 
in <466a3c6d-7986-8cb1-d908-e85aa6a09...@iki.fi>
> On 25/05/18 07:45, Kyotaro HORIGUCHI wrote:
> > Hello.
> > I happened to see the following in XLogWrite.
> > 
> >>   ereport(PANIC,
> >>   (errcode_for_file_access(),
> >>errmsg("could not seek in log file %s to offset %u: %m",
> >>   XLogFileNameP(ThisTimeLineID, openLogSegNo),
> >>   startoffset)));
> > where XLogFileNameP calls palloc within, and it is within a
> > critical section there. So it ends with assertion failure hiding
> > the PANIC message. We should use XLogFileName instead. The
> > problem has existed at least since 9.3. The code is frequently
> > revised so the patch needed to vary into four files.
> 
> Hmm, that's rather annoying, it sure would be handy if we could do
> small palloc()s like this in error messages safely.
> 
> I wonder if we could switch to ErrorContext in errstart()? That way,
> any small allocations in the ereport() arguments, like what
> XLogFileNameP() does, would be allocated in ErrorContext.

It already controlling error context per-recursion-level basis
but it doesn't work for the top-level errmsg(). I'm not sure the
basis of edata->assoc_context, it seems always set to
ErrorContext.

As a PoC, just moving to and restore from ErrorContext at the top
level seems working fine. (The first attached and it changes only
ereport.)

# The second is less invasive version of the previous patch..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 16531f7a0f..0e4877240f 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -154,6 +154,8 @@ static bool saved_timeval_set = false;
 static char formatted_start_time[FORMATTED_TS_LEN];
 static char formatted_log_time[FORMATTED_TS_LEN];
 
+/* Store memory context to restore after error reporting */
+MemoryContext	elog_oldcontext = NULL;
 
 /* Macro for checking errordata_stack_depth is reasonable */
 #define CHECK_STACK_DEPTH() \
@@ -396,8 +398,11 @@ errstart(int elevel, const char *filename, int lineno,
 	 * Any allocations for this error state level should go into ErrorContext
 	 */
 	edata->assoc_context = ErrorContext;
-
 	recursion_depth--;
+
+	/* Set memory context to ErrorContext if this is the toplevel */
+	if (recursion_depth == 0)
+		elog_oldcontext = MemoryContextSwitchTo(ErrorContext);
 	return true;
 }
 
@@ -414,19 +419,12 @@ errfinish(int dummy,...)
 {
 	ErrorData  *edata = [errordata_stack_depth];
 	int			elevel;
-	MemoryContext oldcontext;
 	ErrorContextCallback *econtext;
 
 	recursion_depth++;
 	CHECK_STACK_DEPTH();
 	elevel = edata->elevel;
 
-	/*
-	 * Do processing in ErrorContext, which we hope has enough reserved space
-	 * to report an error.
-	 */
-	oldcontext = MemoryContextSwitchTo(ErrorContext);
-
 	/*
 	 * Call any context callback functions.  Errors occurring in callback
 	 * functions will be treated as recursive errors --- this ensures we will
@@ -509,9 +507,12 @@ errfinish(int dummy,...)
 	errordata_stack_depth--;
 
 	/* Exit error-handling context */
-	MemoryContextSwitchTo(oldcontext);
 	recursion_depth--;
 
+	/* Restore memory context if this is the top-level */
+	if (recursion_depth == 0 && elog_oldcontext)
+		MemoryContextSwitchTo(elog_oldcontext);
+
 	/*
 	 * Perform error recovery action as specified by elevel.
 	 */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adbd6a2126..9cd7a106ba 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2477,7 +2477,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	ereport(PANIC,
 			(errcode_for_file_access(),
 			 errmsg("could not seek in log file %s to offset %u: %m",
-	XLogFileNameP(ThisTimeLineID, openLogSegNo),
+	XLogFileNameEP(ThisTimeLineID, openLogSegNo),
 	startoffset)));
 openLogOff = startoffset;
 			}
@@ -2500,7 +2500,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			(errcode_for_file_access(),
 			 errmsg("could not write to log file %s "
 	"at offset %u, length %zu: %m",
-	XLogFileNameP(ThisTimeLineID, openLogSegNo),
+	XLogFileNameEP(ThisTimeLineID, openLogSegNo),
 	openLogOff, nbytes)));
 }
 nleft -= written;
@@ -3743,7 +3743,8 @@ XLogFileClose(void)
 		ereport(PANIC,
 (errcode_for_file_access(),
  errmsg("could not close log file %s: %m",
-		XLogFileNameP(ThisTimeLineID, openLogSegNo;
+		XLogFileNameEP(ThisTimeLineID, openLogSegNo;
+
 	openLogFile = -1;
 }
 
@@ -10160,7 +10161,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 ereport(PANIC,
 		(errcode_for_file_access(),
 		 errmsg("could not fsync log file %s: %m",
-XLogFileNameP(ThisTimeLineID, segno;
+XLogFileNameEP(ThisTimeLineID, segno;
 			break;

Re: Keeping temporary tables in shared buffers

2018-05-25 Thread Ashwin Agrawal

On Thu, May 24, 2018 at 11:50 PM, Andres Freund  wrote:

> On 2018-05-25 09:40:10 +0300, Heikki Linnakangas wrote:
> > On 25/05/18 09:25, Asim Praveen wrote:
> > > My parochial vision of the overhead is restricted to 4 * NBuffers of
> > > additional shared memory, as 4 bytes are being added to BufferTag.
> May I
> > > please get some enlightenment?
> >
> > Any extra fields in BufferTag make computing the hash more expensive.
> It's a
> > very hot code path, so any cycles spent are significant.
>
> Indeed, very much so.
>
> But I'm not sure we need anything in the tags themselves. We don't
> denote buffers for unlogged tables in the tag itself either. As Tom
> observed the oids for temp tables are either unique or can be made
> unique easy enough.  And the temporaryness can be declared in a bit in
> the buffer header, rather than the tag itself. I don't see why a hash
> lookup would need to know that.
>

Currently, relfilenodes (specifically spcid,dbid,relfilenode) for temp and
regular tables can collide as temp files have "t_nnn" representation
on-disk. Due to this relfilenode allocation logic can assign same
relfilenode for temp and non-temp. If relfilenode uniqueness can be
achieved then need for adding anything to buffer tag goes away.

When starting to work on the radix tree stuff I had, to address the size
> of buffer tag issue you mention above, a prototype patch that created a
> shared 'relfilenode' table. That guaranteed that relfilenodes are
> unique.  That'd work here as well, and would allow to get rid of a good
> chunk of uglyness we have around allocating relfilenodes right now (like
> not unlinking files etc).
>

That would be great!


>
> But more generally, I don't see why it'd be that problematic to just get
> rid of the backendid? I don't really see any technical necessity to have
> it.
>

Backendid was also added it seems due to same reason of not having unique
relfilnodes for temp tables. So, yes with uniqueness guaranteed this can go
away as well.

Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.

2018-05-25 Thread Arseny Sher


Michael Paquier  writes:

> but for now I would recommend to register this patch to the next
> commit fest under the category "Bug Fixes" so as it does not fall into
> the cracks: https://commitfest.postgresql.org/18/
>
> I have added an entry to the open items in the section for older bugs:
> https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Older_Bugs
> However this list tends to be... Er... Ignored.
>

Thank you, Michael. I have created the commitfest entry.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Fix some error handling for read() and errno

2018-05-25 Thread Michael Paquier

On Fri, May 25, 2018 at 01:19:58PM +0900, Kyotaro HORIGUCHI wrote:
> The case is not of an empty file. read() reads 0 bytes without
> error while lseek have told that the file has *more* data. I
> don't think that can happen. How about just commenting with
> something like the following?

Actually it can be useful to report that no data has been read and that
more data was expected, like that:
+   else if (nread == 0)
+   ereport(ERROR,
+   (errmsg("no data read from file \"%s\": expected %zu more 
bytes",
+   path, bytesleft)));
--
Michael
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 87942b4cca..d487347cc6 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -683,6 +683,11 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 	pgstat_report_wait_start(WAIT_EVENT_SLRU_READ);
 	if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
 	{
+		/*
+		 * XXX: Note that this may actually report sucess if the number
+		 * of bytes read is positive, but lacking data so that errno is
+		 * not set.
+		 */
 		pgstat_report_wait_end();
 		slru_errcause = SLRU_READ_FAILED;
 		slru_errno = errno;
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 65194db70e..52f634e706 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1219,6 +1219,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 	uint32		crc_offset;
 	pg_crc32c	calc_crc,
 file_crc;
+	int			r;
 
 	TwoPhaseFilePath(path, xid);
 
@@ -1272,15 +1273,28 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 	buf = (char *) palloc(stat.st_size);
 
 	pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_READ);
-	if (read(fd, buf, stat.st_size) != stat.st_size)
+	r = read(fd, buf, stat.st_size);
+	if (r != stat.st_size)
 	{
+		int		save_errno = errno;
+
 		pgstat_report_wait_end();
 		CloseTransientFile(fd);
 		if (give_warnings)
-			ereport(WARNING,
-	(errcode_for_file_access(),
-	 errmsg("could not read two-phase state file \"%s\": %m",
-			path)));
+		{
+			if (r < 0)
+			{
+errno = save_errno;
+ereport(WARNING,
+		(errcode_for_file_access(),
+		 errmsg("could not read two-phase state file \"%s\": %m",
+path)));
+			}
+			else
+ereport(WARNING,
+		(errmsg("could not read two-phase state file \"%s\": %d read, expected %zu",
+path, r, stat.st_size)));
+		}
 		pfree(buf);
 		return NULL;
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adbd6a2126..2f2102eb71 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3395,21 +3395,24 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 
 		if (nread > 0)
 		{
+			int		r;
+
 			if (nread > sizeof(buffer))
 nread = sizeof(buffer);
 			errno = 0;
 			pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_READ);
-			if (read(srcfd, buffer, nread) != nread)
+			r = read(srcfd, buffer, nread);
+			if (r != nread)
 			{
-if (errno != 0)
+if (r < 0)
 	ereport(ERROR,
 			(errcode_for_file_access(),
 			 errmsg("could not read file \"%s\": %m",
 	path)));
 else
 	ereport(ERROR,
-			(errmsg("not enough data in file \"%s\"",
-	path)));
+			(errmsg("not enough data in file \"%s\": read %d, expected %d",
+	path, r, nread)));
 			}
 			pgstat_report_wait_end();
 		}
@@ -11594,6 +11597,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	int			emode = private->emode;
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
+	int			r;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -11675,8 +11679,10 @@ retry:
 	if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
 	{
 		char		fname[MAXFNAMELEN];
+		int			save_errno = errno;
 
 		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+		errno = save_errno;
 		ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
 (errcode_for_file_access(),
  errmsg("could not seek in log segment %s to offset %u: %m",
@@ -11685,16 +11691,28 @@ retry:
 	}
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+	r = read(readFile, readBuf, XLOG_BLCKSZ);
+	if (r != XLOG_BLCKSZ)
 	{
 		char		fname[MAXFNAMELEN];
+		int			save_errno = errno;
 
 		pgstat_report_wait_end();
 		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
-		ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-(errcode_for_file_access(),
- errmsg("could not read from log segment %s, offset %u: %m",
-		fname, readOff)));
+
+		if (r < 0)
+		{
+			errno = save_errno;
+			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+

Re: Possible bug in logical replication.

2018-05-25 Thread Arseny Sher

Michael Paquier  writes:

> Maybe I am being too naive, but wouldn't it be just enough to update the
> confirmed flush LSN to ctx->reader->ReadRecPtr?  This way, the slot
> advances up to the beginning of the last record where user wants to
> advance, and not the beginning of the next record:

Same problem should be handled at pg_logical_slot_get_changes_guts and
apply worker feedback; and there is a convention that all commits since
confirmed_flush must be decoded, so we risk decoding such boundary
commit twice.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Keeping temporary tables in shared buffers

2018-05-25 Thread Andres Freund

On 2018-05-25 09:40:10 +0300, Heikki Linnakangas wrote:
> On 25/05/18 09:25, Asim Praveen wrote:
> > My parochial vision of the overhead is restricted to 4 * NBuffers of
> > additional shared memory, as 4 bytes are being added to BufferTag.  May I
> > please get some enlightenment?
> 
> Any extra fields in BufferTag make computing the hash more expensive. It's a
> very hot code path, so any cycles spent are significant.

Indeed, very much so.

But I'm not sure we need anything in the tags themselves. We don't
denote buffers for unlogged tables in the tag itself either. As Tom
observed the oids for temp tables are either unique or can be made
unique easy enough.  And the temporaryness can be declared in a bit in
the buffer header, rather than the tag itself. I don't see why a hash
lookup would need to know that.

> In relation to Andres' patches to rewrite the buffer manager with a radix
> tree, there was actually some discussion of trying to make BufferTag
> *smaller*.

FWIW, in the latest version that doesn't matter that much
anymore. Instead of one big tree it's a hashtable of trees (although it
potentially should rather be a tree of trees). The hashtable maps to a
radix tree, and that radix tree is just indexed by the offset.  The root
of the tree is then cached inside the smgr, avoiding the need to
repeatedly look it up.

> For example, we could rearrange things so that
> pg_class.relfilenode is 64 bits wide. Then you could assume that it never
> wraps around, and is unique across all relations in the cluster. Then you
> could replace the 12-byte relfilenode+dbid+spcid triplet, with just the
> 8-byte relfilenode. Doing something like that might be the solution here,
> too.

OTOH it's quite useful to have the buffertag be something that can (or
rather could) be efficiently searched for in a hierachical
fashion. While, by far, not as crucial performancewise as dropping an
individual relation, it would be nice not to have to scan all of s_b to
drop a database.

> > Temp tables have unique filename on disk: t__.  The
> > logic to assign OIDs and relfilenodes, however, doesn't differ.  Given a
> > RelFileNode, it is not possible to tell if it's a temp table or not.
> > RelFileNodeBackend allows for that distinction but it's not used by buffer
> > manager.
> 
> Could you store the backendid in BufferDesc, outside of BufferTag? Is it
> possible for a normal table and a temporary table to have the same
> relfilenode+dbid+spcid triplet?

When starting to work on the radix tree stuff I had, to address the size
of buffer tag issue you mention above, a prototype patch that created a
shared 'relfilenode' table. That guaranteed that relfilenodes are
unique.  That'd work here as well, and would allow to get rid of a good
chunk of uglyness we have around allocating relfilenodes right now (like
not unlinking files etc).

But more generally, I don't see why it'd be that problematic to just get
rid of the backendid? I don't really see any technical necessity to have
it.

Greetings,

Andres Freund

Re: Possible bug in logical replication.

2018-05-25 Thread Arseny Sher

Hello,

Kyotaro HORIGUCHI  writes:

> restart_lsn stays at the beginning of a transaction until the
> transaction ends so just using restart_lsn allows repeated
> decoding of a transaction, in short, rewinding occurs. The
> function works only for inactive slot so the current code works
> fine on this point.

Sorry, I do not follow. restart_lsn is advanced whenever there is a
consistent snapshot dumped (in xl_running_xacts) which is old enough to
wholly decode all xacts not yet confirmed by the client. Could you
please elaborate, what's wrong with that?

> Addition to that restart_lsn also can be on a
> page bounary.

Do you have an example of that? restart_lsn is set initially to WAL
insert position at ReplicationSlotReserveWal, and later it always points
to xl_running_xacts record with consistent snapshot dumped.

> So directly set ctx->reader->EndRecPtr by startlsn fixes the
> problem, but I found another problem here.

There is a minor issue with the patch. Now slot advancement hangs
polling for new WAL on my example from [1]; most probably because we
must exit the loop when ctx->reader->EndRecPtr == moveto.

> The function accepts any LSN even if it is not at the begiining
> of a record. We will see errors or crashs or infinite waiting or
> maybe any kind of trouble by such values. The moved LSN must
> always be at the "end of a record" (that is, at the start of the
> next recored). The attached patch also fixes this.

Indeed, but we have these problems only if we are trying to read WAL
since confirmed_flush.

[1] https://www.postgresql.org/message-id/873720e4hf.fsf%40ars-thinkpad

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Redesigning the executor (async, JIT, memory efficiency)

2018-05-25 Thread Andres Freund

Hi,

On 2018-05-25 09:26:47 +0300, Heikki Linnakangas wrote:
> Cool stuff!

Thanks!

> On 25/05/18 06:35, Andres Freund wrote:
> > For example, this converts this converts TPCH's Q01:
> > 
> > tpch_1[26142][1]=# SET client_min_messages ='log';
> > tpch_1[26142][1]=# SELECT
> > l_returnflag,
> > l_linestatus,
> > sum(l_quantity) AS sum_qty,
> > sum(l_extendedprice) AS sum_base_price,
> > sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
> > sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
> > avg(l_quantity) AS avg_qty,
> > avg(l_extendedprice) AS avg_price,
> > avg(l_discount) AS avg_disc,
> > count(*) AS count_order
> > FROM
> > lineitem
> > WHERE
> > l_shipdate <= date '1998-12-01' - interval '74 days'
> > GROUP BY
> > l_returnflag,
> > l_linestatus
> > ORDER BY
> > l_returnflag,
> > l_linestatus;
> > LOG:  0: pp:
> > 
> > into:
> > 
> > 0: init_sort
> > 1: seqscan_first
> > 2: seqscan [j empty 5] > s0
> > 3: qual [j fail 2] < scan s0
> > 4: hashagg_tuple [j 2] < s0
> > 5: drain_hashagg [j empty 7] > s1
> > 6: sort_tuple [j 5] < s1
> > 7: sort
> > 8: drain_sort [j empty 10] > s2
> > 9: return < s2 [next 8]
> > 10: done
> > 
> > where s are numbered slots, j are, potentially conditional, jumps. This
> > works for a number of plans, but there's several where we currently bail
> > out.
> 
> How is this going to look like in EXPLAIN, even without ANALYZE? Would it
> still show a tree plan, or this linear program, or both?

I think we're going to have to continue showing the tree plan. I think
the linear version is far too hard to understand for anything
nontrivial.  I'd definitely continue to expect the Plan tree to be tree
shaped, and I'm currently not forseeing to entirely get rid of PlanState
trees (albeit with smaller nodes, and potentially with relative rather
than absolute pointers to children).  So I don't really forsee a huge
problem continuing to display a tree shaped plan.

I'd expect there'd be an EXPLAIN option to show the linear plan, but
that that'd not be used frequently outside of development work.

Greetings,

Andres Freund

Re: Keeping temporary tables in shared buffers

2018-05-25 Thread Heikki Linnakangas


On 25/05/18 09:25, Asim Praveen wrote:

On Thu, May 24, 2018 at 8:19 PM, Tom Lane  wrote:


So then you have to think about how to transition smoothly between "rel
is in local buffers" and "rel is in shared buffers", bearing in mind that
ever having the same page in two different buffers would be disastrous.


Local buffers would not be used at all if temp tables start residing in
shared buffers.  The transition mentioned above shouldn't be needed.


What is the performance difference between the local buffer manager and 
the shared buffer manager? The local buffer manager avoids all the 
locking overhead, which has to amount to something, but how big a 
difference is it?



I think that would be a deal breaker right there, because of the
distributed overhead of making the tags bigger.  However, I don't
actually understand why you would need to do that.  Temp tables
have unique OIDs/relfilenodes anyway, don't they?  Or if I'm
misremembering and they don't, couldn't we make them so?


My parochial vision of the overhead is restricted to 4 * NBuffers of
additional shared memory, as 4 bytes are being added to BufferTag.  May I
please get some enlightenment?


Any extra fields in BufferTag make computing the hash more expensive. 
It's a very hot code path, so any cycles spent are significant.


In relation to Andres' patches to rewrite the buffer manager with a 
radix tree, there was actually some discussion of trying to make 
BufferTag *smaller*. For example, we could rearrange things so that 
pg_class.relfilenode is 64 bits wide. Then you could assume that it 
never wraps around, and is unique across all relations in the cluster. 
Then you could replace the 12-byte relfilenode+dbid+spcid triplet, with 
just the 8-byte relfilenode. Doing something like that might be the 
solution here, too.



Temp tables have unique filename on disk: t__.  The
logic to assign OIDs and relfilenodes, however, doesn't differ.  Given a
RelFileNode, it is not possible to tell if it's a temp table or not.
RelFileNodeBackend allows for that distinction but it's not used by buffer
manager.


Could you store the backendid in BufferDesc, outside of BufferTag? Is it 
possible for a normal table and a temporary table to have the same 
relfilenode+dbid+spcid triplet?


- Heikki

Re: Redesigning the executor (async, JIT, memory efficiency)

2018-05-25 Thread Heikki Linnakangas


Cool stuff!

On 25/05/18 06:35, Andres Freund wrote:

For example, this converts this converts TPCH's Q01:

tpch_1[26142][1]=# SET client_min_messages ='log';
tpch_1[26142][1]=# SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) AS sum_qty,
sum(l_extendedprice) AS sum_base_price,
sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
avg(l_quantity) AS avg_qty,
avg(l_extendedprice) AS avg_price,
avg(l_discount) AS avg_disc,
count(*) AS count_order
FROM
lineitem
WHERE
l_shipdate <= date '1998-12-01' - interval '74 days'
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;
LOG:  0: pp:

into:

0: init_sort
1: seqscan_first
2: seqscan [j empty 5] > s0
3: qual [j fail 2] < scan s0
4: hashagg_tuple [j 2] < s0
5: drain_hashagg [j empty 7] > s1
6: sort_tuple [j 5] < s1
7: sort
8: drain_sort [j empty 10] > s2
9: return < s2 [next 8]
10: done

where s are numbered slots, j are, potentially conditional, jumps. This
works for a number of plans, but there's several where we currently bail
out.


How is this going to look like in EXPLAIN, even without ANALYZE? Would 
it still show a tree plan, or this linear program, or both?



Despite being entirely unoptimized this already yields a measurable
speedup over the current executor for the TPCH queries it supports.


Oh, that's impressive!

- Heikki

Re: Keeping temporary tables in shared buffers

2018-05-25 Thread Asim Praveen

On Thu, May 24, 2018 at 8:19 PM, Tom Lane  wrote:
>
> So then you have to think about how to transition smoothly between "rel
> is in local buffers" and "rel is in shared buffers", bearing in mind that
> ever having the same page in two different buffers would be disastrous.

Local buffers would not be used at all if temp tables start residing in
shared buffers.  The transition mentioned above shouldn't be needed.

>
> I think that would be a deal breaker right there, because of the
> distributed overhead of making the tags bigger.  However, I don't
> actually understand why you would need to do that.  Temp tables
> have unique OIDs/relfilenodes anyway, don't they?  Or if I'm
> misremembering and they don't, couldn't we make them so?

My parochial vision of the overhead is restricted to 4 * NBuffers of
additional shared memory, as 4 bytes are being added to BufferTag.  May I
please get some enlightenment?

Temp tables have unique filename on disk: t__.  The
logic to assign OIDs and relfilenodes, however, doesn't differ.  Given a
RelFileNode, it is not possible to tell if it's a temp table or not.
RelFileNodeBackend allows for that distinction but it's not used by buffer
manager.

>
> taking a performance hit to avoid it could be a net loss.  The only reason
> why you'd care about writing at all is to try to make the buffers clean
> in case they have to be reclaimed for some other use --- and if the
> checkpointer does such a write instead of the bgwriter, why's that bad?

Yes, a temp table's buffer would need to be written out only if it needs to
be repurposed for a different page.  It is not bad, our description wasn't
clear enough.

Asim

Re: Possible bug in logical replication.

2018-05-25 Thread Michael Paquier

On Thu, May 24, 2018 at 10:14:01AM +0900, Kyotaro HORIGUCHI wrote:
> At Wed, 23 May 2018 15:56:22 +0900, Masahiko Sawada  
> wrote in 
>> Another possible way might be to make XLogFindNextRecord valid in
>> backend code and move startlsn to the first valid record with an lsn
>> >= startlsn by using that function. Please find attached patch.
> 
> The another reason for the code is the fact that confirmed_lsn is
> storing EndRecPtr after the last XLogReadRecord call. That is,
> from the definition, confirmed_lsn must be on the start of a
> record or page boundary and error out if not. For that reason,
> calling XLogFindNextRecord would not be the right thing to do
> here. We should just skip a header if we are on a boundary but it
> already done in XLogReadRecord.

Maybe I am being too naive, but wouldn't it be just enough to update the
confirmed flush LSN to ctx->reader->ReadRecPtr?  This way, the slot
advances up to the beginning of the last record where user wants to
advance, and not the beginning of the next record:
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -395,7 +395,7 @@ pg_logical_replication_slot_advance(XLogRecPtr startlsn, 
XLogRecPtr moveto)
 
 if (ctx->reader->EndRecPtr != InvalidXLogRecPtr)
 {
-LogicalConfirmReceivedLocation(moveto);
+LogicalConfirmReceivedLocation(ctx->reader->ReadRecPtr);
 
 /*
  * If only the confirmed_flush_lsn has changed the slot won't get

I agree with the point made above to not touch manually the XLogReader
context out of xlogreader.c.
--
Michael


signature.asc
Description: PGP signature

Re: [GSoC] github repo and initial work

2018-05-25 Thread Aleksandr Parfenov

On Thu, 24 May 2018 18:25:28 -0700
Charles Cui  wrote:
> The second is to provide thrift type just like json or jsonb. When you
> create a table, postgres knows ::thrift keywords.
> I think method one should be easier to implement because it only
> limits to this plugin. Method two needs modify postgres kernel to
> register a new type, which may time consuming,
> but more natural. Any ideas on this?
> 
> Thanks, Charles

Hi Charles,

I prefer the second way with separate type. But I think it is good idea
to wait for an answer from your project mentor or someone other.

I'm not an expert in PostgreSQL user-defined types, but AFAIK, it
doesn't require changes in PostgreSQL core, since types can be created
in extensions. It doesn't require changes to grammar or something.
You can look at citext contrib as an example.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: XLogWrite uses palloc within a critical section

2018-05-25 Thread Heikki Linnakangas


On 25/05/18 07:45, Kyotaro HORIGUCHI wrote:

Hello.

I happened to see the following in XLogWrite.


  ereport(PANIC,
  (errcode_for_file_access(),
   errmsg("could not seek in log file %s to offset %u: %m",
  XLogFileNameP(ThisTimeLineID, openLogSegNo),
  startoffset)));


where XLogFileNameP calls palloc within, and it is within a
critical section there. So it ends with assertion failure hiding
the PANIC message. We should use XLogFileName instead. The
problem has existed at least since 9.3. The code is frequently
revised so the patch needed to vary into four files.


Hmm, that's rather annoying, it sure would be handy if we could do small 
palloc()s like this in error messages safely.


I wonder if we could switch to ErrorContext in errstart()? That way, any 
small allocations in the ereport() arguments, like what XLogFileNameP() 
does, would be allocated in ErrorContext.


- Heikki

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

2018-05-25 Thread Thomas Munro

On Thu, Apr 26, 2018 at 8:33 AM, Andres Freund  wrote:
> On 2018-04-25 14:41:44 -0400, Robert Haas wrote:
>> On Mon, Apr 16, 2018 at 2:13 AM, Andrew Gierth
>>  wrote:
>> > The code that detects sequential behavior can not distinguish between
>> > pread() and lseek+read, it looks only at the actual offset of the
>> > current request compared to the previous one for the same fp.
>> >
>> >  Thomas> +1 for adopting pread()/pwrite() in PG12.
>> >
>> > ditto
>>
>> Likewise.
>
> +1 as well. Medium term I forsee usage of at least pwritev(), and
> possibly also preadv(). Being able to write out multiple buffers at once
> is pretty crucial if we ever want to do direct IO.

Also if we ever use threads and want to share file descriptors we'd
have to use it.

CC'ing Oskari Saarenmaa who proposed a patch for this a couple of years back[1].

Oskari, would you like to update your patch and post it for the
September commitfest?  At first glance, it probably needs autoconf-fu
to check if pread()/pwrite() are supported and fallback code, so
someone should update the patch to do that or explain why it's not
needed based on standards we require.  At least Windows apparently
needs special handling (ReadFile() and WriteFile() with an OVERLAPPED
object).

Practically speaking, are there any Unix-like systems outside museums
that don't have it?  According to the man pages I looked at, this
stuff is from System V R4 (1988) and appeared in ancient BSD strains
too.  Hmm, I suppose it's possible that pademelon and gaur don't: they
apparently run HP-UX 10.20 (1996) which Wikipedia tells me is derived
from System V R3!  I can see that current HP-UX does have them... but
unfortunately their man pages don't have a HISTORY section.

FWIW these functions just showed up in the latest POSIX standard[2]
(issue 7, 2017/2018?), having moved from "XSI option" to "base".

[1] 
https://www.postgresql.org/message-id/flat/7fdcb664-4f8a-8626-75df-ffde85005829%40ohmu.fi
[2] http://pubs.opengroup.org/onlinepubs/9699919799/functions/pread.html

-- 
Thomas Munro
http://www.enterprisedb.com

78 matches

Mail list logo