date:20190828

Re: BUG #15977: Inconsistent behavior in chained transactions

2019-08-28 Thread Fabien COELHO




Hello,

COMMIT AND CHAIN in implicit block leaves blockState as TBLOCK_STARTED, 
which doesn't trigger the chaining. but ROLLBACK AND CHAIN sets the 
blockState into TBLOCK_ABORT_PENDING, so the chaining is triggered.


I think disabling s->chain beforehand should do the desired behavior.


Patch applies with "patch", although "git apply" complained because of 
CRLF line terminations forced by the octet-stream mime type.


Patch compiles cleanly. Make check ok.

Patch works for me, and solution seems appropriate. It should be committed 
for pg 12.0.


There could be a test added in "regress/sql/transactions.sql", I'd suggest 
something like:


-- implicit transaction and not chained.
COMMIT AND CHAIN;
COMMIT;
ROLLBACK AND CHAIN;
ROLLBACK;

which should show the appropriate "no transaction in progress" warnings.

Doc could be made a little clearer about what to expect when there is no 
explicit transaction in progress.


--
Fabien.

Re: refactoring - share str2*int64 functions

2019-08-28 Thread Fabien COELHO



Bonjour Michaël,


 - *ptr && WHATEVER(*ptr)
  *ptr is redundant, WHATEVER yields false on '\0', and it costs on each
  char but at the end. It might be debatable in some places, e.g. it is
  likely that there are no spaces in the string, but likely that there are
  more than one digit.


Still this makes the checks less robust?


I do not see any downside, but maybe I lack imagination.

On my laptop isWHATEVER is implementd through an array mapping characters 
to a bitfield saying whether each char is WHATEVER, depending on the bit. 
This array is well defined for index 0 ('\0').


If an implementation is based on comparisons, for isdigit it would be:

   c >= '0' && c <= '9'

Then checking c != '\0' is redundant with c >= '0'.

Given the way the character checks are used in the function, we do not go 
beyond the end of the string because we only proceed further when a 
character is actually recognized, else we return.


So I cannot see any robustness issue, just a few cycles saved.


Hmmm. Have you looked at the fallback cases when the corresponding builtins
are not available?

I'm unsure of a reliable way to detect a generic unsigned int overflow
without expensive dividing back and having to care about zero…


Mr Freund has mentioned that here:
https://www.postgresql.org/message-id/20190717184820.iqz7schxdbucm...@alap3.anarazel.de


Yep, that is what I mean by expensive: there is an integer division, which 
is avoided if b is known to be 10, hence is not zero and the limit value 
can be checked directly on the input without having to perform a division 
each time.



So I was pretty happy with my two discreet, small and efficient tests.


That's also a matter of code and interface consistency IMHO.


Possibly.

ISTM that part of the motivation is to reduce costs for heavily used 
conversions, eg on COPY. Function "scanf" is overly expensive because it 
has to interpret its format, and we have to check for overflows…


Anyway, if we assume that the builtins exist and rely on efficient 
hardware check, maybe we do not care about the fallback cases which would 
just be slow but never executed.


Note that all this is moot, as all instances of string to uint64 
conversion do not check for errors.


Attached v7 does create uint64 overflow inline functions. The stuff yet is 
not moved to "common/int.c", a file which does not exists, even if 
"common/int.h" does.


--
Fabien.diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 221b47298c..28891ba337 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -79,6 +79,7 @@
 #include "utils/builtins.h"
 #include "utils/hashutils.h"
 #include "utils/memutils.h"
+#include "common/string.h"
 
 PG_MODULE_MAGIC;
 
@@ -1022,7 +1023,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		/* parse command tag to retrieve the number of affected rows. */
 		if (completionTag &&
 			strncmp(completionTag, "COPY ", 5) == 0)
-			rows = pg_strtouint64(completionTag + 5, NULL, 10);
+			(void) pg_strtouint64(completionTag + 5, &rows);
 		else
 			rows = 0;
 
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 2c0ae395ba..8e75d52b06 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -21,6 +21,7 @@
 #include "catalog/heap.h"
 #include "catalog/pg_type.h"
 #include "commands/trigger.h"
+#include "common/string.h"
 #include "executor/executor.h"
 #include "executor/spi_priv.h"
 #include "miscadmin.h"
@@ -2338,8 +2339,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
 	CreateTableAsStmt *ctastmt = (CreateTableAsStmt *) stmt->utilityStmt;
 
 	if (strncmp(completionTag, "SELECT ", 7) == 0)
-		_SPI_current->processed =
-			pg_strtouint64(completionTag + 7, NULL, 10);
+		(void) pg_strtouint64(completionTag + 7, &_SPI_current->processed);
 	else
 	{
 		/*
@@ -2361,8 +2361,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
 else if (IsA(stmt->utilityStmt, CopyStmt))
 {
 	Assert(strncmp(completionTag, "COPY ", 5) == 0);
-	_SPI_current->processed = pg_strtouint64(completionTag + 5,
-			 NULL, 10);
+	(void) pg_strtouint64(completionTag + 5, &_SPI_current->processed);
 }
 			}
 
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..17cde42b4d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -34,6 +34,7 @@
 
 #include "fmgr.h"
 #include "miscadmin.h"
+#include "common/string.h"
 #include "nodes/extensible.h"
 #include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
@@ -80,7 +81,7 @@
 #define READ_UINT64_FIELD(fldname) \
 	token = pg_strtok(&length);		/* skip :fldname */ \
 	token = pg_strtok(&length);		/* get field value */ \
-	local_node->fldname = pg_strtouint64(token, NULL, 10)
+	(void) pg_strtouint64(token, &local_node->fldname)
 
 /* Rea

Re: doc: update PL/pgSQL sample loop function

2019-08-28 Thread Pavel Stehule

Hi

čt 29. 8. 2019 v 5:03 odesílatel Ian Barwick 
napsal:

> Hi
>
> Here:
>
>
> https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING
>
> we have a sample PL/PgSQL function (dating from at least 7.2) demonstrating
> query result loops, which refreshes some pseudo materialized views stored
> in
> a user-defined table.
>
> As we've had proper materialized views since 9.3, I thought it might
> be nice to update this with a self-contained sample which can be used
> as-is; see attached patch.
>
> (As a side note the current sample function contains a couple of "%s"
> placeholders which should be just "%"; a quick search of plpgsql.sgml
> shows this is the only place they occur).
>
> Will submit to the next commitfest.
>

+1

Pavel


>
> Regards
>
> Ian Barwick
>
> --
>   Ian Barwick   https://www.2ndQuadrant.com/
>   PostgreSQL Development, 24x7 Support, Training & Services
>

doc: update PL/pgSQL sample loop function

2019-08-28 Thread Ian Barwick


Hi

Here:

  
https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING

we have a sample PL/PgSQL function (dating from at least 7.2) demonstrating
query result loops, which refreshes some pseudo materialized views stored in
a user-defined table.

As we've had proper materialized views since 9.3, I thought it might
be nice to update this with a self-contained sample which can be used
as-is; see attached patch.

(As a side note the current sample function contains a couple of "%s"
placeholders which should be just "%"; a quick search of plpgsql.sgml
shows this is the only place they occur).

Will submit to the next commitfest.


Regards

Ian Barwick

--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
commit d9e99b90fd0e572b4fd2461d7188a0197dee16df
Author: Ian Barwick 
Date:   Thu Aug 29 11:49:23 2019 +0900

doc: update PL/pgSQL sample function

The sample PL/PgSQL function here:

  https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING

dates from at least PostgreSQL 7.2 and updates pseudo-materialized views
defined in a user table.

Replace it with a more up-to-date example which does the same thing
with actual materialized views, which have been available since
PostgreSQL 9.3

diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index ae73630a48..3194173594 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -2437,19 +2437,29 @@ END LOOP  label ;
  resulting from the query and the loop body is
  executed for each row. Here is an example:
 
-CREATE FUNCTION cs_refresh_mviews() RETURNS integer AS $$
+CREATE FUNCTION refresh_mviews() RETURNS integer AS $$
 DECLARE
 mviews RECORD;
 BEGIN
-RAISE NOTICE 'Refreshing materialized views...';
-
-FOR mviews IN SELECT * FROM cs_materialized_views ORDER BY sort_key LOOP
+RAISE NOTICE 'Refreshing all materialized views...';
+
+FOR mviews IN
+   SELECT n.nspname AS mv_schema,
+  c.relname AS mv_name,
+  pg_catalog.pg_get_userbyid(c.relowner) AS owner
+ FROM pg_catalog.pg_class c
+LEFT JOIN pg_catalog.pg_namespace n ON (n.oid = c.relnamespace)
+WHERE c.relkind = 'm'
+ ORDER BY 1
+LOOP
 
--- Now "mviews" has one record from cs_materialized_views
+-- Now "mviews" has one record with information about the materialized view
 
-RAISE NOTICE 'Refreshing materialized view %s ...', quote_ident(mviews.mv_name);
-EXECUTE format('TRUNCATE TABLE %I', mviews.mv_name);
-EXECUTE format('INSERT INTO %I %s', mviews.mv_name, mviews.mv_query);
+RAISE NOTICE 'Refreshing materialized view %.% (owner: %)...',
+ quote_ident(mviews.mv_schema),
+ quote_ident(mviews.mv_name),
+ quote_ident(mviews.owner);
+EXECUTE format('REFRESH MATERIALIZED VIEW %I.%I', mviews.mv_schema, mviews.mv_name);
 END LOOP;
 
 RAISE NOTICE 'Done refreshing materialized views.';

Re: Improve error detections in TAP tests by spreading safe_psql

2019-08-28 Thread Tom Lane

Michael Paquier  writes:
> On Wed, Aug 28, 2019 at 12:19:08PM -0400, Tom Lane wrote:
>> The attached patch just follows a mechanical rule of "set on_error_die
>> to 0 in exactly those calls where the surrounding code verifies the
>> exit code is what it expects".

> I am fine with that approach.  There is an argument for dropping
> safe_psql then?

Well, it's useful if you just want the stdout back.  But its name
is a bit misleading if the default behavior of psql is just as
safe.  Not sure whether renaming it is worthwhile.

>> I have to wonder if it isn't better to just drop the is() test
>> and let PostgresNode::psql issue the failure.

> I got the same thought as you on this point about why the is() is
> actually necessary if we know that it would fail.  An advantage of the
> current code is that we are able to catch all errors in a given run at
> once.

Yeah, but only if the test cases are independent, which I think is
mostly not true in our TAP scripts.  Otherwise you're just looking
at cascading errors.

> An argument against back-patching the stuff changing the
> default flag are tests which rely on the current behavior. This could
> be annoying for some people doing advanced testing.

Yup, the tradeoff is people possibly having test scripts outside
our tree that would break, versus the possibility that we'll back-patch
test changes that don't behave as expected in back branches.  I don't
know if the former is true, but I'm afraid the latter is a certainty
if we don't back-patch.

regards, tom lane

Re: Re: Email to hackers for test coverage

2019-08-28 Thread movead...@highgo.ca


On 2019-08-29 00:43, peter.eisentr...@2ndquadrant.com wrote:
 
 
> Make spaces and capitalization match surrounding code.
>That's fine, but I suggest that if you really want to make an impact in
>test coverage, look to increase function coverage rather than line
>coverage.  Or look for files that are not covered at all.
 
Thanks for pointing all the things, I will reconsider my way on
code coverage work.

-- 
Movead

Re: gharial segfaulting on REL_12_STABLE only

2019-08-28 Thread Thomas Munro

On Tue, Aug 27, 2019 at 2:09 PM Thomas Munro  wrote:
> On Tue, Aug 27, 2019 at 1:48 PM Tom Lane  wrote:
> > A stack trace would likely be really useful right about now.
>
> Yeah.  Looking into how to get that.

Erm.  I heard the system was in a very unhappy state and couldn't be
logged into.  After it was rebooted, the problem appears to have gone
away.  That is quite unsatisfying.

"anole" runs on the same host, and occasionally fails to launch any
parallel workers, and it seems to be pretty unhappy too -- very long
runtimes (minutes where my smaller machines take seconds).  So the
machine may be massively overloaded and swapping or something like
that, something to be looked into, but that doesn't explain how we get
to a segfault without an underlying hard to reach bug in our code...

-- 
Thomas Munro
https://enterprisedb.com

Re: Improve error detections in TAP tests by spreading safe_psql

2019-08-28 Thread Michael Paquier

On Wed, Aug 28, 2019 at 12:19:08PM -0400, Tom Lane wrote:
> The attached patch just follows a mechanical rule of "set on_error_die
> to 0 in exactly those calls where the surrounding code verifies the
> exit code is what it expects".  That leads to a lot of code that
> looks like, say,
> 
>  # Insert should work on standby
>  is( $node_standby->psql(
>   'postgres',
> - qq{insert into testtab select generate_series(1,1000), 'foo';}),
> + qq{insert into testtab select generate_series(1,1000), 'foo';},
> + on_error_die => 0),
>   0,
>   'INSERT succeeds with truncated relation FSM');

I am fine with that approach.  There is an argument for dropping
safe_psql then?

> I have to wonder if it isn't better to just drop the is() test
> and let PostgresNode::psql issue the failure.  The only thing
> the is() is really buying us is the ability to label the test
> with an ID string, which is helpful, but um ... couldn't that
> just be a comment?

I got the same thought as you on this point about why the is() is
actually necessary if we know that it would fail.  An advantage of the
current code is that we are able to catch all errors in a given run at
once.  If the test dies immediately, you cannot catch that and this
needs repetitive retries if there is no dependency between each step
of the test.  An argument against back-patching the stuff changing the
default flag are tests which rely on the current behavior. This could
be annoying for some people doing advanced testing.
--
Michael


signature.asc
Description: PGP signature

Re: refactoring - share str2*int64 functions

2019-08-28 Thread Michael Paquier

On Wed, Aug 28, 2019 at 09:50:44AM +0200, Fabien COELHO wrote:
>  - *ptr && WHATEVER(*ptr)
>   *ptr is redundant, WHATEVER yields false on '\0', and it costs on each
>   char but at the end. It might be debatable in some places, e.g. it is
>   likely that there are no spaces in the string, but likely that there are
>   more than one digit.

Still this makes the checks less robust?

>   If you want all/some *ptr added back, no problem.
> 
>  - isdigit repeated on if and following while, used if/do-while instead.

I see, you don't check twice if the first character is a digit this
way.

> Hmmm. Have you looked at the fallback cases when the corresponding builtins
> are not available?
>
> I'm unsure of a reliable way to detect a generic unsigned int overflow
> without expensive dividing back and having to care about zero…

Mr Freund has mentioned that here:
https://www.postgresql.org/message-id/20190717184820.iqz7schxdbucm...@alap3.anarazel.de

> So I was pretty happy with my two discreet, small and efficient tests.

That's also a matter of code and interface consistency IMHO.
--
Michael


signature.asc
Description: PGP signature

Re: REINDEX filtering in the backend

2019-08-28 Thread Michael Paquier

On Wed, Aug 28, 2019 at 10:22:07AM +0200, Julien Rouhaud wrote:
>>> The filtering is done at table level (with and without the
>>> concurrently option), so SCHEMA, DATABASE and SYSTEM automatically
>>> benefit from it.  If this clause is used with a REINDEX INDEX, the
>>> statement errors out, as I don't see a valid use case for providing a
>>> single index name and asking to possibly filter it at the same time.
>>
>> Supporting that case would not be that much amount of work, no?
> 
> Probably not, but I'm dubious about the use case.  Adding the lib
> version in the catalog would be more useful for people who want to
> write their own rules to reindex specific set of indexes.

Hearing from others here would be helpful.  My take is that having a
simple option doing the filtering, without the need to store anything
in the catalogs, would be helpful enough for users mixing both index
types on a single table.  Others may not agree.

>>  I think that it would be nice to be
>> able to do both, generating an error for REINDEX INDEX if the index
>> specified is not compatible, and a LOG if the index is not filtered
>> out when a list is processed.
> 
> Do you mean having an error if the index does not contain any
> collation based type?  Also, REINDEX only accept a single name, so
> there shouldn't be any list processing for REINDEX INDEX?  I'm not
> really in favour of adding extra code the filtering when user asks for
> a specific index name to be reindexed.

I was referring to adding an error if trying to reindex an index with
the filtering enabled.  That's useful to inform the user that what he
intends to do is not compatible with the options provided.

> That's what I did when I first submitted the feature in reindexdb.  I
> didn't use them because it means switching to TAP tests.  I can drop
> the simple regression test (especially since I now realize than one is
> quite broken) and use the TAP one if, or should I keep both?

There is no need for TAP I think.  You could for example store the
relid and its relfilenode in a temporary table before running the
reindex, run the REINDEX and then compare with what pg_class stores.
And that's much cheaper than setting a new instance for a TAP test.
--
Michael


signature.asc
Description: PGP signature

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-08-28 Thread Tomas Vondra


On Wed, Aug 28, 2019 at 08:17:47PM +0300, Alexey Kondratov wrote:

Hi Tomas,


Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame 
graph for

the "normal" case).

I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 100);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with SSD 
for tests.


Therefore, probably you may write changes on receiver in bigger chunks,
not each change separately.


Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?


I run this on a single machine, but walsender and worker are 
utilizing almost 100% of CPU per each process all the time, and at 
apply side I/O syscalls take about 1/3 of CPU time. Though I am 
still not sure, but for me this result somehow links performance 
drop with problems at receiver side.


Writing in batches was just a hypothesis and to validate it I have 
performed test with large txn, but consisting of a smaller number of 
wide rows. This test does not exhibit any significant performance 
drop, while it was streamed too. So it seems to be valid. Anyway, I 
do not have other reasonable ideas beside that right now.


I've checked recently this patch again and tried to elaborate it in 
terms of performance. As a result I've implemented a new POC version 
of the applier (attached). Almost everything in streaming logic stayed 
intact, but apply worker is significantly different.


As I wrote earlier I still claim, that spilling changes on disk at the 
applier side adds additional overhead, but it is possible to get rid 
of it. In my additional patch I do the following:


1) Maintain a pool of additional background workers (bgworkers), that 
are connected with main logical apply worker via shm_mq's. Each worker 
is dedicated to the processing of specific streamed transaction.


2) When we receive a streamed change for some transaction, we check 
whether there is an existing dedicated bgworker in HTAB (xid -> 
bgworker), or there are some in the idle list, or spawn a new one.


3) We pass all changes (between STREAM START/STOP) to that bgworker 
via shm_mq_send without intermediate waiting. However, we wait for 
bgworker to apply the entire changes chunk at STREAM STOP, since we 
don't want transactions reordering.


4) When transaction is commited/aborted worker is being added to the 
idle list and is waiting for reassigning message.


5) I have used the same machinery with apply_dispatch in bgworkers, 
since most of actions are practically very similar.


Thus, we do not spill anything at the applier side, so transaction 
changes are processed by bgworkers as normal backends do. In the same 
time, changes processing is strictly serial, which prevents 
transactions reordering and possible conflicts/anomalies. Even though 
we trade off performance in favor of stability the result is rather 
impressive. I have used a similar query for testing as before:


EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
    SELECT round(random()*10), random(), random()*142
    FROM generate_series(1, 100) s(i);

with 1kk (100), 3kk and 5kk rows; logical_work_mem = 64MB and 
synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is 
following:


CREATE TABLE large_test (
    id serial primary key,
    num1 bigint,
    num2 double precision,
    num3 double precision
);

Here are the results:

---
| N | Time on master, sec | Total xact time, sec | Ratio  |
---
|    On commit (master, v13)  |
---
| 1kk | 6.5   | 17.6 | x2.74  |
---
| 3kk | 21    | 55.4 | x2.64  |
---
| 5kk | 38.3  | 91.5 | x2.39  |
---
|    Stream +

Re: FETCH FIRST clause PERCENT option

2019-08-28 Thread Ryan Lambert

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi everyone,

The v8 patch [1] applies cleanly and passes tests.  I reviewed the conversation 
to date and from what I can see, all identified bugs have been fixed and 
concerns either fixed or addressed.  Marking as ready for committer.

[1] 
https://www.postgresql.org/message-id/attachment/103486/percent-incremental-v8.patch

Ryan Lambert

The new status of this patch is: Ready for Committer

Re: RFC: seccomp-bpf support

2019-08-28 Thread Alvaro Herrera

On 2019-Aug-28, Joshua Brindle wrote:

> I think we need to reign in the thread somewhat. The feature allows
> end users to define some sandboxing within PG. Nothing is being forced
> on anyone but we would like the capability to harden a PG installation
> for many reasons already stated.

My own objection to this line of development is that it doesn't seem
that any useful policy (allowed/denied syscall list) is part or intends
to be part of the final feature.  So we're shipping a hook system for
which each independent vendor is going to develop their own policy.  Joe
provided an example syscall list, but it's not part of the patch proper;
and it seems, per the discussion, that the precise syscall list to use
is a significant fraction of this.

So, as part of a committable patch, IMO it'd be good to have some sort
of final list of syscalls -- maybe as part of the docbook part of the
patch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: no mailing list hits in google

2019-08-28 Thread Alvaro Herrera

On 2019-Aug-28, Thomas Kellerer wrote:

> Merlin Moncure schrieb am 28.08.2019 um 18:22:
> > My test case here is the query: pgsql-hackers
> 
> That search term is the first hit on DuckDuckGo:
> https://duckduckgo.com/?q=pgsql-hackers+ExecHashJoinNewBatch&t=h_&ia=web

Yes, but that's an old post, not the one from this year.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: RFC: seccomp-bpf support

2019-08-28 Thread Thomas Munro

On Thu, Aug 29, 2019 at 7:08 AM Joshua Brindle
 wrote:
> On Wed, Aug 28, 2019 at 2:53 PM Andres Freund  wrote:
> > On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:
> > > A prime example is madvise() which was a catastrophic failure that 1)
> > > isn't preventable by any LSM including SELinux, 2) isn't used by PG
> > > and is therefore a good candidate for a kill list, and 3) a clear win
> > > in the dont-let-PG-be-a-vector-for-kernel-compromise arena.
> >
> > IIRC it's used by glibc as part of its malloc implementation (also
> > threading etc) - but not necessarily hit during the most common
> > paths. That's *precisely* my problem with this approach.
> >
>
> As long as glibc handles a returned error cleanly the syscall could be
> denied without harming the process and the bug would be mitigated.
>
> seccomp also allows argument whitelisting so things can get very
> granular, depending on who is setting up the lists.

Just by the way, there may also be differences between architectures.
After some head scratching, we recently discovered[1] that default
seccomp whitelists currently cause PostgreSQL to panic for users of
Docker, Nspawn etc on POWER and ARM because of that.  That's a bug
being fixed elsewhere, but it reveals another thing to be careful of
if you're trying to build your own whitelist by guessing what libc
needs to call.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGLiR569VHLjtCNp3NT%2BjnKdhy8g2sdgKzWNojyWQVt6Bw%40mail.gmail.com

-- 
Thomas Munro
https://enterprisedb.com

Re: RFC: seccomp-bpf support

2019-08-28 Thread Peter Eisentraut

On 2019-08-28 21:38, Joshua Brindle wrote:
> I think we need to reign in the thread somewhat. The feature allows
> end users to define some sandboxing within PG. Nothing is being forced
> on anyone

Features come with a maintenance cost.  If we ship it, then people are
going to try it out.  Then weird things will happen.  They will report
mysterious bugs.  They will complain to their colleagues.  It all comes
with a cost.

> but we would like the capability to harden a PG installation
> for many reasons already stated.

Most if not all of those reasons seem to have been questioned.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory-Bounded Hash Aggregation

2019-08-28 Thread Taylor Vesely

I started to review this patch yesterday with Melanie Plageman, so we
rebased this patch over the current master. The main conflicts were
due to a simplehash patch that has been committed separately[1]. I've
attached the rebased patch.

I was playing with the code, and if one of the table's most common
values isn't placed into the initial hash table it spills a whole lot
of tuples to disk that might have been avoided if we had some way to
'seed' the hash table with MCVs from the statistics. Seems to me that
you would need some way of dealing with values that are in the MCV
list, but ultimately don't show up in the scan. I imagine that this
kind of optimization would most useful for aggregates on a full table
scan.

Some questions:

Right now the patch always initializes 32 spill partitions. Have you given
any thought into how to intelligently pick an optimal number of
partitions yet?

> That can be done as an add-on to approach #1 by evicting the entire
> Hash table (writing out the partial states), then resetting the memory
> Context.

By add-on approach, do you mean to say that you have something in mind
to combine the two strategies? Or do you mean that it could be implemented
as a separate strategy?

> I think it's clear there's no perfect eviction strategy - for every
> algorithm we came up with we can construct a data set on which it
> performs terribly (I'm sure we could do that for the approach used by
> Greenplum, for example).
>
> So I think it makes sense to do what Jeff proposed, and then maybe try
> improving that in the future with a switch to different eviction
> strategy based on some heuristics.

I agree. It definitely feels like both spilling strategies have their
own use case.

That said, I think it's worth mentioning that with parallel aggregates
it might actually be more useful to spill the trans values instead,
and have them combined in a Gather or Finalize stage.

[1]
https://www.postgresql.org/message-id/flat/48abe675e1330f0c264ab2fe0d4ff23eb244f9ef.camel%40j-davis.com


v1-0001-Rebased-memory-bounded-hash-aggregation.patch
Description: Binary data

Re: RFC: seccomp-bpf support

2019-08-28 Thread Andres Freund

Hi,

On 2019-08-28 15:38:11 -0400, Joshua Brindle wrote:
> It seems like complete system compromises should be prioritized over
> slowdowns, and it seems very unlikely to cause a noticeable slowdown
> anyway.

The point isn't really this specific issue, but that the argument that
you'll not cause problems by disabling certain syscalls, or that it's
easy to find which ones are used, just plainly isn't true.

> Are there PG users that backed out all of the Linux KPTI patches due
> to the slowdown?

Well, not backed out on a code level, but straight out disabled at boot
time (i.e. pti=off)? Yea, I know of several.

> I think we need to reign in the thread somewhat. The feature allows
> end users to define some sandboxing within PG. Nothing is being forced
> on anyone

Well, we'll have to deal with the fallout of this to some degree. When
postgres breaks people will complain, when it's slow, people will
complain, ...

Greetings,

Andres Freund

Re: RFC: seccomp-bpf support

2019-08-28 Thread Joshua Brindle

On Wed, Aug 28, 2019 at 3:22 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-28 15:02:17 -0400, Joshua Brindle wrote:
> > On Wed, Aug 28, 2019 at 2:53 PM Andres Freund  wrote:
> > > On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:
> > > > A prime example is madvise() which was a catastrophic failure that 1)
> > > > isn't preventable by any LSM including SELinux, 2) isn't used by PG
> > > > and is therefore a good candidate for a kill list, and 3) a clear win
> > > > in the dont-let-PG-be-a-vector-for-kernel-compromise arena.
> > >
> > > IIRC it's used by glibc as part of its malloc implementation (also
> > > threading etc) - but not necessarily hit during the most common
> > > paths. That's *precisely* my problem with this approach.
> > >
> >
> > As long as glibc handles a returned error cleanly the syscall could be
> > denied without harming the process and the bug would be mitigated.
>
> And we'd hit mysterious slowdowns in production uses of PG when seccomp
> is enabled.

It seems like complete system compromises should be prioritized over
slowdowns, and it seems very unlikely to cause a noticeable slowdown
anyway. Are there PG users that backed out all of the Linux KPTI
patches due to the slowdown?

I think we need to reign in the thread somewhat. The feature allows
end users to define some sandboxing within PG. Nothing is being forced
on anyone but we would like the capability to harden a PG installation
for many reasons already stated. This is being done in places all
across the Linux ecosystem and is IMO a very useful mitigation.

Thank you.

Re: Index Skip Scan

2019-08-28 Thread Floris Van Nee


> Sorry for long delay. Here is more or less what I had in mind. After changing
> read_closest to read the whole page I couldn't resist to just merge it into
> readpage itself, since it's basically the same. It could raise questions 
> about>
> performance and how intrusive this patch is, but I hope it's not that much of 
> a
> problem (in the worst case we can split it back). I've also added few tests 
> for
> the issue you've mentioned. Thanks again, I'm appreciate how much efforts you
> put into reviewing!

Putting it into one function makes sense I think. Looking at the patch, I think 
in general there are some good improvements in there.

I'm afraid I did manage to find another incorrect query result though, having 
to do with the keepPrev part and skipping to the first tuple on an index page:

postgres=# drop table if exists b; create table b as select a,b::int2 
b,(b%2)::int2 c from generate_series(1,5) a, generate_series(1,366) b; create 
index on b (a,b,c); analyze b;
DROP TABLE
SELECT 1830
CREATE INDEX
ANALYZE
postgres=# set enable_indexskipscan=1;
SET
postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b;
 a | b | c
---+---+---
 1 | 2 | 0
 2 | 4 | 0
 3 | 4 | 0
 4 | 4 | 0
 5 | 4 | 0
(5 rows)

postgres=# set enable_indexskipscan=0;
SET
postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b;
 a | b | c
---+---+---
 1 | 2 | 0
 2 | 2 | 0
 3 | 2 | 0
 4 | 2 | 0
 5 | 2 | 0
(5 rows)


-Floris

Re: RFC: seccomp-bpf support

2019-08-28 Thread Andres Freund

Hi,

On 2019-08-28 15:02:17 -0400, Joshua Brindle wrote:
> On Wed, Aug 28, 2019 at 2:53 PM Andres Freund  wrote:
> > On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:
> > > A prime example is madvise() which was a catastrophic failure that 1)
> > > isn't preventable by any LSM including SELinux, 2) isn't used by PG
> > > and is therefore a good candidate for a kill list, and 3) a clear win
> > > in the dont-let-PG-be-a-vector-for-kernel-compromise arena.
> >
> > IIRC it's used by glibc as part of its malloc implementation (also
> > threading etc) - but not necessarily hit during the most common
> > paths. That's *precisely* my problem with this approach.
> >
> 
> As long as glibc handles a returned error cleanly the syscall could be
> denied without harming the process and the bug would be mitigated.

And we'd hit mysterious slowdowns in production uses of PG when seccomp
is enabled.

Greetings,

Andres Freund

Re: RFC: seccomp-bpf support

2019-08-28 Thread Tom Lane

Andres Freund  writes:
> On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:
>> A prime example is madvise() which was a catastrophic failure that 1)
>> isn't preventable by any LSM including SELinux, 2) isn't used by PG
>> and is therefore a good candidate for a kill list, and 3) a clear win
>> in the dont-let-PG-be-a-vector-for-kernel-compromise arena.

> IIRC it's used by glibc as part of its malloc implementation (also
> threading etc) - but not necessarily hit during the most common
> paths. That's *precisely* my problem with this approach.

I think Andres is right here.  There are madvise calls in glibc:

glibc-2.28/malloc/malloc.c:__madvise (paligned_mem, size & 
~psm1, MADV_DONTNEED);
glibc-2.28/malloc/arena.c:__madvise ((char *) h + new_size, diff, 
MADV_DONTNEED);

It appears that the first is only reachable from __malloc_trim which
we don't use, but the second is reachable from free().  However,
strace'ing says that it's never called during our standard regression
tests, confirming Andres' thought that it's in seldom-reached paths.
(I did not go through the free() logic in any detail, but it looks
like it is only reached when dealing with quite-large allocations,
which'd make sense.)

So this makes a perfect example for Peter's point that testing is
going to be a very fallible way of finding the set of syscalls that
need to be allowed.  Even if we had 100.00% code coverage of PG
proper, we would not necessarily find calls like this.

regards, tom lane

Re: RFC: seccomp-bpf support

2019-08-28 Thread Joshua Brindle

On Wed, Aug 28, 2019 at 2:53 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:
> > A prime example is madvise() which was a catastrophic failure that 1)
> > isn't preventable by any LSM including SELinux, 2) isn't used by PG
> > and is therefore a good candidate for a kill list, and 3) a clear win
> > in the dont-let-PG-be-a-vector-for-kernel-compromise arena.
>
> IIRC it's used by glibc as part of its malloc implementation (also
> threading etc) - but not necessarily hit during the most common
> paths. That's *precisely* my problem with this approach.
>

As long as glibc handles a returned error cleanly the syscall could be
denied without harming the process and the bug would be mitigated.

seccomp also allows argument whitelisting so things can get very
granular, depending on who is setting up the lists.

Re: Re: Email to hackers for test coverage

2019-08-28 Thread Peter Geoghegan

On Tue, Aug 27, 2019 at 8:30 PM Michael Paquier  wrote:
> Please note that I have not looked in details where we could put that,
> but perhaps Robert and Peter G who worked on 4ea51cd to add support
> for abbreviated keys have some ideas, so I am adding them in CC for
> input.

Movead is correct -- the NULL handling within
ApplySortAbbrevFullComparator() cannot actually be used currently. I
wouldn't change anything about the code, though, since it's useful to
defensively handle NULLs.

-- 
Peter Geoghegan

Re: RFC: seccomp-bpf support

2019-08-28 Thread Tom Lane

Andres Freund  writes:
> On 2019-08-28 14:30:20 -0400, Tom Lane wrote:
>> Admittedly, you can't get per-subprocess restrictions that way, but the
>> incremental value from that seems *really* tiny.  If listen() has a bug
>> you need to fix the bug, not invent this amount of rickety infrastructure
>> to limit who can call it.

> And, as I mentioned in another email, once you can corrupt shared memory
> in arbitrary ways, the differing restrictions aren't worth much
> anyway. Postmaster might be separated out enough to survive attacks like
> that, but backends definitely aren't.

Another point in this area is that if you did feel a need for per-process
syscall sets, having a restriction that the postmaster's allowed set be a
superset of all the childrens' allowed sets seems quite the wrong thing.
The set of calls the postmaster needs is probably a lot smaller than what
the children need, seeing that it does so little.  It's just different
because it includes bind+listen which the children likely don't need.
So the hierarchical way seccomp goes about this seems fairly wrong for
our purposes regardless.

regards, tom lane

Re: RFC: seccomp-bpf support

2019-08-28 Thread Joshua Brindle

On Wed, Aug 28, 2019 at 2:30 PM Tom Lane  wrote:
>

>
> (And, TBH, I'm still wondering why SELinux isn't the way to address this.)

Just going to address this one now. SELinux is an LSM and therefore
only makes decisions when LSM hooks are invoked, which are not 1:1 for
syscalls (not even close). Further, SELinux generally determines what
objects a subject can access and only implements capabilities because
it has to, to be non-bypassable.

Seccomp filtering is an orthogonal system to SELinux and LSMs in
general. We are already doing work to further restrict PG subprocesses
with SELinux via [1] and [2], but that doesn't address the unused,
high-risk, obsolete, etc syscall issue. A prime example is madvise()
which was a catastrophic failure that 1) isn't preventable by any LSM
including SELinux, 2) isn't used by PG and is therefore a good
candidate for a kill list, and 3) a clear win in the
dont-let-PG-be-a-vector-for-kernel-compromise arena.

We are using SELinux, we are also going to use this, they work together.

Re: RFC: seccomp-bpf support

2019-08-28 Thread Andres Freund

Hi,

On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:
> A prime example is madvise() which was a catastrophic failure that 1)
> isn't preventable by any LSM including SELinux, 2) isn't used by PG
> and is therefore a good candidate for a kill list, and 3) a clear win
> in the dont-let-PG-be-a-vector-for-kernel-compromise arena.

IIRC it's used by glibc as part of its malloc implementation (also
threading etc) - but not necessarily hit during the most common
paths. That's *precisely* my problem with this approach.

Greetings,

Andres Freund

Re: RFC: seccomp-bpf support

2019-08-28 Thread Joshua Brindle

On Wed, Aug 28, 2019 at 2:47 PM Joshua Brindle
 wrote:
>
> On Wed, Aug 28, 2019 at 2:30 PM Tom Lane  wrote:
> >
> 
> >
> > (And, TBH, I'm still wondering why SELinux isn't the way to address this.)
>
> Just going to address this one now. SELinux is an LSM and therefore
> only makes decisions when LSM hooks are invoked, which are not 1:1 for
> syscalls (not even close). Further, SELinux generally determines what
> objects a subject can access and only implements capabilities because
> it has to, to be non-bypassable.
>
> Seccomp filtering is an orthogonal system to SELinux and LSMs in
> general. We are already doing work to further restrict PG subprocesses
> with SELinux via [1] and [2], but that doesn't address the unused,

And I forgot the citations *sigh*, actually there should have only been [1]:

1. https://commitfest.postgresql.org/24/2259/

> high-risk, obsolete, etc syscall issue. A prime example is madvise()
> which was a catastrophic failure that 1) isn't preventable by any LSM
> including SELinux, 2) isn't used by PG and is therefore a good
> candidate for a kill list, and 3) a clear win in the
> dont-let-PG-be-a-vector-for-kernel-compromise arena.
>
> We are using SELinux, we are also going to use this, they work together.

Re: RFC: seccomp-bpf support

2019-08-28 Thread Andres Freund

On 2019-08-28 14:30:20 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Maybe I'm missing something, but it's not clear to me what meaningful
> > attack surface can be reduced for PostgreSQL by forbidding certain
> > syscalls, given the wide variety of syscalls required to run postgres.
> 
> I think the idea is to block access to seldom-used syscalls because
> those are exactly the ones most likely to have bugs.

Yea, I can see some benefit in that. I'm just quite doubtful that the
set of syscalls pg relies on doesn't already allow any determined
attacker to trigger kernel bugs. E.g. the whole sysv ipc code is among
those seldomly used areas of the code. As is the permission transfer
through unix sockets. As is forking from within a syscall. ...

> (We certainly hope that all the ones PG uses are well-debugged...)

One would hope, but it's not quite my experience...

> Whether the incremental protection is really worth the effort is
> debatable, but as long as it's only people who think it *is* worth the
> effort who have to deal with it, I don't mind.

It seems almost guaranteed that there'll be bug reports about ominous
crashes due to some less commonly used syscall being blacklisted. In a
lot of cases that'll be hard to debug.  After all, we already got such
bug reports, without us providing anything builtin - and it's not like
us adding our own filter suport will make container solutions not use
their filter, so there's no argument that doing so ourselves will reduce
the fragility.

> Admittedly, you can't get per-subprocess restrictions that way, but the
> incremental value from that seems *really* tiny.  If listen() has a bug
> you need to fix the bug, not invent this amount of rickety infrastructure
> to limit who can call it.

And, as I mentioned in another email, once you can corrupt shared memory
in arbitrary ways, the differing restrictions aren't worth much
anyway. Postmaster might be separated out enough to survive attacks like
that, but backends definitely aren't.

Greetings,

Andres Freund

Re: RFC: seccomp-bpf support

2019-08-28 Thread Andres Freund

Hi,

On 2019-08-28 14:23:00 -0400, Joshua Brindle wrote:
> > or some similar technology where the filtering is done by logic
> > that's outside the executable you wish to not trust.
> > (After googling for libseccomp, I see that it's supposed to not
> > allow syscalls to be turned back on once turned off, but that isn't
> > any protection against this problem.  An attacker who's found an ACE
> > hole in Postgres can just issue ALTER SYSTEM SET to disable the
> > feature, then force a postmaster restart, then profit.)

A postmaster restart might not be enough, because the postmaster's
restrictions can't be removed, once in place. But all that's needed to
circumvent that is force postmaster to die, and rely on systemd etc to
restart it.

> My preference would have been to enable it unconditionally but Joe was
> being more practical.

Well, the current approach is to configure the list of allowed syscalls
in postgres. How would you ever secure that against the attacks
described by Tom? As long as the restrictions are put into place by
postgres itself, and as long they're configurable, such attacks are
possible, no?  And as long as extensions etc need different syscalls,
you'll need configurability.

> > I follow the idea of limiting the attack surface for kernel bugs,
> > but this doesn't seem like a useful implementation of that, even
> > ignoring the ease-of-use problems Peter mentions.
> 
> Limiting the kernel attack surface for network facing daemons is
> imperative to hardening systems. All modern attacks are chained
> together so a kernel bug is useful only if you can execute code, and
> PG is a decent vector for executing code.

I don't really buy that in case pof postgres. Normally, in a medium to
high security world, once you have RCE in postgres, the valuable data
can already be exfiltrated. And that's game over. The only real benefits
of a kernel vulnerabily is that that might allow to persist the attack
for longer - but there's already plenty you can do inside postgres, once
you have RCE.

> At a minimum I would urge the community to look at adding high risk
> syscalls to the kill list, systemd has some predefined sets we can
> pick from like @obsoluete, @cpu-emulation, @privileged, @mount, and
> @module.

I can see some small value in disallowing these - but we're back to the
point where that is better done one layer *above* postgres, by a process
with more privileges than PG. Because then a PG RCE doesn't have a way
to prevent those filters from being applied.

> The postmaster and per-role lists can further reduce the available
> syscalls based on the exact extensions and PLs being used.

I don't buy that per-session configurable lists buy you anything
meaningful. With an RCE in one session, it's pretty trivial to corrupt
shared memory to also trigger RCE in other sessions. And there's no way
seccomp or anything like that will prevent that.

An additional reason I'm quite sceptical about more fine grained
restrictions is that I think we're going to have to go for some use of
threading in the next few years. While I think that's still far from
agreed upon, I think there's a pretty large number of "senior" hackers
that see this as the future.  You can have per-thread seccomp filters,
but that's so trivial to circumvent (just overwrite some vtable like
data in another thread's data, and have it call whatever gadget you
want), that it's not even worth considering.

Greetings,

Andres Freund

Re: RFC: seccomp-bpf support

2019-08-28 Thread Tom Lane

Andres Freund  writes:
> Maybe I'm missing something, but it's not clear to me what meaningful
> attack surface can be reduced for PostgreSQL by forbidding certain
> syscalls, given the wide variety of syscalls required to run postgres.

I think the idea is to block access to seldom-used syscalls because
those are exactly the ones most likely to have bugs.  (We certainly
hope that all the ones PG uses are well-debugged...)  That seems fine.
Whether the incremental protection is really worth the effort is
debatable, but as long as it's only people who think it *is* worth
the effort who have to deal with it, I don't mind.

What I don't like about this proposal is that the filter configuration is
kept somewhere where it's not at all hard for an attacker to modify it.
It can't be a GUC, indeed it can't be in any file that the server has
permissions to write on, or you might as well not bother.  So I'd throw
away pretty much everything in the submitted patch, and instead think
about doing the filtering/limiting in something that's launched from the
service start script and in turn launches the postmaster.  That makes it
(mostly?) Not Our Problem, but rather an independent component.

Admittedly, you can't get per-subprocess restrictions that way, but the
incremental value from that seems *really* tiny.  If listen() has a bug
you need to fix the bug, not invent this amount of rickety infrastructure
to limit who can call it.

(And, TBH, I'm still wondering why SELinux isn't the way to address this.)

regards, tom lane

Re: RFC: seccomp-bpf support

2019-08-28 Thread Joshua Brindle

On Wed, Aug 28, 2019 at 1:42 PM Tom Lane  wrote:
>
> Peter Eisentraut  writes:
> > On 2019-08-28 17:13, Joe Conway wrote:
> >> Attached is a patch for discussion, adding support for seccomp-bpf
> >> (nowadays generally just called seccomp) syscall filtering at
> >> configure-time using libseccomp. I would like to get this in shape to be
> >> committed by the end of the November CF if possible.
>
> > To compute the initial set of allowed system calls, you need to have
> > fantastic test coverage.  What you don't want is some rarely used error
> > recovery path to cause a system crash.  I wouldn't trust our current
> > coverage for this.
>
> Yeah, that seems like quite a serious problem.  I think you'd want
> to have some sort of static-code-analysis-based way of identifying
> the syscalls in use, rather than trying to test your way to it.
>
> > Overall, I think this sounds like a maintenance headache, and the
> > possible benefits are unclear.
>
> After thinking about this for awhile, I really don't follow what
> threat model it's trying to protect against.  Given that we'll allow
> any syscall that an unmodified PG executable might use, it seems
> like the only scenarios being protected against involve someone
> having already compromised the server enough to have arbitrary code
> execution.  OK, fine, but then why wouldn't the attacker just
> bypass libseccomp?  Or tell it to let through the syscall he wants
> to use?  Having the list of allowed syscalls be determined inside
> the process seems like fundamentally the wrong implementation.
> I'd have expected a feature like this to be implemented by SELinux,

SELinux is generally an object model and while it does implement e.g.,
capability checks, that is not the main intent, nor is it possible for
LSMs to implement syscall filters, the features are orthogonal.

> or some similar technology where the filtering is done by logic
> that's outside the executable you wish to not trust.
> (After googling for libseccomp, I see that it's supposed to not
> allow syscalls to be turned back on once turned off, but that isn't
> any protection against this problem.  An attacker who's found an ACE
> hole in Postgres can just issue ALTER SYSTEM SET to disable the
> feature, then force a postmaster restart, then profit.)
>

My preference would have been to enable it unconditionally but Joe was
being more practical.

> I follow the idea of limiting the attack surface for kernel bugs,
> but this doesn't seem like a useful implementation of that, even
> ignoring the ease-of-use problems Peter mentions.

Limiting the kernel attack surface for network facing daemons is
imperative to hardening systems. All modern attacks are chained
together so a kernel bug is useful only if you can execute code, and
PG is a decent vector for executing code.

At a minimum I would urge the community to look at adding high risk
syscalls to the kill list, systemd has some predefined sets we can
pick from like @obsoluete, @cpu-emulation, @privileged, @mount, and
@module.

The goal is to prevent an ACE hole in Postgres from becoming a
complete system compromise. This may not do it alone, and security
conscious integrators will want to, for example, add seccomp filters
to systemd to prevent superuser from disabling them. The postmaster
and per-role lists can further reduce the available syscalls based on
the exact extensions and PLs being used. Each step reduced the surface
more and throwing it all out because one step can go rogue is
unsatisfying.

Thank you.

Re: no mailing list hits in google

2019-08-28 Thread Thomas Kellerer


Merlin Moncure schrieb am 28.08.2019 um 18:22:

My test case here is the query: pgsql-hackers


That search term is the first hit on DuckDuckGo:
https://duckduckgo.com/?q=pgsql-hackers+ExecHashJoinNewBatch&t=h_&ia=web

Searching for "postgres ExecHashJoinNewBatch" returns that ot position 4
https://duckduckgo.com/?q=postgres+ExecHashJoinNewBatch&t=h_&ia=web

Re: RFC: seccomp-bpf support

2019-08-28 Thread Andres Freund

Hi,

On 2019-08-28 13:28:06 -0400, Joe Conway wrote:
> > To compute the initial set of allowed system calls, you need to have
> > fantastic test coverage.  What you don't want is some rarely used error
> > recovery path to cause a system crash.  I wouldn't trust our current
> > coverage for this.

> So if you are worried about that make your default action 'log' and
> watch audit.log. There will be no errors or crashes of postgres caused
> by that because there will be no change in postgres visible behavior.

But the benefit of integrating this into postgres become even less
clear.


> And if returning an error from a syscall causes a crash that would be a
> serious bug and we should fix it.

Err, there's a lot of syscall failures that'll cause PANICs, and where
there's no reasonable way around that.

Greetings,

Andres Freund

Re: RFC: seccomp-bpf support

2019-08-28 Thread Andres Freund

Hi,

On 2019-08-28 11:13:27 -0400, Joe Conway wrote:
> Recent security best-practices recommend, and certain highly
> security-conscious organizations are beginning to require, that SECCOMP
> be used to the extent possible. The major web browsers, container
> runtime engines, and systemd are all examples of software that already
> support seccomp.

Maybe I'm missing something, but it's not clear to me what meaningful
attack surface can be reduced for PostgreSQL by forbidding certain
syscalls, given the wide variety of syscalls required to run postgres.
That's different from something like a browser's CSS process, or such,
which really doesn't need much beyond some IPC and memory
allocations. But postgres is going to need syscalls as broad as
fork/clone, exec, connect, shm*, etc. I guess you can argue that we'd
still reduce the attack surface for kernel escalations, but that seems
like a pretty small win compared to the cost.

> * With built-in support, it is possible to lock down backend processes
>   more tightly than the postmaster.

Which important syscalls would you get away with removing in backends
that postmaster needs? I think the only one - which is a good one though
- that I can think of is listen(). But even that might be too
restrictive for some PLs running out of process.

My main problem with seccomp is that it's *incredibly* fragile,
especially for a program as complex as postgres. We already had seccomp
related bug reports on list, even just due to the very permissive
filtering by some container solutions.

There's regularly new syscalls (e.g. epoll_create1(), and we'll soon get
openat2()), different versions of glibc use different syscalls
(e.g. switching from open() to always using openat()), the system
configuration influences which syscalls are being used (e.g. using
vsyscalls only being used for certain clock sources), and kernel.
bugfixes change the exact set of syscalls being used ([1]).

[1] https://lwn.net/Articles/795128/

Then there's also the issue that many extensions are going to need
additional syscalls.

> Notes on usage:
> ===
> In order to determine your minimally required allow lists, do something
> like the following on a non-production server with the same architecture
> as production:

>  c) Cut and paste the result as the value of session_syscall_allow.

That seems nearly guaranteed to miss a significant fraction of
syscalls. There's just no way we're going to cover all the potential
paths and configurations in our testsuite.

I think if you actually wanted to do something like this, you'd need to
use static analysis to come up with a more reliable list.

Greetings,

Andres Freund

Re: no mailing list hits in google

2019-08-28 Thread Andres Freund

Hi,

On 2019-08-28 10:26:35 -0700, Andres Freund wrote:
> On August 28, 2019 9:22:44 AM PDT, Merlin Moncure  wrote:
> >Hackers,
> >[apologies if this is the incorrect list or is already discussed
> >material]
> 
> Probably should be on the -www list. Redirecting. Please trim in future 
> replies.
> 
> >I've noticed that mailing list discussions in -hackers and other
> >mailing lists appear to not be indexed by google -- at all.
> 
> I noticed that there's fewer and fewer hits too. Pretty annoying. I have an 
> online archive I can search, but that's not something everyone should have to 
> do.
> 
> I think it's because robots.txt tells search engines to ignore the lists. 
> Quite hard to understand how that's a good idea.
> 
> https://www.postgresql.org/robots.txt
> 
> User-agent: *
> Disallow: /admin/
> Disallow: /account/
> Disallow: /docs/devel/
> Disallow: /list/
> Disallow: /search/
> Disallow: /message-id/raw/
> Disallow: /message-id/flat/
> 
> Sitemap: https://www.postgresql.org/sitemap.xml 
> 
> Without /list, there's no links to the individual messages. So there needs to 
> be another external reference for a search engine to arrive at individual 
> messages.
> 
> Andres
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.

For reasons that I do not understand, the previous mail had a broken
html part, making the above message invisible for people viewing the
html part.

Greetings,

Andres Freund

Re: RFC: seccomp-bpf support

2019-08-28 Thread Tom Lane

Peter Eisentraut  writes:
> On 2019-08-28 17:13, Joe Conway wrote:
>> Attached is a patch for discussion, adding support for seccomp-bpf
>> (nowadays generally just called seccomp) syscall filtering at
>> configure-time using libseccomp. I would like to get this in shape to be
>> committed by the end of the November CF if possible.

> To compute the initial set of allowed system calls, you need to have
> fantastic test coverage.  What you don't want is some rarely used error
> recovery path to cause a system crash.  I wouldn't trust our current
> coverage for this.

Yeah, that seems like quite a serious problem.  I think you'd want
to have some sort of static-code-analysis-based way of identifying
the syscalls in use, rather than trying to test your way to it.

> Overall, I think this sounds like a maintenance headache, and the
> possible benefits are unclear.

After thinking about this for awhile, I really don't follow what
threat model it's trying to protect against.  Given that we'll allow
any syscall that an unmodified PG executable might use, it seems
like the only scenarios being protected against involve someone
having already compromised the server enough to have arbitrary code
execution.  OK, fine, but then why wouldn't the attacker just
bypass libseccomp?  Or tell it to let through the syscall he wants
to use?  Having the list of allowed syscalls be determined inside
the process seems like fundamentally the wrong implementation.
I'd have expected a feature like this to be implemented by SELinux,
or some similar technology where the filtering is done by logic
that's outside the executable you wish to not trust.

(After googling for libseccomp, I see that it's supposed to not
allow syscalls to be turned back on once turned off, but that isn't
any protection against this problem.  An attacker who's found an ACE
hole in Postgres can just issue ALTER SYSTEM SET to disable the
feature, then force a postmaster restart, then profit.)

I follow the idea of limiting the attack surface for kernel bugs,
but this doesn't seem like a useful implementation of that, even
ignoring the ease-of-use problems Peter mentions.

regards, tom lane

Re: RFC: seccomp-bpf support

2019-08-28 Thread Joe Conway

On 8/28/19 12:47 PM, David Fetter wrote:
> On Wed, Aug 28, 2019 at 11:13:27AM -0400, Joe Conway wrote:
>> SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall
>> filtering mechanism which allows reduction of the kernel attack surface
>> by preventing (or at least audit logging) normally unused syscalls.
>> 
>> Quoting from this link:
>> https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
>> 
>>"A large number of system calls are exposed to every userland process
>> with many of them going unused for the entire lifetime of the
>> process. As system calls change and mature, bugs are found and
>> eradicated. A certain subset of userland applications benefit by
>> having a reduced set of available system calls. The resulting set
>> reduces the total kernel surface exposed to the application. System
>> call filtering is meant for use with those applications."
>> 
>> Recent security best-practices recommend, and certain highly
>> security-conscious organizations are beginning to require, that SECCOMP
>> be used to the extent possible. The major web browsers, container
>> runtime engines, and systemd are all examples of software that already
>> support seccomp.
> 
> Neat!
> 
> Are the seccomp interfaces for other kernels arranged in a manner
> similar enough to have a unified interface in PostgreSQL, or is this
> more of a Linux-only feature?


As far as I know libseccomp is Linux specific, at least at the moment.

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: RFC: seccomp-bpf support

2019-08-28 Thread Joe Conway

On 8/28/19 1:03 PM, Peter Eisentraut wrote:
> On 2019-08-28 17:13, Joe Conway wrote:
>> * systemd does not implement seccomp filters by default. Packagers may
>>   decide to do so, but there is no guarantee. Adding them post install
>>   potentially requires cooperation by groups outside control of
>>   the database admins.
> 
> Well, if we are interested in this, why don't we just ask packagers to
> do so?  That seems like a much easier solution.

For the reason listed below

>> * In the container and systemd case there is no particularly good way to
>>   inspect what filters are active. It is possible to observe actions
>>   taken, but again, control is possibly outside the database admin
>>   group. For example, the best way to understand what happened is to
>>   review the auditd log, which is likely not readable by the DBA.
> 
> Why does one need to know what filters are active (other than,
> obviously, checking that the filters one has set were actually
> activated)?  What decisions would a DBA or database user be able to make
> based on whether a filter is set or not?

So that when an enforement action happens, you can understand why it
happened. Perhaps it was a bug (omission) in your allow list, or perhaps
it was an intentional rule to prevent abuse (say blocking certain
syscalls from plperlu), or it was because someone is trying to
compromise you system (e.g. some obscure and clearly not needed syscall).

>> * With built-in support, it is possible to lock down backend processes
>>   more tightly than the postmaster.
> 
> Also the other way around?

As I stated in the original email, the filters can add restrictions but
never remove them.

>> * With built-in support, it is possible to lock down different backend
>>   processes differently than each other, for example by using ALTER ROLE
>>   ... SET or ALTER DATABASE ... SET.
> 
> What are use cases for this?

For example to allow a specific user to use plperlu to exec shell code
while others cannot.

>> Attached is a patch for discussion, adding support for seccomp-bpf
>> (nowadays generally just called seccomp) syscall filtering at
>> configure-time using libseccomp. I would like to get this in shape to be
>> committed by the end of the November CF if possible.
> 
> To compute the initial set of allowed system calls, you need to have
> fantastic test coverage.  What you don't want is some rarely used error
> recovery path to cause a system crash.  I wouldn't trust our current
> coverage for this.

So if you are worried about that make your default action 'log' and
watch audit.log. There will be no errors or crashes of postgres caused
by that because there will be no change in postgres visible behavior.

And if returning an error from a syscall causes a crash that would be a
serious bug and we should fix it.

> Also, the list of system calls in use changes all the time.  The same
> function call from PostgreSQL could be a system call or a glibc
> implementation, depending on the specific installed packages or run-time
> settings.

True. That is why I did not provide an initial list and believe folks
who want to use this should develop their own.

> Extensions would need to maintain their own permissions list, and they
> would need to be merged manually into the respective existing settings.

People would have to generate their own list for extensions -- I don't
believe it is the extension writers' problem.

> Without good answers to these, I suspect that this feature would go
> straight to the top of the list of "if in doubt, turn off".

That is fine. Perhaps most people never use this, but when needed (and
increasingly will be required) it is available.

> Overall, I think this sounds like a maintenance headache, and the
> possible benefits are unclear.

The only people who will incur the maintenance headache are those who
need to use it. The benefits are compliance with requirements.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

signature.asc
Description: OpenPGP digital signature

Re: no mailing list hits in google

2019-08-28 Thread Andres Freund

Hi,

On August 28, 2019 9:22:44 AM PDT, Merlin Moncure  wrote:
>Hackers,
>[apologies if this is the incorrect list or is already discussed
>material]

Probably should be on the -www list. Redirecting. Please trim in future replies.

>I've noticed that mailing list discussions in -hackers and other
>mailing lists appear to not be indexed by google -- at all.

I noticed that there's fewer and fewer hits too. Pretty annoying. I have an 
online archive I can search, but that's not something everyone should have to 
do.

I think it's because robots.txt tells search engines to ignore the lists. Quite 
hard to understand how that's a good idea.

https://www.postgresql.org/robots.txt

User-agent: *
Disallow: /admin/
Disallow: /account/
Disallow: /docs/devel/
Disallow: /list/
Disallow: /search/
Disallow: /message-id/raw/
Disallow: /message-id/flat/

Sitemap: https://www.postgresql.org/sitemap.xml 

Without /list, there's no links to the individual messages. So there needs to 
be another external reference for a search engine to arrive at individual 
messages.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-08-28 Thread Alexey Kondratov


Hi Tomas,


Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame graph 
for

the "normal" case).

I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 100);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with SSD for 
tests.


Therefore, probably you may write changes on receiver in bigger chunks,
not each change separately.


Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?


I run this on a single machine, but walsender and worker are utilizing 
almost 100% of CPU per each process all the time, and at apply side 
I/O syscalls take about 1/3 of CPU time. Though I am still not sure, 
but for me this result somehow links performance drop with problems at 
receiver side.


Writing in batches was just a hypothesis and to validate it I have 
performed test with large txn, but consisting of a smaller number of 
wide rows. This test does not exhibit any significant performance 
drop, while it was streamed too. So it seems to be valid. Anyway, I do 
not have other reasonable ideas beside that right now.


I've checked recently this patch again and tried to elaborate it in 
terms of performance. As a result I've implemented a new POC version of 
the applier (attached). Almost everything in streaming logic stayed 
intact, but apply worker is significantly different.


As I wrote earlier I still claim, that spilling changes on disk at the 
applier side adds additional overhead, but it is possible to get rid of 
it. In my additional patch I do the following:


1) Maintain a pool of additional background workers (bgworkers), that 
are connected with main logical apply worker via shm_mq's. Each worker 
is dedicated to the processing of specific streamed transaction.


2) When we receive a streamed change for some transaction, we check 
whether there is an existing dedicated bgworker in HTAB (xid -> 
bgworker), or there are some in the idle list, or spawn a new one.


3) We pass all changes (between STREAM START/STOP) to that bgworker via 
shm_mq_send without intermediate waiting. However, we wait for bgworker 
to apply the entire changes chunk at STREAM STOP, since we don't want 
transactions reordering.


4) When transaction is commited/aborted worker is being added to the 
idle list and is waiting for reassigning message.


5) I have used the same machinery with apply_dispatch in bgworkers, 
since most of actions are practically very similar.


Thus, we do not spill anything at the applier side, so transaction 
changes are processed by bgworkers as normal backends do. In the same 
time, changes processing is strictly serial, which prevents transactions 
reordering and possible conflicts/anomalies. Even though we trade off 
performance in favor of stability the result is rather impressive. I 
have used a similar query for testing as before:


EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
    SELECT round(random()*10), random(), random()*142
    FROM generate_series(1, 100) s(i);

with 1kk (100), 3kk and 5kk rows; logical_work_mem = 64MB and 
synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is 
following:


CREATE TABLE large_test (
    id serial primary key,
    num1 bigint,
    num2 double precision,
    num3 double precision
);

Here are the results:

---
| N | Time on master, sec | Total xact time, sec | Ratio  |
---
|    On commit (master, v13)  |
---
| 1kk | 6.5   | 17.6 | x2.74  |
---
| 3kk | 21    | 55.4 | x2.64  |
---
| 5kk | 38.3  | 91.5 | x2.39  |
---
|    Stream + spill   |
-

Re: RFC: seccomp-bpf support

2019-08-28 Thread Peter Eisentraut

On 2019-08-28 17:13, Joe Conway wrote:
> * systemd does not implement seccomp filters by default. Packagers may
>   decide to do so, but there is no guarantee. Adding them post install
>   potentially requires cooperation by groups outside control of
>   the database admins.

Well, if we are interested in this, why don't we just ask packagers to
do so?  That seems like a much easier solution.

> * In the container and systemd case there is no particularly good way to
>   inspect what filters are active. It is possible to observe actions
>   taken, but again, control is possibly outside the database admin
>   group. For example, the best way to understand what happened is to
>   review the auditd log, which is likely not readable by the DBA.

Why does one need to know what filters are active (other than,
obviously, checking that the filters one has set were actually
activated)?  What decisions would a DBA or database user be able to make
based on whether a filter is set or not?

> * With built-in support, it is possible to lock down backend processes
>   more tightly than the postmaster.

Also the other way around?

> * With built-in support, it is possible to lock down different backend
>   processes differently than each other, for example by using ALTER ROLE
>   ... SET or ALTER DATABASE ... SET.

What are use cases for this?

> Attached is a patch for discussion, adding support for seccomp-bpf
> (nowadays generally just called seccomp) syscall filtering at
> configure-time using libseccomp. I would like to get this in shape to be
> committed by the end of the November CF if possible.

To compute the initial set of allowed system calls, you need to have
fantastic test coverage.  What you don't want is some rarely used error
recovery path to cause a system crash.  I wouldn't trust our current
coverage for this.

Also, the list of system calls in use changes all the time.  The same
function call from PostgreSQL could be a system call or a glibc
implementation, depending on the specific installed packages or run-time
settings.

Extensions would need to maintain their own permissions list, and they
would need to be merged manually into the respective existing settings.

Without good answers to these, I suspect that this feature would go
straight to the top of the list of "if in doubt, turn off".

Overall, I think this sounds like a maintenance headache, and the
possible benefits are unclear.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: basebackup.c's sendFile() ignores read errors

2019-08-28 Thread Jeevan Chalke

On Tue, Aug 27, 2019 at 10:33 PM Robert Haas  wrote:

> While reviewing a proposed patch to basebackup.c this morning, I found
> myself a bit underwhelmed by the quality of the code and comments in
> basebackup.c's sendFile(). I believe it's already been pointed out
> that the the retry logic here is not particularly correct, and the
> comments demonstrate a pretty clear lack of understanding of how the
> actual race conditions that are possible here might operate.
>
> However, I then noticed another problem which I think is significantly
> more serious: this code totally ignores the possibility of a failure
> in fread().  It just assumes that if fread() fails to return a
> positive value, it's reached the end of the file, and if that happens,
> it just pads out the rest of the file with zeroes.  So it looks to me
> like if fread() encountered, say, an I/O error while trying to read a
> data file, and if that error were on the first data block, then the
> entire contents of that file would be replaced with zero bytes in the
> backup, and no error or warning of any kind would be given to the
> user.  If it hit the error later, everything from the point of the
> error onward would just get replaced with zero bytes.  To be clear, I
> think it is fine for basebackup.c to fill out the rest of the expected
> length with zeroes if in fact the file has been truncated during the
> backup; recovery should fix it.  But recovery is not going to fix data
> that got eaten because EIO was encountered while copying a file.
>

Per fread(3) manpage, we cannot distinguish between an error or EOF. So to
check for any error we must use ferror() after fread(). Attached patch which
tests ferror() and throws an error if it returns true.  However, I think
fread()/ferror() both do not set errno (per some web reference) and thus
throwing a generic error here. I have manually tested it.


> The logic that rereads a block appears to have the opposite problem.
> Here, it will detect an error, but it will also fail in the face of a
> concurrent truncation, which it shouldn't.
>

For this, I have checked for feof() assuming that when the file gets
truncated
we reach EOF. And if yes, getting out of the loop instead of throwing an
error.
I may be wrong as I couldn't reproduce this issue.

Please have a look over the patch and let me know your views.


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index c91f66d..5e64acf 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -90,6 +90,18 @@ static char *statrelpath = NULL;
  */
 #define THROTTLING_FREQUENCY	8
 
+/*
+ * Checks whether we encountered any error in fread().  fread() doesn't give
+ * any clue what has happened, so we check with ferror().  Also, neither
+ * fread() nor ferror() set errno, so we just throw a generic error.
+ */
+#define CHECK_FREAD_ERROR(fp, filename) \
+do { \
+	if (ferror(fp)) \
+		ereport(ERROR, \
+(errmsg("could not read from file \"%s\"", filename))); \
+} while (0)
+
 /* The actual number of bytes, transfer of which may cause sleep. */
 static uint64 throttling_sample;
 
@@ -543,6 +555,9 @@ perform_base_backup(basebackup_options *opt)
 	break;
 			}
 
+			/* Check fread() error. */
+			CHECK_FREAD_ERROR(fp, pathbuf);
+
 			if (len != wal_segment_size)
 			{
 CheckXLogRemoved(segno, tli);
@@ -1478,6 +1493,18 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
 
 			if (fread(buf + BLCKSZ * i, 1, BLCKSZ, fp) != BLCKSZ)
 			{
+/*
+ * If file is truncated, then we will hit
+ * end-of-file error in which case we don't
+ * want to error out, instead just pad it with
+ * zeros.
+ */
+if (feof(fp))
+{
+	cnt = BLCKSZ * i;
+	break;
+}
+
 ereport(ERROR,
 		(errcode_for_file_access(),
 		 errmsg("could not reread block %d of file \"%s\": %m",
@@ -1529,7 +1556,7 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
 		len += cnt;
 		throttle(cnt);
 
-		if (len >= statbuf->st_size)
+		if (feof(fp) || len >= statbuf->st_size)
 		{
 			/*
 			 * Reached end of file. The file could be longer, if it was
@@ -1540,6 +1567,9 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
 		}
 	}
 
+	/* Check fread() error. */
+	CHECK_FREAD_ERROR(fp, readfilename);
+
 	/* If the file was truncated while we were sending it, pad it with zeros */
 	if (len < statbuf->st_size)
 	{

Re: RFC: seccomp-bpf support

2019-08-28 Thread David Fetter

On Wed, Aug 28, 2019 at 11:13:27AM -0400, Joe Conway wrote:
> SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall
> filtering mechanism which allows reduction of the kernel attack surface
> by preventing (or at least audit logging) normally unused syscalls.
> 
> Quoting from this link:
> https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
> 
>"A large number of system calls are exposed to every userland process
> with many of them going unused for the entire lifetime of the
> process. As system calls change and mature, bugs are found and
> eradicated. A certain subset of userland applications benefit by
> having a reduced set of available system calls. The resulting set
> reduces the total kernel surface exposed to the application. System
> call filtering is meant for use with those applications."
> 
> Recent security best-practices recommend, and certain highly
> security-conscious organizations are beginning to require, that SECCOMP
> be used to the extent possible. The major web browsers, container
> runtime engines, and systemd are all examples of software that already
> support seccomp.

Neat!

Are the seccomp interfaces for other kernels arranged in a manner
similar enough to have a unified interface in PostgreSQL, or is this
more of a Linux-only feature?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Procedure support improvements

2019-08-28 Thread Peter Eisentraut

On 2019-08-26 20:08, Laurenz Albe wrote:
> test=> CREATE OR REPLACE PROCEDURE testproc() LANGUAGE plpgsql AS
>$$BEGIN PERFORM 42; COMMIT; PERFORM 'x'; END;$$;
> CREATE PROCEDURE
> test=> CALL testproc();
> CALL
> test=> BEGIN;
> BEGIN
> test=> CALL testproc();
> ERROR:  invalid transaction termination
> CONTEXT:  PL/pgSQL function testproc() line 1 at COMMIT
> 
> Oops.
> I find that indeed surprising.
> 
> What is the rationale for this?

It's mostly an implementation restriction.  You would need to teach
SPI_commit() and SPI_rollback() to manipulate the top-level transaction
block state appropriately and carefully.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Email to hackers for test coverage

2019-08-28 Thread Peter Eisentraut

On 2019-08-22 11:46, movead...@highgo.ca wrote:
> *1. src/include/utils/float.h:140*
> 
> Analyze: 
> This is an error report line when converting a big float8 value
> which a float4 can not storage to float4.
> 
> Test case:
> Add a test case as below in file float4.sql:
> select float4(1234567890123456789012345678901234567890::float8);

> +-- Add test case for float4() type fonversion function

Check spelling

> *2. src/include/utils/float.h:145*
> 
> Analyze:
> This is an error report line when converting a small float8 value
> which a float4 can not storage to float4.
> 
> Test case:
> Add a test case as below in file float4.sql:
> select float4(0.01::float8);
> 
> *3.src/include/utils/sortsupport.h:264*
> 
> Analyze:
> It is reverse sorting for the data type that has abbreviated for
> sort, for example macaddr, uuid, numeric, network and I choose
> numeric to do it.
> 
> Test cast:
> Add a test case as below in file numeric.sql:
> INSERT INTO num_input_test(n1) values('99.998');
> INSERT INTO num_input_test(n1) values('99.997');
> SELECT * FROM num_input_test ORDER BY n1 DESC;

>  INSERT INTO num_input_test(n1) VALUES ('nan');
> +INSERT INTO num_input_test(n1) values('99.998');
> +INSERT INTO num_input_test(n1) values('99.997');

Make spaces and capitalization match surrounding code.

> Result and patch
> 
> By adding the test cases, the test coverage of  float.h increased from
> 97.7% to 100% and sortsupport.h increased from 76.7% to 80.0%.

That's fine, but I suggest that if you really want to make an impact in
test coverage, look to increase function coverage rather than line
coverage.  Or look for files that are not covered at all.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

no mailing list hits in google

2019-08-28 Thread Merlin Moncure

Hackers,
[apologies if this is the incorrect list or is already discussed material]

I've noticed that mailing list discussions in -hackers and other
mailing lists appear to not be indexed by google -- at all.  We are
also not being tracked by any mailing list aggregators -- in contrast
to a decade ago where we had nabble and other systems to collect and
organize results (tbh, often better than we do) we are now at an
extreme disadvantage; mailing list activity was formerly and
absolutely fantastic research via google to find solutions to obscure
technical problems in the database.  Limited access to this
information will directly lead to increased bug reports, lack of
solution confidence, etc.

My test case here is the query: pgsql-hackers ExecHashJoinNewBatch
I was searching out a link to recent bug report for copy/paste into
corporate email. In the old days this would fire right up but now
returns no hits even though the discussion is available in the
archives (which I had to find by looking up the specific day the
thread was active).  Just a heads up.

merlin

Re: Statement timeout in pg_rewind

2019-08-28 Thread Tom Lane

Alexander Kukushkin  writes:
> On Wed, 28 Aug 2019 at 04:51, Michael Paquier  wrote:
>> note that your patch had a warning as "result" is not needed in
>> run_simple_command().

> Ohh, sorry about that. I compiled the whole source tree and the
> warning was buried down among other output :(

"make -s" is your friend ...

regards, tom lane

Re: Improve error detections in TAP tests by spreading safe_psql

2019-08-28 Thread Tom Lane

I wrote:
> After a brief review of node->psql calls in the TAP tests, I'm
> of the opinion that what we should do is revert fb57f40 and instead
> change PostgresNode::psql so that on_error_die defaults to 1, then
> fix the *very* small number of callers that need it to be zero.
> Otherwise we're just going to be fighting this same fire forevermore.
> Moreover, that's going to lead to a much smaller patch.

Hmm .. I experimented with doing it this way, as attached, and I guess
I have to take back the assertion that it'd be a much smaller patch.
I found 47 calls that'd need to be changed to set on_error_die to 0,
whereas your patch is touching 44 calls (though I think it missed some).
I still think this is a safer, less bug-prone way to proceed though.

The attached patch just follows a mechanical rule of "set on_error_die
to 0 in exactly those calls where the surrounding code verifies the
exit code is what it expects".  That leads to a lot of code that
looks like, say,

 # Insert should work on standby
 is( $node_standby->psql(
'postgres',
-   qq{insert into testtab select generate_series(1,1000), 'foo';}),
+   qq{insert into testtab select generate_series(1,1000), 'foo';},
+   on_error_die => 0),
0,
'INSERT succeeds with truncated relation FSM');

I have to wonder if it isn't better to just drop the is() test
and let PostgresNode::psql issue the failure.  The only thing
the is() is really buying us is the ability to label the test
with an ID string, which is helpful, but um ... couldn't that
just be a comment?  Or we could think about adding another
optional parameter:

$node_standby->psql(
'postgres',
qq{insert into testtab select generate_series(1,1000), 'foo';},
test_name => 'INSERT succeeds with truncated relation FSM');

regards, tom lane

diff --git a/src/bin/pg_ctl/t/004_logrotate.pl b/src/bin/pg_ctl/t/004_logrotate.pl
index 71dbfd2..8760804 100644
--- a/src/bin/pg_ctl/t/004_logrotate.pl
+++ b/src/bin/pg_ctl/t/004_logrotate.pl
@@ -19,7 +19,7 @@ $node->start();
 
 # Verify that log output gets to the file
 
-$node->psql('postgres', 'SELECT 1/0');
+$node->psql('postgres', 'SELECT 1/0', on_error_die => 0);
 
 my $current_logfiles = slurp_file($node->data_dir . '/current_logfiles');
 
@@ -75,7 +75,7 @@ chomp $lfname;
 
 # Verify that log output gets to this file, too
 
-$node->psql('postgres', 'fee fi fo fum');
+$node->psql('postgres', 'fee fi fo fum', on_error_die => 0);
 
 my $second_logfile;
 for (my $attempts = 0; $attempts < $max_attempts; $attempts++)
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index f064ea1..53f4824 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -3274,6 +3274,7 @@ $node->psql(
 	'postgres',
 	"CREATE COLLATION testing FROM \"C\"; DROP COLLATION testing;",
 	on_error_stop => 0,
+	on_error_die  => 0,
 	stderr=> \$collation_check_stderr);
 
 if ($collation_check_stderr !~ /ERROR: /)
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index b82d3f6..7cc1c88 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -487,12 +487,10 @@ INSERT INTO seeded_random(seed, rand, val) VALUES
 }
 
 # check that all runs generated the same 4 values
-my ($ret, $out, $err) = $node->psql('postgres',
+my $out = $node->safe_psql('postgres',
 	'SELECT seed, rand, val, COUNT(*) FROM seeded_random GROUP BY seed, rand, val'
 );
 
-ok($ret == 0,  "psql seeded_random count ok");
-ok($err eq '', "psql seeded_random count stderr is empty");
 ok($out =~ /\b$seed\|uniform\|1\d\d\d\|2/,
 	"psql seeded_random count uniform");
 ok( $out =~ /\b$seed\|exponential\|2\d\d\d\|2/,
diff --git a/src/test/authentication/t/001_password.pl b/src/test/authentication/t/001_password.pl
index 3a3b0eb..b58304e 100644
--- a/src/test/authentication/t/001_password.pl
+++ b/src/test/authentication/t/001_password.pl
@@ -45,7 +45,10 @@ sub test_role
 
 	$status_string = 'success' if ($expected_res eq 0);
 
-	my $res = $node->psql('postgres', undef, extra_params => [ '-U', $role ]);
+	my $res = $node->psql(
+		'postgres', undef,
+		on_error_die => 0,
+		extra_params => [ '-U', $role ]);
 	is($res, $expected_res,
 		"authentication $status_string for method $method, role $role");
 	return;
diff --git a/src/test/authentication/t/002_saslprep.pl b/src/test/authentication/t/002_saslprep.pl
index c4b335c..0a75736 100644
--- a/src/test/authentication/t/002_saslprep.pl
+++ b/src/test/authentication/t/002_saslprep.pl
@@ -42,7 +42,10 @@ sub test_login
 	$status_string = 'success' if ($expected_res eq 0);
 
 	$ENV{"PGPASSWORD"} = $password;
-	my $res = $node->psql('postgres', undef, extra_params => [ '-U', $role ]);
+	my $res = $node->psql(
+		'postgres', undef,
+		on_error_die => 0,
+		extra_params => [ '-U', $role ]);
 	is($res, $expected_r

Re: Index Skip Scan

2019-08-28 Thread Dmitry Dolgov

> On Mon, Aug 5, 2019 at 10:38 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> Of course we can modify it to read a whole page and leave visibility check
> for the code after `index_getnext_tid` (although in case if we know that all
> tuples on this page are visilbe I guess it's not strictly necessary, but we
> still get improvement from skipping itself).

Sorry for long delay. Here is more or less what I had in mind. After changing
read_closest to read the whole page I couldn't resist to just merge it into
readpage itself, since it's basically the same. It could raise questions about
performance and how intrusive this patch is, but I hope it's not that much of a
problem (in the worst case we can split it back). I've also added few tests for
the issue you've mentioned. Thanks again, I'm appreciate how much efforts you
put into reviewing!

v24-0001-Index-skip-scan.patch
Description: Binary data

RFC: seccomp-bpf support

2019-08-28 Thread Joe Conway

SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall
filtering mechanism which allows reduction of the kernel attack surface
by preventing (or at least audit logging) normally unused syscalls.

Quoting from this link:
https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt

   "A large number of system calls are exposed to every userland process
with many of them going unused for the entire lifetime of the
process. As system calls change and mature, bugs are found and
eradicated. A certain subset of userland applications benefit by
having a reduced set of available system calls. The resulting set
reduces the total kernel surface exposed to the application. System
call filtering is meant for use with those applications."

Recent security best-practices recommend, and certain highly
security-conscious organizations are beginning to require, that SECCOMP
be used to the extent possible. The major web browsers, container
runtime engines, and systemd are all examples of software that already
support seccomp.

-
A seccomp (bpf) filter is comprised of a default action, and a set of
rules with actions pertaining to specific syscalls (possibly with even
more specific sets of arguments). Once loaded into the kernel, a filter
is inherited by all child processes and cannot be removed. It can,
however, be overlaid with another filter. For any given syscall match,
the most restrictive (a.k.a. highest precedence) action will be taken by
the kernel. PostgreSQL has already been run "in the wild" under seccomp
control in containers, and possibly systemd. Adding seccomp support into
PostgreSQL itself mitigates issues with these approaches, and has
several advantages:

* Container seccomp filters tend to be extremely broad/permissive,
  typically allowing about 6 out 7 of all syscalls. They must do this
  because the use cases for containers vary widely.
* systemd does not implement seccomp filters by default. Packagers may
  decide to do so, but there is no guarantee. Adding them post install
  potentially requires cooperation by groups outside control of
  the database admins.
* In the container and systemd case there is no particularly good way to
  inspect what filters are active. It is possible to observe actions
  taken, but again, control is possibly outside the database admin
  group. For example, the best way to understand what happened is to
  review the auditd log, which is likely not readable by the DBA.
* With built-in support, it is possible to lock down backend processes
  more tightly than the postmaster.
* With built-in support, it is possible to lock down different backend
  processes differently than each other, for example by using ALTER ROLE
  ... SET or ALTER DATABASE ... SET.
* With built-in support, it is possible to calculate and return (in the
  form of an SRF) the effective filters being applied to the postmaster
  and the current backend.
* With built-in support, it could be possible (this part not yet
  implemented) to have separate filters for different backend types,
  e.g. autovac workers, background writer, etc.

-
Attached is a patch for discussion, adding support for seccomp-bpf
(nowadays generally just called seccomp) syscall filtering at
configure-time using libseccomp. I would like to get this in shape to be
committed by the end of the November CF if possible.

The code itself has been through several rounds of revision based on
discussions I have had with the author of libseccomp as well as a few
other folks. However as of the moment:

* Documentation - general discussion missing entirely
* No regression tests

-
For convenience, here are a couple of additional links to relevant
information regarding seccomp:
https://en.wikipedia.org/wiki/Seccomp
https://github.com/seccomp/libseccomp

-
Specific feedback requested:
1. Placement of pg_get_seccomp_filter() in
   src/backend/utils/adt/genfile.c
   originally made sense but after several rewrites no longer does.
   Ideas where it *should* go?
2. Where should a general discussion section go in the docs, if at all?
3. Currently this supports a global filter at the postmaster level,
   which is inherited by all child processes, and a secondary filter
   at the client backend session level. It likely makes sense to
   support secondary filters for other types of child processes,
   e.g. autovacuum workers, etc. Add that now (pg13), later release,
   or never?
4. What is the best way to approach testing of this feature? Tap
   testing perhaps?
5. Default GUC values - should we provide "starter" lists, or only a
   procedure for generating a list (as below).

-
Notes on usage:
===
In order to determine your minimally required allow lists, do something
like the following on a non-production server with the same architecture
as production:

0. Setup:
 * install libseccomp, libseccomp-dev, and seccomp
 * install auditd if not already installed
 * confi

Improve base backup protocol documentation

2019-08-28 Thread Peter Eisentraut

It was apparently entirely undocumented that the tablespace size
estimates sent by the base backup protocol are in kilobytes.  Here is a
patch to document that.  Also, a related clarification in the
pg_basebackup.c source code: It was not clear without analyzing the
whole stack that "totalsize" is in kilobytes and "totaldone" is in bytes.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 2b9c1b1c8397bf9f6613d0d134ad3fa19a9292b9 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 28 Aug 2019 16:46:54 +0200
Subject: [PATCH] Improve base backup protocol documentation

Document that the tablespace sizes are in units of kilobytes.  Make
the pg_basebackup source code a bit clearer about this, too.
---
 doc/src/sgml/protocol.sgml|  4 ++--
 src/bin/pg_basebackup/pg_basebackup.c | 14 +++---
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index b20f1690a7..f036d5f178 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2615,8 +2615,8 @@ Streaming Replication Protocol
 size (int8)
 
  
-  The approximate size of the tablespace, if progress report has
-  been requested; otherwise it's null.
+  The approximate size of the tablespace, in kilobytes (1024 bytes),
+  if progress report has been requested; otherwise it's null.
  
 

diff --git a/src/bin/pg_basebackup/pg_basebackup.c 
b/src/bin/pg_basebackup/pg_basebackup.c
index 9207109ba3..498754eb32 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -115,7 +115,7 @@ static bool made_tablespace_dirs = false;
 static bool found_tablespace_dirs = false;
 
 /* Progress counters */
-static uint64 totalsize;
+static uint64 totalsize_kb;
 static uint64 totaldone;
 static int tablespacecount;
 
@@ -722,7 +722,7 @@ progress_report(int tablespacenum, const char *filename, 
bool force)
return; /* Max once per second 
*/
 
last_progress_report = now;
-   percent = totalsize ? (int) ((totaldone / 1024) * 100 / totalsize) : 0;
+   percent = totalsize_kb ? (int) ((totaldone / 1024) * 100 / 
totalsize_kb) : 0;
 
/*
 * Avoid overflowing past 100% or the full size. This may make the total
@@ -732,8 +732,8 @@ progress_report(int tablespacenum, const char *filename, 
bool force)
 */
if (percent > 100)
percent = 100;
-   if (totaldone / 1024 > totalsize)
-   totalsize = totaldone / 1024;
+   if (totaldone / 1024 > totalsize_kb)
+   totalsize_kb = totaldone / 1024;
 
/*
 * Separate step to keep platform-dependent format code out of
@@ -742,7 +742,7 @@ progress_report(int tablespacenum, const char *filename, 
bool force)
 */
snprintf(totaldone_str, sizeof(totaldone_str), INT64_FORMAT,
 totaldone / 1024);
-   snprintf(totalsize_str, sizeof(totalsize_str), INT64_FORMAT, totalsize);
+   snprintf(totalsize_str, sizeof(totalsize_str), INT64_FORMAT, 
totalsize_kb);
 
 #define VERBOSE_FILENAME_LENGTH 35
if (verbose)
@@ -1942,11 +1942,11 @@ BaseBackup(void)
/*
 * Sum up the total size, for progress reporting
 */
-   totalsize = totaldone = 0;
+   totalsize_kb = totaldone = 0;
tablespacecount = PQntuples(res);
for (i = 0; i < PQntuples(res); i++)
{
-   totalsize += atol(PQgetvalue(res, i, 2));
+   totalsize_kb += atol(PQgetvalue(res, i, 2));
 
/*
 * Verify tablespace directories are empty. Don't bother with 
the
-- 
2.22.0

Re: assertion at postmaster start

2019-08-28 Thread Alvaro Herrera

On 2019-Aug-26, Tom Lane wrote:

> I wrote:
> > I propose the attached.  I'm inclined to think that the risk/benefit
> > of back-patching this is not very good, so I just want to stick it in
> > HEAD, unless somebody can explain why dead_end children are likely to
> > crash in the field.
> 
> Pushed at ee3278239.
> 
> I'm still curious as to the explanation for a dead_end child exiting
> with code 15, but I have no way to pursue the point.

Many thanks for all the investigation and fix!

Sadly, I have *no* idea what could have happened that would have caused
a connection at that point (my start scripts don't do it).  It is
possible that I had a terminal running some shell loop on psql ("watch
psql -c something" perhaps).  But I'm sure I didn't notice that when I
reported this, or I would have mentioned it.  However, I have no idea
why would it have died with code 15.  From my notes of what I was doing
that day, I can't find any evidence that I would have had anything in
shared_preload_libraries.  (I don't have Frost's complete timestamped
shell history, however.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improve error detections in TAP tests by spreading safe_psql

2019-08-28 Thread Tom Lane

Michael Paquier  writes:
> This is a follow-up of the discussion which happened here after Tom
> has committed fb57f40:
> https://www.postgresql.org/message-id/20190828012439.ga1...@paquier.xyz
> I have reviewed the TAP tests, and we have much more spots where it
> is better to use PostgresNode::safe_psql instead PostgresNode::psql so
> as the test dies immediately if there is a failure on a query which
> should never fail.

After a brief review of node->psql calls in the TAP tests, I'm
of the opinion that what we should do is revert fb57f40 and instead
change PostgresNode::psql so that on_error_die defaults to 1, then
fix the *very* small number of callers that need it to be zero.
Otherwise we're just going to be fighting this same fire forevermore.
Moreover, that's going to lead to a much smaller patch.

> Attached are the spots I have found.  Any thoughts or opinions?

Well, for instance, you missed SSLServer.pm, in which every single
one of the psql calls is wrong.

If we go this route we'd have to back-patch the change, else it'd
be a serious hazard for test case back-patching.  But the buildfarm
would, more or less by definition, find any oversights --- so that
doesn't scare me much.

regards, tom lane

Doubt regarding logical decoding

2019-08-28 Thread nilsocket

I'm pretty new to this, and I find it hard trying to make some sense from
src code.

I'm trying to understand, logical decoding.

In `HEAP2` RMGR,
what does HEAP2_NEW_CID operation do?

can somebody give an example, when this type of record is written to wal
files.

When logical decoding decodes this record, what it means to it.

Is it a transaction with sub-transactions, which does some ddl changes?


-- 
Thank you

psql - improve test coverage from 41% to 88%

2019-08-28 Thread Fabien COELHO



Hello devs,

The attached patch improves psql code coverage by adding a specific TAP 
test. The 1709 tests take 4 seconds CPU (6.3 elapsed time) on my laptop.


The infrastructure is updated to require perl module "Expect", allowing to 
test interactive features such as tab completion and prompt changes.


Coverage in "src/bin/psql" jumps from 40.0% of lines and 41.9% of 
functions to 78.4% of lines and 98.1% of functions with "check-world".


--
Fabien.diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b7d36b65dd..aabc27ab6f 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -506,6 +506,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start';
 $node->command_checks_all(
 	[ 'pg_basebackup', '-D', "$tempdir/backup_corrupt" ],
 	1,
+	'',
 	[qr{^$}],
 	[qr/^WARNING.*checksum verification failed/s],
 	'pg_basebackup reports checksum mismatch');
@@ -526,6 +527,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start';
 $node->command_checks_all(
 	[ 'pg_basebackup', '-D', "$tempdir/backup_corrupt2" ],
 	1,
+	'',
 	[qr{^$}],
 	[qr/^WARNING.*further.*failures.*will.not.be.reported/s],
 	'pg_basebackup does not report more than 5 checksum mismatches');
@@ -542,6 +544,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start';
 $node->command_checks_all(
 	[ 'pg_basebackup', '-D', "$tempdir/backup_corrupt3" ],
 	1,
+	'',
 	[qr{^$}],
 	[qr/^WARNING.*7 total checksum verification failures/s],
 	'pg_basebackup correctly report the total number of checksum mismatches');
diff --git a/src/bin/pg_checksums/t/002_actions.pl b/src/bin/pg_checksums/t/002_actions.pl
index 59228b916c..cf4811d382 100644
--- a/src/bin/pg_checksums/t/002_actions.pl
+++ b/src/bin/pg_checksums/t/002_actions.pl
@@ -62,6 +62,7 @@ sub check_relation_corruption
 			'--filenode',   $relfilenode_corrupted
 		],
 		1,
+		'',
 		[qr/Bad checksums:.*1/],
 		[qr/checksum verification failed/],
 		"fails with corrupted data for single relfilenode on tablespace $tablespace"
@@ -71,6 +72,7 @@ sub check_relation_corruption
 	$node->command_checks_all(
 		[ 'pg_checksums', '--check', '-D', $pgdata ],
 		1,
+		'',
 		[qr/Bad checksums:.*1/],
 		[qr/checksum verification failed/],
 		"fails with corrupted data on tablespace $tablespace");
@@ -203,6 +205,7 @@ sub fail_corrupt
 	$node->command_checks_all(
 		[ 'pg_checksums', '--check', '-D', $pgdata ],
 		1,
+		'',
 		[qr/^$/],
 		[qr/could not read block 0 in file.*$file\":/],
 		"fails for corrupted data in $file");
diff --git a/src/bin/pg_controldata/t/001_pg_controldata.pl b/src/bin/pg_controldata/t/001_pg_controldata.pl
index 3b63ad230f..ebe9b80a52 100644
--- a/src/bin/pg_controldata/t/001_pg_controldata.pl
+++ b/src/bin/pg_controldata/t/001_pg_controldata.pl
@@ -33,6 +33,7 @@ close $fh;
 command_checks_all(
 	[ 'pg_controldata', $node->data_dir ],
 	0,
+	'',
 	[
 		qr/WARNING: Calculated CRC checksum does not match value stored in file/,
 		qr/WARNING: invalid WAL segment size/
diff --git a/src/bin/pg_resetwal/t/002_corrupted.pl b/src/bin/pg_resetwal/t/002_corrupted.pl
index f9940d7fc5..1990669d26 100644
--- a/src/bin/pg_resetwal/t/002_corrupted.pl
+++ b/src/bin/pg_resetwal/t/002_corrupted.pl
@@ -30,6 +30,7 @@ close $fh;
 command_checks_all(
 	[ 'pg_resetwal', '-n', $node->data_dir ],
 	0,
+	'',
 	[qr/pg_control version number/],
 	[
 		qr/pg_resetwal: warning: pg_control exists but is broken or wrong version; ignoring it/
@@ -46,6 +47,7 @@ close $fh;
 command_checks_all(
 	[ 'pg_resetwal', '-n', $node->data_dir ],
 	0,
+	'',
 	[qr/pg_control version number/],
 	[
 		qr/\Qpg_resetwal: warning: pg_control specifies invalid WAL segment size (0 bytes); proceed with caution\E/
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index b82d3f65c4..01010914fe 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -50,7 +50,7 @@ sub pgbench
 
 	push @cmd, @args;
 
-	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
+	$node->command_checks_all(\@cmd, $stat, '', $out, $err, $name);
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
diff --git a/src/bin/pgbench/t/002_pgbench_no_server.pl b/src/bin/pgbench/t/002_pgbench_no_server.pl
index f7fa18418b..b58f3548c3 100644
--- a/src/bin/pgbench/t/002_pgbench_no_server.pl
+++ b/src/bin/pgbench/t/002_pgbench_no_server.pl
@@ -25,7 +25,7 @@ sub pgbench
 	my ($opts, $stat, $out, $err, $name) = @_;
 	print STDERR "opts=$opts, stat=$stat, out=$out, err=$err, name=$name";
 	command_checks_all([ 'pgbench', split(/\s+/, $opts) ],
-		$stat, $out, $err, $name);
+		$stat, '', $out, $err, $name);
 	return;
 }
 
@@ -52,7 +52,7 @@ sub pgbench_scripts
 			push @cmd, '-f', $filename;
 		}
 	}
-	command_checks_all(\@cmd, $stat, $out, $err, $name);
+	command_checks_all(\@cmd, $stat, '', $out, $err, $name);
 	return;
 }
 
diff --git a/src/b

Re: Crash in BRIN summarization

2019-08-28 Thread Tom Lane

Heikki Linnakangas  writes:
> I bumped into a little bug in BRIN, while hacking on something 
> unrelated. This causes a segfault, or an assertion failure if assertions 
> are enabled:

Good catch.

> Fix attached.

Hm, I don't particularly care for directly comparing Datum values
like that.  We don't do that elsewhere (I think) and it feels like
a type violation to me.  Since you've already tested for !attbyval,
I think it'd be legitimate to extract the pointer values and compare
those, eg

if (!attr->attbyval &&
DatumGetPointer(result) != 
DatumGetPointer(column->bv_values[INCLUSION_UNION]))

I realize that this makes no difference in terms of the machine
code (at least for most machines?) but it avoids breaking the
Datum abstraction.

regards, tom lane

Re: pg_get_databasebyid(oid)

2019-08-28 Thread Sergei Kornilov

> Please add that to commitfest.

Done: https://commitfest.postgresql.org/24/2261/

regards, Sergei

Re: pg_get_databasebyid(oid)

2019-08-28 Thread Ibrar Ahmed

On Wed, Aug 28, 2019 at 5:38 PM Sergei Kornilov  wrote:

> Hello
> We already have function pg_get_userbyid(oid) with lookup in pg_authid
> catalog. My collegue ask me can we add similar function
> pg_get_databasebyid(oid) with lookup in pg_databases.
> It is simple function to get a database name by oid and fallback to
> 'unknown (OID=n)' if missing.
>
> The proposed patch is attached. Currently I missed the tests - I doubt
> which file in src/test/regress/sql/ is the most suitable. pg_get_userbyid
> is called from privileges.sql only.
>
> regards, Sergei


Please add that to commitfest.


-- 
Ibrar Ahmed

pg_get_databasebyid(oid)

2019-08-28 Thread Sergei Kornilov

Hello
We already have function pg_get_userbyid(oid) with lookup in pg_authid catalog. 
My collegue ask me can we add similar function pg_get_databasebyid(oid) with 
lookup in pg_databases.
It is simple function to get a database name by oid and fallback to 'unknown 
(OID=n)' if missing.

The proposed patch is attached. Currently I missed the tests - I doubt which 
file in src/test/regress/sql/ is the most suitable. pg_get_userbyid is called 
from privileges.sql only.

regards, Sergeidiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c878a0ba4d..5ed5b3ac39 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18333,6 +18333,10 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);
 pg_get_userbyid

 
+   
+pg_get_databasebyid
+   
+

 pg_get_viewdef

@@ -18513,6 +18517,11 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);
name
get role name with given OID
   
+  
+   pg_get_databasebyid(db_oid)
+   name
+   get database name with given OID
+  
   
pg_get_viewdef(view_name)
text
@@ -18703,8 +18712,9 @@ SELECT currval(pg_get_serial_sequence('sometable', 'id'));
   
 
   
-   pg_get_userbyid extracts a role's name given
-   its OID.
+   pg_get_userbyid and
+   pg_get_databasebyid extracts respectively a
+   role's and database's name given its OID.
   
 
   
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 3e64390d81..214a081555 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -31,6 +31,7 @@
 #include "catalog/pg_authid.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_constraint.h"
+#include "catalog/pg_database.h"
 #include "catalog/pg_depend.h"
 #include "catalog/pg_language.h"
 #include "catalog/pg_opclass.h"
@@ -2474,6 +2475,41 @@ pg_get_userbyid(PG_FUNCTION_ARGS)
 	PG_RETURN_NAME(result);
 }
 
+/* --
+ * get_databasebyid			- Get a database name by oid and
+ *  fallback to 'unknown (OID=n)'
+ * --
+ */
+Datum
+pg_get_databasebyid(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	Name		result;
+	HeapTuple	dbtup;
+	Form_pg_database dbrec;
+
+	/*
+	 * Allocate space for the result
+	 */
+	result = (Name) palloc(NAMEDATALEN);
+	memset(NameStr(*result), 0, NAMEDATALEN);
+
+	/*
+	 * Get the pg_database entry and print the result
+	 */
+	dbtup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(dbid));
+	if (HeapTupleIsValid(dbtup))
+	{
+		dbrec = (Form_pg_database) GETSTRUCT(dbtup);
+		StrNCpy(NameStr(*result), NameStr(dbrec->datname), NAMEDATALEN);
+		ReleaseSysCache(dbtup);
+	}
+	else
+		sprintf(NameStr(*result), "unknown (OID=%u)", dbid);
+
+	PG_RETURN_NAME(result);
+}
+
 
 /*
  * pg_get_serial_sequence
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index cf1f409351..4f1c55c3c7 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3573,6 +3573,9 @@
 { oid => '1642', descr => 'role name by OID (with fallback)',
   proname => 'pg_get_userbyid', provolatile => 's', prorettype => 'name',
   proargtypes => 'oid', prosrc => 'pg_get_userbyid' },
+{ oid => '9978', descr => 'database name by OID (with fallback)',
+  proname => 'pg_get_databasebyid', provolatile => 's', prorettype => 'name',
+  proargtypes => 'oid', prosrc => 'pg_get_databasebyid' },
 { oid => '1643', descr => 'index description',
   proname => 'pg_get_indexdef', provolatile => 's', prorettype => 'text',
   proargtypes => 'oid', prosrc => 'pg_get_indexdef' },

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-08-28 Thread Etsuro Fujita

On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita  wrote:
> It seems that I performed the above tests on an assertion-enabled
> build.  :(  So I executed the tests one more time.  Here are the
> results.
>
> * 2-way self-join of pt: explain analyze select * from pt t0, pt t1
> where t0.a = t1.a;
>  - HEAD:
>  Planning Time: 0.969 ms
>  Execution Time: 13.843 ms
>  - with patch:
>  Planning Time: 1.720 ms
>  Execution Time: 14.393 ms
>  - with patch plus attached:
>  Planning Time: 1.630 ms
>  Execution Time: 14.002 ms
>
> * 4-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3 where t0.a = t1.a
>
> and t1.a = t2.a and t2.a = t3.a;
>  - HEAD:
>  Planning Time: 12.203 ms
>  Execution Time: 31.784 ms
>  - with patch:
>  Planning Time: 32.102 ms
>  Execution Time: 32.504 ms
>  - with patch plus attached:
>  Planning Time: 19.471 ms
>  Execution Time: 32.582 ms
>
> * 8-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a =
> t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a
> and t6.a = t7.a;
>  - HEAD:
>  Planning Time: 948.131 ms
>  Execution Time: 55.645 ms
>  - with patch:
>  Planning Time: 2939.813 ms
>  Execution Time: 56.760 ms
>  - with patch plus attached:
>  Planning Time: 1108.076 ms
>  Execution Time: 55.750 ms
>
> Note: the attached patch still uses the proposed partition matching
> algorithm for these queries.  As I said before, these queries don't
> need that algorithm, so we could eliminate the planning overhead
> compared to HEAD, by planning these queries as before, perhaps, but I
> haven't modified the patch as such yet.

I modified the patch further as such.  Attached is an updated version
of the patch created on top of the patch in [1].  I did the tests
again using the updated version of the patch.  Here are the results:

* 2-way self-join of pt:
 Planning Time: 1.043 ms
 Execution Time: 13.931 ms

* 4-way self-join of pt:
 Planning Time: 12.499 ms
 Execution Time: 32.392 ms

* 8-way self-join of pt:
 Planning Time: 968.412 ms
 Execution Time: 56.328 ms

The planning time for each test case still increased slightly, but IMO
I think that would be acceptable.  To see the efficiency of the
attached, I did another testing with test cases that really need the
new partition-matching algorithm:

* explain analyze select * from pt6 t6, pt7 t7 where t6.a = t7.a;
 - base patch in [1]
 Planning Time: 1.758 ms
 Execution Time: 13.977 ms
 - with attached
 Planning Time: 1.777 ms
 Execution Time: 13.959 ms

* explain analyze select * from pt4 t4, pt5 t5, pt6 t6, pt7 t7 where
t4.a = t5.a and t5.a = t6.a and t6.a = t7.a;
 - base patch in [1]
 Planning Time: 33.201 ms
 Execution Time: 32.480 ms
 - with attached
 Planning Time: 21.019 ms
 Execution Time: 32.777 ms

* explain analyze select * from pt0 t0, pt1 t1, pt2 t2, pt3 t3, pt4
t4, pt5 t5, pt6 t6, pt7 t7 where t0.a = t1.a and t1.a = t2.a and t2.a
= t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a and t6.a =
t7.a;
 - base patch in [1]
 Planning Time: 3060.000 ms
 Execution Time: 55.553 ms
 - with attached
 Planning Time: 1144.996 ms
 Execution Time: 56.233 ms

where pt0, pt1, pt2, pt3, pt4, pt5, pt6 and pt7 are list partitioned
tables that have slighly different list values.  (The structure and
list values of ptN are almost the same as that of pt used above, but
ptN's N-th partition ptNpN has an extra list value that pt's N-th
partition ptpN doesn't have.)  If anyone is interested in this
testing, I'll send a script file for producing these list partitioned
tables.

About the attached:

* The attached patch modified try_partitionwise_join() so that we call
partition_bounds_equal() then partition_bounds_merge() (if necessary)
to create the partition bounds for the join rel.  We don't support for
merging the partition bounds for the hash-partitioning case, so this
makes code to handle the hash-partitioning case in
partition_bounds_merge() completely unnecessary, so I removed that
entirely.

* I removed this assertion in partition_bounds_merge(), because I
think this is covered by two assertions above this.

+   Assert((*outer_parts == NIL || *inner_parts != NIL) &&
+  (*inner_parts == NIL || *outer_parts != NIL));

* (I forgot to mention this in a previous email, but) I removed this
bit of generate_matching_part_pairs(), because we already have the
same check in try_partitionwise_join(), so this bit would be redundant
IIUC.

+   switch (jointype)
+   {
+   case JOIN_INNER:
+   case JOIN_SEMI:
+
+   /*
+* An inner or semi join can not return any row when the
+* matching partition on either side is missing. We should
+* have eliminated all such cases while merging the bounds.
+*/
+   Assert(part1 >= 0 && part2 >= 0);
+   break;
+
+   case JOIN_LEFT:
+   case JOIN_ANTI:
+   A

[no subject]

2019-08-28 Thread Pavel Demidov

 Hello,

I hear that not recommended to set pg_resetwal with --wal-segsize for wal
increasing. Are any more detailed information exists about it? What an
effects could be? Does it possible change it due full offline?

Regards, Paul

Re: Performance improvement of WAL writing?

2019-08-28 Thread Jan Wieck

On Wed, Aug 28, 2019 at 2:43 AM Moon, Insung 
wrote:

> So what about modifying the XLogWrite function only to write the size
> that should record?
> Can this idea benefit from WAL writing performance?
> If it's OK to improve, I want to do modification.
> How do you think of it?
>

I have performed tests in this direction several years ago.

The way it works now guarantees that the data of recycled WAL segments is
never read from disk, as long as the WAL block size is a multiple of the
filesystem's page/fragment size. The OS sees that the write() is on a
fragment boundary and full fragment(s) in size. If the write() would be
smaller in size, the OS would be forced to combine the new data with the
rest of the fragment's old data on disk. To do so the system now has to
wait until the old fragment is actually in the OS buffer. Which means that
once you have enough WAL segments to cycle through so that the checkpoint
reason is never XLOG and the blocks of the WAL segment that is cycled in
have been evicted since it was last used, this causes real reads. On
spinning media, which are still an excellent choice for WAL, this turns
into a total disaster because it adds a rotational delay for every single
WAL block that is only partially overwritten.

I believe that this could only work if we stopped recycling WAL segments
and instead delete old segments, then create new files as needed. This
however adds the overhead of constant metadata change to WAL writing and I
have no idea what performance or reliability impact that may have. There
were reasons we chose to implement WAL segment recycling many years ago.
These reasons may no longer be valid on modern filesystems, but it
definitely is not a performance question alone.

Regards, Jan

-- 
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info

Re: Crash in BRIN summarization

2019-08-28 Thread Emre Hasegeli

Thank you for your fix.

> This assumes that the merge function returns a newly-palloc'd value.
> That's a shaky assumption; if one of the arguments is an empty range,
> range_merge() returns the other argument, rather than a newly
> constructed value. And surely we can't assume assume that for
> user-defined opclasses.

Your analysis looks right to me.

> brin_inclusion_union() has a similar issue, but I didn't write a script
> to reproduce that. Fix attached.

I am not sure about this part.  If it's okay to use col_a->bv_values
without copying, it should also be okay to use col_b->bv_values, no?

Converting Nested loop to hashjoin for not is distinct from case

2019-08-28 Thread Narendra Pradeep U U

Hi, 

  

  I have a  join query with is not distinct from criteria (select * 
from table1 join table2 on a is not  distinct from b). on examining,  it was 
the  nested loop join that makes the query slow.  For join with ' is not 
distinct from ' qual can't the planner choose  hash-join over nested loop  or 
will there be any problem by doing so. Kindly enlighten me about the same.


Thanks
Pradeep

Re: A problem about partitionwise join

2019-08-28 Thread Etsuro Fujita

Hi,

On Tue, Aug 27, 2019 at 4:57 PM Richard Guo  wrote:
> Check the query below as a more illustrative example:
>
> create table p (k int, val int) partition by range(k);
> create table p_1 partition of p for values from (1) to (10);
> create table p_2 partition of p for values from (10) to (100);
>
> If we use quals 'foo.k = bar.k and foo.k = bar.val', we can generate
> partitionwise join:
>
> # explain (costs off)
> select * from p as foo join p as bar on foo.k = bar.k and foo.k = bar.val;
>QUERY PLAN
> -
>  Append
>->  Hash Join
>  Hash Cond: (foo.k = bar.k)
>  ->  Seq Scan on p_1 foo
>  ->  Hash
>->  Seq Scan on p_1 bar
>  Filter: (k = val)
>->  Hash Join
>  Hash Cond: (foo_1.k = bar_1.k)
>  ->  Seq Scan on p_2 foo_1
>  ->  Hash
>->  Seq Scan on p_2 bar_1
>  Filter: (k = val)
> (13 rows)
>
> But if we exchange the order of the two quals to 'foo.k = bar.val and
> foo.k = bar.k', then partitionwise join cannot be generated any more,
> because we only have joinclause 'foo.k = bar.val' as it first reached
> score of 3. We have missed the joinclause on the partition key although
> it does exist.
>
> # explain (costs off)
> select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k;
>QUERY PLAN
> -
>  Hash Join
>Hash Cond: (foo.k = bar.val)
>->  Append
>  ->  Seq Scan on p_1 foo
>  ->  Seq Scan on p_2 foo_1
>->  Hash
>  ->  Append
>->  Seq Scan on p_1 bar
>  Filter: (val = k)
>->  Seq Scan on p_2 bar_1
>  Filter: (val = k)
> (11 rows)

I think it would be nice if we can address this issue.

Best regards,
Etsuro Fujita

Re: Statement timeout in pg_rewind

2019-08-28 Thread Alexander Kukushkin

Hi,

Thank you, Michael, for committing it!

On Wed, 28 Aug 2019 at 04:51, Michael Paquier  wrote:
> note that your patch had a warning as "result" is not needed in
> run_simple_command().

Ohh, sorry about that. I compiled the whole source tree and the
warning was buried down among other output :(

Regards,
--
Alexander Kukushkin

Re: Comment in ginpostinglist.c doesn't match code

2019-08-28 Thread Heikki Linnakangas


On 23/08/2019 11:44, Masahiko Sawada wrote:

On Fri, Aug 23, 2019 at 7:05 AM Ashwin Agrawal  wrote:


On Thu, Aug 22, 2019 at 1:14 AM Heikki Linnakangas  wrote:

The patch also includes a little unit test module to test this without
creating a 16 TB table. A whole new test module seems a bit like
overkill just for this, but clearly we were missing test coverage here.
And it will come handy, if we want to invent a new better posting list
format in the future. Thoughts on whether to include the test module or not?



I like the test as importantly adds missing coverage. Also, really simplifies 
validation effort if required to make change in this area anytime in future. 
So, I would +1 keeping the same.


I'd +1 too. It's valuable to test hard-to-reproduce case. I often want
to do such unit tests with more cheaper costs, though.


Ok, including it seems to be the consensus.


BTW it's not related to this patch but I got confused that where "17
bits" of the following paragraph in ginpostinglist.c comes from. If we
use only 43 bit out of 64-bit unsigned integer we have 21 bits left.

  * For encoding purposes, item pointers are represented as 64-bit unsigned
  * integers. The lowest 11 bits represent the offset number, and the next
  * lowest 32 bits are the block number. That leaves 17 bits unused, i.e.
  * only 43 low bits are used.


Huh, you're right. I think it should be "leaves 21 bits unused". I 
suspect the patch used 15 bits for the offset number early in the 
development, which would've left 17 bits unused, and when it was later 
changed to use only 11 bits, that comment was neglected.


Fixed that comment, too, and pushed. Thanks!

- Heikki

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-08-28 Thread amul sul

Thank you Fujita San for the enhancement, will have a look.

Regards,
Amul

On Wed, Aug 28, 2019 at 3:52 PM Etsuro Fujita 
wrote:

> On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita 
> wrote:
> > It seems that I performed the above tests on an assertion-enabled
> > build.  :(  So I executed the tests one more time.  Here are the
> > results.
> >
> > * 2-way self-join of pt: explain analyze select * from pt t0, pt t1
> > where t0.a = t1.a;
> >  - HEAD:
> >  Planning Time: 0.969 ms
> >  Execution Time: 13.843 ms
> >  - with patch:
> >  Planning Time: 1.720 ms
> >  Execution Time: 14.393 ms
> >  - with patch plus attached:
> >  Planning Time: 1.630 ms
> >  Execution Time: 14.002 ms
> >
> > * 4-way self-join of pt: explain analyze select * from pt t0, pt t1,
> > pt t2, pt t3 where t0.a = t1.a
> >
> > and t1.a = t2.a and t2.a = t3.a;
> >  - HEAD:
> >  Planning Time: 12.203 ms
> >  Execution Time: 31.784 ms
> >  - with patch:
> >  Planning Time: 32.102 ms
> >  Execution Time: 32.504 ms
> >  - with patch plus attached:
> >  Planning Time: 19.471 ms
> >  Execution Time: 32.582 ms
> >
> > * 8-way self-join of pt: explain analyze select * from pt t0, pt t1,
> > pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a =
> > t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a
> > and t6.a = t7.a;
> >  - HEAD:
> >  Planning Time: 948.131 ms
> >  Execution Time: 55.645 ms
> >  - with patch:
> >  Planning Time: 2939.813 ms
> >  Execution Time: 56.760 ms
> >  - with patch plus attached:
> >  Planning Time: 1108.076 ms
> >  Execution Time: 55.750 ms
> >
> > Note: the attached patch still uses the proposed partition matching
> > algorithm for these queries.  As I said before, these queries don't
> > need that algorithm, so we could eliminate the planning overhead
> > compared to HEAD, by planning these queries as before, perhaps, but I
> > haven't modified the patch as such yet.
>
> I modified the patch further as such.  Attached is an updated version
> of the patch created on top of the patch in [1].  I did the tests
> again using the updated version of the patch.  Here are the results:
>
> * 2-way self-join of pt:
>  Planning Time: 1.043 ms
>  Execution Time: 13.931 ms
>
> * 4-way self-join of pt:
>  Planning Time: 12.499 ms
>  Execution Time: 32.392 ms
>
> * 8-way self-join of pt:
>  Planning Time: 968.412 ms
>  Execution Time: 56.328 ms
>
> The planning time for each test case still increased slightly, but IMO
> I think that would be acceptable.  To see the efficiency of the
> attached, I did another testing with test cases that really need the
> new partition-matching algorithm:
>
> * explain analyze select * from pt6 t6, pt7 t7 where t6.a = t7.a;
>  - base patch in [1]
>  Planning Time: 1.758 ms
>  Execution Time: 13.977 ms
>  - with attached
>  Planning Time: 1.777 ms
>  Execution Time: 13.959 ms
>
> * explain analyze select * from pt4 t4, pt5 t5, pt6 t6, pt7 t7 where
> t4.a = t5.a and t5.a = t6.a and t6.a = t7.a;
>  - base patch in [1]
>  Planning Time: 33.201 ms
>  Execution Time: 32.480 ms
>  - with attached
>  Planning Time: 21.019 ms
>  Execution Time: 32.777 ms
>
> * explain analyze select * from pt0 t0, pt1 t1, pt2 t2, pt3 t3, pt4
> t4, pt5 t5, pt6 t6, pt7 t7 where t0.a = t1.a and t1.a = t2.a and t2.a
> = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a and t6.a =
> t7.a;
>  - base patch in [1]
>  Planning Time: 3060.000 ms
>  Execution Time: 55.553 ms
>  - with attached
>  Planning Time: 1144.996 ms
>  Execution Time: 56.233 ms
>
> where pt0, pt1, pt2, pt3, pt4, pt5, pt6 and pt7 are list partitioned
> tables that have slighly different list values.  (The structure and
> list values of ptN are almost the same as that of pt used above, but
> ptN's N-th partition ptNpN has an extra list value that pt's N-th
> partition ptpN doesn't have.)  If anyone is interested in this
> testing, I'll send a script file for producing these list partitioned
> tables.
>
> About the attached:
>
> * The attached patch modified try_partitionwise_join() so that we call
> partition_bounds_equal() then partition_bounds_merge() (if necessary)
> to create the partition bounds for the join rel.  We don't support for
> merging the partition bounds for the hash-partitioning case, so this
> makes code to handle the hash-partitioning case in
> partition_bounds_merge() completely unnecessary, so I removed that
> entirely.
>
> * I removed this assertion in partition_bounds_merge(), because I
> think this is covered by two assertions above this.
>
> +   Assert((*outer_parts == NIL || *inner_parts != NIL) &&
> +  (*inner_parts == NIL || *outer_parts != NIL));
>
> * (I forgot to mention this in a previous email, but) I removed this
> bit of generate_matching_part_pairs(), because we already have the
> same check in try_partitionwise_join(), so this bit would be redundant
> IIUC.
>
> +   switch (jointype)
> +   {
> +   case JOIN_INNER:
> +   case JOIN_SEMI:
> +
> +   /*
> +

Re: allow online change primary_conninfo

2019-08-28 Thread Sergei Kornilov

Hello

Updated patch attached. (also I merged into one file)

> + 
> + WAL receiver will be restarted after primary_slot_name
> + was changed.
>   
> The sentence sounds strange. Here is a suggestion:
> The WAL receiver is restarted after an update of primary_slot_name (or
> primary_conninfo).

Changed.

> The comment at the top of the call of ProcessStartupSigHup() in
> HandleStartupProcInterrupts() needs to be updated as it mentions a
> configuration file re-read, but that's not the case anymore.

Changed to "Process any requests or signals received recently." like in other 
places, e.g. syslogger

> pendingRestartSource's name does not represent what it does, as it is
> only associated with the restart of a WAL receiver when in streaming
> state, and that's a no-op for the archive mode and the local mode.

I renamed to pendingWalRcvRestart and replaced switch with simple condition.

> So when shutting down the WAL receiver after a parameter change, what
> happens is that the startup process waits for retrieve_retry_interval
> before moving back to the archive mode. Only after scanning again the
> archives do we restart a new WAL receiver.

As I answered earlier, here is no switch to archive or 
wal_retrieve_retry_interval waiting in my patch. I recheck on current revision 
too:

2019-08-28 12:16:27.295 MSK 11180 @ from  [vxid: txid:0] [] DEBUG:  sending 
write 0/30346C8 flush 0/30346C8 apply 0/30346C8
2019-08-28 12:16:27.493 MSK 11172 @ from  [vxid: txid:0] [] LOG:  received 
SIGHUP, reloading configuration files
2019-08-28 12:16:27.494 MSK 11172 @ from  [vxid: txid:0] [] LOG:  parameter 
"primary_conninfo" changed to "host='/tmp' port= sslmode=disable 
sslcompression=0 gssencmode=disable target_session_attrs=any"
2019-08-28 12:16:27.496 MSK 11173 @ from  [vxid:1/0 txid:0] [] LOG:  The WAL 
receiver is going to be restarted due to change of primary_conninfo
2019-08-28 12:16:27.496 MSK 11176 @ from  [vxid: txid:0] [] DEBUG:  
checkpointer updated shared memory configuration values
2019-08-28 12:16:27.496 MSK 11180 @ from  [vxid: txid:0] [] FATAL:  terminating 
walreceiver process due to administrator command
2019-08-28 12:16:27.500 MSK 11335 @ from  [vxid: txid:0] [] LOG:  started 
streaming WAL from primary at 0/300 on timeline 1
2019-08-28 12:16:27.500 MSK 11335 @ from  [vxid: txid:0] [] DEBUG:  sendtime 
2019-08-28 12:16:27.50037+03 receipttime 2019-08-28 12:16:27.500821+03 
replication apply delay 0 ms transfer latency 0 ms

No "DEBUG:  switched WAL source from stream to archive after failure" messages, 
no time difference (wal_retrieve_retry_interval = 5s).

regards, Sergeidiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89284dc5c0..b36749fe91 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3929,9 +3929,13 @@ ANY num_sync ( 
@@ -3946,9 +3950,13 @@ ANY num_sync ( ).
-  This parameter can only be set at server start.
+  This parameter can only be set in the postgresql.conf
+  file or on the server command line.
   This setting has no effect if primary_conninfo is not
-  set.
+  set or the server is not in standby mode.
+ 
+ 
+  The WAL receiver is restarted after an update of primary_slot_name.
  
 

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e651a841bb..4eed462f34 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -803,6 +803,12 @@ static XLogSource readSource = 0;	/* XLOG_FROM_* code */
 static XLogSource currentSource = 0;	/* XLOG_FROM_* code */
 static bool lastSourceFailed = false;
 
+/*
+ * Need for restart running WalReceiver due the configuration change.
+ * Suitable only for XLOG_FROM_STREAM source
+ */
+static bool pendingWalRcvRestart = false;
+
 typedef struct XLogPageReadPrivate
 {
 	int			emode;
@@ -11787,48 +11793,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	if (!StandbyMode)
 		return false;
 
-	/*
-	 * If primary_conninfo is set, launch walreceiver to try
-	 * to stream the missing WAL.
-	 *
-	 * If fetching_ckpt is true, RecPtr points to the initial
-	 * checkpoint location. In that case, we use RedoStartLSN
-	 * as the streaming start position instead of RecPtr, so
-	 * that when we later jump backwards to start redo at
-	 * RedoStartLSN, we will have the logs streamed already.
-	 */
-	if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
-	{
-		XLogRecPtr	ptr;
-		TimeLineID	tli;
-
-		if (fetching_ckpt)
-		{
-			ptr = RedoStartLSN;
-			tli = ControlFile->checkPointCopy.ThisTimeLineID;
-		}
-		else
-		{
-			ptr = RecPtr;
-
-			/*
-			 * Use the record begin position to determine the
-			 * TLI, rather than the position we're reading.
-			 */
-			tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
-
-

Crash in BRIN summarization

2019-08-28 Thread Heikki Linnakangas

I bumped into a little bug in BRIN, while hacking on something 
unrelated. This causes a segfault, or an assertion failure if assertions 
are enabled:


CREATE TABLE brintest (n numrange);
CREATE INDEX brinidx ON brintest USING brin (n);

INSERT INTO brintest VALUES ('empty');
INSERT INTO brintest VALUES (numrange(0, 2^1000::numeric));
INSERT INTO brintest VALUES ('(-1, 0)');

SELECT brin_desummarize_range('brinidx',0);
SELECT brin_summarize_range('brinidx',0) ;

gdb backtrace:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x560fef71f4b2 in GetMemoryChunkContext (pointer=0x7fed65a8fc30) 
at ../../../../src/include/utils/memutils.h:129

129 AssertArg(MemoryContextIsValid(context));
(gdb) bt
#0  0x560fef71f4b2 in GetMemoryChunkContext (pointer=0x7fed65a8fc30) 
at ../../../../src/include/utils/memutils.h:129

#1  0x560fef721290 in pfree (pointer=0x7fed65a8fc30) at mcxt.c:1033
#2  0x560fef0bd13f in brin_inclusion_add_value 
(fcinfo=0x7ffc4cd36220) at brin_inclusion.c:239
#3  0x560fef6ee543 in FunctionCall4Coll (flinfo=0x560ff05d0cf0, 
collation=0, arg1=94626457109880, arg2=94626457868896, 
arg3=140657589550088, arg4=0) at fmgr.c:1214
#4  0x560fef0b39b7 in brinbuildCallback (index=0x7fed64f37460, 
htup=0x560ff0685d68, values=0x7ffc4cd365d0, isnull=0x7ffc4cd365b0, 
tupleIsAlive=true, brstate=0x560ff06859e0)

at brin.c:650
#5  0x560fef12e418 in heapam_index_build_range_scan 
(heapRelation=0x7fed64f37248, indexRelation=0x7fed64f37460, 
indexInfo=0x560ff0685b48, allow_sync=false, anyvisible=true,
progress=false, start_blockno=0, numblocks=1, 
callback=0x560fef0b37c9 , 
callback_state=0x560ff06859e0, scan=0x560ff0685d18) at heapam_handler.c:1663
#6  0x560fef0b2858 in table_index_build_range_scan 
(table_rel=0x7fed64f37248, index_rel=0x7fed64f37460, 
index_info=0x560ff0685b48, allow_sync=false, anyvisible=true, 
progress=false,
start_blockno=0, numblocks=1, callback=0x560fef0b37c9 
, callback_state=0x560ff06859e0, scan=0x0) at 
../../../../src/include/access/tableam.h:1544
#7  0x560fef0b4dac in summarize_range (indexInfo=0x560ff0685b48, 
state=0x560ff06859e0, heapRel=0x7fed64f37248, heapBlk=0, heapNumBlks=1) 
at brin.c:1240
#8  0x560fef0b50d9 in brinsummarize (index=0x7fed64f37460, 
heapRel=0x7fed64f37248, pageRange=0, include_partial=true, 
numSummarized=0x7ffc4cd36928, numExisting=0x0) at brin.c:1375
#9  0x560fef0b43bf in brin_summarize_range (fcinfo=0x560ff06840d0) 
at brin.c:933
#10 0x560fef339acc in ExecInterpExpr (state=0x560ff0683fe8, 
econtext=0x560ff0683ce8, isnull=0x7ffc4cd36c17) at execExprInterp.c:650


Analysis:

This is a memory management issue in the brin_inclusion_add_value() 
function:


/* Finally, merge the new value to the existing union. */
finfo = inclusion_get_procinfo(bdesc, attno, PROCNUM_MERGE);
Assert(finfo != NULL);
result = FunctionCall2Coll(finfo, colloid,
   
column->bv_values[INCLUSION_UNION], newval);
if (!attr->attbyval)
pfree(DatumGetPointer(column->bv_values[INCLUSION_UNION]));
column->bv_values[INCLUSION_UNION] = result;

This assumes that the merge function returns a newly-palloc'd value. 
That's a shaky assumption; if one of the arguments is an empty range, 
range_merge() returns the other argument, rather than a newly 
constructed value. And surely we can't assume assume that for 
user-defined opclasses.


When that happens, we store the 'newval' directly in 
column->bv_values[INCLUSION_UNION], but it is allocated in a 
shortly-lived memory context (or points directly to a buffer in the 
buffer cache). On the next call, we try to pfree() the old value, but 
the memory context that contained it was already reset.


It took a while to work out the script to reproduce it. The first value 
passed to brin_inclusion_add_value() must be an empty range, so that 
column->bv_values[INCLUSION_UNION] is set to an empty range. On the 
second call, the new value must be non-empty, so that range_merge() 
returns 'newval' unchanged. And it must be a very large value, because 
if it uses a short varlen header, the PG_GETARG_RANGE_P() call in 
range_merge() will make a copy of it.


brin_inclusion_union() has a similar issue, but I didn't write a script 
to reproduce that. Fix attached.


- Heikki
diff --git a/src/backend/access/brin/brin_inclusion.c b/src/backend/access/brin/brin_inclusion.c
index 86788024ef..cbe14cb3b1 100644
--- a/src/backend/access/brin/brin_inclusion.c
+++ b/src/backend/access/brin/brin_inclusion.c
@@ -235,8 +235,13 @@ brin_inclusion_add_value(PG_FUNCTION_ARGS)
 	Assert(finfo != NULL);
 	result = FunctionCall2Coll(finfo, colloid,
 			   column->bv_values[INCLUSION_UNION], newval);
-	if (!attr->attbyval)
+	if (!attr->attbyval && result != column->bv_values[INCLUSION_UNION])
+	{
 		pfree(DatumGetPointer(column->bv_values[INCLUSION_UNION]));
+
+		if (result == newv

Re: old_snapshot_threshold vs indexes

2019-08-28 Thread Thomas Munro

On Wed, Aug 28, 2019 at 3:05 AM Tom Lane  wrote:
> I'd vote for back-patching to 10.  Even if there is in fact no deadlock
> hazard, you've clearly demonstrated a significant performance hit that
> we're taking for basically no reason.

Done.

The second symptom reported in my first email looked like evidence of
high levels of spinlock backoff, which I guess might have been coming
from TransactionIdLimitedForOldSnapshots()'s hammering of
oldSnapshotControl->mutex_threshold and
oldSnapshotControl->mutex_threshold, when running
heap_page_prune_opt()-heavy workloads like the one generated by
test-indexscan.sql (from my earlier message) from many backends at the
same time on a large system.  That's just an observation I'm leaving
here, I'm not planning to chase that any further for now.

-- 
Thomas Munro
https://enterprisedb.com

Re: REINDEX filtering in the backend

2019-08-28 Thread Julien Rouhaud

On Wed, Aug 28, 2019 at 7:03 AM Michael Paquier  wrote:
>
> On Thu, Jul 11, 2019 at 11:14:20PM +0200, Julien Rouhaud wrote:
> > I didn't want to spend too much time enjoying bison and adding new
> > unreserved keywords, so for now I just implemented this syntax to
> > start a discussion for this feature in the next commitfest:
> >
> > REINDEX ( FILTER = COLLATION ) [...]
> >
> > The FILTER clause can be used multiple times, each one is OR-ed with
> > the ReindexStmt's option, so we could easily add a LIBC, ICU and other
> > filters, also making COLLATION (or more realistically a better new
> > keyword) an alias for (LIBC | ICU) for instance.
>
> I would prefer keeping the interface simple with only COLLATION, so as
> only collation sensitive indexes should be updated, including icu and
> libc ones.  Or would it be useful to have the filtering for both as
> libicu can break things similarly to glibc in an update still a
> breakage on one or the other would not happen at the same time?  I
> don't know enough of libicu regarding that, eager to learn.  In which
> case, wouldn't it be better to support that from the start?

I'm not sure either.  Another thing would be to add extra syntax to be
able to discard even more indexes.  For instance we could store the
version of the underlying lib used when the index is (re)created, and
do something like
REINDEX (FILTER = LIBC!=2.28) or REINDEX (FILTER = LIBC==2.27) or
something similar.

> > The filtering is done at table level (with and without the
> > concurrently option), so SCHEMA, DATABASE and SYSTEM automatically
> > benefit from it.  If this clause is used with a REINDEX INDEX, the
> > statement errors out, as I don't see a valid use case for providing a
> > single index name and asking to possibly filter it at the same time.
>
> Supporting that case would not be that much amount of work, no?

Probably not, but I'm dubious about the use case.  Adding the lib
version in the catalog would be more useful for people who want to
write their own rules to reindex specific set of indexes.

> > I also added some minimal documentation and regression tests.  I'll
> > add this patch to the next commitfest.
> >
> > [1] 
> > https://www.postgresql.org/message-id/7140716c-679e-a0b9-a273-b201329d8891%402ndquadrant.com
>
> +if ((stmt->options & REINDEXOPT_ALL_FILTERS) != 0)
> +elog(ERROR, "FILTER clause is not compatible with REINDEX INDEX");
> [...]
> +  discard indexes whose ordering does not depend on a collation. Note 
> that
> +  the FILTER option is not compatible with REINDEX
> +  SCHEMA.
>
> Why do you have both limitations?

That's actually a typo, the documentation should have specified that
it's not compatible with REINDEX INDEX, not REINDEX SCHEMA, I'll fix.

>  I think that it would be nice to be
> able to do both, generating an error for REINDEX INDEX if the index
> specified is not compatible, and a LOG if the index is not filtered
> out when a list is processed.

Do you mean having an error if the index does not contain any
collation based type?  Also, REINDEX only accept a single name, so
there shouldn't be any list processing for REINDEX INDEX?  I'm not
really in favour of adding extra code the filtering when user asks for
a specific index name to be reindexed.

>  Please note that elog() cannot be used
> for user-facing failures, only for internal ones.

Indeed, I'll change with an ereport and ERRCODE_FEATURE_NOT_SUPPORTED.

>
> +REINDEX (VERBOSE, FILTER = COLLATION) TABLE reindex_verbose;
> +-- One column, not depending on a collation
> In order to make sure that a reindex has been done for a given entry
> with the filtering, an idea is to save the relfilenode before the
> REINDEX and check it afterwards.  That would be nice to make sure that
> only the wanted indexes are processed, but it is not possible to be
> sure of that based on your tests, and some tests should be done on
> relations which have collation-sensitive as well as
> collation-insensitive indexes.

That's what I did when I first submitted the feature in reindexdb.  I
didn't use them because it means switching to TAP tests.  I can drop
the simple regression test (especially since I now realize than one is
quite broken) and use the TAP one if, or should I keep both?

>
> +   index = index_open(indexOid, AccessShareLock);
> +   numAtts = index->rd_index->indnatts;
> +   index_close(index, AccessShareLock);
> Wouldn't it be better to close that after doing the scan?

Yes indeed.

> Could you add tests with REINDEX CONCURRENTLY?

Sure!

> Bonus: support for reindexdb should be added.  Let's not forget about
> it.

Yep.  That was a first prototype to see if this approach is ok.  I'll
add more tests, run pgindent and reindexdb support if this approach is
sensible.

Re: pgbench - implement strict TPC-B benchmark

2019-08-28 Thread Dmitry Dolgov

> On Wed, Aug 28, 2019 at 7:37 AM Fabien COELHO  wrote:
>
> > While doing benchmarking using different tools, including pgbench, I found 
> > it
> > useful as a temporary hack to add copy freeze and maintenance_work_mem 
> > options
> > (the last one not as an env variable, just as a set before, although not 
> > sure
> > if it's a best idea). Is it similar to what you were talking about?
>
> About this patch:
>
> Concerning the --maintenance... option, ISTM that there could rather be a
> generic way to provide "set" settings, not a specific option for a
> specific parameter with a specific unit. Moreover, ISTM that it only needs
> to be set once on a connection, not per command. I'd suggest something
> like:
>
>--connection-initialization '...'
>
> That would be issue when a connection is started, for any query, then the
> effect would be achieved with:
>
>pgbench --conn…-init… "SET maintenance_work_main TO '12MB'" ...
>
> The --help does not say that the option expects a parameter.
>
> Also, in you patch it is a initialization option, but the code does not
> check for that.
>
> Concerning the freeze option:
>
> It is also a initialization-specific option that should be checked for
> that.
>
> The option does not make sense if
>
> The alternative queries could be managed simply without intermediate
> variables.
>
> Pgbench documentation is not updated.
>
> There are no tests.
>
> This patch should be submitted in its own thread to help manage it in the
> CF app.

Thanks, that was a pretty deep answer for what was supposed to be just an
alignment question :) But sure, I can prepare a proper version and post it
separately.

Re: Performance improvement of WAL writing?

2019-08-28 Thread Kyotaro Horiguchi

Hi, Moon-san.

At Wed, 28 Aug 2019 15:43:02 +0900, "Moon, Insung" 
 wrote in 

> Dear Hackers.
> 
> Currently, the XLogWrite function is written in 8k(or 16kb) units
> regardless of the size of the new record.
> For example, even if a new record is only 300 bytes, pg_pwrite is
> called to write data in 8k units (if it cannot be writing on one page
> is 16kb written).
> Let's look at the worst case.
> 1) LogwrtResult.Flush is 8100 pos.
> 2) And the new record size is only 100 bytes.
> In this case, pg_pwrite is called which writes 16 kb to update only 100 bytes.
> It is a rare case, but I think there is overhead for pg_pwrite for some 
> systems.
> # For systems that often update one record in one transaction.

If a commit is lonely in the system, the 100 bytes ought to be
flushed out immediately. If there are many concurrent commits,
XLogFlush() waits for all in-flight insertions up to the LSN the
backends is writing then actually flushes. Of course sometimes it
flushes just 100 bytes but sometimes it flushes far many bytes
involving records from other backends. If another backend has
flushed further than the backend's upto LSN, the backend skips
flush.

If you want to involve more commits in a flush, commit_delay lets
the backend wait for that duration so that more commits can come
in within the window. Or synchronous_commit = off works somewhat
more aggressively.

> So what about modifying the XLogWrite function only to write the size
> that should record?

If I understand your porposal correctly, you're proposing to
separate fsync from XLogWrite. Actually, as you proposed, roughly
speaking no flush happen execpt at segment switch or commit. If
you are proposing not flushing immediately even at commit,
commit_delay (or synchronous_commit) works that way.

> Can this idea benefit from WAL writing performance?
> If it's OK to improve, I want to do modification.
> How do you think of it?

So the proposal seems to be already achieved.  If not, could you
elaborate the proposal, or explain about actual problem?

> Best Regards.
> Moon.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: refactoring - share str2*int64 functions

2019-08-28 Thread Fabien COELHO



Michaël,


I have taken the liberty to optimize the existing int64 function by removing
spurious tests.


Which are?


 - *ptr && WHATEVER(*ptr)
  *ptr is redundant, WHATEVER yields false on '\0', and it costs on each
  char but at the end. It might be debatable in some places, e.g. it is
  likely that there are no spaces in the string, but likely that there are
  more than one digit.

  If you want all/some *ptr added back, no problem.

 - isdigit repeated on if and following while, used if/do-while instead.


I have not created uint64 specific inlined overflow functions.


Why?  There is a comment below ;p


See comment about comment below:-)


If it looks ok, a separate patch could address the 32 & 16 versions.


I am surprised to see a negative diff


Is it? Long duplicate functions are factored out (this was my initial 
intent), one file is removed…


actually just by doing that (adding the 32 and 16 parts will add much 
more code of course).  At quick glance, I think that this is on the 
right track.  Some comments I have on the top of my mind:



- It would me good to have the unsigned equivalents of
pg_mul_s64_overflow, etc.  These are simple enough,


Hmmm. Have you looked at the fallback cases when the corresponding 
builtins are not available?


I'm unsure of a reliable way to detect a generic unsigned int overflow 
without expensive dividing back and having to care about zero…


So I was pretty happy with my two discreet, small and efficient tests.


and per the feedback from Andres they could live in common/int.h.


Could be, however "string.c" already contains a string to int conversion 
function, so I put them together. Probably this function should be 
removed in the end, though.



- It is more consistent to use upper-case statuses in the enum
strtoint_status.


I thought of that, but first enum I found used lower case, so it did not 
seem obvious that pg style was to use upper case. Indeed, some enum 
constants use upper cases.



 Could it be renamed to pg_strtoint_status?


Sure. I also prefixed the enum constants for consistency.

Attached patch uses a prefix and uppers constants. Waiting for further 
input about other comments.


--
Fabien.diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 221b47298c..28891ba337 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -79,6 +79,7 @@
 #include "utils/builtins.h"
 #include "utils/hashutils.h"
 #include "utils/memutils.h"
+#include "common/string.h"
 
 PG_MODULE_MAGIC;
 
@@ -1022,7 +1023,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		/* parse command tag to retrieve the number of affected rows. */
 		if (completionTag &&
 			strncmp(completionTag, "COPY ", 5) == 0)
-			rows = pg_strtouint64(completionTag + 5, NULL, 10);
+			(void) pg_strtouint64(completionTag + 5, &rows);
 		else
 			rows = 0;
 
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 2c0ae395ba..8e75d52b06 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -21,6 +21,7 @@
 #include "catalog/heap.h"
 #include "catalog/pg_type.h"
 #include "commands/trigger.h"
+#include "common/string.h"
 #include "executor/executor.h"
 #include "executor/spi_priv.h"
 #include "miscadmin.h"
@@ -2338,8 +2339,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
 	CreateTableAsStmt *ctastmt = (CreateTableAsStmt *) stmt->utilityStmt;
 
 	if (strncmp(completionTag, "SELECT ", 7) == 0)
-		_SPI_current->processed =
-			pg_strtouint64(completionTag + 7, NULL, 10);
+		(void) pg_strtouint64(completionTag + 7, &_SPI_current->processed);
 	else
 	{
 		/*
@@ -2361,8 +2361,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
 else if (IsA(stmt->utilityStmt, CopyStmt))
 {
 	Assert(strncmp(completionTag, "COPY ", 5) == 0);
-	_SPI_current->processed = pg_strtouint64(completionTag + 5,
-			 NULL, 10);
+	(void) pg_strtouint64(completionTag + 5, &_SPI_current->processed);
 }
 			}
 
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..17cde42b4d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -34,6 +34,7 @@
 
 #include "fmgr.h"
 #include "miscadmin.h"
+#include "common/string.h"
 #include "nodes/extensible.h"
 #include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
@@ -80,7 +81,7 @@
 #define READ_UINT64_FIELD(fldname) \
 	token = pg_strtok(&length);		/* skip :fldname */ \
 	token = pg_strtok(&length);		/* get field value */ \
-	local_node->fldname = pg_strtouint64(token, NULL, 10)
+	(void) pg_strtouint64(token, &local_node->fldname)
 
 /* Read a long integer field (anything written as ":fldname %ld") */
 #define READ_LONG_FIELD(fldname) \
diff --git a/src/backend/parser/parse_node.c b/src/backend/parser/parse_node.c
index 1baf7ef31f..8af5e083ee 100644

Re: refactoring - share str2*int64 functions

2019-08-28 Thread Michael Paquier

On Wed, Aug 28, 2019 at 08:51:29AM +0200, Fabien COELHO wrote:
> Here is an updated patch for the u?int64 conversion functions.

Thanks!

> I have taken the liberty to optimize the existing int64 function by removing
> spurious tests.

Which are?

> I have not created uint64 specific inlined overflow functions.

Why?  There is a comment below ;p

> If it looks ok, a separate patch could address the 32 & 16 versions.

I am surprised to see a negative diff actually just by doing that
(adding the 32 and 16 parts will add much more code of course).  At
quick glance, I think that this is on the right track.  Some comments
I have on the top of my mind:
- It would me good to have the unsigned equivalents of
pg_mul_s64_overflow, etc.  These are simple enough, and per the
feedback from Andres they could live in common/int.h.
- It is more consistent to use upper-case statuses in the enum
strtoint_status.  Could it be renamed to pg_strtoint_status?
--
Michael


signature.asc
Description: PGP signature

77 matches

Mail list logo