Re: BUG #15977: Inconsistent behavior in chained transactions
Hello, COMMIT AND CHAIN in implicit block leaves blockState as TBLOCK_STARTED, which doesn't trigger the chaining. but ROLLBACK AND CHAIN sets the blockState into TBLOCK_ABORT_PENDING, so the chaining is triggered. I think disabling s->chain beforehand should do the desired behavior. Patch applies with "patch", although "git apply" complained because of CRLF line terminations forced by the octet-stream mime type. Patch compiles cleanly. Make check ok. Patch works for me, and solution seems appropriate. It should be committed for pg 12.0. There could be a test added in "regress/sql/transactions.sql", I'd suggest something like: -- implicit transaction and not chained. COMMIT AND CHAIN; COMMIT; ROLLBACK AND CHAIN; ROLLBACK; which should show the appropriate "no transaction in progress" warnings. Doc could be made a little clearer about what to expect when there is no explicit transaction in progress. -- Fabien.
Re: refactoring - share str2*int64 functions
Bonjour Michaël, - *ptr && WHATEVER(*ptr) *ptr is redundant, WHATEVER yields false on '\0', and it costs on each char but at the end. It might be debatable in some places, e.g. it is likely that there are no spaces in the string, but likely that there are more than one digit. Still this makes the checks less robust? I do not see any downside, but maybe I lack imagination. On my laptop isWHATEVER is implementd through an array mapping characters to a bitfield saying whether each char is WHATEVER, depending on the bit. This array is well defined for index 0 ('\0'). If an implementation is based on comparisons, for isdigit it would be: c >= '0' && c <= '9' Then checking c != '\0' is redundant with c >= '0'. Given the way the character checks are used in the function, we do not go beyond the end of the string because we only proceed further when a character is actually recognized, else we return. So I cannot see any robustness issue, just a few cycles saved. Hmmm. Have you looked at the fallback cases when the corresponding builtins are not available? I'm unsure of a reliable way to detect a generic unsigned int overflow without expensive dividing back and having to care about zero… Mr Freund has mentioned that here: https://www.postgresql.org/message-id/20190717184820.iqz7schxdbucm...@alap3.anarazel.de Yep, that is what I mean by expensive: there is an integer division, which is avoided if b is known to be 10, hence is not zero and the limit value can be checked directly on the input without having to perform a division each time. So I was pretty happy with my two discreet, small and efficient tests. That's also a matter of code and interface consistency IMHO. Possibly. ISTM that part of the motivation is to reduce costs for heavily used conversions, eg on COPY. Function "scanf" is overly expensive because it has to interpret its format, and we have to check for overflows… Anyway, if we assume that the builtins exist and rely on efficient hardware check, maybe we do not care about the fallback cases which would just be slow but never executed. Note that all this is moot, as all instances of string to uint64 conversion do not check for errors. Attached v7 does create uint64 overflow inline functions. The stuff yet is not moved to "common/int.c", a file which does not exists, even if "common/int.h" does. -- Fabien.diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 221b47298c..28891ba337 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -79,6 +79,7 @@ #include "utils/builtins.h" #include "utils/hashutils.h" #include "utils/memutils.h" +#include "common/string.h" PG_MODULE_MAGIC; @@ -1022,7 +1023,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString, /* parse command tag to retrieve the number of affected rows. */ if (completionTag && strncmp(completionTag, "COPY ", 5) == 0) - rows = pg_strtouint64(completionTag + 5, NULL, 10); + (void) pg_strtouint64(completionTag + 5, &rows); else rows = 0; diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c index 2c0ae395ba..8e75d52b06 100644 --- a/src/backend/executor/spi.c +++ b/src/backend/executor/spi.c @@ -21,6 +21,7 @@ #include "catalog/heap.h" #include "catalog/pg_type.h" #include "commands/trigger.h" +#include "common/string.h" #include "executor/executor.h" #include "executor/spi_priv.h" #include "miscadmin.h" @@ -2338,8 +2339,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI, CreateTableAsStmt *ctastmt = (CreateTableAsStmt *) stmt->utilityStmt; if (strncmp(completionTag, "SELECT ", 7) == 0) - _SPI_current->processed = - pg_strtouint64(completionTag + 7, NULL, 10); + (void) pg_strtouint64(completionTag + 7, &_SPI_current->processed); else { /* @@ -2361,8 +2361,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI, else if (IsA(stmt->utilityStmt, CopyStmt)) { Assert(strncmp(completionTag, "COPY ", 5) == 0); - _SPI_current->processed = pg_strtouint64(completionTag + 5, - NULL, 10); + (void) pg_strtouint64(completionTag + 5, &_SPI_current->processed); } } diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c index 764e3bb90c..17cde42b4d 100644 --- a/src/backend/nodes/readfuncs.c +++ b/src/backend/nodes/readfuncs.c @@ -34,6 +34,7 @@ #include "fmgr.h" #include "miscadmin.h" +#include "common/string.h" #include "nodes/extensible.h" #include "nodes/parsenodes.h" #include "nodes/plannodes.h" @@ -80,7 +81,7 @@ #define READ_UINT64_FIELD(fldname) \ token = pg_strtok(&length); /* skip :fldname */ \ token = pg_strtok(&length); /* get field value */ \ - local_node->fldname = pg_strtouint64(token, NULL, 10) + (void) pg_strtouint64(token, &local_node->fldname) /* Rea
Re: doc: update PL/pgSQL sample loop function
Hi čt 29. 8. 2019 v 5:03 odesílatel Ian Barwick napsal: > Hi > > Here: > > > https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING > > we have a sample PL/PgSQL function (dating from at least 7.2) demonstrating > query result loops, which refreshes some pseudo materialized views stored > in > a user-defined table. > > As we've had proper materialized views since 9.3, I thought it might > be nice to update this with a self-contained sample which can be used > as-is; see attached patch. > > (As a side note the current sample function contains a couple of "%s" > placeholders which should be just "%"; a quick search of plpgsql.sgml > shows this is the only place they occur). > > Will submit to the next commitfest. > +1 Pavel > > Regards > > Ian Barwick > > -- > Ian Barwick https://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services >
doc: update PL/pgSQL sample loop function
Hi Here: https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING we have a sample PL/PgSQL function (dating from at least 7.2) demonstrating query result loops, which refreshes some pseudo materialized views stored in a user-defined table. As we've had proper materialized views since 9.3, I thought it might be nice to update this with a self-contained sample which can be used as-is; see attached patch. (As a side note the current sample function contains a couple of "%s" placeholders which should be just "%"; a quick search of plpgsql.sgml shows this is the only place they occur). Will submit to the next commitfest. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services commit d9e99b90fd0e572b4fd2461d7188a0197dee16df Author: Ian Barwick Date: Thu Aug 29 11:49:23 2019 +0900 doc: update PL/pgSQL sample function The sample PL/PgSQL function here: https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING dates from at least PostgreSQL 7.2 and updates pseudo-materialized views defined in a user table. Replace it with a more up-to-date example which does the same thing with actual materialized views, which have been available since PostgreSQL 9.3 diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml index ae73630a48..3194173594 100644 --- a/doc/src/sgml/plpgsql.sgml +++ b/doc/src/sgml/plpgsql.sgml @@ -2437,19 +2437,29 @@ END LOOP label ; resulting from the query and the loop body is executed for each row. Here is an example: -CREATE FUNCTION cs_refresh_mviews() RETURNS integer AS $$ +CREATE FUNCTION refresh_mviews() RETURNS integer AS $$ DECLARE mviews RECORD; BEGIN -RAISE NOTICE 'Refreshing materialized views...'; - -FOR mviews IN SELECT * FROM cs_materialized_views ORDER BY sort_key LOOP +RAISE NOTICE 'Refreshing all materialized views...'; + +FOR mviews IN + SELECT n.nspname AS mv_schema, + c.relname AS mv_name, + pg_catalog.pg_get_userbyid(c.relowner) AS owner + FROM pg_catalog.pg_class c +LEFT JOIN pg_catalog.pg_namespace n ON (n.oid = c.relnamespace) +WHERE c.relkind = 'm' + ORDER BY 1 +LOOP --- Now "mviews" has one record from cs_materialized_views +-- Now "mviews" has one record with information about the materialized view -RAISE NOTICE 'Refreshing materialized view %s ...', quote_ident(mviews.mv_name); -EXECUTE format('TRUNCATE TABLE %I', mviews.mv_name); -EXECUTE format('INSERT INTO %I %s', mviews.mv_name, mviews.mv_query); +RAISE NOTICE 'Refreshing materialized view %.% (owner: %)...', + quote_ident(mviews.mv_schema), + quote_ident(mviews.mv_name), + quote_ident(mviews.owner); +EXECUTE format('REFRESH MATERIALIZED VIEW %I.%I', mviews.mv_schema, mviews.mv_name); END LOOP; RAISE NOTICE 'Done refreshing materialized views.';
Re: Improve error detections in TAP tests by spreading safe_psql
Michael Paquier writes: > On Wed, Aug 28, 2019 at 12:19:08PM -0400, Tom Lane wrote: >> The attached patch just follows a mechanical rule of "set on_error_die >> to 0 in exactly those calls where the surrounding code verifies the >> exit code is what it expects". > I am fine with that approach. There is an argument for dropping > safe_psql then? Well, it's useful if you just want the stdout back. But its name is a bit misleading if the default behavior of psql is just as safe. Not sure whether renaming it is worthwhile. >> I have to wonder if it isn't better to just drop the is() test >> and let PostgresNode::psql issue the failure. > I got the same thought as you on this point about why the is() is > actually necessary if we know that it would fail. An advantage of the > current code is that we are able to catch all errors in a given run at > once. Yeah, but only if the test cases are independent, which I think is mostly not true in our TAP scripts. Otherwise you're just looking at cascading errors. > An argument against back-patching the stuff changing the > default flag are tests which rely on the current behavior. This could > be annoying for some people doing advanced testing. Yup, the tradeoff is people possibly having test scripts outside our tree that would break, versus the possibility that we'll back-patch test changes that don't behave as expected in back branches. I don't know if the former is true, but I'm afraid the latter is a certainty if we don't back-patch. regards, tom lane
Re: Re: Email to hackers for test coverage
On 2019-08-29 00:43, peter.eisentr...@2ndquadrant.com wrote: > Make spaces and capitalization match surrounding code. >That's fine, but I suggest that if you really want to make an impact in >test coverage, look to increase function coverage rather than line >coverage. Or look for files that are not covered at all. Thanks for pointing all the things, I will reconsider my way on code coverage work. -- Movead
Re: gharial segfaulting on REL_12_STABLE only
On Tue, Aug 27, 2019 at 2:09 PM Thomas Munro wrote: > On Tue, Aug 27, 2019 at 1:48 PM Tom Lane wrote: > > A stack trace would likely be really useful right about now. > > Yeah. Looking into how to get that. Erm. I heard the system was in a very unhappy state and couldn't be logged into. After it was rebooted, the problem appears to have gone away. That is quite unsatisfying. "anole" runs on the same host, and occasionally fails to launch any parallel workers, and it seems to be pretty unhappy too -- very long runtimes (minutes where my smaller machines take seconds). So the machine may be massively overloaded and swapping or something like that, something to be looked into, but that doesn't explain how we get to a segfault without an underlying hard to reach bug in our code... -- Thomas Munro https://enterprisedb.com
Re: Improve error detections in TAP tests by spreading safe_psql
On Wed, Aug 28, 2019 at 12:19:08PM -0400, Tom Lane wrote: > The attached patch just follows a mechanical rule of "set on_error_die > to 0 in exactly those calls where the surrounding code verifies the > exit code is what it expects". That leads to a lot of code that > looks like, say, > > # Insert should work on standby > is( $node_standby->psql( > 'postgres', > - qq{insert into testtab select generate_series(1,1000), 'foo';}), > + qq{insert into testtab select generate_series(1,1000), 'foo';}, > + on_error_die => 0), > 0, > 'INSERT succeeds with truncated relation FSM'); I am fine with that approach. There is an argument for dropping safe_psql then? > I have to wonder if it isn't better to just drop the is() test > and let PostgresNode::psql issue the failure. The only thing > the is() is really buying us is the ability to label the test > with an ID string, which is helpful, but um ... couldn't that > just be a comment? I got the same thought as you on this point about why the is() is actually necessary if we know that it would fail. An advantage of the current code is that we are able to catch all errors in a given run at once. If the test dies immediately, you cannot catch that and this needs repetitive retries if there is no dependency between each step of the test. An argument against back-patching the stuff changing the default flag are tests which rely on the current behavior. This could be annoying for some people doing advanced testing. -- Michael signature.asc Description: PGP signature
Re: refactoring - share str2*int64 functions
On Wed, Aug 28, 2019 at 09:50:44AM +0200, Fabien COELHO wrote: > - *ptr && WHATEVER(*ptr) > *ptr is redundant, WHATEVER yields false on '\0', and it costs on each > char but at the end. It might be debatable in some places, e.g. it is > likely that there are no spaces in the string, but likely that there are > more than one digit. Still this makes the checks less robust? > If you want all/some *ptr added back, no problem. > > - isdigit repeated on if and following while, used if/do-while instead. I see, you don't check twice if the first character is a digit this way. > Hmmm. Have you looked at the fallback cases when the corresponding builtins > are not available? > > I'm unsure of a reliable way to detect a generic unsigned int overflow > without expensive dividing back and having to care about zero… Mr Freund has mentioned that here: https://www.postgresql.org/message-id/20190717184820.iqz7schxdbucm...@alap3.anarazel.de > So I was pretty happy with my two discreet, small and efficient tests. That's also a matter of code and interface consistency IMHO. -- Michael signature.asc Description: PGP signature
Re: REINDEX filtering in the backend
On Wed, Aug 28, 2019 at 10:22:07AM +0200, Julien Rouhaud wrote: >>> The filtering is done at table level (with and without the >>> concurrently option), so SCHEMA, DATABASE and SYSTEM automatically >>> benefit from it. If this clause is used with a REINDEX INDEX, the >>> statement errors out, as I don't see a valid use case for providing a >>> single index name and asking to possibly filter it at the same time. >> >> Supporting that case would not be that much amount of work, no? > > Probably not, but I'm dubious about the use case. Adding the lib > version in the catalog would be more useful for people who want to > write their own rules to reindex specific set of indexes. Hearing from others here would be helpful. My take is that having a simple option doing the filtering, without the need to store anything in the catalogs, would be helpful enough for users mixing both index types on a single table. Others may not agree. >> I think that it would be nice to be >> able to do both, generating an error for REINDEX INDEX if the index >> specified is not compatible, and a LOG if the index is not filtered >> out when a list is processed. > > Do you mean having an error if the index does not contain any > collation based type? Also, REINDEX only accept a single name, so > there shouldn't be any list processing for REINDEX INDEX? I'm not > really in favour of adding extra code the filtering when user asks for > a specific index name to be reindexed. I was referring to adding an error if trying to reindex an index with the filtering enabled. That's useful to inform the user that what he intends to do is not compatible with the options provided. > That's what I did when I first submitted the feature in reindexdb. I > didn't use them because it means switching to TAP tests. I can drop > the simple regression test (especially since I now realize than one is > quite broken) and use the TAP one if, or should I keep both? There is no need for TAP I think. You could for example store the relid and its relfilenode in a temporary table before running the reindex, run the REINDEX and then compare with what pg_class stores. And that's much cheaper than setting a new instance for a TAP test. -- Michael signature.asc Description: PGP signature
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Aug 28, 2019 at 08:17:47PM +0300, Alexey Kondratov wrote: Hi Tomas, Interesting. Any idea where does the extra overhead in this particular case come from? It's hard to deduce that from the single flame graph, when I don't have anything to compare it with (i.e. the flame graph for the "normal" case). I guess that bottleneck is in disk operations. You can check logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and writes (~26%) take around 35% of CPU time in summary. To compare, please, see attached flame graph for the following transaction: INSERT INTO large_text SELECT (SELECT string_agg('x', ',') FROM generate_series(1, 2000)) FROM generate_series(1, 100); Execution Time: 44519.816 ms Time: 98333,642 ms (01:38,334) where disk IO is only ~7-8% in total. So we get very roughly the same ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests. Therefore, probably you may write changes on receiver in bigger chunks, not each change separately. Possibly, I/O is certainly a possible culprit, although we should be using buffered I/O and there certainly are not any fsyncs here. So I'm not sure why would it be cheaper to do the writes in batches. BTW does this mean you see the overhead on the apply side? Or are you running this on a single machine, and it's difficult to decide? I run this on a single machine, but walsender and worker are utilizing almost 100% of CPU per each process all the time, and at apply side I/O syscalls take about 1/3 of CPU time. Though I am still not sure, but for me this result somehow links performance drop with problems at receiver side. Writing in batches was just a hypothesis and to validate it I have performed test with large txn, but consisting of a smaller number of wide rows. This test does not exhibit any significant performance drop, while it was streamed too. So it seems to be valid. Anyway, I do not have other reasonable ideas beside that right now. I've checked recently this patch again and tried to elaborate it in terms of performance. As a result I've implemented a new POC version of the applier (attached). Almost everything in streaming logic stayed intact, but apply worker is significantly different. As I wrote earlier I still claim, that spilling changes on disk at the applier side adds additional overhead, but it is possible to get rid of it. In my additional patch I do the following: 1) Maintain a pool of additional background workers (bgworkers), that are connected with main logical apply worker via shm_mq's. Each worker is dedicated to the processing of specific streamed transaction. 2) When we receive a streamed change for some transaction, we check whether there is an existing dedicated bgworker in HTAB (xid -> bgworker), or there are some in the idle list, or spawn a new one. 3) We pass all changes (between STREAM START/STOP) to that bgworker via shm_mq_send without intermediate waiting. However, we wait for bgworker to apply the entire changes chunk at STREAM STOP, since we don't want transactions reordering. 4) When transaction is commited/aborted worker is being added to the idle list and is waiting for reassigning message. 5) I have used the same machinery with apply_dispatch in bgworkers, since most of actions are practically very similar. Thus, we do not spill anything at the applier side, so transaction changes are processed by bgworkers as normal backends do. In the same time, changes processing is strictly serial, which prevents transactions reordering and possible conflicts/anomalies. Even though we trade off performance in favor of stability the result is rather impressive. I have used a similar query for testing as before: EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3) SELECT round(random()*10), random(), random()*142 FROM generate_series(1, 100) s(i); with 1kk (100), 3kk and 5kk rows; logical_work_mem = 64MB and synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is following: CREATE TABLE large_test ( id serial primary key, num1 bigint, num2 double precision, num3 double precision ); Here are the results: --- | N | Time on master, sec | Total xact time, sec | Ratio | --- | On commit (master, v13) | --- | 1kk | 6.5 | 17.6 | x2.74 | --- | 3kk | 21 | 55.4 | x2.64 | --- | 5kk | 38.3 | 91.5 | x2.39 | --- | Stream +
Re: FETCH FIRST clause PERCENT option
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, passed Hi everyone, The v8 patch [1] applies cleanly and passes tests. I reviewed the conversation to date and from what I can see, all identified bugs have been fixed and concerns either fixed or addressed. Marking as ready for committer. [1] https://www.postgresql.org/message-id/attachment/103486/percent-incremental-v8.patch Ryan Lambert The new status of this patch is: Ready for Committer
Re: RFC: seccomp-bpf support
On 2019-Aug-28, Joshua Brindle wrote: > I think we need to reign in the thread somewhat. The feature allows > end users to define some sandboxing within PG. Nothing is being forced > on anyone but we would like the capability to harden a PG installation > for many reasons already stated. My own objection to this line of development is that it doesn't seem that any useful policy (allowed/denied syscall list) is part or intends to be part of the final feature. So we're shipping a hook system for which each independent vendor is going to develop their own policy. Joe provided an example syscall list, but it's not part of the patch proper; and it seems, per the discussion, that the precise syscall list to use is a significant fraction of this. So, as part of a committable patch, IMO it'd be good to have some sort of final list of syscalls -- maybe as part of the docbook part of the patch. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: no mailing list hits in google
On 2019-Aug-28, Thomas Kellerer wrote: > Merlin Moncure schrieb am 28.08.2019 um 18:22: > > My test case here is the query: pgsql-hackers > > That search term is the first hit on DuckDuckGo: > https://duckduckgo.com/?q=pgsql-hackers+ExecHashJoinNewBatch&t=h_&ia=web Yes, but that's an old post, not the one from this year. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: RFC: seccomp-bpf support
On Thu, Aug 29, 2019 at 7:08 AM Joshua Brindle wrote: > On Wed, Aug 28, 2019 at 2:53 PM Andres Freund wrote: > > On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote: > > > A prime example is madvise() which was a catastrophic failure that 1) > > > isn't preventable by any LSM including SELinux, 2) isn't used by PG > > > and is therefore a good candidate for a kill list, and 3) a clear win > > > in the dont-let-PG-be-a-vector-for-kernel-compromise arena. > > > > IIRC it's used by glibc as part of its malloc implementation (also > > threading etc) - but not necessarily hit during the most common > > paths. That's *precisely* my problem with this approach. > > > > As long as glibc handles a returned error cleanly the syscall could be > denied without harming the process and the bug would be mitigated. > > seccomp also allows argument whitelisting so things can get very > granular, depending on who is setting up the lists. Just by the way, there may also be differences between architectures. After some head scratching, we recently discovered[1] that default seccomp whitelists currently cause PostgreSQL to panic for users of Docker, Nspawn etc on POWER and ARM because of that. That's a bug being fixed elsewhere, but it reveals another thing to be careful of if you're trying to build your own whitelist by guessing what libc needs to call. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGLiR569VHLjtCNp3NT%2BjnKdhy8g2sdgKzWNojyWQVt6Bw%40mail.gmail.com -- Thomas Munro https://enterprisedb.com
Re: RFC: seccomp-bpf support
On 2019-08-28 21:38, Joshua Brindle wrote: > I think we need to reign in the thread somewhat. The feature allows > end users to define some sandboxing within PG. Nothing is being forced > on anyone Features come with a maintenance cost. If we ship it, then people are going to try it out. Then weird things will happen. They will report mysterious bugs. They will complain to their colleagues. It all comes with a cost. > but we would like the capability to harden a PG installation > for many reasons already stated. Most if not all of those reasons seem to have been questioned. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Memory-Bounded Hash Aggregation
I started to review this patch yesterday with Melanie Plageman, so we rebased this patch over the current master. The main conflicts were due to a simplehash patch that has been committed separately[1]. I've attached the rebased patch. I was playing with the code, and if one of the table's most common values isn't placed into the initial hash table it spills a whole lot of tuples to disk that might have been avoided if we had some way to 'seed' the hash table with MCVs from the statistics. Seems to me that you would need some way of dealing with values that are in the MCV list, but ultimately don't show up in the scan. I imagine that this kind of optimization would most useful for aggregates on a full table scan. Some questions: Right now the patch always initializes 32 spill partitions. Have you given any thought into how to intelligently pick an optimal number of partitions yet? > That can be done as an add-on to approach #1 by evicting the entire > Hash table (writing out the partial states), then resetting the memory > Context. By add-on approach, do you mean to say that you have something in mind to combine the two strategies? Or do you mean that it could be implemented as a separate strategy? > I think it's clear there's no perfect eviction strategy - for every > algorithm we came up with we can construct a data set on which it > performs terribly (I'm sure we could do that for the approach used by > Greenplum, for example). > > So I think it makes sense to do what Jeff proposed, and then maybe try > improving that in the future with a switch to different eviction > strategy based on some heuristics. I agree. It definitely feels like both spilling strategies have their own use case. That said, I think it's worth mentioning that with parallel aggregates it might actually be more useful to spill the trans values instead, and have them combined in a Gather or Finalize stage. [1] https://www.postgresql.org/message-id/flat/48abe675e1330f0c264ab2fe0d4ff23eb244f9ef.camel%40j-davis.com v1-0001-Rebased-memory-bounded-hash-aggregation.patch Description: Binary data
Re: RFC: seccomp-bpf support
Hi, On 2019-08-28 15:38:11 -0400, Joshua Brindle wrote: > It seems like complete system compromises should be prioritized over > slowdowns, and it seems very unlikely to cause a noticeable slowdown > anyway. The point isn't really this specific issue, but that the argument that you'll not cause problems by disabling certain syscalls, or that it's easy to find which ones are used, just plainly isn't true. > Are there PG users that backed out all of the Linux KPTI patches due > to the slowdown? Well, not backed out on a code level, but straight out disabled at boot time (i.e. pti=off)? Yea, I know of several. > I think we need to reign in the thread somewhat. The feature allows > end users to define some sandboxing within PG. Nothing is being forced > on anyone Well, we'll have to deal with the fallout of this to some degree. When postgres breaks people will complain, when it's slow, people will complain, ... Greetings, Andres Freund
Re: RFC: seccomp-bpf support
On Wed, Aug 28, 2019 at 3:22 PM Andres Freund wrote: > > Hi, > > On 2019-08-28 15:02:17 -0400, Joshua Brindle wrote: > > On Wed, Aug 28, 2019 at 2:53 PM Andres Freund wrote: > > > On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote: > > > > A prime example is madvise() which was a catastrophic failure that 1) > > > > isn't preventable by any LSM including SELinux, 2) isn't used by PG > > > > and is therefore a good candidate for a kill list, and 3) a clear win > > > > in the dont-let-PG-be-a-vector-for-kernel-compromise arena. > > > > > > IIRC it's used by glibc as part of its malloc implementation (also > > > threading etc) - but not necessarily hit during the most common > > > paths. That's *precisely* my problem with this approach. > > > > > > > As long as glibc handles a returned error cleanly the syscall could be > > denied without harming the process and the bug would be mitigated. > > And we'd hit mysterious slowdowns in production uses of PG when seccomp > is enabled. It seems like complete system compromises should be prioritized over slowdowns, and it seems very unlikely to cause a noticeable slowdown anyway. Are there PG users that backed out all of the Linux KPTI patches due to the slowdown? I think we need to reign in the thread somewhat. The feature allows end users to define some sandboxing within PG. Nothing is being forced on anyone but we would like the capability to harden a PG installation for many reasons already stated. This is being done in places all across the Linux ecosystem and is IMO a very useful mitigation. Thank you.
Re: Index Skip Scan
> Sorry for long delay. Here is more or less what I had in mind. After changing > read_closest to read the whole page I couldn't resist to just merge it into > readpage itself, since it's basically the same. It could raise questions > about> > performance and how intrusive this patch is, but I hope it's not that much of > a > problem (in the worst case we can split it back). I've also added few tests > for > the issue you've mentioned. Thanks again, I'm appreciate how much efforts you > put into reviewing! Putting it into one function makes sense I think. Looking at the patch, I think in general there are some good improvements in there. I'm afraid I did manage to find another incorrect query result though, having to do with the keepPrev part and skipping to the first tuple on an index page: postgres=# drop table if exists b; create table b as select a,b::int2 b,(b%2)::int2 c from generate_series(1,5) a, generate_series(1,366) b; create index on b (a,b,c); analyze b; DROP TABLE SELECT 1830 CREATE INDEX ANALYZE postgres=# set enable_indexskipscan=1; SET postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b; a | b | c ---+---+--- 1 | 2 | 0 2 | 4 | 0 3 | 4 | 0 4 | 4 | 0 5 | 4 | 0 (5 rows) postgres=# set enable_indexskipscan=0; SET postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b; a | b | c ---+---+--- 1 | 2 | 0 2 | 2 | 0 3 | 2 | 0 4 | 2 | 0 5 | 2 | 0 (5 rows) -Floris
Re: RFC: seccomp-bpf support
Hi, On 2019-08-28 15:02:17 -0400, Joshua Brindle wrote: > On Wed, Aug 28, 2019 at 2:53 PM Andres Freund wrote: > > On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote: > > > A prime example is madvise() which was a catastrophic failure that 1) > > > isn't preventable by any LSM including SELinux, 2) isn't used by PG > > > and is therefore a good candidate for a kill list, and 3) a clear win > > > in the dont-let-PG-be-a-vector-for-kernel-compromise arena. > > > > IIRC it's used by glibc as part of its malloc implementation (also > > threading etc) - but not necessarily hit during the most common > > paths. That's *precisely* my problem with this approach. > > > > As long as glibc handles a returned error cleanly the syscall could be > denied without harming the process and the bug would be mitigated. And we'd hit mysterious slowdowns in production uses of PG when seccomp is enabled. Greetings, Andres Freund
Re: RFC: seccomp-bpf support
Andres Freund writes: > On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote: >> A prime example is madvise() which was a catastrophic failure that 1) >> isn't preventable by any LSM including SELinux, 2) isn't used by PG >> and is therefore a good candidate for a kill list, and 3) a clear win >> in the dont-let-PG-be-a-vector-for-kernel-compromise arena. > IIRC it's used by glibc as part of its malloc implementation (also > threading etc) - but not necessarily hit during the most common > paths. That's *precisely* my problem with this approach. I think Andres is right here. There are madvise calls in glibc: glibc-2.28/malloc/malloc.c:__madvise (paligned_mem, size & ~psm1, MADV_DONTNEED); glibc-2.28/malloc/arena.c:__madvise ((char *) h + new_size, diff, MADV_DONTNEED); It appears that the first is only reachable from __malloc_trim which we don't use, but the second is reachable from free(). However, strace'ing says that it's never called during our standard regression tests, confirming Andres' thought that it's in seldom-reached paths. (I did not go through the free() logic in any detail, but it looks like it is only reached when dealing with quite-large allocations, which'd make sense.) So this makes a perfect example for Peter's point that testing is going to be a very fallible way of finding the set of syscalls that need to be allowed. Even if we had 100.00% code coverage of PG proper, we would not necessarily find calls like this. regards, tom lane
Re: RFC: seccomp-bpf support
On Wed, Aug 28, 2019 at 2:53 PM Andres Freund wrote: > > Hi, > > On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote: > > A prime example is madvise() which was a catastrophic failure that 1) > > isn't preventable by any LSM including SELinux, 2) isn't used by PG > > and is therefore a good candidate for a kill list, and 3) a clear win > > in the dont-let-PG-be-a-vector-for-kernel-compromise arena. > > IIRC it's used by glibc as part of its malloc implementation (also > threading etc) - but not necessarily hit during the most common > paths. That's *precisely* my problem with this approach. > As long as glibc handles a returned error cleanly the syscall could be denied without harming the process and the bug would be mitigated. seccomp also allows argument whitelisting so things can get very granular, depending on who is setting up the lists.
Re: Re: Email to hackers for test coverage
On Tue, Aug 27, 2019 at 8:30 PM Michael Paquier wrote: > Please note that I have not looked in details where we could put that, > but perhaps Robert and Peter G who worked on 4ea51cd to add support > for abbreviated keys have some ideas, so I am adding them in CC for > input. Movead is correct -- the NULL handling within ApplySortAbbrevFullComparator() cannot actually be used currently. I wouldn't change anything about the code, though, since it's useful to defensively handle NULLs. -- Peter Geoghegan
Re: RFC: seccomp-bpf support
Andres Freund writes: > On 2019-08-28 14:30:20 -0400, Tom Lane wrote: >> Admittedly, you can't get per-subprocess restrictions that way, but the >> incremental value from that seems *really* tiny. If listen() has a bug >> you need to fix the bug, not invent this amount of rickety infrastructure >> to limit who can call it. > And, as I mentioned in another email, once you can corrupt shared memory > in arbitrary ways, the differing restrictions aren't worth much > anyway. Postmaster might be separated out enough to survive attacks like > that, but backends definitely aren't. Another point in this area is that if you did feel a need for per-process syscall sets, having a restriction that the postmaster's allowed set be a superset of all the childrens' allowed sets seems quite the wrong thing. The set of calls the postmaster needs is probably a lot smaller than what the children need, seeing that it does so little. It's just different because it includes bind+listen which the children likely don't need. So the hierarchical way seccomp goes about this seems fairly wrong for our purposes regardless. regards, tom lane
Re: RFC: seccomp-bpf support
On Wed, Aug 28, 2019 at 2:30 PM Tom Lane wrote: > > > (And, TBH, I'm still wondering why SELinux isn't the way to address this.) Just going to address this one now. SELinux is an LSM and therefore only makes decisions when LSM hooks are invoked, which are not 1:1 for syscalls (not even close). Further, SELinux generally determines what objects a subject can access and only implements capabilities because it has to, to be non-bypassable. Seccomp filtering is an orthogonal system to SELinux and LSMs in general. We are already doing work to further restrict PG subprocesses with SELinux via [1] and [2], but that doesn't address the unused, high-risk, obsolete, etc syscall issue. A prime example is madvise() which was a catastrophic failure that 1) isn't preventable by any LSM including SELinux, 2) isn't used by PG and is therefore a good candidate for a kill list, and 3) a clear win in the dont-let-PG-be-a-vector-for-kernel-compromise arena. We are using SELinux, we are also going to use this, they work together.
Re: RFC: seccomp-bpf support
Hi, On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote: > A prime example is madvise() which was a catastrophic failure that 1) > isn't preventable by any LSM including SELinux, 2) isn't used by PG > and is therefore a good candidate for a kill list, and 3) a clear win > in the dont-let-PG-be-a-vector-for-kernel-compromise arena. IIRC it's used by glibc as part of its malloc implementation (also threading etc) - but not necessarily hit during the most common paths. That's *precisely* my problem with this approach. Greetings, Andres Freund
Re: RFC: seccomp-bpf support
On Wed, Aug 28, 2019 at 2:47 PM Joshua Brindle wrote: > > On Wed, Aug 28, 2019 at 2:30 PM Tom Lane wrote: > > > > > > > (And, TBH, I'm still wondering why SELinux isn't the way to address this.) > > Just going to address this one now. SELinux is an LSM and therefore > only makes decisions when LSM hooks are invoked, which are not 1:1 for > syscalls (not even close). Further, SELinux generally determines what > objects a subject can access and only implements capabilities because > it has to, to be non-bypassable. > > Seccomp filtering is an orthogonal system to SELinux and LSMs in > general. We are already doing work to further restrict PG subprocesses > with SELinux via [1] and [2], but that doesn't address the unused, And I forgot the citations *sigh*, actually there should have only been [1]: 1. https://commitfest.postgresql.org/24/2259/ > high-risk, obsolete, etc syscall issue. A prime example is madvise() > which was a catastrophic failure that 1) isn't preventable by any LSM > including SELinux, 2) isn't used by PG and is therefore a good > candidate for a kill list, and 3) a clear win in the > dont-let-PG-be-a-vector-for-kernel-compromise arena. > > We are using SELinux, we are also going to use this, they work together.
Re: RFC: seccomp-bpf support
On 2019-08-28 14:30:20 -0400, Tom Lane wrote: > Andres Freund writes: > > Maybe I'm missing something, but it's not clear to me what meaningful > > attack surface can be reduced for PostgreSQL by forbidding certain > > syscalls, given the wide variety of syscalls required to run postgres. > > I think the idea is to block access to seldom-used syscalls because > those are exactly the ones most likely to have bugs. Yea, I can see some benefit in that. I'm just quite doubtful that the set of syscalls pg relies on doesn't already allow any determined attacker to trigger kernel bugs. E.g. the whole sysv ipc code is among those seldomly used areas of the code. As is the permission transfer through unix sockets. As is forking from within a syscall. ... > (We certainly hope that all the ones PG uses are well-debugged...) One would hope, but it's not quite my experience... > Whether the incremental protection is really worth the effort is > debatable, but as long as it's only people who think it *is* worth the > effort who have to deal with it, I don't mind. It seems almost guaranteed that there'll be bug reports about ominous crashes due to some less commonly used syscall being blacklisted. In a lot of cases that'll be hard to debug. After all, we already got such bug reports, without us providing anything builtin - and it's not like us adding our own filter suport will make container solutions not use their filter, so there's no argument that doing so ourselves will reduce the fragility. > Admittedly, you can't get per-subprocess restrictions that way, but the > incremental value from that seems *really* tiny. If listen() has a bug > you need to fix the bug, not invent this amount of rickety infrastructure > to limit who can call it. And, as I mentioned in another email, once you can corrupt shared memory in arbitrary ways, the differing restrictions aren't worth much anyway. Postmaster might be separated out enough to survive attacks like that, but backends definitely aren't. Greetings, Andres Freund
Re: RFC: seccomp-bpf support
Hi, On 2019-08-28 14:23:00 -0400, Joshua Brindle wrote: > > or some similar technology where the filtering is done by logic > > that's outside the executable you wish to not trust. > > (After googling for libseccomp, I see that it's supposed to not > > allow syscalls to be turned back on once turned off, but that isn't > > any protection against this problem. An attacker who's found an ACE > > hole in Postgres can just issue ALTER SYSTEM SET to disable the > > feature, then force a postmaster restart, then profit.) A postmaster restart might not be enough, because the postmaster's restrictions can't be removed, once in place. But all that's needed to circumvent that is force postmaster to die, and rely on systemd etc to restart it. > My preference would have been to enable it unconditionally but Joe was > being more practical. Well, the current approach is to configure the list of allowed syscalls in postgres. How would you ever secure that against the attacks described by Tom? As long as the restrictions are put into place by postgres itself, and as long they're configurable, such attacks are possible, no? And as long as extensions etc need different syscalls, you'll need configurability. > > I follow the idea of limiting the attack surface for kernel bugs, > > but this doesn't seem like a useful implementation of that, even > > ignoring the ease-of-use problems Peter mentions. > > Limiting the kernel attack surface for network facing daemons is > imperative to hardening systems. All modern attacks are chained > together so a kernel bug is useful only if you can execute code, and > PG is a decent vector for executing code. I don't really buy that in case pof postgres. Normally, in a medium to high security world, once you have RCE in postgres, the valuable data can already be exfiltrated. And that's game over. The only real benefits of a kernel vulnerabily is that that might allow to persist the attack for longer - but there's already plenty you can do inside postgres, once you have RCE. > At a minimum I would urge the community to look at adding high risk > syscalls to the kill list, systemd has some predefined sets we can > pick from like @obsoluete, @cpu-emulation, @privileged, @mount, and > @module. I can see some small value in disallowing these - but we're back to the point where that is better done one layer *above* postgres, by a process with more privileges than PG. Because then a PG RCE doesn't have a way to prevent those filters from being applied. > The postmaster and per-role lists can further reduce the available > syscalls based on the exact extensions and PLs being used. I don't buy that per-session configurable lists buy you anything meaningful. With an RCE in one session, it's pretty trivial to corrupt shared memory to also trigger RCE in other sessions. And there's no way seccomp or anything like that will prevent that. An additional reason I'm quite sceptical about more fine grained restrictions is that I think we're going to have to go for some use of threading in the next few years. While I think that's still far from agreed upon, I think there's a pretty large number of "senior" hackers that see this as the future. You can have per-thread seccomp filters, but that's so trivial to circumvent (just overwrite some vtable like data in another thread's data, and have it call whatever gadget you want), that it's not even worth considering. Greetings, Andres Freund
Re: RFC: seccomp-bpf support
Andres Freund writes: > Maybe I'm missing something, but it's not clear to me what meaningful > attack surface can be reduced for PostgreSQL by forbidding certain > syscalls, given the wide variety of syscalls required to run postgres. I think the idea is to block access to seldom-used syscalls because those are exactly the ones most likely to have bugs. (We certainly hope that all the ones PG uses are well-debugged...) That seems fine. Whether the incremental protection is really worth the effort is debatable, but as long as it's only people who think it *is* worth the effort who have to deal with it, I don't mind. What I don't like about this proposal is that the filter configuration is kept somewhere where it's not at all hard for an attacker to modify it. It can't be a GUC, indeed it can't be in any file that the server has permissions to write on, or you might as well not bother. So I'd throw away pretty much everything in the submitted patch, and instead think about doing the filtering/limiting in something that's launched from the service start script and in turn launches the postmaster. That makes it (mostly?) Not Our Problem, but rather an independent component. Admittedly, you can't get per-subprocess restrictions that way, but the incremental value from that seems *really* tiny. If listen() has a bug you need to fix the bug, not invent this amount of rickety infrastructure to limit who can call it. (And, TBH, I'm still wondering why SELinux isn't the way to address this.) regards, tom lane
Re: RFC: seccomp-bpf support
On Wed, Aug 28, 2019 at 1:42 PM Tom Lane wrote: > > Peter Eisentraut writes: > > On 2019-08-28 17:13, Joe Conway wrote: > >> Attached is a patch for discussion, adding support for seccomp-bpf > >> (nowadays generally just called seccomp) syscall filtering at > >> configure-time using libseccomp. I would like to get this in shape to be > >> committed by the end of the November CF if possible. > > > To compute the initial set of allowed system calls, you need to have > > fantastic test coverage. What you don't want is some rarely used error > > recovery path to cause a system crash. I wouldn't trust our current > > coverage for this. > > Yeah, that seems like quite a serious problem. I think you'd want > to have some sort of static-code-analysis-based way of identifying > the syscalls in use, rather than trying to test your way to it. > > > Overall, I think this sounds like a maintenance headache, and the > > possible benefits are unclear. > > After thinking about this for awhile, I really don't follow what > threat model it's trying to protect against. Given that we'll allow > any syscall that an unmodified PG executable might use, it seems > like the only scenarios being protected against involve someone > having already compromised the server enough to have arbitrary code > execution. OK, fine, but then why wouldn't the attacker just > bypass libseccomp? Or tell it to let through the syscall he wants > to use? Having the list of allowed syscalls be determined inside > the process seems like fundamentally the wrong implementation. > I'd have expected a feature like this to be implemented by SELinux, SELinux is generally an object model and while it does implement e.g., capability checks, that is not the main intent, nor is it possible for LSMs to implement syscall filters, the features are orthogonal. > or some similar technology where the filtering is done by logic > that's outside the executable you wish to not trust. > (After googling for libseccomp, I see that it's supposed to not > allow syscalls to be turned back on once turned off, but that isn't > any protection against this problem. An attacker who's found an ACE > hole in Postgres can just issue ALTER SYSTEM SET to disable the > feature, then force a postmaster restart, then profit.) > My preference would have been to enable it unconditionally but Joe was being more practical. > I follow the idea of limiting the attack surface for kernel bugs, > but this doesn't seem like a useful implementation of that, even > ignoring the ease-of-use problems Peter mentions. Limiting the kernel attack surface for network facing daemons is imperative to hardening systems. All modern attacks are chained together so a kernel bug is useful only if you can execute code, and PG is a decent vector for executing code. At a minimum I would urge the community to look at adding high risk syscalls to the kill list, systemd has some predefined sets we can pick from like @obsoluete, @cpu-emulation, @privileged, @mount, and @module. The goal is to prevent an ACE hole in Postgres from becoming a complete system compromise. This may not do it alone, and security conscious integrators will want to, for example, add seccomp filters to systemd to prevent superuser from disabling them. The postmaster and per-role lists can further reduce the available syscalls based on the exact extensions and PLs being used. Each step reduced the surface more and throwing it all out because one step can go rogue is unsatisfying. Thank you.
Re: no mailing list hits in google
Merlin Moncure schrieb am 28.08.2019 um 18:22: My test case here is the query: pgsql-hackers That search term is the first hit on DuckDuckGo: https://duckduckgo.com/?q=pgsql-hackers+ExecHashJoinNewBatch&t=h_&ia=web Searching for "postgres ExecHashJoinNewBatch" returns that ot position 4 https://duckduckgo.com/?q=postgres+ExecHashJoinNewBatch&t=h_&ia=web
Re: RFC: seccomp-bpf support
Hi, On 2019-08-28 13:28:06 -0400, Joe Conway wrote: > > To compute the initial set of allowed system calls, you need to have > > fantastic test coverage. What you don't want is some rarely used error > > recovery path to cause a system crash. I wouldn't trust our current > > coverage for this. > So if you are worried about that make your default action 'log' and > watch audit.log. There will be no errors or crashes of postgres caused > by that because there will be no change in postgres visible behavior. But the benefit of integrating this into postgres become even less clear. > And if returning an error from a syscall causes a crash that would be a > serious bug and we should fix it. Err, there's a lot of syscall failures that'll cause PANICs, and where there's no reasonable way around that. Greetings, Andres Freund
Re: RFC: seccomp-bpf support
Hi, On 2019-08-28 11:13:27 -0400, Joe Conway wrote: > Recent security best-practices recommend, and certain highly > security-conscious organizations are beginning to require, that SECCOMP > be used to the extent possible. The major web browsers, container > runtime engines, and systemd are all examples of software that already > support seccomp. Maybe I'm missing something, but it's not clear to me what meaningful attack surface can be reduced for PostgreSQL by forbidding certain syscalls, given the wide variety of syscalls required to run postgres. That's different from something like a browser's CSS process, or such, which really doesn't need much beyond some IPC and memory allocations. But postgres is going to need syscalls as broad as fork/clone, exec, connect, shm*, etc. I guess you can argue that we'd still reduce the attack surface for kernel escalations, but that seems like a pretty small win compared to the cost. > * With built-in support, it is possible to lock down backend processes > more tightly than the postmaster. Which important syscalls would you get away with removing in backends that postmaster needs? I think the only one - which is a good one though - that I can think of is listen(). But even that might be too restrictive for some PLs running out of process. My main problem with seccomp is that it's *incredibly* fragile, especially for a program as complex as postgres. We already had seccomp related bug reports on list, even just due to the very permissive filtering by some container solutions. There's regularly new syscalls (e.g. epoll_create1(), and we'll soon get openat2()), different versions of glibc use different syscalls (e.g. switching from open() to always using openat()), the system configuration influences which syscalls are being used (e.g. using vsyscalls only being used for certain clock sources), and kernel. bugfixes change the exact set of syscalls being used ([1]). [1] https://lwn.net/Articles/795128/ Then there's also the issue that many extensions are going to need additional syscalls. > Notes on usage: > === > In order to determine your minimally required allow lists, do something > like the following on a non-production server with the same architecture > as production: > c) Cut and paste the result as the value of session_syscall_allow. That seems nearly guaranteed to miss a significant fraction of syscalls. There's just no way we're going to cover all the potential paths and configurations in our testsuite. I think if you actually wanted to do something like this, you'd need to use static analysis to come up with a more reliable list. Greetings, Andres Freund
Re: no mailing list hits in google
Hi, On 2019-08-28 10:26:35 -0700, Andres Freund wrote: > On August 28, 2019 9:22:44 AM PDT, Merlin Moncure wrote: > >Hackers, > >[apologies if this is the incorrect list or is already discussed > >material] > > Probably should be on the -www list. Redirecting. Please trim in future > replies. > > >I've noticed that mailing list discussions in -hackers and other > >mailing lists appear to not be indexed by google -- at all. > > I noticed that there's fewer and fewer hits too. Pretty annoying. I have an > online archive I can search, but that's not something everyone should have to > do. > > I think it's because robots.txt tells search engines to ignore the lists. > Quite hard to understand how that's a good idea. > > https://www.postgresql.org/robots.txt > > User-agent: * > Disallow: /admin/ > Disallow: /account/ > Disallow: /docs/devel/ > Disallow: /list/ > Disallow: /search/ > Disallow: /message-id/raw/ > Disallow: /message-id/flat/ > > Sitemap: https://www.postgresql.org/sitemap.xml > > Without /list, there's no links to the individual messages. So there needs to > be another external reference for a search engine to arrive at individual > messages. > > Andres > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity. For reasons that I do not understand, the previous mail had a broken html part, making the above message invisible for people viewing the html part. Greetings, Andres Freund
Re: RFC: seccomp-bpf support
Peter Eisentraut writes: > On 2019-08-28 17:13, Joe Conway wrote: >> Attached is a patch for discussion, adding support for seccomp-bpf >> (nowadays generally just called seccomp) syscall filtering at >> configure-time using libseccomp. I would like to get this in shape to be >> committed by the end of the November CF if possible. > To compute the initial set of allowed system calls, you need to have > fantastic test coverage. What you don't want is some rarely used error > recovery path to cause a system crash. I wouldn't trust our current > coverage for this. Yeah, that seems like quite a serious problem. I think you'd want to have some sort of static-code-analysis-based way of identifying the syscalls in use, rather than trying to test your way to it. > Overall, I think this sounds like a maintenance headache, and the > possible benefits are unclear. After thinking about this for awhile, I really don't follow what threat model it's trying to protect against. Given that we'll allow any syscall that an unmodified PG executable might use, it seems like the only scenarios being protected against involve someone having already compromised the server enough to have arbitrary code execution. OK, fine, but then why wouldn't the attacker just bypass libseccomp? Or tell it to let through the syscall he wants to use? Having the list of allowed syscalls be determined inside the process seems like fundamentally the wrong implementation. I'd have expected a feature like this to be implemented by SELinux, or some similar technology where the filtering is done by logic that's outside the executable you wish to not trust. (After googling for libseccomp, I see that it's supposed to not allow syscalls to be turned back on once turned off, but that isn't any protection against this problem. An attacker who's found an ACE hole in Postgres can just issue ALTER SYSTEM SET to disable the feature, then force a postmaster restart, then profit.) I follow the idea of limiting the attack surface for kernel bugs, but this doesn't seem like a useful implementation of that, even ignoring the ease-of-use problems Peter mentions. regards, tom lane
Re: RFC: seccomp-bpf support
On 8/28/19 12:47 PM, David Fetter wrote: > On Wed, Aug 28, 2019 at 11:13:27AM -0400, Joe Conway wrote: >> SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall >> filtering mechanism which allows reduction of the kernel attack surface >> by preventing (or at least audit logging) normally unused syscalls. >> >> Quoting from this link: >> https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt >> >>"A large number of system calls are exposed to every userland process >> with many of them going unused for the entire lifetime of the >> process. As system calls change and mature, bugs are found and >> eradicated. A certain subset of userland applications benefit by >> having a reduced set of available system calls. The resulting set >> reduces the total kernel surface exposed to the application. System >> call filtering is meant for use with those applications." >> >> Recent security best-practices recommend, and certain highly >> security-conscious organizations are beginning to require, that SECCOMP >> be used to the extent possible. The major web browsers, container >> runtime engines, and systemd are all examples of software that already >> support seccomp. > > Neat! > > Are the seccomp interfaces for other kernels arranged in a manner > similar enough to have a unified interface in PostgreSQL, or is this > more of a Linux-only feature? As far as I know libseccomp is Linux specific, at least at the moment. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development signature.asc Description: OpenPGP digital signature
Re: RFC: seccomp-bpf support
On 8/28/19 1:03 PM, Peter Eisentraut wrote: > On 2019-08-28 17:13, Joe Conway wrote: >> * systemd does not implement seccomp filters by default. Packagers may >> decide to do so, but there is no guarantee. Adding them post install >> potentially requires cooperation by groups outside control of >> the database admins. > > Well, if we are interested in this, why don't we just ask packagers to > do so? That seems like a much easier solution. For the reason listed below >> * In the container and systemd case there is no particularly good way to >> inspect what filters are active. It is possible to observe actions >> taken, but again, control is possibly outside the database admin >> group. For example, the best way to understand what happened is to >> review the auditd log, which is likely not readable by the DBA. > > Why does one need to know what filters are active (other than, > obviously, checking that the filters one has set were actually > activated)? What decisions would a DBA or database user be able to make > based on whether a filter is set or not? So that when an enforement action happens, you can understand why it happened. Perhaps it was a bug (omission) in your allow list, or perhaps it was an intentional rule to prevent abuse (say blocking certain syscalls from plperlu), or it was because someone is trying to compromise you system (e.g. some obscure and clearly not needed syscall). >> * With built-in support, it is possible to lock down backend processes >> more tightly than the postmaster. > > Also the other way around? As I stated in the original email, the filters can add restrictions but never remove them. >> * With built-in support, it is possible to lock down different backend >> processes differently than each other, for example by using ALTER ROLE >> ... SET or ALTER DATABASE ... SET. > > What are use cases for this? For example to allow a specific user to use plperlu to exec shell code while others cannot. >> Attached is a patch for discussion, adding support for seccomp-bpf >> (nowadays generally just called seccomp) syscall filtering at >> configure-time using libseccomp. I would like to get this in shape to be >> committed by the end of the November CF if possible. > > To compute the initial set of allowed system calls, you need to have > fantastic test coverage. What you don't want is some rarely used error > recovery path to cause a system crash. I wouldn't trust our current > coverage for this. So if you are worried about that make your default action 'log' and watch audit.log. There will be no errors or crashes of postgres caused by that because there will be no change in postgres visible behavior. And if returning an error from a syscall causes a crash that would be a serious bug and we should fix it. > Also, the list of system calls in use changes all the time. The same > function call from PostgreSQL could be a system call or a glibc > implementation, depending on the specific installed packages or run-time > settings. True. That is why I did not provide an initial list and believe folks who want to use this should develop their own. > Extensions would need to maintain their own permissions list, and they > would need to be merged manually into the respective existing settings. People would have to generate their own list for extensions -- I don't believe it is the extension writers' problem. > Without good answers to these, I suspect that this feature would go > straight to the top of the list of "if in doubt, turn off". That is fine. Perhaps most people never use this, but when needed (and increasingly will be required) it is available. > Overall, I think this sounds like a maintenance headache, and the > possible benefits are unclear. The only people who will incur the maintenance headache are those who need to use it. The benefits are compliance with requirements. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development signature.asc Description: OpenPGP digital signature
Re: no mailing list hits in google
Hi, On August 28, 2019 9:22:44 AM PDT, Merlin Moncure wrote: >Hackers, >[apologies if this is the incorrect list or is already discussed >material] Probably should be on the -www list. Redirecting. Please trim in future replies. >I've noticed that mailing list discussions in -hackers and other >mailing lists appear to not be indexed by google -- at all. I noticed that there's fewer and fewer hits too. Pretty annoying. I have an online archive I can search, but that's not something everyone should have to do. I think it's because robots.txt tells search engines to ignore the lists. Quite hard to understand how that's a good idea. https://www.postgresql.org/robots.txt User-agent: * Disallow: /admin/ Disallow: /account/ Disallow: /docs/devel/ Disallow: /list/ Disallow: /search/ Disallow: /message-id/raw/ Disallow: /message-id/flat/ Sitemap: https://www.postgresql.org/sitemap.xml Without /list, there's no links to the individual messages. So there needs to be another external reference for a search engine to arrive at individual messages. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Hi Tomas, Interesting. Any idea where does the extra overhead in this particular case come from? It's hard to deduce that from the single flame graph, when I don't have anything to compare it with (i.e. the flame graph for the "normal" case). I guess that bottleneck is in disk operations. You can check logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and writes (~26%) take around 35% of CPU time in summary. To compare, please, see attached flame graph for the following transaction: INSERT INTO large_text SELECT (SELECT string_agg('x', ',') FROM generate_series(1, 2000)) FROM generate_series(1, 100); Execution Time: 44519.816 ms Time: 98333,642 ms (01:38,334) where disk IO is only ~7-8% in total. So we get very roughly the same ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests. Therefore, probably you may write changes on receiver in bigger chunks, not each change separately. Possibly, I/O is certainly a possible culprit, although we should be using buffered I/O and there certainly are not any fsyncs here. So I'm not sure why would it be cheaper to do the writes in batches. BTW does this mean you see the overhead on the apply side? Or are you running this on a single machine, and it's difficult to decide? I run this on a single machine, but walsender and worker are utilizing almost 100% of CPU per each process all the time, and at apply side I/O syscalls take about 1/3 of CPU time. Though I am still not sure, but for me this result somehow links performance drop with problems at receiver side. Writing in batches was just a hypothesis and to validate it I have performed test with large txn, but consisting of a smaller number of wide rows. This test does not exhibit any significant performance drop, while it was streamed too. So it seems to be valid. Anyway, I do not have other reasonable ideas beside that right now. I've checked recently this patch again and tried to elaborate it in terms of performance. As a result I've implemented a new POC version of the applier (attached). Almost everything in streaming logic stayed intact, but apply worker is significantly different. As I wrote earlier I still claim, that spilling changes on disk at the applier side adds additional overhead, but it is possible to get rid of it. In my additional patch I do the following: 1) Maintain a pool of additional background workers (bgworkers), that are connected with main logical apply worker via shm_mq's. Each worker is dedicated to the processing of specific streamed transaction. 2) When we receive a streamed change for some transaction, we check whether there is an existing dedicated bgworker in HTAB (xid -> bgworker), or there are some in the idle list, or spawn a new one. 3) We pass all changes (between STREAM START/STOP) to that bgworker via shm_mq_send without intermediate waiting. However, we wait for bgworker to apply the entire changes chunk at STREAM STOP, since we don't want transactions reordering. 4) When transaction is commited/aborted worker is being added to the idle list and is waiting for reassigning message. 5) I have used the same machinery with apply_dispatch in bgworkers, since most of actions are practically very similar. Thus, we do not spill anything at the applier side, so transaction changes are processed by bgworkers as normal backends do. In the same time, changes processing is strictly serial, which prevents transactions reordering and possible conflicts/anomalies. Even though we trade off performance in favor of stability the result is rather impressive. I have used a similar query for testing as before: EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3) SELECT round(random()*10), random(), random()*142 FROM generate_series(1, 100) s(i); with 1kk (100), 3kk and 5kk rows; logical_work_mem = 64MB and synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is following: CREATE TABLE large_test ( id serial primary key, num1 bigint, num2 double precision, num3 double precision ); Here are the results: --- | N | Time on master, sec | Total xact time, sec | Ratio | --- | On commit (master, v13) | --- | 1kk | 6.5 | 17.6 | x2.74 | --- | 3kk | 21 | 55.4 | x2.64 | --- | 5kk | 38.3 | 91.5 | x2.39 | --- | Stream + spill | -
Re: RFC: seccomp-bpf support
On 2019-08-28 17:13, Joe Conway wrote: > * systemd does not implement seccomp filters by default. Packagers may > decide to do so, but there is no guarantee. Adding them post install > potentially requires cooperation by groups outside control of > the database admins. Well, if we are interested in this, why don't we just ask packagers to do so? That seems like a much easier solution. > * In the container and systemd case there is no particularly good way to > inspect what filters are active. It is possible to observe actions > taken, but again, control is possibly outside the database admin > group. For example, the best way to understand what happened is to > review the auditd log, which is likely not readable by the DBA. Why does one need to know what filters are active (other than, obviously, checking that the filters one has set were actually activated)? What decisions would a DBA or database user be able to make based on whether a filter is set or not? > * With built-in support, it is possible to lock down backend processes > more tightly than the postmaster. Also the other way around? > * With built-in support, it is possible to lock down different backend > processes differently than each other, for example by using ALTER ROLE > ... SET or ALTER DATABASE ... SET. What are use cases for this? > Attached is a patch for discussion, adding support for seccomp-bpf > (nowadays generally just called seccomp) syscall filtering at > configure-time using libseccomp. I would like to get this in shape to be > committed by the end of the November CF if possible. To compute the initial set of allowed system calls, you need to have fantastic test coverage. What you don't want is some rarely used error recovery path to cause a system crash. I wouldn't trust our current coverage for this. Also, the list of system calls in use changes all the time. The same function call from PostgreSQL could be a system call or a glibc implementation, depending on the specific installed packages or run-time settings. Extensions would need to maintain their own permissions list, and they would need to be merged manually into the respective existing settings. Without good answers to these, I suspect that this feature would go straight to the top of the list of "if in doubt, turn off". Overall, I think this sounds like a maintenance headache, and the possible benefits are unclear. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: basebackup.c's sendFile() ignores read errors
On Tue, Aug 27, 2019 at 10:33 PM Robert Haas wrote: > While reviewing a proposed patch to basebackup.c this morning, I found > myself a bit underwhelmed by the quality of the code and comments in > basebackup.c's sendFile(). I believe it's already been pointed out > that the the retry logic here is not particularly correct, and the > comments demonstrate a pretty clear lack of understanding of how the > actual race conditions that are possible here might operate. > > However, I then noticed another problem which I think is significantly > more serious: this code totally ignores the possibility of a failure > in fread(). It just assumes that if fread() fails to return a > positive value, it's reached the end of the file, and if that happens, > it just pads out the rest of the file with zeroes. So it looks to me > like if fread() encountered, say, an I/O error while trying to read a > data file, and if that error were on the first data block, then the > entire contents of that file would be replaced with zero bytes in the > backup, and no error or warning of any kind would be given to the > user. If it hit the error later, everything from the point of the > error onward would just get replaced with zero bytes. To be clear, I > think it is fine for basebackup.c to fill out the rest of the expected > length with zeroes if in fact the file has been truncated during the > backup; recovery should fix it. But recovery is not going to fix data > that got eaten because EIO was encountered while copying a file. > Per fread(3) manpage, we cannot distinguish between an error or EOF. So to check for any error we must use ferror() after fread(). Attached patch which tests ferror() and throws an error if it returns true. However, I think fread()/ferror() both do not set errno (per some web reference) and thus throwing a generic error here. I have manually tested it. > The logic that rereads a block appears to have the opposite problem. > Here, it will detect an error, but it will also fail in the face of a > concurrent truncation, which it shouldn't. > For this, I have checked for feof() assuming that when the file gets truncated we reach EOF. And if yes, getting out of the loop instead of throwing an error. I may be wrong as I couldn't reproduce this issue. Please have a look over the patch and let me know your views. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- Jeevan Chalke Technical Architect, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index c91f66d..5e64acf 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -90,6 +90,18 @@ static char *statrelpath = NULL; */ #define THROTTLING_FREQUENCY 8 +/* + * Checks whether we encountered any error in fread(). fread() doesn't give + * any clue what has happened, so we check with ferror(). Also, neither + * fread() nor ferror() set errno, so we just throw a generic error. + */ +#define CHECK_FREAD_ERROR(fp, filename) \ +do { \ + if (ferror(fp)) \ + ereport(ERROR, \ +(errmsg("could not read from file \"%s\"", filename))); \ +} while (0) + /* The actual number of bytes, transfer of which may cause sleep. */ static uint64 throttling_sample; @@ -543,6 +555,9 @@ perform_base_backup(basebackup_options *opt) break; } + /* Check fread() error. */ + CHECK_FREAD_ERROR(fp, pathbuf); + if (len != wal_segment_size) { CheckXLogRemoved(segno, tli); @@ -1478,6 +1493,18 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf if (fread(buf + BLCKSZ * i, 1, BLCKSZ, fp) != BLCKSZ) { +/* + * If file is truncated, then we will hit + * end-of-file error in which case we don't + * want to error out, instead just pad it with + * zeros. + */ +if (feof(fp)) +{ + cnt = BLCKSZ * i; + break; +} + ereport(ERROR, (errcode_for_file_access(), errmsg("could not reread block %d of file \"%s\": %m", @@ -1529,7 +1556,7 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf len += cnt; throttle(cnt); - if (len >= statbuf->st_size) + if (feof(fp) || len >= statbuf->st_size) { /* * Reached end of file. The file could be longer, if it was @@ -1540,6 +1567,9 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf } } + /* Check fread() error. */ + CHECK_FREAD_ERROR(fp, readfilename); + /* If the file was truncated while we were sending it, pad it with zeros */ if (len < statbuf->st_size) {
Re: RFC: seccomp-bpf support
On Wed, Aug 28, 2019 at 11:13:27AM -0400, Joe Conway wrote: > SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall > filtering mechanism which allows reduction of the kernel attack surface > by preventing (or at least audit logging) normally unused syscalls. > > Quoting from this link: > https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt > >"A large number of system calls are exposed to every userland process > with many of them going unused for the entire lifetime of the > process. As system calls change and mature, bugs are found and > eradicated. A certain subset of userland applications benefit by > having a reduced set of available system calls. The resulting set > reduces the total kernel surface exposed to the application. System > call filtering is meant for use with those applications." > > Recent security best-practices recommend, and certain highly > security-conscious organizations are beginning to require, that SECCOMP > be used to the extent possible. The major web browsers, container > runtime engines, and systemd are all examples of software that already > support seccomp. Neat! Are the seccomp interfaces for other kernels arranged in a manner similar enough to have a unified interface in PostgreSQL, or is this more of a Linux-only feature? Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: Procedure support improvements
On 2019-08-26 20:08, Laurenz Albe wrote: > test=> CREATE OR REPLACE PROCEDURE testproc() LANGUAGE plpgsql AS >$$BEGIN PERFORM 42; COMMIT; PERFORM 'x'; END;$$; > CREATE PROCEDURE > test=> CALL testproc(); > CALL > test=> BEGIN; > BEGIN > test=> CALL testproc(); > ERROR: invalid transaction termination > CONTEXT: PL/pgSQL function testproc() line 1 at COMMIT > > Oops. > I find that indeed surprising. > > What is the rationale for this? It's mostly an implementation restriction. You would need to teach SPI_commit() and SPI_rollback() to manipulate the top-level transaction block state appropriately and carefully. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Email to hackers for test coverage
On 2019-08-22 11:46, movead...@highgo.ca wrote: > *1. src/include/utils/float.h:140* > > Analyze: > This is an error report line when converting a big float8 value > which a float4 can not storage to float4. > > Test case: > Add a test case as below in file float4.sql: > select float4(1234567890123456789012345678901234567890::float8); > +-- Add test case for float4() type fonversion function Check spelling > *2. src/include/utils/float.h:145* > > Analyze: > This is an error report line when converting a small float8 value > which a float4 can not storage to float4. > > Test case: > Add a test case as below in file float4.sql: > select float4(0.01::float8); > > *3.src/include/utils/sortsupport.h:264* > > Analyze: > It is reverse sorting for the data type that has abbreviated for > sort, for example macaddr, uuid, numeric, network and I choose > numeric to do it. > > Test cast: > Add a test case as below in file numeric.sql: > INSERT INTO num_input_test(n1) values('99.998'); > INSERT INTO num_input_test(n1) values('99.997'); > SELECT * FROM num_input_test ORDER BY n1 DESC; > INSERT INTO num_input_test(n1) VALUES ('nan'); > +INSERT INTO num_input_test(n1) values('99.998'); > +INSERT INTO num_input_test(n1) values('99.997'); Make spaces and capitalization match surrounding code. > Result and patch > > By adding the test cases, the test coverage of float.h increased from > 97.7% to 100% and sortsupport.h increased from 76.7% to 80.0%. That's fine, but I suggest that if you really want to make an impact in test coverage, look to increase function coverage rather than line coverage. Or look for files that are not covered at all. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
no mailing list hits in google
Hackers, [apologies if this is the incorrect list or is already discussed material] I've noticed that mailing list discussions in -hackers and other mailing lists appear to not be indexed by google -- at all. We are also not being tracked by any mailing list aggregators -- in contrast to a decade ago where we had nabble and other systems to collect and organize results (tbh, often better than we do) we are now at an extreme disadvantage; mailing list activity was formerly and absolutely fantastic research via google to find solutions to obscure technical problems in the database. Limited access to this information will directly lead to increased bug reports, lack of solution confidence, etc. My test case here is the query: pgsql-hackers ExecHashJoinNewBatch I was searching out a link to recent bug report for copy/paste into corporate email. In the old days this would fire right up but now returns no hits even though the discussion is available in the archives (which I had to find by looking up the specific day the thread was active). Just a heads up. merlin
Re: Statement timeout in pg_rewind
Alexander Kukushkin writes: > On Wed, 28 Aug 2019 at 04:51, Michael Paquier wrote: >> note that your patch had a warning as "result" is not needed in >> run_simple_command(). > Ohh, sorry about that. I compiled the whole source tree and the > warning was buried down among other output :( "make -s" is your friend ... regards, tom lane
Re: Improve error detections in TAP tests by spreading safe_psql
I wrote: > After a brief review of node->psql calls in the TAP tests, I'm > of the opinion that what we should do is revert fb57f40 and instead > change PostgresNode::psql so that on_error_die defaults to 1, then > fix the *very* small number of callers that need it to be zero. > Otherwise we're just going to be fighting this same fire forevermore. > Moreover, that's going to lead to a much smaller patch. Hmm .. I experimented with doing it this way, as attached, and I guess I have to take back the assertion that it'd be a much smaller patch. I found 47 calls that'd need to be changed to set on_error_die to 0, whereas your patch is touching 44 calls (though I think it missed some). I still think this is a safer, less bug-prone way to proceed though. The attached patch just follows a mechanical rule of "set on_error_die to 0 in exactly those calls where the surrounding code verifies the exit code is what it expects". That leads to a lot of code that looks like, say, # Insert should work on standby is( $node_standby->psql( 'postgres', - qq{insert into testtab select generate_series(1,1000), 'foo';}), + qq{insert into testtab select generate_series(1,1000), 'foo';}, + on_error_die => 0), 0, 'INSERT succeeds with truncated relation FSM'); I have to wonder if it isn't better to just drop the is() test and let PostgresNode::psql issue the failure. The only thing the is() is really buying us is the ability to label the test with an ID string, which is helpful, but um ... couldn't that just be a comment? Or we could think about adding another optional parameter: $node_standby->psql( 'postgres', qq{insert into testtab select generate_series(1,1000), 'foo';}, test_name => 'INSERT succeeds with truncated relation FSM'); regards, tom lane diff --git a/src/bin/pg_ctl/t/004_logrotate.pl b/src/bin/pg_ctl/t/004_logrotate.pl index 71dbfd2..8760804 100644 --- a/src/bin/pg_ctl/t/004_logrotate.pl +++ b/src/bin/pg_ctl/t/004_logrotate.pl @@ -19,7 +19,7 @@ $node->start(); # Verify that log output gets to the file -$node->psql('postgres', 'SELECT 1/0'); +$node->psql('postgres', 'SELECT 1/0', on_error_die => 0); my $current_logfiles = slurp_file($node->data_dir . '/current_logfiles'); @@ -75,7 +75,7 @@ chomp $lfname; # Verify that log output gets to this file, too -$node->psql('postgres', 'fee fi fo fum'); +$node->psql('postgres', 'fee fi fo fum', on_error_die => 0); my $second_logfile; for (my $attempts = 0; $attempts < $max_attempts; $attempts++) diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index f064ea1..53f4824 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -3274,6 +3274,7 @@ $node->psql( 'postgres', "CREATE COLLATION testing FROM \"C\"; DROP COLLATION testing;", on_error_stop => 0, + on_error_die => 0, stderr=> \$collation_check_stderr); if ($collation_check_stderr !~ /ERROR: /) diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl index b82d3f6..7cc1c88 100644 --- a/src/bin/pgbench/t/001_pgbench_with_server.pl +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl @@ -487,12 +487,10 @@ INSERT INTO seeded_random(seed, rand, val) VALUES } # check that all runs generated the same 4 values -my ($ret, $out, $err) = $node->psql('postgres', +my $out = $node->safe_psql('postgres', 'SELECT seed, rand, val, COUNT(*) FROM seeded_random GROUP BY seed, rand, val' ); -ok($ret == 0, "psql seeded_random count ok"); -ok($err eq '', "psql seeded_random count stderr is empty"); ok($out =~ /\b$seed\|uniform\|1\d\d\d\|2/, "psql seeded_random count uniform"); ok( $out =~ /\b$seed\|exponential\|2\d\d\d\|2/, diff --git a/src/test/authentication/t/001_password.pl b/src/test/authentication/t/001_password.pl index 3a3b0eb..b58304e 100644 --- a/src/test/authentication/t/001_password.pl +++ b/src/test/authentication/t/001_password.pl @@ -45,7 +45,10 @@ sub test_role $status_string = 'success' if ($expected_res eq 0); - my $res = $node->psql('postgres', undef, extra_params => [ '-U', $role ]); + my $res = $node->psql( + 'postgres', undef, + on_error_die => 0, + extra_params => [ '-U', $role ]); is($res, $expected_res, "authentication $status_string for method $method, role $role"); return; diff --git a/src/test/authentication/t/002_saslprep.pl b/src/test/authentication/t/002_saslprep.pl index c4b335c..0a75736 100644 --- a/src/test/authentication/t/002_saslprep.pl +++ b/src/test/authentication/t/002_saslprep.pl @@ -42,7 +42,10 @@ sub test_login $status_string = 'success' if ($expected_res eq 0); $ENV{"PGPASSWORD"} = $password; - my $res = $node->psql('postgres', undef, extra_params => [ '-U', $role ]); + my $res = $node->psql( + 'postgres', undef, + on_error_die => 0, + extra_params => [ '-U', $role ]); is($res, $expected_r
Re: Index Skip Scan
> On Mon, Aug 5, 2019 at 10:38 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > Of course we can modify it to read a whole page and leave visibility check > for the code after `index_getnext_tid` (although in case if we know that all > tuples on this page are visilbe I guess it's not strictly necessary, but we > still get improvement from skipping itself). Sorry for long delay. Here is more or less what I had in mind. After changing read_closest to read the whole page I couldn't resist to just merge it into readpage itself, since it's basically the same. It could raise questions about performance and how intrusive this patch is, but I hope it's not that much of a problem (in the worst case we can split it back). I've also added few tests for the issue you've mentioned. Thanks again, I'm appreciate how much efforts you put into reviewing! v24-0001-Index-skip-scan.patch Description: Binary data
RFC: seccomp-bpf support
SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall filtering mechanism which allows reduction of the kernel attack surface by preventing (or at least audit logging) normally unused syscalls. Quoting from this link: https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt "A large number of system calls are exposed to every userland process with many of them going unused for the entire lifetime of the process. As system calls change and mature, bugs are found and eradicated. A certain subset of userland applications benefit by having a reduced set of available system calls. The resulting set reduces the total kernel surface exposed to the application. System call filtering is meant for use with those applications." Recent security best-practices recommend, and certain highly security-conscious organizations are beginning to require, that SECCOMP be used to the extent possible. The major web browsers, container runtime engines, and systemd are all examples of software that already support seccomp. - A seccomp (bpf) filter is comprised of a default action, and a set of rules with actions pertaining to specific syscalls (possibly with even more specific sets of arguments). Once loaded into the kernel, a filter is inherited by all child processes and cannot be removed. It can, however, be overlaid with another filter. For any given syscall match, the most restrictive (a.k.a. highest precedence) action will be taken by the kernel. PostgreSQL has already been run "in the wild" under seccomp control in containers, and possibly systemd. Adding seccomp support into PostgreSQL itself mitigates issues with these approaches, and has several advantages: * Container seccomp filters tend to be extremely broad/permissive, typically allowing about 6 out 7 of all syscalls. They must do this because the use cases for containers vary widely. * systemd does not implement seccomp filters by default. Packagers may decide to do so, but there is no guarantee. Adding them post install potentially requires cooperation by groups outside control of the database admins. * In the container and systemd case there is no particularly good way to inspect what filters are active. It is possible to observe actions taken, but again, control is possibly outside the database admin group. For example, the best way to understand what happened is to review the auditd log, which is likely not readable by the DBA. * With built-in support, it is possible to lock down backend processes more tightly than the postmaster. * With built-in support, it is possible to lock down different backend processes differently than each other, for example by using ALTER ROLE ... SET or ALTER DATABASE ... SET. * With built-in support, it is possible to calculate and return (in the form of an SRF) the effective filters being applied to the postmaster and the current backend. * With built-in support, it could be possible (this part not yet implemented) to have separate filters for different backend types, e.g. autovac workers, background writer, etc. - Attached is a patch for discussion, adding support for seccomp-bpf (nowadays generally just called seccomp) syscall filtering at configure-time using libseccomp. I would like to get this in shape to be committed by the end of the November CF if possible. The code itself has been through several rounds of revision based on discussions I have had with the author of libseccomp as well as a few other folks. However as of the moment: * Documentation - general discussion missing entirely * No regression tests - For convenience, here are a couple of additional links to relevant information regarding seccomp: https://en.wikipedia.org/wiki/Seccomp https://github.com/seccomp/libseccomp - Specific feedback requested: 1. Placement of pg_get_seccomp_filter() in src/backend/utils/adt/genfile.c originally made sense but after several rewrites no longer does. Ideas where it *should* go? 2. Where should a general discussion section go in the docs, if at all? 3. Currently this supports a global filter at the postmaster level, which is inherited by all child processes, and a secondary filter at the client backend session level. It likely makes sense to support secondary filters for other types of child processes, e.g. autovacuum workers, etc. Add that now (pg13), later release, or never? 4. What is the best way to approach testing of this feature? Tap testing perhaps? 5. Default GUC values - should we provide "starter" lists, or only a procedure for generating a list (as below). - Notes on usage: === In order to determine your minimally required allow lists, do something like the following on a non-production server with the same architecture as production: 0. Setup: * install libseccomp, libseccomp-dev, and seccomp * install auditd if not already installed * confi
Improve base backup protocol documentation
It was apparently entirely undocumented that the tablespace size estimates sent by the base backup protocol are in kilobytes. Here is a patch to document that. Also, a related clarification in the pg_basebackup.c source code: It was not clear without analyzing the whole stack that "totalsize" is in kilobytes and "totaldone" is in bytes. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 2b9c1b1c8397bf9f6613d0d134ad3fa19a9292b9 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 28 Aug 2019 16:46:54 +0200 Subject: [PATCH] Improve base backup protocol documentation Document that the tablespace sizes are in units of kilobytes. Make the pg_basebackup source code a bit clearer about this, too. --- doc/src/sgml/protocol.sgml| 4 ++-- src/bin/pg_basebackup/pg_basebackup.c | 14 +++--- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index b20f1690a7..f036d5f178 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -2615,8 +2615,8 @@ Streaming Replication Protocol size (int8) - The approximate size of the tablespace, if progress report has - been requested; otherwise it's null. + The approximate size of the tablespace, in kilobytes (1024 bytes), + if progress report has been requested; otherwise it's null. diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index 9207109ba3..498754eb32 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -115,7 +115,7 @@ static bool made_tablespace_dirs = false; static bool found_tablespace_dirs = false; /* Progress counters */ -static uint64 totalsize; +static uint64 totalsize_kb; static uint64 totaldone; static int tablespacecount; @@ -722,7 +722,7 @@ progress_report(int tablespacenum, const char *filename, bool force) return; /* Max once per second */ last_progress_report = now; - percent = totalsize ? (int) ((totaldone / 1024) * 100 / totalsize) : 0; + percent = totalsize_kb ? (int) ((totaldone / 1024) * 100 / totalsize_kb) : 0; /* * Avoid overflowing past 100% or the full size. This may make the total @@ -732,8 +732,8 @@ progress_report(int tablespacenum, const char *filename, bool force) */ if (percent > 100) percent = 100; - if (totaldone / 1024 > totalsize) - totalsize = totaldone / 1024; + if (totaldone / 1024 > totalsize_kb) + totalsize_kb = totaldone / 1024; /* * Separate step to keep platform-dependent format code out of @@ -742,7 +742,7 @@ progress_report(int tablespacenum, const char *filename, bool force) */ snprintf(totaldone_str, sizeof(totaldone_str), INT64_FORMAT, totaldone / 1024); - snprintf(totalsize_str, sizeof(totalsize_str), INT64_FORMAT, totalsize); + snprintf(totalsize_str, sizeof(totalsize_str), INT64_FORMAT, totalsize_kb); #define VERBOSE_FILENAME_LENGTH 35 if (verbose) @@ -1942,11 +1942,11 @@ BaseBackup(void) /* * Sum up the total size, for progress reporting */ - totalsize = totaldone = 0; + totalsize_kb = totaldone = 0; tablespacecount = PQntuples(res); for (i = 0; i < PQntuples(res); i++) { - totalsize += atol(PQgetvalue(res, i, 2)); + totalsize_kb += atol(PQgetvalue(res, i, 2)); /* * Verify tablespace directories are empty. Don't bother with the -- 2.22.0
Re: assertion at postmaster start
On 2019-Aug-26, Tom Lane wrote: > I wrote: > > I propose the attached. I'm inclined to think that the risk/benefit > > of back-patching this is not very good, so I just want to stick it in > > HEAD, unless somebody can explain why dead_end children are likely to > > crash in the field. > > Pushed at ee3278239. > > I'm still curious as to the explanation for a dead_end child exiting > with code 15, but I have no way to pursue the point. Many thanks for all the investigation and fix! Sadly, I have *no* idea what could have happened that would have caused a connection at that point (my start scripts don't do it). It is possible that I had a terminal running some shell loop on psql ("watch psql -c something" perhaps). But I'm sure I didn't notice that when I reported this, or I would have mentioned it. However, I have no idea why would it have died with code 15. From my notes of what I was doing that day, I can't find any evidence that I would have had anything in shared_preload_libraries. (I don't have Frost's complete timestamped shell history, however.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Improve error detections in TAP tests by spreading safe_psql
Michael Paquier writes: > This is a follow-up of the discussion which happened here after Tom > has committed fb57f40: > https://www.postgresql.org/message-id/20190828012439.ga1...@paquier.xyz > I have reviewed the TAP tests, and we have much more spots where it > is better to use PostgresNode::safe_psql instead PostgresNode::psql so > as the test dies immediately if there is a failure on a query which > should never fail. After a brief review of node->psql calls in the TAP tests, I'm of the opinion that what we should do is revert fb57f40 and instead change PostgresNode::psql so that on_error_die defaults to 1, then fix the *very* small number of callers that need it to be zero. Otherwise we're just going to be fighting this same fire forevermore. Moreover, that's going to lead to a much smaller patch. > Attached are the spots I have found. Any thoughts or opinions? Well, for instance, you missed SSLServer.pm, in which every single one of the psql calls is wrong. If we go this route we'd have to back-patch the change, else it'd be a serious hazard for test case back-patching. But the buildfarm would, more or less by definition, find any oversights --- so that doesn't scare me much. regards, tom lane
Doubt regarding logical decoding
I'm pretty new to this, and I find it hard trying to make some sense from src code. I'm trying to understand, logical decoding. In `HEAP2` RMGR, what does HEAP2_NEW_CID operation do? can somebody give an example, when this type of record is written to wal files. When logical decoding decodes this record, what it means to it. Is it a transaction with sub-transactions, which does some ddl changes? -- Thank you
psql - improve test coverage from 41% to 88%
Hello devs, The attached patch improves psql code coverage by adding a specific TAP test. The 1709 tests take 4 seconds CPU (6.3 elapsed time) on my laptop. The infrastructure is updated to require perl module "Expect", allowing to test interactive features such as tab completion and prompt changes. Coverage in "src/bin/psql" jumps from 40.0% of lines and 41.9% of functions to 78.4% of lines and 98.1% of functions with "check-world". -- Fabien.diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl index b7d36b65dd..aabc27ab6f 100644 --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl @@ -506,6 +506,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; $node->command_checks_all( [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt" ], 1, + '', [qr{^$}], [qr/^WARNING.*checksum verification failed/s], 'pg_basebackup reports checksum mismatch'); @@ -526,6 +527,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; $node->command_checks_all( [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt2" ], 1, + '', [qr{^$}], [qr/^WARNING.*further.*failures.*will.not.be.reported/s], 'pg_basebackup does not report more than 5 checksum mismatches'); @@ -542,6 +544,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; $node->command_checks_all( [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt3" ], 1, + '', [qr{^$}], [qr/^WARNING.*7 total checksum verification failures/s], 'pg_basebackup correctly report the total number of checksum mismatches'); diff --git a/src/bin/pg_checksums/t/002_actions.pl b/src/bin/pg_checksums/t/002_actions.pl index 59228b916c..cf4811d382 100644 --- a/src/bin/pg_checksums/t/002_actions.pl +++ b/src/bin/pg_checksums/t/002_actions.pl @@ -62,6 +62,7 @@ sub check_relation_corruption '--filenode', $relfilenode_corrupted ], 1, + '', [qr/Bad checksums:.*1/], [qr/checksum verification failed/], "fails with corrupted data for single relfilenode on tablespace $tablespace" @@ -71,6 +72,7 @@ sub check_relation_corruption $node->command_checks_all( [ 'pg_checksums', '--check', '-D', $pgdata ], 1, + '', [qr/Bad checksums:.*1/], [qr/checksum verification failed/], "fails with corrupted data on tablespace $tablespace"); @@ -203,6 +205,7 @@ sub fail_corrupt $node->command_checks_all( [ 'pg_checksums', '--check', '-D', $pgdata ], 1, + '', [qr/^$/], [qr/could not read block 0 in file.*$file\":/], "fails for corrupted data in $file"); diff --git a/src/bin/pg_controldata/t/001_pg_controldata.pl b/src/bin/pg_controldata/t/001_pg_controldata.pl index 3b63ad230f..ebe9b80a52 100644 --- a/src/bin/pg_controldata/t/001_pg_controldata.pl +++ b/src/bin/pg_controldata/t/001_pg_controldata.pl @@ -33,6 +33,7 @@ close $fh; command_checks_all( [ 'pg_controldata', $node->data_dir ], 0, + '', [ qr/WARNING: Calculated CRC checksum does not match value stored in file/, qr/WARNING: invalid WAL segment size/ diff --git a/src/bin/pg_resetwal/t/002_corrupted.pl b/src/bin/pg_resetwal/t/002_corrupted.pl index f9940d7fc5..1990669d26 100644 --- a/src/bin/pg_resetwal/t/002_corrupted.pl +++ b/src/bin/pg_resetwal/t/002_corrupted.pl @@ -30,6 +30,7 @@ close $fh; command_checks_all( [ 'pg_resetwal', '-n', $node->data_dir ], 0, + '', [qr/pg_control version number/], [ qr/pg_resetwal: warning: pg_control exists but is broken or wrong version; ignoring it/ @@ -46,6 +47,7 @@ close $fh; command_checks_all( [ 'pg_resetwal', '-n', $node->data_dir ], 0, + '', [qr/pg_control version number/], [ qr/\Qpg_resetwal: warning: pg_control specifies invalid WAL segment size (0 bytes); proceed with caution\E/ diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl index b82d3f65c4..01010914fe 100644 --- a/src/bin/pgbench/t/001_pgbench_with_server.pl +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl @@ -50,7 +50,7 @@ sub pgbench push @cmd, @args; - $node->command_checks_all(\@cmd, $stat, $out, $err, $name); + $node->command_checks_all(\@cmd, $stat, '', $out, $err, $name); # cleanup? #unlink @filenames or die "cannot unlink files (@filenames): $!"; diff --git a/src/bin/pgbench/t/002_pgbench_no_server.pl b/src/bin/pgbench/t/002_pgbench_no_server.pl index f7fa18418b..b58f3548c3 100644 --- a/src/bin/pgbench/t/002_pgbench_no_server.pl +++ b/src/bin/pgbench/t/002_pgbench_no_server.pl @@ -25,7 +25,7 @@ sub pgbench my ($opts, $stat, $out, $err, $name) = @_; print STDERR "opts=$opts, stat=$stat, out=$out, err=$err, name=$name"; command_checks_all([ 'pgbench', split(/\s+/, $opts) ], - $stat, $out, $err, $name); + $stat, '', $out, $err, $name); return; } @@ -52,7 +52,7 @@ sub pgbench_scripts push @cmd, '-f', $filename; } } - command_checks_all(\@cmd, $stat, $out, $err, $name); + command_checks_all(\@cmd, $stat, '', $out, $err, $name); return; } diff --git a/src/b
Re: Crash in BRIN summarization
Heikki Linnakangas writes: > I bumped into a little bug in BRIN, while hacking on something > unrelated. This causes a segfault, or an assertion failure if assertions > are enabled: Good catch. > Fix attached. Hm, I don't particularly care for directly comparing Datum values like that. We don't do that elsewhere (I think) and it feels like a type violation to me. Since you've already tested for !attbyval, I think it'd be legitimate to extract the pointer values and compare those, eg if (!attr->attbyval && DatumGetPointer(result) != DatumGetPointer(column->bv_values[INCLUSION_UNION])) I realize that this makes no difference in terms of the machine code (at least for most machines?) but it avoids breaking the Datum abstraction. regards, tom lane
Re: pg_get_databasebyid(oid)
> Please add that to commitfest. Done: https://commitfest.postgresql.org/24/2261/ regards, Sergei
Re: pg_get_databasebyid(oid)
On Wed, Aug 28, 2019 at 5:38 PM Sergei Kornilov wrote: > Hello > We already have function pg_get_userbyid(oid) with lookup in pg_authid > catalog. My collegue ask me can we add similar function > pg_get_databasebyid(oid) with lookup in pg_databases. > It is simple function to get a database name by oid and fallback to > 'unknown (OID=n)' if missing. > > The proposed patch is attached. Currently I missed the tests - I doubt > which file in src/test/regress/sql/ is the most suitable. pg_get_userbyid > is called from privileges.sql only. > > regards, Sergei Please add that to commitfest. -- Ibrar Ahmed
pg_get_databasebyid(oid)
Hello We already have function pg_get_userbyid(oid) with lookup in pg_authid catalog. My collegue ask me can we add similar function pg_get_databasebyid(oid) with lookup in pg_databases. It is simple function to get a database name by oid and fallback to 'unknown (OID=n)' if missing. The proposed patch is attached. Currently I missed the tests - I doubt which file in src/test/regress/sql/ is the most suitable. pg_get_userbyid is called from privileges.sql only. regards, Sergeidiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index c878a0ba4d..5ed5b3ac39 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -18333,6 +18333,10 @@ SELECT pg_type_is_visible('myschema.widget'::regtype); pg_get_userbyid + +pg_get_databasebyid + + pg_get_viewdef @@ -18513,6 +18517,11 @@ SELECT pg_type_is_visible('myschema.widget'::regtype); name get role name with given OID + + pg_get_databasebyid(db_oid) + name + get database name with given OID + pg_get_viewdef(view_name) text @@ -18703,8 +18712,9 @@ SELECT currval(pg_get_serial_sequence('sometable', 'id')); - pg_get_userbyid extracts a role's name given - its OID. + pg_get_userbyid and + pg_get_databasebyid extracts respectively a + role's and database's name given its OID. diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 3e64390d81..214a081555 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -31,6 +31,7 @@ #include "catalog/pg_authid.h" #include "catalog/pg_collation.h" #include "catalog/pg_constraint.h" +#include "catalog/pg_database.h" #include "catalog/pg_depend.h" #include "catalog/pg_language.h" #include "catalog/pg_opclass.h" @@ -2474,6 +2475,41 @@ pg_get_userbyid(PG_FUNCTION_ARGS) PG_RETURN_NAME(result); } +/* -- + * get_databasebyid - Get a database name by oid and + * fallback to 'unknown (OID=n)' + * -- + */ +Datum +pg_get_databasebyid(PG_FUNCTION_ARGS) +{ + Oid dbid = PG_GETARG_OID(0); + Name result; + HeapTuple dbtup; + Form_pg_database dbrec; + + /* + * Allocate space for the result + */ + result = (Name) palloc(NAMEDATALEN); + memset(NameStr(*result), 0, NAMEDATALEN); + + /* + * Get the pg_database entry and print the result + */ + dbtup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(dbid)); + if (HeapTupleIsValid(dbtup)) + { + dbrec = (Form_pg_database) GETSTRUCT(dbtup); + StrNCpy(NameStr(*result), NameStr(dbrec->datname), NAMEDATALEN); + ReleaseSysCache(dbtup); + } + else + sprintf(NameStr(*result), "unknown (OID=%u)", dbid); + + PG_RETURN_NAME(result); +} + /* * pg_get_serial_sequence diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index cf1f409351..4f1c55c3c7 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -3573,6 +3573,9 @@ { oid => '1642', descr => 'role name by OID (with fallback)', proname => 'pg_get_userbyid', provolatile => 's', prorettype => 'name', proargtypes => 'oid', prosrc => 'pg_get_userbyid' }, +{ oid => '9978', descr => 'database name by OID (with fallback)', + proname => 'pg_get_databasebyid', provolatile => 's', prorettype => 'name', + proargtypes => 'oid', prosrc => 'pg_get_databasebyid' }, { oid => '1643', descr => 'index description', proname => 'pg_get_indexdef', provolatile => 's', prorettype => 'text', proargtypes => 'oid', prosrc => 'pg_get_indexdef' },
Re: [HACKERS] advanced partition matching algorithm for partition-wise join
On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita wrote: > It seems that I performed the above tests on an assertion-enabled > build. :( So I executed the tests one more time. Here are the > results. > > * 2-way self-join of pt: explain analyze select * from pt t0, pt t1 > where t0.a = t1.a; > - HEAD: > Planning Time: 0.969 ms > Execution Time: 13.843 ms > - with patch: > Planning Time: 1.720 ms > Execution Time: 14.393 ms > - with patch plus attached: > Planning Time: 1.630 ms > Execution Time: 14.002 ms > > * 4-way self-join of pt: explain analyze select * from pt t0, pt t1, > pt t2, pt t3 where t0.a = t1.a > > and t1.a = t2.a and t2.a = t3.a; > - HEAD: > Planning Time: 12.203 ms > Execution Time: 31.784 ms > - with patch: > Planning Time: 32.102 ms > Execution Time: 32.504 ms > - with patch plus attached: > Planning Time: 19.471 ms > Execution Time: 32.582 ms > > * 8-way self-join of pt: explain analyze select * from pt t0, pt t1, > pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a = > t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a > and t6.a = t7.a; > - HEAD: > Planning Time: 948.131 ms > Execution Time: 55.645 ms > - with patch: > Planning Time: 2939.813 ms > Execution Time: 56.760 ms > - with patch plus attached: > Planning Time: 1108.076 ms > Execution Time: 55.750 ms > > Note: the attached patch still uses the proposed partition matching > algorithm for these queries. As I said before, these queries don't > need that algorithm, so we could eliminate the planning overhead > compared to HEAD, by planning these queries as before, perhaps, but I > haven't modified the patch as such yet. I modified the patch further as such. Attached is an updated version of the patch created on top of the patch in [1]. I did the tests again using the updated version of the patch. Here are the results: * 2-way self-join of pt: Planning Time: 1.043 ms Execution Time: 13.931 ms * 4-way self-join of pt: Planning Time: 12.499 ms Execution Time: 32.392 ms * 8-way self-join of pt: Planning Time: 968.412 ms Execution Time: 56.328 ms The planning time for each test case still increased slightly, but IMO I think that would be acceptable. To see the efficiency of the attached, I did another testing with test cases that really need the new partition-matching algorithm: * explain analyze select * from pt6 t6, pt7 t7 where t6.a = t7.a; - base patch in [1] Planning Time: 1.758 ms Execution Time: 13.977 ms - with attached Planning Time: 1.777 ms Execution Time: 13.959 ms * explain analyze select * from pt4 t4, pt5 t5, pt6 t6, pt7 t7 where t4.a = t5.a and t5.a = t6.a and t6.a = t7.a; - base patch in [1] Planning Time: 33.201 ms Execution Time: 32.480 ms - with attached Planning Time: 21.019 ms Execution Time: 32.777 ms * explain analyze select * from pt0 t0, pt1 t1, pt2 t2, pt3 t3, pt4 t4, pt5 t5, pt6 t6, pt7 t7 where t0.a = t1.a and t1.a = t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a and t6.a = t7.a; - base patch in [1] Planning Time: 3060.000 ms Execution Time: 55.553 ms - with attached Planning Time: 1144.996 ms Execution Time: 56.233 ms where pt0, pt1, pt2, pt3, pt4, pt5, pt6 and pt7 are list partitioned tables that have slighly different list values. (The structure and list values of ptN are almost the same as that of pt used above, but ptN's N-th partition ptNpN has an extra list value that pt's N-th partition ptpN doesn't have.) If anyone is interested in this testing, I'll send a script file for producing these list partitioned tables. About the attached: * The attached patch modified try_partitionwise_join() so that we call partition_bounds_equal() then partition_bounds_merge() (if necessary) to create the partition bounds for the join rel. We don't support for merging the partition bounds for the hash-partitioning case, so this makes code to handle the hash-partitioning case in partition_bounds_merge() completely unnecessary, so I removed that entirely. * I removed this assertion in partition_bounds_merge(), because I think this is covered by two assertions above this. + Assert((*outer_parts == NIL || *inner_parts != NIL) && + (*inner_parts == NIL || *outer_parts != NIL)); * (I forgot to mention this in a previous email, but) I removed this bit of generate_matching_part_pairs(), because we already have the same check in try_partitionwise_join(), so this bit would be redundant IIUC. + switch (jointype) + { + case JOIN_INNER: + case JOIN_SEMI: + + /* +* An inner or semi join can not return any row when the +* matching partition on either side is missing. We should +* have eliminated all such cases while merging the bounds. +*/ + Assert(part1 >= 0 && part2 >= 0); + break; + + case JOIN_LEFT: + case JOIN_ANTI: + A
[no subject]
Hello, I hear that not recommended to set pg_resetwal with --wal-segsize for wal increasing. Are any more detailed information exists about it? What an effects could be? Does it possible change it due full offline? Regards, Paul
Re: Performance improvement of WAL writing?
On Wed, Aug 28, 2019 at 2:43 AM Moon, Insung wrote: > So what about modifying the XLogWrite function only to write the size > that should record? > Can this idea benefit from WAL writing performance? > If it's OK to improve, I want to do modification. > How do you think of it? > I have performed tests in this direction several years ago. The way it works now guarantees that the data of recycled WAL segments is never read from disk, as long as the WAL block size is a multiple of the filesystem's page/fragment size. The OS sees that the write() is on a fragment boundary and full fragment(s) in size. If the write() would be smaller in size, the OS would be forced to combine the new data with the rest of the fragment's old data on disk. To do so the system now has to wait until the old fragment is actually in the OS buffer. Which means that once you have enough WAL segments to cycle through so that the checkpoint reason is never XLOG and the blocks of the WAL segment that is cycled in have been evicted since it was last used, this causes real reads. On spinning media, which are still an excellent choice for WAL, this turns into a total disaster because it adds a rotational delay for every single WAL block that is only partially overwritten. I believe that this could only work if we stopped recycling WAL segments and instead delete old segments, then create new files as needed. This however adds the overhead of constant metadata change to WAL writing and I have no idea what performance or reliability impact that may have. There were reasons we chose to implement WAL segment recycling many years ago. These reasons may no longer be valid on modern filesystems, but it definitely is not a performance question alone. Regards, Jan -- Jan Wieck Senior Postgres Architect http://pgblog.wi3ck.info
Re: Crash in BRIN summarization
Thank you for your fix. > This assumes that the merge function returns a newly-palloc'd value. > That's a shaky assumption; if one of the arguments is an empty range, > range_merge() returns the other argument, rather than a newly > constructed value. And surely we can't assume assume that for > user-defined opclasses. Your analysis looks right to me. > brin_inclusion_union() has a similar issue, but I didn't write a script > to reproduce that. Fix attached. I am not sure about this part. If it's okay to use col_a->bv_values without copying, it should also be okay to use col_b->bv_values, no?
Converting Nested loop to hashjoin for not is distinct from case
Hi, I have a join query with is not distinct from criteria (select * from table1 join table2 on a is not distinct from b). on examining, it was the nested loop join that makes the query slow. For join with ' is not distinct from ' qual can't the planner choose hash-join over nested loop or will there be any problem by doing so. Kindly enlighten me about the same. Thanks Pradeep
Re: A problem about partitionwise join
Hi, On Tue, Aug 27, 2019 at 4:57 PM Richard Guo wrote: > Check the query below as a more illustrative example: > > create table p (k int, val int) partition by range(k); > create table p_1 partition of p for values from (1) to (10); > create table p_2 partition of p for values from (10) to (100); > > If we use quals 'foo.k = bar.k and foo.k = bar.val', we can generate > partitionwise join: > > # explain (costs off) > select * from p as foo join p as bar on foo.k = bar.k and foo.k = bar.val; >QUERY PLAN > - > Append >-> Hash Join > Hash Cond: (foo.k = bar.k) > -> Seq Scan on p_1 foo > -> Hash >-> Seq Scan on p_1 bar > Filter: (k = val) >-> Hash Join > Hash Cond: (foo_1.k = bar_1.k) > -> Seq Scan on p_2 foo_1 > -> Hash >-> Seq Scan on p_2 bar_1 > Filter: (k = val) > (13 rows) > > But if we exchange the order of the two quals to 'foo.k = bar.val and > foo.k = bar.k', then partitionwise join cannot be generated any more, > because we only have joinclause 'foo.k = bar.val' as it first reached > score of 3. We have missed the joinclause on the partition key although > it does exist. > > # explain (costs off) > select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k; >QUERY PLAN > - > Hash Join >Hash Cond: (foo.k = bar.val) >-> Append > -> Seq Scan on p_1 foo > -> Seq Scan on p_2 foo_1 >-> Hash > -> Append >-> Seq Scan on p_1 bar > Filter: (val = k) >-> Seq Scan on p_2 bar_1 > Filter: (val = k) > (11 rows) I think it would be nice if we can address this issue. Best regards, Etsuro Fujita
Re: Statement timeout in pg_rewind
Hi, Thank you, Michael, for committing it! On Wed, 28 Aug 2019 at 04:51, Michael Paquier wrote: > note that your patch had a warning as "result" is not needed in > run_simple_command(). Ohh, sorry about that. I compiled the whole source tree and the warning was buried down among other output :( Regards, -- Alexander Kukushkin
Re: Comment in ginpostinglist.c doesn't match code
On 23/08/2019 11:44, Masahiko Sawada wrote: On Fri, Aug 23, 2019 at 7:05 AM Ashwin Agrawal wrote: On Thu, Aug 22, 2019 at 1:14 AM Heikki Linnakangas wrote: The patch also includes a little unit test module to test this without creating a 16 TB table. A whole new test module seems a bit like overkill just for this, but clearly we were missing test coverage here. And it will come handy, if we want to invent a new better posting list format in the future. Thoughts on whether to include the test module or not? I like the test as importantly adds missing coverage. Also, really simplifies validation effort if required to make change in this area anytime in future. So, I would +1 keeping the same. I'd +1 too. It's valuable to test hard-to-reproduce case. I often want to do such unit tests with more cheaper costs, though. Ok, including it seems to be the consensus. BTW it's not related to this patch but I got confused that where "17 bits" of the following paragraph in ginpostinglist.c comes from. If we use only 43 bit out of 64-bit unsigned integer we have 21 bits left. * For encoding purposes, item pointers are represented as 64-bit unsigned * integers. The lowest 11 bits represent the offset number, and the next * lowest 32 bits are the block number. That leaves 17 bits unused, i.e. * only 43 low bits are used. Huh, you're right. I think it should be "leaves 21 bits unused". I suspect the patch used 15 bits for the offset number early in the development, which would've left 17 bits unused, and when it was later changed to use only 11 bits, that comment was neglected. Fixed that comment, too, and pushed. Thanks! - Heikki
Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Thank you Fujita San for the enhancement, will have a look. Regards, Amul On Wed, Aug 28, 2019 at 3:52 PM Etsuro Fujita wrote: > On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita > wrote: > > It seems that I performed the above tests on an assertion-enabled > > build. :( So I executed the tests one more time. Here are the > > results. > > > > * 2-way self-join of pt: explain analyze select * from pt t0, pt t1 > > where t0.a = t1.a; > > - HEAD: > > Planning Time: 0.969 ms > > Execution Time: 13.843 ms > > - with patch: > > Planning Time: 1.720 ms > > Execution Time: 14.393 ms > > - with patch plus attached: > > Planning Time: 1.630 ms > > Execution Time: 14.002 ms > > > > * 4-way self-join of pt: explain analyze select * from pt t0, pt t1, > > pt t2, pt t3 where t0.a = t1.a > > > > and t1.a = t2.a and t2.a = t3.a; > > - HEAD: > > Planning Time: 12.203 ms > > Execution Time: 31.784 ms > > - with patch: > > Planning Time: 32.102 ms > > Execution Time: 32.504 ms > > - with patch plus attached: > > Planning Time: 19.471 ms > > Execution Time: 32.582 ms > > > > * 8-way self-join of pt: explain analyze select * from pt t0, pt t1, > > pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a = > > t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a > > and t6.a = t7.a; > > - HEAD: > > Planning Time: 948.131 ms > > Execution Time: 55.645 ms > > - with patch: > > Planning Time: 2939.813 ms > > Execution Time: 56.760 ms > > - with patch plus attached: > > Planning Time: 1108.076 ms > > Execution Time: 55.750 ms > > > > Note: the attached patch still uses the proposed partition matching > > algorithm for these queries. As I said before, these queries don't > > need that algorithm, so we could eliminate the planning overhead > > compared to HEAD, by planning these queries as before, perhaps, but I > > haven't modified the patch as such yet. > > I modified the patch further as such. Attached is an updated version > of the patch created on top of the patch in [1]. I did the tests > again using the updated version of the patch. Here are the results: > > * 2-way self-join of pt: > Planning Time: 1.043 ms > Execution Time: 13.931 ms > > * 4-way self-join of pt: > Planning Time: 12.499 ms > Execution Time: 32.392 ms > > * 8-way self-join of pt: > Planning Time: 968.412 ms > Execution Time: 56.328 ms > > The planning time for each test case still increased slightly, but IMO > I think that would be acceptable. To see the efficiency of the > attached, I did another testing with test cases that really need the > new partition-matching algorithm: > > * explain analyze select * from pt6 t6, pt7 t7 where t6.a = t7.a; > - base patch in [1] > Planning Time: 1.758 ms > Execution Time: 13.977 ms > - with attached > Planning Time: 1.777 ms > Execution Time: 13.959 ms > > * explain analyze select * from pt4 t4, pt5 t5, pt6 t6, pt7 t7 where > t4.a = t5.a and t5.a = t6.a and t6.a = t7.a; > - base patch in [1] > Planning Time: 33.201 ms > Execution Time: 32.480 ms > - with attached > Planning Time: 21.019 ms > Execution Time: 32.777 ms > > * explain analyze select * from pt0 t0, pt1 t1, pt2 t2, pt3 t3, pt4 > t4, pt5 t5, pt6 t6, pt7 t7 where t0.a = t1.a and t1.a = t2.a and t2.a > = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a and t6.a = > t7.a; > - base patch in [1] > Planning Time: 3060.000 ms > Execution Time: 55.553 ms > - with attached > Planning Time: 1144.996 ms > Execution Time: 56.233 ms > > where pt0, pt1, pt2, pt3, pt4, pt5, pt6 and pt7 are list partitioned > tables that have slighly different list values. (The structure and > list values of ptN are almost the same as that of pt used above, but > ptN's N-th partition ptNpN has an extra list value that pt's N-th > partition ptpN doesn't have.) If anyone is interested in this > testing, I'll send a script file for producing these list partitioned > tables. > > About the attached: > > * The attached patch modified try_partitionwise_join() so that we call > partition_bounds_equal() then partition_bounds_merge() (if necessary) > to create the partition bounds for the join rel. We don't support for > merging the partition bounds for the hash-partitioning case, so this > makes code to handle the hash-partitioning case in > partition_bounds_merge() completely unnecessary, so I removed that > entirely. > > * I removed this assertion in partition_bounds_merge(), because I > think this is covered by two assertions above this. > > + Assert((*outer_parts == NIL || *inner_parts != NIL) && > + (*inner_parts == NIL || *outer_parts != NIL)); > > * (I forgot to mention this in a previous email, but) I removed this > bit of generate_matching_part_pairs(), because we already have the > same check in try_partitionwise_join(), so this bit would be redundant > IIUC. > > + switch (jointype) > + { > + case JOIN_INNER: > + case JOIN_SEMI: > + > + /* > +
Re: allow online change primary_conninfo
Hello Updated patch attached. (also I merged into one file) > + > + WAL receiver will be restarted after primary_slot_name > + was changed. > > The sentence sounds strange. Here is a suggestion: > The WAL receiver is restarted after an update of primary_slot_name (or > primary_conninfo). Changed. > The comment at the top of the call of ProcessStartupSigHup() in > HandleStartupProcInterrupts() needs to be updated as it mentions a > configuration file re-read, but that's not the case anymore. Changed to "Process any requests or signals received recently." like in other places, e.g. syslogger > pendingRestartSource's name does not represent what it does, as it is > only associated with the restart of a WAL receiver when in streaming > state, and that's a no-op for the archive mode and the local mode. I renamed to pendingWalRcvRestart and replaced switch with simple condition. > So when shutting down the WAL receiver after a parameter change, what > happens is that the startup process waits for retrieve_retry_interval > before moving back to the archive mode. Only after scanning again the > archives do we restart a new WAL receiver. As I answered earlier, here is no switch to archive or wal_retrieve_retry_interval waiting in my patch. I recheck on current revision too: 2019-08-28 12:16:27.295 MSK 11180 @ from [vxid: txid:0] [] DEBUG: sending write 0/30346C8 flush 0/30346C8 apply 0/30346C8 2019-08-28 12:16:27.493 MSK 11172 @ from [vxid: txid:0] [] LOG: received SIGHUP, reloading configuration files 2019-08-28 12:16:27.494 MSK 11172 @ from [vxid: txid:0] [] LOG: parameter "primary_conninfo" changed to "host='/tmp' port= sslmode=disable sslcompression=0 gssencmode=disable target_session_attrs=any" 2019-08-28 12:16:27.496 MSK 11173 @ from [vxid:1/0 txid:0] [] LOG: The WAL receiver is going to be restarted due to change of primary_conninfo 2019-08-28 12:16:27.496 MSK 11176 @ from [vxid: txid:0] [] DEBUG: checkpointer updated shared memory configuration values 2019-08-28 12:16:27.496 MSK 11180 @ from [vxid: txid:0] [] FATAL: terminating walreceiver process due to administrator command 2019-08-28 12:16:27.500 MSK 11335 @ from [vxid: txid:0] [] LOG: started streaming WAL from primary at 0/300 on timeline 1 2019-08-28 12:16:27.500 MSK 11335 @ from [vxid: txid:0] [] DEBUG: sendtime 2019-08-28 12:16:27.50037+03 receipttime 2019-08-28 12:16:27.500821+03 replication apply delay 0 ms transfer latency 0 ms No "DEBUG: switched WAL source from stream to archive after failure" messages, no time difference (wal_retrieve_retry_interval = 5s). regards, Sergeidiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 89284dc5c0..b36749fe91 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3929,9 +3929,13 @@ ANY num_sync ( @@ -3946,9 +3950,13 @@ ANY num_sync ( ). - This parameter can only be set at server start. + This parameter can only be set in the postgresql.conf + file or on the server command line. This setting has no effect if primary_conninfo is not - set. + set or the server is not in standby mode. + + + The WAL receiver is restarted after an update of primary_slot_name. diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e651a841bb..4eed462f34 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -803,6 +803,12 @@ static XLogSource readSource = 0; /* XLOG_FROM_* code */ static XLogSource currentSource = 0; /* XLOG_FROM_* code */ static bool lastSourceFailed = false; +/* + * Need for restart running WalReceiver due the configuration change. + * Suitable only for XLOG_FROM_STREAM source + */ +static bool pendingWalRcvRestart = false; + typedef struct XLogPageReadPrivate { int emode; @@ -11787,48 +11793,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, if (!StandbyMode) return false; - /* - * If primary_conninfo is set, launch walreceiver to try - * to stream the missing WAL. - * - * If fetching_ckpt is true, RecPtr points to the initial - * checkpoint location. In that case, we use RedoStartLSN - * as the streaming start position instead of RecPtr, so - * that when we later jump backwards to start redo at - * RedoStartLSN, we will have the logs streamed already. - */ - if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0) - { - XLogRecPtr ptr; - TimeLineID tli; - - if (fetching_ckpt) - { - ptr = RedoStartLSN; - tli = ControlFile->checkPointCopy.ThisTimeLineID; - } - else - { - ptr = RecPtr; - - /* - * Use the record begin position to determine the - * TLI, rather than the position we're reading. - */ - tli = tliOfPointInHistory(tliRecPtr, expectedTLEs); - -
Crash in BRIN summarization
I bumped into a little bug in BRIN, while hacking on something unrelated. This causes a segfault, or an assertion failure if assertions are enabled: CREATE TABLE brintest (n numrange); CREATE INDEX brinidx ON brintest USING brin (n); INSERT INTO brintest VALUES ('empty'); INSERT INTO brintest VALUES (numrange(0, 2^1000::numeric)); INSERT INTO brintest VALUES ('(-1, 0)'); SELECT brin_desummarize_range('brinidx',0); SELECT brin_summarize_range('brinidx',0) ; gdb backtrace: Program terminated with signal SIGSEGV, Segmentation fault. #0 0x560fef71f4b2 in GetMemoryChunkContext (pointer=0x7fed65a8fc30) at ../../../../src/include/utils/memutils.h:129 129 AssertArg(MemoryContextIsValid(context)); (gdb) bt #0 0x560fef71f4b2 in GetMemoryChunkContext (pointer=0x7fed65a8fc30) at ../../../../src/include/utils/memutils.h:129 #1 0x560fef721290 in pfree (pointer=0x7fed65a8fc30) at mcxt.c:1033 #2 0x560fef0bd13f in brin_inclusion_add_value (fcinfo=0x7ffc4cd36220) at brin_inclusion.c:239 #3 0x560fef6ee543 in FunctionCall4Coll (flinfo=0x560ff05d0cf0, collation=0, arg1=94626457109880, arg2=94626457868896, arg3=140657589550088, arg4=0) at fmgr.c:1214 #4 0x560fef0b39b7 in brinbuildCallback (index=0x7fed64f37460, htup=0x560ff0685d68, values=0x7ffc4cd365d0, isnull=0x7ffc4cd365b0, tupleIsAlive=true, brstate=0x560ff06859e0) at brin.c:650 #5 0x560fef12e418 in heapam_index_build_range_scan (heapRelation=0x7fed64f37248, indexRelation=0x7fed64f37460, indexInfo=0x560ff0685b48, allow_sync=false, anyvisible=true, progress=false, start_blockno=0, numblocks=1, callback=0x560fef0b37c9 , callback_state=0x560ff06859e0, scan=0x560ff0685d18) at heapam_handler.c:1663 #6 0x560fef0b2858 in table_index_build_range_scan (table_rel=0x7fed64f37248, index_rel=0x7fed64f37460, index_info=0x560ff0685b48, allow_sync=false, anyvisible=true, progress=false, start_blockno=0, numblocks=1, callback=0x560fef0b37c9 , callback_state=0x560ff06859e0, scan=0x0) at ../../../../src/include/access/tableam.h:1544 #7 0x560fef0b4dac in summarize_range (indexInfo=0x560ff0685b48, state=0x560ff06859e0, heapRel=0x7fed64f37248, heapBlk=0, heapNumBlks=1) at brin.c:1240 #8 0x560fef0b50d9 in brinsummarize (index=0x7fed64f37460, heapRel=0x7fed64f37248, pageRange=0, include_partial=true, numSummarized=0x7ffc4cd36928, numExisting=0x0) at brin.c:1375 #9 0x560fef0b43bf in brin_summarize_range (fcinfo=0x560ff06840d0) at brin.c:933 #10 0x560fef339acc in ExecInterpExpr (state=0x560ff0683fe8, econtext=0x560ff0683ce8, isnull=0x7ffc4cd36c17) at execExprInterp.c:650 Analysis: This is a memory management issue in the brin_inclusion_add_value() function: /* Finally, merge the new value to the existing union. */ finfo = inclusion_get_procinfo(bdesc, attno, PROCNUM_MERGE); Assert(finfo != NULL); result = FunctionCall2Coll(finfo, colloid, column->bv_values[INCLUSION_UNION], newval); if (!attr->attbyval) pfree(DatumGetPointer(column->bv_values[INCLUSION_UNION])); column->bv_values[INCLUSION_UNION] = result; This assumes that the merge function returns a newly-palloc'd value. That's a shaky assumption; if one of the arguments is an empty range, range_merge() returns the other argument, rather than a newly constructed value. And surely we can't assume assume that for user-defined opclasses. When that happens, we store the 'newval' directly in column->bv_values[INCLUSION_UNION], but it is allocated in a shortly-lived memory context (or points directly to a buffer in the buffer cache). On the next call, we try to pfree() the old value, but the memory context that contained it was already reset. It took a while to work out the script to reproduce it. The first value passed to brin_inclusion_add_value() must be an empty range, so that column->bv_values[INCLUSION_UNION] is set to an empty range. On the second call, the new value must be non-empty, so that range_merge() returns 'newval' unchanged. And it must be a very large value, because if it uses a short varlen header, the PG_GETARG_RANGE_P() call in range_merge() will make a copy of it. brin_inclusion_union() has a similar issue, but I didn't write a script to reproduce that. Fix attached. - Heikki diff --git a/src/backend/access/brin/brin_inclusion.c b/src/backend/access/brin/brin_inclusion.c index 86788024ef..cbe14cb3b1 100644 --- a/src/backend/access/brin/brin_inclusion.c +++ b/src/backend/access/brin/brin_inclusion.c @@ -235,8 +235,13 @@ brin_inclusion_add_value(PG_FUNCTION_ARGS) Assert(finfo != NULL); result = FunctionCall2Coll(finfo, colloid, column->bv_values[INCLUSION_UNION], newval); - if (!attr->attbyval) + if (!attr->attbyval && result != column->bv_values[INCLUSION_UNION]) + { pfree(DatumGetPointer(column->bv_values[INCLUSION_UNION])); + + if (result == newv
Re: old_snapshot_threshold vs indexes
On Wed, Aug 28, 2019 at 3:05 AM Tom Lane wrote: > I'd vote for back-patching to 10. Even if there is in fact no deadlock > hazard, you've clearly demonstrated a significant performance hit that > we're taking for basically no reason. Done. The second symptom reported in my first email looked like evidence of high levels of spinlock backoff, which I guess might have been coming from TransactionIdLimitedForOldSnapshots()'s hammering of oldSnapshotControl->mutex_threshold and oldSnapshotControl->mutex_threshold, when running heap_page_prune_opt()-heavy workloads like the one generated by test-indexscan.sql (from my earlier message) from many backends at the same time on a large system. That's just an observation I'm leaving here, I'm not planning to chase that any further for now. -- Thomas Munro https://enterprisedb.com
Re: REINDEX filtering in the backend
On Wed, Aug 28, 2019 at 7:03 AM Michael Paquier wrote: > > On Thu, Jul 11, 2019 at 11:14:20PM +0200, Julien Rouhaud wrote: > > I didn't want to spend too much time enjoying bison and adding new > > unreserved keywords, so for now I just implemented this syntax to > > start a discussion for this feature in the next commitfest: > > > > REINDEX ( FILTER = COLLATION ) [...] > > > > The FILTER clause can be used multiple times, each one is OR-ed with > > the ReindexStmt's option, so we could easily add a LIBC, ICU and other > > filters, also making COLLATION (or more realistically a better new > > keyword) an alias for (LIBC | ICU) for instance. > > I would prefer keeping the interface simple with only COLLATION, so as > only collation sensitive indexes should be updated, including icu and > libc ones. Or would it be useful to have the filtering for both as > libicu can break things similarly to glibc in an update still a > breakage on one or the other would not happen at the same time? I > don't know enough of libicu regarding that, eager to learn. In which > case, wouldn't it be better to support that from the start? I'm not sure either. Another thing would be to add extra syntax to be able to discard even more indexes. For instance we could store the version of the underlying lib used when the index is (re)created, and do something like REINDEX (FILTER = LIBC!=2.28) or REINDEX (FILTER = LIBC==2.27) or something similar. > > The filtering is done at table level (with and without the > > concurrently option), so SCHEMA, DATABASE and SYSTEM automatically > > benefit from it. If this clause is used with a REINDEX INDEX, the > > statement errors out, as I don't see a valid use case for providing a > > single index name and asking to possibly filter it at the same time. > > Supporting that case would not be that much amount of work, no? Probably not, but I'm dubious about the use case. Adding the lib version in the catalog would be more useful for people who want to write their own rules to reindex specific set of indexes. > > I also added some minimal documentation and regression tests. I'll > > add this patch to the next commitfest. > > > > [1] > > https://www.postgresql.org/message-id/7140716c-679e-a0b9-a273-b201329d8891%402ndquadrant.com > > +if ((stmt->options & REINDEXOPT_ALL_FILTERS) != 0) > +elog(ERROR, "FILTER clause is not compatible with REINDEX INDEX"); > [...] > + discard indexes whose ordering does not depend on a collation. Note > that > + the FILTER option is not compatible with REINDEX > + SCHEMA. > > Why do you have both limitations? That's actually a typo, the documentation should have specified that it's not compatible with REINDEX INDEX, not REINDEX SCHEMA, I'll fix. > I think that it would be nice to be > able to do both, generating an error for REINDEX INDEX if the index > specified is not compatible, and a LOG if the index is not filtered > out when a list is processed. Do you mean having an error if the index does not contain any collation based type? Also, REINDEX only accept a single name, so there shouldn't be any list processing for REINDEX INDEX? I'm not really in favour of adding extra code the filtering when user asks for a specific index name to be reindexed. > Please note that elog() cannot be used > for user-facing failures, only for internal ones. Indeed, I'll change with an ereport and ERRCODE_FEATURE_NOT_SUPPORTED. > > +REINDEX (VERBOSE, FILTER = COLLATION) TABLE reindex_verbose; > +-- One column, not depending on a collation > In order to make sure that a reindex has been done for a given entry > with the filtering, an idea is to save the relfilenode before the > REINDEX and check it afterwards. That would be nice to make sure that > only the wanted indexes are processed, but it is not possible to be > sure of that based on your tests, and some tests should be done on > relations which have collation-sensitive as well as > collation-insensitive indexes. That's what I did when I first submitted the feature in reindexdb. I didn't use them because it means switching to TAP tests. I can drop the simple regression test (especially since I now realize than one is quite broken) and use the TAP one if, or should I keep both? > > + index = index_open(indexOid, AccessShareLock); > + numAtts = index->rd_index->indnatts; > + index_close(index, AccessShareLock); > Wouldn't it be better to close that after doing the scan? Yes indeed. > Could you add tests with REINDEX CONCURRENTLY? Sure! > Bonus: support for reindexdb should be added. Let's not forget about > it. Yep. That was a first prototype to see if this approach is ok. I'll add more tests, run pgindent and reindexdb support if this approach is sensible.
Re: pgbench - implement strict TPC-B benchmark
> On Wed, Aug 28, 2019 at 7:37 AM Fabien COELHO wrote: > > > While doing benchmarking using different tools, including pgbench, I found > > it > > useful as a temporary hack to add copy freeze and maintenance_work_mem > > options > > (the last one not as an env variable, just as a set before, although not > > sure > > if it's a best idea). Is it similar to what you were talking about? > > About this patch: > > Concerning the --maintenance... option, ISTM that there could rather be a > generic way to provide "set" settings, not a specific option for a > specific parameter with a specific unit. Moreover, ISTM that it only needs > to be set once on a connection, not per command. I'd suggest something > like: > >--connection-initialization '...' > > That would be issue when a connection is started, for any query, then the > effect would be achieved with: > >pgbench --conn…-init… "SET maintenance_work_main TO '12MB'" ... > > The --help does not say that the option expects a parameter. > > Also, in you patch it is a initialization option, but the code does not > check for that. > > Concerning the freeze option: > > It is also a initialization-specific option that should be checked for > that. > > The option does not make sense if > > The alternative queries could be managed simply without intermediate > variables. > > Pgbench documentation is not updated. > > There are no tests. > > This patch should be submitted in its own thread to help manage it in the > CF app. Thanks, that was a pretty deep answer for what was supposed to be just an alignment question :) But sure, I can prepare a proper version and post it separately.
Re: Performance improvement of WAL writing?
Hi, Moon-san. At Wed, 28 Aug 2019 15:43:02 +0900, "Moon, Insung" wrote in > Dear Hackers. > > Currently, the XLogWrite function is written in 8k(or 16kb) units > regardless of the size of the new record. > For example, even if a new record is only 300 bytes, pg_pwrite is > called to write data in 8k units (if it cannot be writing on one page > is 16kb written). > Let's look at the worst case. > 1) LogwrtResult.Flush is 8100 pos. > 2) And the new record size is only 100 bytes. > In this case, pg_pwrite is called which writes 16 kb to update only 100 bytes. > It is a rare case, but I think there is overhead for pg_pwrite for some > systems. > # For systems that often update one record in one transaction. If a commit is lonely in the system, the 100 bytes ought to be flushed out immediately. If there are many concurrent commits, XLogFlush() waits for all in-flight insertions up to the LSN the backends is writing then actually flushes. Of course sometimes it flushes just 100 bytes but sometimes it flushes far many bytes involving records from other backends. If another backend has flushed further than the backend's upto LSN, the backend skips flush. If you want to involve more commits in a flush, commit_delay lets the backend wait for that duration so that more commits can come in within the window. Or synchronous_commit = off works somewhat more aggressively. > So what about modifying the XLogWrite function only to write the size > that should record? If I understand your porposal correctly, you're proposing to separate fsync from XLogWrite. Actually, as you proposed, roughly speaking no flush happen execpt at segment switch or commit. If you are proposing not flushing immediately even at commit, commit_delay (or synchronous_commit) works that way. > Can this idea benefit from WAL writing performance? > If it's OK to improve, I want to do modification. > How do you think of it? So the proposal seems to be already achieved. If not, could you elaborate the proposal, or explain about actual problem? > Best Regards. > Moon. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: refactoring - share str2*int64 functions
Michaël, I have taken the liberty to optimize the existing int64 function by removing spurious tests. Which are? - *ptr && WHATEVER(*ptr) *ptr is redundant, WHATEVER yields false on '\0', and it costs on each char but at the end. It might be debatable in some places, e.g. it is likely that there are no spaces in the string, but likely that there are more than one digit. If you want all/some *ptr added back, no problem. - isdigit repeated on if and following while, used if/do-while instead. I have not created uint64 specific inlined overflow functions. Why? There is a comment below ;p See comment about comment below:-) If it looks ok, a separate patch could address the 32 & 16 versions. I am surprised to see a negative diff Is it? Long duplicate functions are factored out (this was my initial intent), one file is removed… actually just by doing that (adding the 32 and 16 parts will add much more code of course). At quick glance, I think that this is on the right track. Some comments I have on the top of my mind: - It would me good to have the unsigned equivalents of pg_mul_s64_overflow, etc. These are simple enough, Hmmm. Have you looked at the fallback cases when the corresponding builtins are not available? I'm unsure of a reliable way to detect a generic unsigned int overflow without expensive dividing back and having to care about zero… So I was pretty happy with my two discreet, small and efficient tests. and per the feedback from Andres they could live in common/int.h. Could be, however "string.c" already contains a string to int conversion function, so I put them together. Probably this function should be removed in the end, though. - It is more consistent to use upper-case statuses in the enum strtoint_status. I thought of that, but first enum I found used lower case, so it did not seem obvious that pg style was to use upper case. Indeed, some enum constants use upper cases. Could it be renamed to pg_strtoint_status? Sure. I also prefixed the enum constants for consistency. Attached patch uses a prefix and uppers constants. Waiting for further input about other comments. -- Fabien.diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 221b47298c..28891ba337 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -79,6 +79,7 @@ #include "utils/builtins.h" #include "utils/hashutils.h" #include "utils/memutils.h" +#include "common/string.h" PG_MODULE_MAGIC; @@ -1022,7 +1023,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString, /* parse command tag to retrieve the number of affected rows. */ if (completionTag && strncmp(completionTag, "COPY ", 5) == 0) - rows = pg_strtouint64(completionTag + 5, NULL, 10); + (void) pg_strtouint64(completionTag + 5, &rows); else rows = 0; diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c index 2c0ae395ba..8e75d52b06 100644 --- a/src/backend/executor/spi.c +++ b/src/backend/executor/spi.c @@ -21,6 +21,7 @@ #include "catalog/heap.h" #include "catalog/pg_type.h" #include "commands/trigger.h" +#include "common/string.h" #include "executor/executor.h" #include "executor/spi_priv.h" #include "miscadmin.h" @@ -2338,8 +2339,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI, CreateTableAsStmt *ctastmt = (CreateTableAsStmt *) stmt->utilityStmt; if (strncmp(completionTag, "SELECT ", 7) == 0) - _SPI_current->processed = - pg_strtouint64(completionTag + 7, NULL, 10); + (void) pg_strtouint64(completionTag + 7, &_SPI_current->processed); else { /* @@ -2361,8 +2361,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI, else if (IsA(stmt->utilityStmt, CopyStmt)) { Assert(strncmp(completionTag, "COPY ", 5) == 0); - _SPI_current->processed = pg_strtouint64(completionTag + 5, - NULL, 10); + (void) pg_strtouint64(completionTag + 5, &_SPI_current->processed); } } diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c index 764e3bb90c..17cde42b4d 100644 --- a/src/backend/nodes/readfuncs.c +++ b/src/backend/nodes/readfuncs.c @@ -34,6 +34,7 @@ #include "fmgr.h" #include "miscadmin.h" +#include "common/string.h" #include "nodes/extensible.h" #include "nodes/parsenodes.h" #include "nodes/plannodes.h" @@ -80,7 +81,7 @@ #define READ_UINT64_FIELD(fldname) \ token = pg_strtok(&length); /* skip :fldname */ \ token = pg_strtok(&length); /* get field value */ \ - local_node->fldname = pg_strtouint64(token, NULL, 10) + (void) pg_strtouint64(token, &local_node->fldname) /* Read a long integer field (anything written as ":fldname %ld") */ #define READ_LONG_FIELD(fldname) \ diff --git a/src/backend/parser/parse_node.c b/src/backend/parser/parse_node.c index 1baf7ef31f..8af5e083ee 100644
Re: refactoring - share str2*int64 functions
On Wed, Aug 28, 2019 at 08:51:29AM +0200, Fabien COELHO wrote: > Here is an updated patch for the u?int64 conversion functions. Thanks! > I have taken the liberty to optimize the existing int64 function by removing > spurious tests. Which are? > I have not created uint64 specific inlined overflow functions. Why? There is a comment below ;p > If it looks ok, a separate patch could address the 32 & 16 versions. I am surprised to see a negative diff actually just by doing that (adding the 32 and 16 parts will add much more code of course). At quick glance, I think that this is on the right track. Some comments I have on the top of my mind: - It would me good to have the unsigned equivalents of pg_mul_s64_overflow, etc. These are simple enough, and per the feedback from Andres they could live in common/int.h. - It is more consistent to use upper-case statuses in the enum strtoint_status. Could it be renamed to pg_strtoint_status? -- Michael signature.asc Description: PGP signature