Re: Should we add GUCs to allow partition pruning to be disabled?
On Thu, Apr 11, 2019 at 03:34:30PM +1200, David Rowley wrote: > On Thu, 21 Mar 2019 at 00:51, David Rowley > wrote: > > Just so I don't forget about this, I've added it to the July 'fest. > > > > https://commitfest.postgresql.org/23/2065/ > > Now that we have 428b260f8, I think the version of this that goes into > master should be more like the attached. I tweaked this patch some more (sorry): - remove "currently" since that's not expected to be changed (right?); - remove "especially"; - refer to "partition hierarchies" not "partitioning hierarchies"; - rewrite bit about "When partition pruning is not possible" Also, I noticed awhile ago while grepping for "probably be fixed in future releases" that some items under ddl-inherit-caveats are actually possible for relkind=p partitions in v11. I assume those will never be implemented for inheritence partitioning, so I propose another update to docs (if preferred, I'll bring up on a new thread). - unique constraints on parent table; - FK constraints on parent table; Note that FK constraints *referencing* a partitiond table are possible in v12 but not in v11. So if there's any finer-grained update to documentation of the individual limitations, it'd need to be tweaked for back branches (v10 and 11). Justin >From 3a787b95f5a35b53cd958855ec6fc4ff9fc9a455 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 11 Apr 2019 00:24:44 -0500 Subject: [PATCH v1 1/2] Reinstate warnings regarding large heirarchies Put back warnings regarding high planning time and/or RAM use for large inheritance heirarchies, and high planning time for large number of partitions not pruned during planning with declaratively partitioned tables. Discussion: https://www.postgresql.org/message-id/CAKJS1f8RW-mHQ8aEWD5Dv0%2B8A1wH5tHHdYMGW9y5sXqnE0X9wA%40mail.gmail.com https://commitfest.postgresql.org/23/2065/ Author: Robert Haas, David Rowley Reviewed by: Amit Langote, Justin Pryzby --- doc/src/sgml/ddl.sgml | 20 1 file changed, 20 insertions(+) diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 244d5ce..83cbc66 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -3582,6 +3582,26 @@ VALUES ('Albany', NULL, NULL, 'NY'); offer flexibility but do not have some of the performance benefits of built-in declarative partitioning. + + + + When using table inheritance, partition hierarchies with more than a few + hundred partitions are not recommended. Larger partition hierarchies may + incur long planning time, and, in the case of UPDATE + and DELETE, excessive memory usage. When inheritance + is used, see also the limitations described in + . + + + + When using declarative partitioning, the overhead of query planning + is directly related to the number of unpruned partitions. Planning is + generally fast with small numbers of unpruned partitions, even in + partition hierarchies containing many thousands of partitions. However, + long planning time will be incurred by large partition hierarchies if + partition pruning is not possible during the planning phase. + + -- 2.1.4 >From 6bd80e7cdddc3c9552d44439b4b8e9843c1007e4 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 11 Apr 2019 00:22:56 -0500 Subject: [PATCH v1 2/2] Document features of declarative partitioning.. ..which will never be implemented for legacy inheritance. --- doc/src/sgml/ddl.sgml | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 83cbc66..3495a66 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -3450,8 +3450,9 @@ VALUES ('Albany', NULL, NULL, 'NY'); - These deficiencies will probably be fixed in some future release, - but in the meantime considerable care is needed in deciding whether + Some functionality not implemented for inheritance hierarchies is + implemented for declarative partitioning. + Considerable care is needed in deciding whether partitioning with legacy inheritance is useful for your application. -- 2.1.4
Re: bug in update tuple routing with foreign partitions
Fujita-san, Thanks for the review. On 2019/04/10 17:38, Etsuro Fujita wrote: > (2019/03/06 18:33), Amit Langote wrote: >> I noticed a bug with how UPDATE tuple routing initializes ResultRelInfos >> to use for partition routing targets. Specifically, the bug occurs when >> UPDATE targets include a foreign partition that is locally modified (as >> opposed to being modified directly on the remove server) and its >> ResultRelInfo (called subplan result rel in the source code) is reused for >> tuple routing: > >> The solution is to teach ExecSetupPartitionTupleRouting to avoid using a >> subplan result rel if its ri_FdwState is already set. > > I'm not sure that is a good idea, because that requires to create a new > ResultRelInfo, which is not free; I think it requires a lot of work. After considering this a bit more and studying your proposed solution, I tend to agree. Beside the performance considerations you point out that I too think are valid, I realize that going with my approach would mean embedding some assumptions in the core code about ri_FdwState; in this case, assuming that a subplan result rel's ri_FdwState is not to be re-purposed for tuple routing insert, which is only based on looking at postgres_fdw's implementation. Not to mention the ensuing complexity in the core tuple routing code -- it now needs to account for the fact that not all subplan result rels may have been re-purposed for tuple-routing, something that's too complicated to handle in PG 11. IOW, let's call this a postgres_fdw bug and fix it there as you propose. > Another solution to avoid that would be to store the fmstate created by > postgresBeginForeignInsert() into the ri_FdwState, not overwrite the > ri_FdwState, like the attached. This would not need any changes to the > core, and this would not cause any overheads either, IIUC. What do you > think about that? +1. Just one comment: +/* + * If the given resultRelInfo already has PgFdwModifyState set, it means + * the foreign table is an UPDATE subplan resultrel; in which case, store + * the resulting state into the PgFdwModifyState. + */ The last "PgFdwModifyState" should be something else? Maybe, sub-PgModifyState or sub_fmstate? >> I've also added a >> test in postgres_fdw.sql to exercise this test case. > > Thanks again, but the test case you added works well without any fix: Oops, I should really adopt a habit of adding a test case before code. > +-- Check case where the foreign partition is a subplan target rel and > +-- foreign parttion is locally modified (target table being joined > +-- locally prevents a direct/remote modification plan). > +explain (verbose, costs off) > +update utrtest set a = 1 from (values (1), (2)) s(x) where a = s.x > returning *; > + QUERY PLAN > +-- > > + Update on public.utrtest > + Output: remp.a, remp.b, "*VALUES*".column1 > + Foreign Update on public.remp > + Remote SQL: UPDATE public.loct SET a = $2 WHERE ctid = $1 RETURNING > a, b > + Update on public.locp > + -> Hash Join > + Output: 1, remp.b, remp.ctid, "*VALUES*".*, "*VALUES*".column1 > + Hash Cond: (remp.a = "*VALUES*".column1) > + -> Foreign Scan on public.remp > + Output: remp.b, remp.ctid, remp.a > + Remote SQL: SELECT a, b, ctid FROM public.loct FOR UPDATE > + -> Hash > + Output: "*VALUES*".*, "*VALUES*".column1 > + -> Values Scan on "*VALUES*" > + Output: "*VALUES*".*, "*VALUES*".column1 > + -> Hash Join > + Output: 1, locp.b, locp.ctid, "*VALUES*".*, "*VALUES*".column1 > + Hash Cond: (locp.a = "*VALUES*".column1) > + -> Seq Scan on public.locp > + Output: locp.b, locp.ctid, locp.a > + -> Hash > + Output: "*VALUES*".*, "*VALUES*".column1 > + -> Values Scan on "*VALUES*" > + Output: "*VALUES*".*, "*VALUES*".column1 > +(24 rows) > > In this test case, the foreign partition is updated before ri_FdwState is > overwritten by postgresBeginForeignInsert() that's invoked by the tuple > routing code, so it would work without any fix. So I modified this test > case as such. Thanks for fixing that. I hadn't noticed that foreign partition remp is updated before local partition locp; should've added a locp0 for values (0) so that remp appears second in the ModifyTable's subplans. Or, well, make the foreign table remp appear later by making it a partition for values in (3) as your patch does. BTW, have you noticed that the RETURNING clause returns the same row twice? +update utrtest set a = 3 from (values (2), (3)) s(x) where a = s.x returning *; + a | b | x +---+---+--- + 3 | qux triggered ! | 2 + 3 | xyzzy triggered ! | 3 + 3 | qux triggered ! | 3 +(3 rows) You can see that
Re: speeding up planning with partitions
On 2019/04/11 14:03, David Rowley wrote: > On Fri, 5 Apr 2019 at 19:50, Amit Langote > wrote: >> While we're on the topic of the relation between constraint exclusion and >> partition pruning, I'd like to (re-) propose this documentation update >> patch. The partitioning chapter in ddl.sgml says update/delete of >> partitioned tables uses constraint exclusion internally to emulate >> partition pruning, which is no longer true as of 428b260f8. > > Update-docs-that-update-delete-no-longer-use-cons.patch looks good to > me. It should be changed as what the docs say is no longer true. Thanks for the quick review. :) Regards, Amit
Re: speeding up planning with partitions
On Fri, 5 Apr 2019 at 19:50, Amit Langote wrote: > While we're on the topic of the relation between constraint exclusion and > partition pruning, I'd like to (re-) propose this documentation update > patch. The partitioning chapter in ddl.sgml says update/delete of > partitioned tables uses constraint exclusion internally to emulate > partition pruning, which is no longer true as of 428b260f8. Update-docs-that-update-delete-no-longer-use-cons.patch looks good to me. It should be changed as what the docs say is no longer true. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Zedstore - compressed in-core columnar storage
> On Apr 10, 2019, at 9:08 PM, Mark Kirkwood > wrote: > > >> On 11/04/19 4:01 PM, Mark Kirkwood wrote: >>> On 9/04/19 12:27 PM, Ashwin Agrawal wrote: >>> >>> Heikki and I have been hacking recently for few weeks to implement >>> in-core columnar storage for PostgreSQL. Here's the design and initial >>> implementation of Zedstore, compressed in-core columnar storage (table >>> access method). Attaching the patch and link to github branch [1] to >>> follow along. >>> >>> >> >> Very nice. I realize that it is very early days, but applying this patch >> I've managed to stumble over some compression bugs doing some COPY's: >> >> benchz=# COPY dim1 FROM '/data0/dump/dim1.dat' >> USING DELIMITERS ','; >> psql: ERROR: compression failed. what now? >> CONTEXT: COPY dim1, line 458 >> >> The log has: >> >> 2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed. what >> now? >> 2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458 >> 2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, >> zedstore_compression.c:287 >> 2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM >> '/data0/dump/dim1.dat' >> USING DELIMITERS ','; >> >> The dataset is generated from and old DW benchmark I wrote >> (https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_projects_benchw_=DwIDaQ=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM=BgmTkDoY6SKOgODe8v6fpH4hs-wM0H91cLfrAfEL6C0=lLcXp_8h2bRb_OR4FT8kxD-FG9MaLBPU7M5aV9nQ7JY=). >> The row concerned looks like: >> >> 457,457th interesting measure,1th measure >> type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp >> 458,458th interesting measure,2th measure >> type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg >> >> >> I'll see if changing to LZ4 makes any different. >> >> > > The COPY works with LZ4 configured. Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers. Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case and errors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast as well with it.
Re: Should we add GUCs to allow partition pruning to be disabled?
On 2019/04/11 13:50, David Rowley wrote: > On Thu, 11 Apr 2019 at 16:06, Amit Langote > wrote: >> I've posted a patch last week on the "speed up partition planning" thread >> [1] which modifies ddl.sgml to remove the text about UPDATE/DELETE using >> constraint exclusion under the covers. Do you think there's any merit to >> combining that with this one? > > Probably separate is better. I don't think anything you're proposing > there is for back-patching, but I think the original patch over here > should be. OK, no problem. I just thought to point out my patch because you've posted a version of the patch here for HEAD *because of* 428b260f8, the commit which also obsoleted the text that the other patch fixes. Anyway, let's leave the other patch on its own thread where there are a few other things to be sorted out. Thanks, Amit
Re: Should we add GUCs to allow partition pruning to be disabled?
On Thu, 11 Apr 2019 at 16:06, Amit Langote wrote: > > On 2019/04/11 12:34, David Rowley wrote: > > Now that we have 428b260f8, I think the version of this that goes into > > master should be more like the attached. > > Thanks, looks good. Thanks for looking. > I've posted a patch last week on the "speed up partition planning" thread > [1] which modifies ddl.sgml to remove the text about UPDATE/DELETE using > constraint exclusion under the covers. Do you think there's any merit to > combining that with this one? Probably separate is better. I don't think anything you're proposing there is for back-patching, but I think the original patch over here should be. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Reducing the runtime of the core regression tests
David Rowley writes: > I was surprised to see nothing mentioned about attempting to roughly > sort the test order in each parallel group according to their runtime. I'm confused about what you have in mind here? I'm pretty sure pg_regress launches all the scripts in a group at the same time, so that just rearranging the order they're listed in on the schedule line shouldn't make any noticeable difference. If you meant changing the order of operations within each script, I don't really want to go there. It'd require careful per-script analysis, and to the extent that the existing tests have some meaningful ordering (admittedly, many don't), we'd lose that. >> Thoughts? Anyone object to making these sorts of changes >> post-feature-freeze? > I think it's a good time to do this sort of thing. It should be > easier to differentiate tests destabilising due to this change out > from the noise of other changes that are going in since currently, > the rate of those other changes should not be very high. Doing it any > later in the freeze does not seem better since we might discover some > things that need to be fixed due to this. Yeah. I wouldn't propose pushing this in shortly before beta, but if we do it now then we've probably got a month to sort out any problems that may appear. regards, tom lane
Re: Cleanup/remove/update references to OID column
On Wed, Apr 10, 2019 at 11:59:18AM -0500, Justin Pryzby wrote: > I found and included fixes for a few more references: > > doc/src/sgml/catalogs.sgml | 2 +- > doc/src/sgml/ddl.sgml| 3 +-- > doc/src/sgml/information_schema.sgml | 4 ++-- > doc/src/sgml/ref/create_trigger.sgml | 2 +- > doc/src/sgml/spi.sgml| 2 +- Open Item++. Andres, as this is a consequence of 578b229, could you look at what is proposed here? -- Michael signature.asc Description: PGP signature
Re: Proper usage of ndistinct vs. dependencies extended statistics
On Thu, 11 Apr 2019 at 11:53, Paul Martinez wrote: > > I have some questions about the different types of extended statistics > that were introduced in Postgres 10. > - Which types of queries are each statistic type supposed to improve? Multivariate ndistinct stats are aimed to improve distinct estimates over groups of columns. These can help in cases like GROUP BY a,b, SELECT DISTINCT a,b, SELECT a,b FROM x UNION SELECT a,b FROM y; They also help in determining the number of times an index will be rescanned in cases like nested loops with a parameterised inner path. I see multivariate ndistinct estimates are not used for normal selectivity estimates for unknown values. e.g PREPARE q1 (int, int) AS SELECT * FROM t1 WHERE a = $1 and b = $2; still assumes a and b are independent even when ndistinct stats exist on the two columns. There are a few other usages too. See calls of estimate_num_groups() dependency stats just handle WHERE clauses (or more accurately, clauses containing a reference to a single relation. These only handle equality OpExprs. e.g "a = 10 and y = 3", not "a < 6 and y = 3". Further stat types (most common values) added in PG12 aim to allow inequality operators too. > - When should one type of statistic be used over the other? Should they > both always be used? If they both always should be always used then we'd likely not have bothered making the types optional. Both ndistinct and dependency stats are fairly cheap to calculate and store, so it might not be too big an issue adding both types if you're not sure. With these two types there's not really any choice for the planner to decide to use one or the other, it just makes use of the ones it can use for the given situation. That won't be the case as more stats types get added. In PG12, for example, we had to choose of MCV stats should be applied before dependencies stats. That might be a no-brainer, but perhaps the future there will be stats types where the order to apply them is not so clear, although in those cases it might be questionable why you'd want to define more than one type of stats on the same set of columns. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: block-level incremental backup
On Wed, Apr 10, 2019 at 09:42:47PM +0200, Peter Eisentraut wrote: > That is a great analysis. Seems like block-level is the preferred way > forward. In any solution related to incremental backups I have see from community, all of them tend to prefer block-level backups per the filtering which is possible based on the LSN of the page header. The holes in the middle of the page are also easier to handle so as an incremental page size is reduced in the actual backup. My preference tends toward a block-level approach if we were to do something in this area, though I fear that performance will be bad if we begin to scan all the relation files to fetch a set of blocks since a past LSN. Hence we need some kind of LSN map so as it is possible to skip a one block or a group of blocks (say one LSN every 8/16 blocks for example) at once for a given relation if the relation is mostly read-only. -- Michael signature.asc Description: PGP signature
Re: pg_rewind vs superuser
On Tue, Apr 09, 2019 at 10:38:19AM +0900, Michael Paquier wrote: > Sure. With something like the attached? I don't think that there is > much point to complicate the test code with multiple roles if the > default is a superuser. As this topic differs from the original thread, I haev started a new thread, so let's discuss the proposed patch there: https://www.postgresql.org/message-id/20190411041336.gm2...@paquier.xyz -- Michael signature.asc Description: PGP signature
Switch TAP tests of pg_rewind to use role with only function permissions
Hi all, Recent commit bfc80683 has added some documentation in pg_rewind about the fact that it is possible to do the operation with a non-superuser, assuming that this role has sufficient grant rights to execute the functions used by pg_rewind. Peter Eisentraut has suggested to have some tests for this kind of user here: https://www.postgresql.org/message-id/e1570ba6-4459-d9b2-1321-9449adaae...@2ndquadrant.com Attached is a patch which switches all the TAP tests of pg_rewind to do that. As of now, the tests depend on a superuser for everything, and it seems to me that it makes little sense to make the tests more pluggable by being able to switch the roles used on-the-fly (the invocation of pg_rewind is stuck into RewindTest.pm) as a superuser has no restrictions. Any thoughts? -- Michael diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm index 900d452d8b..618de85161 100644 --- a/src/bin/pg_rewind/t/RewindTest.pm +++ b/src/bin/pg_rewind/t/RewindTest.pm @@ -144,6 +144,20 @@ sub start_master { $node_master->start; + # Create a custom role which will be used to run pg_rewind. This has + # minimal permissions to make pg_rewind able to work with an online + # source. + $node_master->psql('postgres', " + CREATE ROLE rewind_user LOGIN; + GRANT EXECUTE ON function pg_catalog.pg_ls_dir(text, boolean, boolean) + TO rewind_user; + GRANT EXECUTE ON function pg_catalog.pg_stat_file(text, boolean) + TO rewind_user; + GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text) + TO rewind_user; + GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, boolean) + TO rewind_user;"); + Now run the test-specific parts to initialize the master before setting # up standby @@ -207,6 +221,9 @@ sub run_pg_rewind my $standby_connstr = $node_standby->connstr('postgres'); my $tmp_folder = TestLib::tempdir; + # Append the rewind role to the connection string. + $standby_connstr = "$standby_connstr user=rewind_user"; + # Stop the master and be ready to perform the rewind $node_master->stop; signature.asc Description: PGP signature
Re: Zedstore - compressed in-core columnar storage
On 11/04/19 4:01 PM, Mark Kirkwood wrote: On 9/04/19 12:27 PM, Ashwin Agrawal wrote: Heikki and I have been hacking recently for few weeks to implement in-core columnar storage for PostgreSQL. Here's the design and initial implementation of Zedstore, compressed in-core columnar storage (table access method). Attaching the patch and link to github branch [1] to follow along. Very nice. I realize that it is very early days, but applying this patch I've managed to stumble over some compression bugs doing some COPY's: benchz=# COPY dim1 FROM '/data0/dump/dim1.dat' USING DELIMITERS ','; psql: ERROR: compression failed. what now? CONTEXT: COPY dim1, line 458 The log has: 2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed. what now? 2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458 2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, zedstore_compression.c:287 2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM '/data0/dump/dim1.dat' USING DELIMITERS ','; The dataset is generated from and old DW benchmark I wrote (https://sourceforge.net/projects/benchw/). The row concerned looks like: 457,457th interesting measure,1th measure type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp 458,458th interesting measure,2th measure type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg I'll see if changing to LZ4 makes any different. The COPY works with LZ4 configured.
Re: Zedstore - compressed in-core columnar storage
On 9/04/19 12:27 PM, Ashwin Agrawal wrote: Heikki and I have been hacking recently for few weeks to implement in-core columnar storage for PostgreSQL. Here's the design and initial implementation of Zedstore, compressed in-core columnar storage (table access method). Attaching the patch and link to github branch [1] to follow along. Very nice. I realize that it is very early days, but applying this patch I've managed to stumble over some compression bugs doing some COPY's: benchz=# COPY dim1 FROM '/data0/dump/dim1.dat' USING DELIMITERS ','; psql: ERROR: compression failed. what now? CONTEXT: COPY dim1, line 458 The log has: 2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed. what now? 2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458 2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, zedstore_compression.c:287 2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM '/data0/dump/dim1.dat' USING DELIMITERS ','; The dataset is generated from and old DW benchmark I wrote (https://sourceforge.net/projects/benchw/). The row concerned looks like: 457,457th interesting measure,1th measure type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp 458,458th interesting measure,2th measure type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg I'll see if changing to LZ4 makes any different. best wishes Mark
Re: REINDEX CONCURRENTLY 2.0
On Tue, Apr 09, 2019 at 03:50:27PM +0900, Michael Paquier wrote: > And here is the patch to address this issue. It happens that a bit > more than the dependency switch was lacking here: > - At swap time, we need to have the new index definition track > relispartition from the old index. > - Again at swap time, the inheritance link needs to be updated between > the old/new index and its parent when reindexing a partition index. Peter, this is an open item, and I think as the committer of the feature you are its owner. Well, in this case, I don't mind taking the ownership as need be as I know this stuff. Anyway, could you have a look at the patch proposed and see if you have any issues with it? -- Michael signature.asc Description: PGP signature
Re: Should we add GUCs to allow partition pruning to be disabled?
On Thu, 21 Mar 2019 at 00:51, David Rowley wrote: > Just so I don't forget about this, I've added it to the July 'fest. > > https://commitfest.postgresql.org/23/2065/ Now that we have 428b260f8, I think the version of this that goes into master should be more like the attached. I think the original patch is fine for the back branches. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services docs_partitioning_warning_master.patch Description: Binary data
RE: Contribution to Perldoc for TestLib module in Postgres
Hi Ram, I think this documentation helps people who want to understand functions. >Please find the updated patch. Added to the commitfest as well I have some comments. I think some users who would like to know custom function check src/test/perl/README at first. How about add comments to the README file, such as "See Testlib.pm if you want to know more details"? In the above document, why not write a description after the function name? I think it is better to write the function name first and then the description. In your code; #Checks if all the tests passed or not all_tests_passing() In my suggestion; all_tests_passing() Checks if all the tests passed or not And some functions return value. How about adding return information to the above doc? I find ^M character in your new code. Please use LF line ending not CRLF or get rid of it in next patch. Regards, Aya Iwata
Re: Should the docs have a warning about pg_stat_reset()?
On Thu, 11 Apr 2019 at 06:52, Bruce Momjian wrote: > > OK, let me step back. Why are people resetting the statistics > regularly? Based on that purpose, does it make sense to clear the > stats that effect autovacuum? I can't speak for everyone, but once upon a time when I first started using PostgreSQL, to monitor the application's use of the database I recorded the output of pg_stat_user_tables once per day and then reset the statistics. It was useful to know the number of inserted tuples, seq scans, index scans etc so I could understand the load on the database better. Of course, nowadays with LEAD()/LAG() it's pretty easy to find the difference from the previous day. I'd have done it differently if those had existed back then. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH v20] GSSAPI encryption support
Greetings, * Robbie Harwood (rharw...@redhat.com) wrote: > Bruce Momjian writes: > > On Wed, Apr 3, 2019 at 08:49:25AM +0200, Magnus Hagander wrote: > >> On Wed, Apr 3, 2019 at 12:22 AM Joe Conway wrote: > >> > >> Personally I don't find it as confusing as is either, and I find > >> hostgss to be a good analog of hostssl. On the other hand hostgssenc > >> is long and unintuitive. So +1 for leaving as is and -1 one for > >> changing it IMHO. > >> > >> I think for those who are well versed in pg_hba (and maybe gss as > >> well), it's not confusing. That includes me. > >> > >> However, for a new user, I can definitely see how it can be > >> considered confusing. And confusion in *security configuration* is > >> always a bad idea, even if it's just potential. > >> > >> Thus +1 on changing it. > >> > >> If it was on the table it might have been better to keep hostgss and > >> change the authentication method to gssauth or something, but that > >> ship sailed *years* ago. > > > > Uh, did we consider keeping hostgss and changing the auth part at the > > end to "gssauth"? > > I think that was implicitly rejected because we'd have to keep the > capability to configure "gss" there else break compatibility. Right, if we changed the name of the auth method then everyone who is using the "gss" auth method would have to update their pg_hba.conf files... That would be very ugly. Also, it wasn't implicitly rejected, it was discussed up-thread (see the comments between Magnus and I, specifically, quoted above- "that ship sailed *years* ago") and explicitly rejected. Thanks! Stephen signature.asc Description: PGP signature
Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid
On Mon, Apr 8, 2019 at 6:42 PM Noah Misch wrote: > - AIX animals failed two ways. First, I missed a "use" statement such that > poll_start() would fail if it needed more than one attempt. Second, I > assumed $pid would be gone as soon as kill(9, $pid) returned[1]. > [1] POSIX says "sig or at least one pending unblocked signal shall be > delivered to the sending thread before kill() returns." I doubt the > postmaster had another signal pending often enough to explain the failures, so > AIX probably doesn't follow POSIX in this respect. It looks like you fixed this, but I was curious about this obversation as someone interested in learning more about kernel stuff and portability... Maybe I misunderstood, but isn't POSIX referring to kill(sig, $YOUR_OWN_PID) there? That is, if you signal *yourself*, and no other thread exists that could handle the signal, it will be handled by the sending thread, and in the case of SIGKILL it will therefore never return. But here, you were talking about a perl script that kills the postmaster, no? If so, that passage doesn't seem to apply. In any case, regardless of whether the signal handler has run to completion when kill() returns, doesn't the pid have to continue to exist in the process table until it is reaped by its parent (possibly in response to SIGCHLD), with one of the wait*() family of system calls? -- Thomas Munro https://enterprisedb.com
Qestion about .partial WAL file
Hi, Hackers I noticed something strange. Does it cause nothing? I didn't detect anything, but feel restless. Step: - There are two standbys that connect to primary. - Kill primary and promote one standby. - Restart another standby that is reset primary_conninfo to connect new primary. I expected that the latest WAL segment file in old timeline is renamed with .partial suffix, but it's not renamed in the restarted standby. xlog.c says the following, but I didn't understand the bad situation. * the archive. It's physically present in the new file with new TLI, * but recovery won't look there when it's recovering to the older --> * timeline. On the other hand, if we archive the partial segment, and --> * the original server on that timeline is still running and archives --> * the completed version of the same segment later, it will fail. (We * used to do that in 9.4 and below, and it caused such problems). * * As a compromise, we rename the last segment with the .partial * suffix, and archive it. Archive recovery will never try to read * .partial segments, so they will normally go unused. But in the odd * PITR case, the administrator can copy them manually to the pg_wal * directory (removing the suffix). They can be useful in debugging, * too. Regards Ryo Matsumura
Re: finding changed blocks using WAL scanning
On Wed, Apr 10, 2019 at 5:49 PM Robert Haas wrote: > There is one thing that does worry me about the file-per-LSN-range > approach, and that is memory consumption when trying to consume the > information. Suppose you have a really high velocity system. I don't > know exactly what the busiest systems around are doing in terms of > data churn these days, but let's say just for kicks that we are > dirtying 100GB/hour. That means, roughly 12.5 million block > references per hour. If each block reference takes 12 bytes, that's > maybe 150MB/hour in block reference files. If you run a daily > incremental backup, you've got to load all the block references for > the last 24 hours and deduplicate them, which means you're going to > need about 3.6GB of memory. If you run a weekly incremental backup, > you're going to need about 25GB of memory. That is not ideal. One > can keep the memory consumption to a more reasonable level by using > temporary files. For instance, say you realize you're going to need > 25GB of memory to store all the block references you have, but you > only have 1GB of memory that you're allowed to use. Well, just > hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno, > writing each batch to a separate temporary file, and then process each > of those 32 files separately. That does add some additional I/O, but > it's not crazily complicated and doesn't seem too terrible, at least > to me. Still, it's something not to like. Oh, I'm being dumb. We should just have the process that writes out these files sort the records first. Then when we read them back in to use them, we can just do a merge pass like MergeAppend would do. Then you never need very much memory at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PostgreSQL pollutes the file system
>>> On 2019-03-29 20:32, Joe Conway wrote: pg_util >>> >>> How is that better than just renaming to pg_$oldname? >> >> As I already said in up thread: >> >>> This way, we would be free from the command name conflict problem > > Well, whatever we do -- if anything -- we would certainly need to keep > the old names around for a while somehow. So this doesn't really make > that issue go away. Another complain was, it's hard to remember the tool names for novice users. I think this way would solve the problem. I agree that command name conflicting problem will not be solved by the idea. However I do not believe there's name conflicting problem in the first place. So I am happy to keep the old names as they are. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Re: Reducing the runtime of the core regression tests
On Wed, Apr 10, 2019 at 4:56 PM Peter Geoghegan wrote: > The original fastpath tests don't seem particularly effective to me, > even without the oversight I mentioned. I suggest that you remove > them, since the minimal btree_index.sql fast path test is sufficient. To be clear: I propose that you remove the tests entirely, and we leave it at that. I don't intend to follow up with my own patch because I don't think that there is anything in the original test case that is worth salvaging. -- Peter Geoghegan
Re: PostgreSQL pollutes the file system
Em sex, 29 de mar de 2019 às 13:25, Tom Lane escreveu: > > Maybe if we want to merge these things into one executable, > it should be a new one. "pg_util createrole bob" ? > +1 as I proposed in https://www.postgresql.org/message-id/bdd1adb1-c26d-ad1f-2f15-cc52056065d4%40timbira.com.br -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Re: Reducing the runtime of the core regression tests
On Wed, Apr 10, 2019 at 4:19 PM Tom Lane wrote: > > I'll come up with a patch to deal with this situation, by > > consolidating the old and new tests in some way. I don't think that > > your work needs to block on that, though. > > Should I leave out the part of my patch that creates index_fastpath.sql? > If we're going to end up removing that version of the test, there's no > point in churning the related lines beforehand. The suffix truncation stuff made it tricky to force a B-Tree to be tall without also consisting of many blocks. Simply using large, random key values in suffix attributes didn't work anymore. The solution I came up with for the new fastpath test that made it into btree_index.sql was to have redundancy in leading keys, while avoiding TOAST compression by using plain storage in the table/index. > One way or the other I want to get that test out of where it is, > because indexing.sql is currently the slowest test in its group. > But if you prefer to make btree_index run a bit longer rather than > inventing a new test script, that's no problem from where I stand. The original fastpath tests don't seem particularly effective to me, even without the oversight I mentioned. I suggest that you remove them, since the minimal btree_index.sql fast path test is sufficient. If there ever was a problem in this area, then amcheck would certainly detect it -- there is precisely one place for every tuple on v4 indexes. The original fastpath tests won't tickle the implementation in any interesting way in my opinion. -- Peter Geoghegan
Proper usage of ndistinct vs. dependencies extended statistics
Hello, I have some questions about the different types of extended statistics that were introduced in Postgres 10. - Which types of queries are each statistic type supposed to improve? - When should one type of statistic be used over the other? Should they both always be used? We have a multi-tenant application and all of our tables have a denormalized tenant_id column. (Most tables actually use the tenant_id as part of a composite primary key on (tenant_id, id).) As the docs suggest, we haven't created extended STATISTICS except for when we observe the query planner making poor query plans. We've seen poor query plans on queries involving filters on foreign keys: Table: fk_table tenant_id | integer id | integer fk_id | integer PRIMARY KEY (tenant_id, id) FOREIGN KEY (tenant_id, fk_id) REFERENCES left_table(tenant_id, id) The id columns on these tables are unique, so there is a functional dependence between fk_id and tenant_id; if the fk_id columns are the same, then the tenant_id columns must also be the same. This table has ~4.6 million rows, ~1300 distinct values for tenant_id, and ~13000 distinct values for fk_id. A single SELECT query that filters on tenant_id and fk_id erroneously estimates that it will return a single row (4,600,000 / 1300 / 13,000 ~= 0.1): => EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND fk_id = 1; QUERY PLAN -- Index Scan using fk_table_tenant_id_fk_id_index on fk_table (cost=0.43..4.45 rows=1 width=44) (actual time=0.016..1.547 rows=3113 loops=1) Index Cond: ((tenant_id = 100) AND (fk_id = 1)) In other places we've used a ndistinct statistic to solve this issue, but that doesn't help in this case. Postgres still estimates that the query will return a single row. => CREATE STATISTICS ndistinct_stat (ndistinct) ON tenant_id, fk_id FROM fk_table; => ANALYZE fk_table; => SELECT stxname, stxndistinct FROM pg_statistic_ext; stxname | stxndistinct | +-+ ndistinct_stat | {"1, 3": 3433} | => EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND fk_id = 1; -- (unchanged) Why doesn't the ndistinct statistic get used when planning this query? (We're currently on Postgre 10.6.) In contrast, if we create a functional dependency statistic then Postgres will accurately predict the result size. => CREATE STATISTICS dep_stat (dependencies) ON tenant_id, fk_id FROM fk_table; => ANALYZE fk_table; => SELECT stxname, stxdependencies FROM pg_statistic_ext; stxname | stxdependencies +-- dep_stat | {"1 => 3": 1.00, "3 => 1": 0.060300} => EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND fk_id = 1; QUERY PLAN -- Index Scan using fk_table_tenant_id_fk_id_index on fk_table (cost=0.43..1042.23 rows=612 width=44) (actual time=0.011..0.813 rows=3056 loops=1) Index Cond: ((tenant_id = 100) AND (fk_id = 1)) So, in general, which type of extended statistic should be used? Where do the different kinds of statistics get used in the query planner? Is there an advantage to using one type of statistic vs the other, or should we always create both? And in our specific example, with a schema designed for multi-tenancy, which types of statistics should we use for our foreign keys, where tenant_id is functionally dependent on the other foreign_id columns? To explain where some of our confusion is coming from, here's the example where adding an ndistinct statistic helped: Postgres was adding a filter after an index scan instead of including the filter as part of the index scan. big_table had ~500,000,000 rows, ~3000 distinct values for column a, ~3000 distinct values for column b, but just ~4500 distinct values for the (a, b) tuple, and column b was functionally dependent on column c. Postgres wanted to do: => SELECT * FROM big_table WHERE a = 1 AND b = 10 AND c IN (100, 101, 102, ...); Index Scan using big_table_a_b_c on big_table (cost=0.57..122.41 rows=1 width=16) Index Cond: ((a = 1) AND (b = 10)) Filter: c = ANY ('{100, 101, 102, 103, 104, 105, ...}') But then we did: => CREATE STATISTICS big_table_a_b_ndistinct (ndistinct) ON a, b FROM big_table; => ANALYZE big_table; => SELECT * FROM big_table WHERE a = 1 AND b = 10 AND c IN (100, 101, 102, ...); Index Scan using big_table_a_b_c on big_table (cost=0.57..122.41 rows=1 width=16) Index Cond: ((a = 1) AND (b = 10)) AND (c = ANY ('{100, 101, 102, 103, 104, 105, ...}')) (This had very poor performance between Postgres thought it would have to filter 500,000,000 / 3000 / 3000 ~= 55 rows, but actually it had to filter 500,000,000 / 4500 ~= 110,000 rows.) Because of the functional
Re: Reducing the runtime of the core regression tests
Peter Geoghegan writes: > On Wed, Apr 10, 2019 at 3:35 PM Tom Lane wrote: >> * Likewise, I split up indexing.sql by moving the "fastpath" test into >> a new file index_fastpath.sql. > I just noticed that the "fastpath" test actually fails to test the > fastpath optimization -- the coverage we do have comes from another > test in btree_index.sql, that I wrote back in December. Oh! Hmm. > I'll come up with a patch to deal with this situation, by > consolidating the old and new tests in some way. I don't think that > your work needs to block on that, though. Should I leave out the part of my patch that creates index_fastpath.sql? If we're going to end up removing that version of the test, there's no point in churning the related lines beforehand. One way or the other I want to get that test out of where it is, because indexing.sql is currently the slowest test in its group. But if you prefer to make btree_index run a bit longer rather than inventing a new test script, that's no problem from where I stand. regards, tom lane
Re: serializable transaction: exclude constraint violation (backed by GIST index) instead of ssi conflict
On Thu, Apr 11, 2019 at 10:54 AM Thomas Munro wrote: > On Thu, Apr 11, 2019 at 9:43 AM Peter Billen wrote: > > I kinda expected/hoped that transaction t2 would get aborted by a > > serialization error, and not an exclude constraint violation. This makes > > the application session bound to transaction t2 failing, as only > > serialization errors are retried. > Yeah, I agree, the behaviour you are expecting is desirable and we > should figure out how to do that. The basic trick for btree unique > constraints was to figure out where the index *would* have written, to > give the SSI machinery a chance to object to that before raising the > UCV. I wonder if we can use the same technique here... at first > glance, check_exclusion_or_unique_constraint() is raising the error, > but is not index AM specific code, and it is somewhat removed from the > GIST code that would do the equivalent > CheckForSerializableConflictIn() call. I haven't looked into it > properly, but that certainly complicates matters somewhat... Perhaps > the index AM would actually need a new entrypoint that could be called > before the error is raised, or perhaps there is an easier way. Adding Kevin (architect of SSI and reviewer/committer of my UCV interception patch) and Shubham (author of GIST SSI support) to the CC list in case they have thoughts on this. -- Thomas Munro https://enterprisedb.com
Re: Libpq support to connect to standby server as priority
On Fri, Mar 29, 2019 at 7:06 PM Tsunakawa, Takayuki < tsunakawa.ta...@jp.fujitsu.com> wrote: > Hi Hari-san, > > I've reviewed all the files. The patch would be OK when the following > have been fixed, except for the complexity of fe-connect.c (which probably > cannot be improved.) > > Unfortunately, I'll be absent next week. The earliest date I can do the > test will be April 8 or 9. I hope someone could take care of this patch... > Thanks for the review. Apologies that I could not able finish it on time because of other work. > > (1) 0001 > With this read-only option type, application can connect to > to a read-only server in the list of hosts, in case > ... > before issuing the SHOW transaction_readonly to find out whether > > > "to" appears twice in a row. > transaction_readonly -> transaction_read_only > > > (2) 0001 > +succesful connection or failure. > > succesful -> successful > > > (3) 0008 > to conenct to a standby server with a faster check instead of > > conenct -> connect > > > (4) 0008 > Logically, recovery exit should be notified after the following statement: > > XLogCtl->SharedRecoveryInProgress = false; > > > (5) 0008 > + /* Update in_recovery status. */ > + if (LocalRecoveryInProgress) > + SetConfigOption("in_recovery", > + "on", > + PGC_INTERNAL, > PGC_S_OVERRIDE); > + > > This SetConfigOption() is called for every RecoveryInProgress() call on > the standby. It may cause undesirable overhead. How about just calling > SetConfigOption() once in InitPostgres() to set the initial value for > in_recovery? InitPostgres() and its subsidiary functions call > SetConfigOption() likewise. > > > (6) 0008 > async.c is for LISTEN/UNLISTEN/NOTIFY. How about adding the new functions > in postgres.c like RecoveryConflictInterrupt()? > > > (7) 0008 > + if (pid != 0) > + { > + (void) SendProcSignal(pid, reason, > procvxid.backendId); > + } > > The braces are not necessary because the block only contains a single > statement. > I fixed all the comments that you have raised above and attached the updated patches. Regards, Haribabu Kommi Fujitsu Australia 0001-Restructure-the-code-to-remove-duplicate-code.patch Description: Binary data 0002-New-TargetSessionAttrsType-enum.patch Description: Binary data 0003-Make-transaction_read_only-as-GUC_REPORT-varaible.patch Description: Binary data 0004-New-prefer-read-target_session_attrs-type.patch Description: Binary data 0005-New-read-only-target_session_attrs-type.patch Description: Binary data 0006-Primary-prefer-standby-and-standby-options.patch Description: Binary data 0007-New-function-to-rejecting-the-checked-write-connecti.patch Description: Binary data 0008-Server-recovery-mode-handling.patch Description: Binary data
Re: Reducing the runtime of the core regression tests
On Wed, Apr 10, 2019 at 3:35 PM Tom Lane wrote: > I finally got some time to pursue that, and attached is a proposed patch > that moves some tests around and slightly adjusts some other ones. > To cut to the chase: on my workstation, this cuts the time for > "make installcheck-parallel" from 21.9 sec to 13.9 sec, or almost 40%. > I think that's a worthwhile improvement, considering how often all of us > run those tests. Great! > * create_index.sql ran much longer than other tests in its parallel > group, so I split out the SP-GiST-related tests into a new file > create_index_spgist.sql, and moved the delete_test_table test case > to btree_index.sql. Putting the delete_test_table test case in btree_index.sql make perfect sense. > * Likewise, I split up indexing.sql by moving the "fastpath" test into > a new file index_fastpath.sql. I just noticed that the "fastpath" test actually fails to test the fastpath optimization -- the coverage we do have comes from another test in btree_index.sql, that I wrote back in December. While I did make a point of ensuring that we had test coverage for the nbtree fastpath optimization that went into Postgres 11, I also didn't consider the original fastpath test. I assumed that there were no tests to begin with, because gcov showed me that there was no test coverage back in December. What happened here was that commit 074251db limited the fastpath to only be applied to B-Trees with at least 3 levels. While the original fastpath optimization tests actually tested the fastpath optimization when it first went in in March 2018, that only lasted a few weeks, since 074251db didn't adjust the test to still be effective. I'll come up with a patch to deal with this situation, by consolidating the old and new tests in some way. I don't think that your work needs to block on that, though. > Thoughts? Anyone object to making these sorts of changes > post-feature-freeze? IMV there should be no problem with pushing ahead with this after feature freeze. -- Peter Geoghegan
Re: Should the docs have a warning about pg_stat_reset()?
Em qua, 10 de abr de 2019 às 16:33, Alvaro Herrera escreveu: > > On 2019-Apr-10, Bruce Momjian wrote: > > > On Thu, Apr 11, 2019 at 04:14:11AM +1200, David Rowley wrote: > > > > I still think we should start with a warning about pg_stat_reset(). > > > People are surprised by this, and these are just the ones who notice: > > > > > > https://www.postgresql.org/message-id/CAB_myF4sZpxNXdb-x=welpqbdou6ue8fhtm0fverpm-1j7p...@mail.gmail.com > > > > > > I imagine there are many others just suffering from bloat due to > > > auto-vacuum not knowing how many dead tuples there are in the tables. > > > > OK, let me step back. Why are people resetting the statistics > > regularly? Based on that purpose, does it make sense to clear the > > stats that effect autovacuum? > > I agree that we should research that angle. IMO resetting stats should > be seriously frowned upon. And if they do need to reset some counters > for some valid reason, offer a mechanism that leaves the autovac- > guiding counters alone. > Then you have to change the way pg_stat_reset() works (it currently removes the hash tables). Even pg_stat_reset_single_table_counters() could cause trouble although it is in a smaller proportion. Reset statistics leaves autovacuum state machine in an invalid state. Since reset statistic is a rare situation (at least I don't know monitoring tools or practices that regularly execute those functions), would it be worth adding complexity to pg_stat_reset* functions? autovacuum could handle those rare cases just fine. > IMO the answer for $SUBJECT is yes. > +1. However, I also suggest a WARNING saying "autovacuum won't work because you reset statistics that it depends on" plus detail "Consider executing ANALYZE on all your tables" / "Consider executing ANALYZE on table foo.bar". -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Re: Reducing the runtime of the core regression tests
Andres Freund writes: > On 2019-04-10 18:35:15 -0400, Tom Lane wrote: >> ... What I did instead was to shove >> that test case and some related ones into a new plpgsql test file, >> src/pl/plpgsql/src/sql/plpgsql_trap.sql, so that it's not part of the >> core regression tests at all. (We've talked before about moving >> chunks of plpgsql.sql into the plpgsql module, so this is sort of a >> down payment on that.) Now, if you think about the time to do >> check-world rather than just the core regression tests, this isn't >> obviously a win, and in fact it might be a loss because the plpgsql >> tests run serially not in parallel with anything else. > Hm, can't we "just" parallelize the plpgsql schedule instead? If somebody wants to work on that, I won't stand in the way, but it seems like material for a different patch. >> Thoughts? Anyone object to making these sorts of changes >> post-feature-freeze? > Hm. There's some advantage to doing so, because it won't break any large > pending changes. But it's also possible that it'll destabilize the > buildfarm some. In personal capacity I'm like +0.5. My thought was that there is (hopefully) going to be a lot of testing going on over the next few months, so making that faster would be a useful activity. regards, tom lane
Re: Reducing the runtime of the core regression tests
Hi, On 2019-04-10 18:35:15 -0400, Tom Lane wrote: > on my workstation, this cuts the time for "make installcheck-parallel" > from 21.9 sec to 13.9 sec, or almost 40%. I think that's a worthwhile > improvement, considering how often all of us run those tests. Awesome. > * The plpgsql test ran much longer than others, which turns out to be > largely due to the 2-second timeout in its test of statement_timeout. > In view of the experience reflected in commit f1e671a0b, just > reducing that timeout seems unsafe. What I did instead was to shove > that test case and some related ones into a new plpgsql test file, > src/pl/plpgsql/src/sql/plpgsql_trap.sql, so that it's not part of the > core regression tests at all. (We've talked before about moving > chunks of plpgsql.sql into the plpgsql module, so this is sort of a > down payment on that.) Now, if you think about the time to do > check-world rather than just the core regression tests, this isn't > obviously a win, and in fact it might be a loss because the plpgsql > tests run serially not in parallel with anything else. However, > by that same token, the parallel-testing overload we were concerned > about in f1e671a0b should be much less bad in the plpgsql context. > I therefore took a chance on reducing the timeout down to 1 second. > If the buildfarm doesn't like that, we can change it back to 2 seconds > again. It should still be a net win because of the fact that > check-world runs the core tests more than once. Hm, can't we "just" parallelize the plpgsql schedule instead? > Thoughts? Anyone object to making these sorts of changes > post-feature-freeze? Hm. There's some advantage to doing so, because it won't break any large pending changes. But it's also possible that it'll destabilize the buildfarm some. In personal capacity I'm like +0.5. Greetings, Andres Freund
Re: B-tree cache prefetches
On Tue, Nov 27, 2018 at 5:43 AM Andrey Borodin wrote: > > 31 авг. 2018 г., в 2:40, Thomas Munro > > написал(а): > > [1] https://arxiv.org/pdf/1509.05053.pdf > > I've implemented all of the strategies used in that paper. > On a B-tree page we have a line pointers ordered in key order and tuples > residing at the end of the page. Tuples at the end are mostly "appended", > i.e. they usually are accommodated at the end of the free space. > The ides of the attached patch is to try to order tuples in different > strategies. This ordering happens during VACUUM and can be easily broken with > single insert. > Strategies are switched by #define: > 1. USE_SEQ_ORDER - just order tuples in order of line pointers. This strategy > has no idea behind and is expected to work similar to baseline (no changes to > code at all) > 2. USE_EYZINGER_ORDER - depth-first topological search. If you always choose > lower branch - your search is just a sequential read of bytes. > 3. USE_VEB_ORDER - Van Emde Boas layout - try to pack 3 tuples (mid and two > sub-mids) together. The key ideas is that you cache-miss when read mid, but > tuples of both next steps are already fetched by that cache miss. > 4. USE_BT_ORDER - order two sub-mids together so that you can prefetch both > two possible next mids in a single cache prefetch. > > Cache prefetches of B-tree binsearch are controlled by USE_PREFETCH. > > I've tested all these strategies on my laptop (2.2GHz Intel i7, 16GB RAM, > MacOS 10.14.1) > ./pgbench -i -s 25 postgres && ./pgbench -T 100 postgres > No changes in configs. > > Here are results in tps: > without prefetch > baseline 1448.331199 > seq 1433.737204 > bt 1463.701527 > veb 1457.586089 > eyz 1483.654765 > > with prefetch > baseline 1486.585895 > seq 1471.757042 > bt 1480.169967 > veb 1464.834678 > eyz 1460.323392 > > This numbers are not statistically cleared, just one run was done for every > number, experiment needs to be tuned anyway. > From this I can conclude: > 1. It is hard to observe significant performance influence on pgbench. Maybe > I should use some more representative B-tree performance tests? > 2. Cache prefetches seem to increase performance without any layout > strategies. But the difference is hair-splitting 2.6% > 3. For implemented strategies my preliminary results corresponds to the > result in a paper: Eyzinger layout is the best. > 4. Among layout strategies with prefetch, bt-order strategy seems to be > winner. > > How can I improve experimental evaluation of these layout strategies? > Other thoughts? I'd be happy to hear any comment on this. What about read-only tests with prewarmed indexes? Those numbers look IO bound. This is focusing on prefetch within btree code, but I wonder if there is some lower hanging fruit that doesn't require any layout changes: heap prefetch during index scans. By analogy to figure 8a "software-pipelined prefetching" of [1], I wonder if you could build a pipeline using dirty (unlocked) memory reads, like this: 1. Prefetch buffer mapping hash bucket for heap_tids[n + 3] 2. Prefetch page header for heap_tids[n + 2] using a dirty read of the first value in the buffer mapping hash bucket 3. Prefetch heap tuple for tids[n + 1] using a dirty read of the heap page 4. Return heap tid for tids[0] to caller If you're lucky, by the time the caller uses the tid to the find the buffer, read the line pointer and access the tuple, all three things will be in cache and hopefully won't have changed since the 'dirty' reads. Prefetching the wrong cachelines would be possible obviously, if there is a buffer mapping hash table collision and the one you wanted isn't first, or stuff just changed. Maybe there aren't enough cachelines for this 3-level pipeline to work, considering that in between random other executor nodes could run, so perhaps just a 2-level or 1-level pipeline would pay off. As I showed in my toy hash join prefetch experiment[2], I could get 20-30% speedup from prefetching just hash buckets (equivalent to prefetching buffer mapping hash buckets), and then a further 5-10% speedup from prefetching the first item in the hash bucket, but I expect nested hash joins to interfere with that as they compete for cache lines. I suppose I could try to do three levels and fetch the next tuple in the chain, but that seems unlikely to help because we're getting into lower probabilities that I'd even even want that data; in contrast, the index scan -> heap scan case *definitely* needs to go through a line pointer to a tuple somewhere on the page so 3-level might be worth experimenting with. Just a though, I have not tried any of this. Maybe there are really 4 levels, if you count the buffer descriptor/lock. [1] http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf [2] https://www.postgresql.org/message-id/CA%2BhUKGKB%3DEWq%2Brj4NK8CX6ZqHZzuQYWSb9iDfYKzGN6-ejShUQ%40mail.gmail.com -- Thomas Munro
Re: pg_dump is broken for partition tablespaces
On 2019-Apr-10, Alvaro Herrera wrote: > but the test immediately does this: > > alter table at_partitioned alter column b type numeric using b::numeric; > > and watch what happens! (1663 is pg_default) > > alvherre=# select relname, reltablespace from pg_class where relname like > 'at_partitioned%'; >relname| reltablespace > --+--- > at_partitioned | 0 > at_partitioned_a_idx | 0 > at_partitioned_b_idx | 1663 > (3 filas) > > Outrageous! This is because ruleutils.c attaches a TABLESPACE clause when asked to dump an index definition; and tablecmds.c uses ruleutils to deparse the index definition into something that can be replayed via CREATE INDEX commands (or ALTER TABLE ADD CONSTRAINT UNIQUE/PRIMARY KEY, if that's the case.) This patch (PoC quality) fixes that behavior, but I'm looking to see what else it breaks. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 83f72470b10..acb6b09a576 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -10567,7 +10567,13 @@ ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel, tab->changedIndexOids = lappend_oid(tab->changedIndexOids, foundObject.objectId); tab->changedIndexDefs = lappend(tab->changedIndexDefs, - pg_get_indexdef_string(foundObject.objectId)); + /* + * we include a tablespace clause + * only if it's a plain index. + * This likely breaks some stuff. + */ + pg_get_indexdef_string(foundObject.objectId, + relKind == RELKIND_INDEX)); } } else if (relKind == RELKIND_SEQUENCE) diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 0c7a533e697..41144d84a76 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -1142,11 +1142,11 @@ pg_get_indexdef_ext(PG_FUNCTION_ARGS) * Returns a palloc'd C string; no pretty-printing. */ char * -pg_get_indexdef_string(Oid indexrelid) +pg_get_indexdef_string(Oid indexrelid, bool tablespace) { return pg_get_indexdef_worker(indexrelid, 0, NULL, false, false, - true, true, + tablespace, true, 0, false); } @@ -2170,6 +2170,7 @@ pg_get_constraintdef_worker(Oid constraintId, bool fullCommand, pfree(options); } + /* XXX this tablespace works even if it's 0 */ tblspc = get_rel_tablespace(indexId); if (OidIsValid(tblspc)) appendStringInfo(, " USING INDEX TABLESPACE %s", diff --git a/src/include/utils/ruleutils.h b/src/include/utils/ruleutils.h index 7c49e9d0a83..6ff3b841057 100644 --- a/src/include/utils/ruleutils.h +++ b/src/include/utils/ruleutils.h @@ -18,7 +18,7 @@ #include "nodes/pg_list.h" -extern char *pg_get_indexdef_string(Oid indexrelid); +extern char *pg_get_indexdef_string(Oid indexrelid, bool tablespace); extern char *pg_get_indexdef_columns(Oid indexrelid, bool pretty); extern char *pg_get_partkeydef_columns(Oid relid, bool pretty);
finding changed blocks using WAL scanning
Over at https://www.postgresql.org/message-id/CA%2BTgmobFVe4J4AA7z9OMUzKnm09Tt%2BsybhxeL_Ddst3q3wqpzQ%40mail.gmail.com I mentioned parsing the WAL to extract block references so that incremental backup could efficiently determine which blocks needed to be copied. Ashwin replied in http://postgr.es/m/CALfoeitO-vkfjubMFQRmgyXghL-uUnZLNxbr=obrqqsm8kf...@mail.gmail.com to mention that the same approach could be useful for pg_upgrade: # Currently, pg_rewind requires # all the WAL logs to be present on source side from point of divergence to # rewind. Instead just parse the wal and keep the changed blocks around on # sourece. Then don't need to retain the WAL but can still rewind using the # changed block map. Since there are at least two possible use case for an efficient way to learn when blocks last changed, and in each case the performance benefits of being able to learn that are potentially quite large, I'm starting this thread to brainstorm how such a system might work. It seems to me that there are basically two ways of storing this kind of information, plus a bunch of variants. One way is to store files that cover a range of LSNs, and basically contain a synopsis of the WAL for those LSNs. You omit all the actual data and just mention which blocks were changed by some record in that part of the WAL. In this type of scheme, the storage required is roughly proportional to the volume of WAL for which you wish to retain data. Pruning old data is easy; just remove the files that provide information about LSNs that you don't care about any more. The other way is to store data about each block, or each range of blocks, or all the blocks that hash onto a certain slot; for each, store the newest LSN that has modified that block, or a block in that range, or a block that hashes onto that that slot. In this system, storage is roughly proportional to the size of the database cluster, except maybe in the hashing case, but I *think* that case will degrade unless you basically expand the map to be roughly proportional to the size of the cluster anyway. I may be wrong. Of these two variants, I am inclined to prefer the version where each file is a summary of the block references within some range of LSNs. It seems simpler to implement to me. You just read a bunch of WAL files and then when you get tired you stop and emit an output file. You need to protect yourself against untimely crashes. One way is to stick a checksum into the output file. After you finish writing it, fsync() it before you start writing the next one. After a restart, read the latest such file and see if the checksum is OK. If not, regenerate it; if not, assume it's good and move on. Files other than the last one can be assumed good. Another way is to create the file with a temporary name, fsync() it, and then rename it into place and fsync() again. The background worker that generates the files can have a GUC to remove them when they are older than some threshold amount of time, or you can keep them forever and let the user manually remove stuff they no longer want based on LSN. That's pretty much it. The version where you keep an LSN per block/range/hash bucket seems more complex in terms of durability. Now you have to worry not only about creating new files, but about modifying them. That seems to add some complexity. I think it may be possible to make it work without doing write-ahead logging for every change, but it certainly needs careful thought to avoid torn page problems and checkpoint synchronization issues. Moreover, it potentially uses lots and lots of inodes if there are many relations in the cluster. You can avoid that by not creating maps for small files, if you like, or by switching to the hash bucket approach. But either of those approaches is lossy. Omitting the maps for small files means you always have to assume everything in those files changed. The hash bucket approach is vulnerable to hash collisions; you have to assume that all blocks that hash to a given bucket have changed. Those are probably manageable disadvantages, but I think they are real ones. There is one thing that does worry me about the file-per-LSN-range approach, and that is memory consumption when trying to consume the information. Suppose you have a really high velocity system. I don't know exactly what the busiest systems around are doing in terms of data churn these days, but let's say just for kicks that we are dirtying 100GB/hour. That means, roughly 12.5 million block references per hour. If each block reference takes 12 bytes, that's maybe 150MB/hour in block reference files. If you run a daily incremental backup, you've got to load all the block references for the last 24 hours and deduplicate them, which means you're going to need about 3.6GB of memory. If you run a weekly incremental backup, you're going to need about 25GB of memory. That is not ideal. One can keep the memory consumption to a more reasonable
serializable transaction: exclude constraint violation (backed by GIST index) instead of ssi conflict
Hi all, I understood that v11 includes predicate locking for gist indexes, as per https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ad55863e9392bff73377911ebbf9760027ed405 . I tried this in combination with an exclude constraint as following: drop table if exists t; create table t(period tsrange); alter table t add constraint bla exclude using gist(period with &&); -- t1 begin transaction isolation level serializable; select * from t where period && tsrange(now()::timestamp, now()::timestamp + interval '1 hour'); insert into t(period) values(tsrange(now()::timestamp, now()::timestamp + interval '1 hour')); -- t2 begin transaction isolation level serializable; select * from t where period && tsrange(now()::timestamp, now()::timestamp + interval '1 hour'); insert into t(period) values(tsrange(now()::timestamp, now()::timestamp + interval '1 hour')); -- t1 commit; -- t2 ERROR: conflicting key value violates exclusion constraint "bla" DETAIL: Key (period)=(["2019-04-10 20:59:20.6265","2019-04-10 21:59:20.6265")) conflicts with existing key (period)=(["2019-04-10 20:59:13.332622","2019-04-10 21:59:13.332622")). I kinda expected/hoped that transaction t2 would get aborted by a serialization error, and not an exclude constraint violation. This makes the application session bound to transaction t2 failing, as only serialization errors are retried. We introduced the same kind of improvement/fix for btree indexes earlier, see https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=fcff8a575198478023ada8a48e13b50f70054766. Should this also be applied for (exclude) constraints backed by a gist index (as gist indexes now support predicate locking), or am I creating incorrect assumptions something here? Thanks.
Re: PostgreSQL pollutes the file system
I just want to be on record that I don't think there is a problem here that needs to be solved. The choice to put Postgres-related binaries in /usr/bin or wherever is a distribution/packaging decision. As has been pointed out, if I download, build, and install Postgres, the binaries by default go in /usr/local/pgsql/bin. It is a long-standing Unix tradition to have short-named commands from many sources in /usr/bin and /bin, not to mention other files, often with short names, in various directories all over the system. For example, on one of the Ubuntu machines at my work, take a look at all the 2-character commands in those directories, and how many different packages they come from, in the list at the bottom of this message. At this point I think Postgres absolutely owns the name "psql" as a Unix binary and I would oppose any suggestion that this should be renamed. Just my own effort to teach my fingers to type something different would probably outweigh any benefit from renaming. Having said this, if people are enthusiastic and can actually agree, there are a few changes that might make sense: - move clusterdb, createdb, etc. (*db, but not initdb because that is a server, not client, program) into pg_db_util [subcommand] (or some such) - move createuser, dropuser into pg_role_util [subcommand] (or some such) - pgbench -> pg_bench (why no '_' anyway?) - ecpg -> pg_ec (usually invoked by makefiles anyway, I'm guessing) But I consider this worth doing only if people consider that it's an improvement for reasons other than just getting stuff out of /bin or /usr/bin. List of 2-character commands and their source packages on one of our systems (the "no path found" ones are mostly symlinks into the Ubuntu "alternatives" system): 16:52 ijmorlan@ubuntu1604-102$ dpkg -S /usr/bin/?? /bin/?? | sort dpkg-query: no path found matching pattern /usr/bin/cc dpkg-query: no path found matching pattern /usr/bin/ex dpkg-query: no path found matching pattern /usr/bin/fp dpkg-query: no path found matching pattern /usr/bin/js dpkg-query: no path found matching pattern /usr/bin/pc dpkg-query: no path found matching pattern /usr/bin/rn dpkg-query: no path found matching pattern /usr/bin/rt dpkg-query: no path found matching pattern /usr/bin/vi dpkg-query: no path found matching pattern /bin/mt dpkg-query: no path found matching pattern /bin/nc acct: /usr/bin/ac apache2-utils: /usr/bin/ab aspectj: /usr/bin/aj at: /usr/bin/at bc: /usr/bin/bc bf: /usr/bin/bf binutils: /usr/bin/ar binutils: /usr/bin/as binutils: /usr/bin/ld binutils: /usr/bin/nm bsdmainutils: /usr/bin/hd bsdmainutils: /usr/bin/ul byobu: /usr/bin/NF coreutils: /bin/cp coreutils: /bin/dd coreutils: /bin/df coreutils: /bin/ln coreutils: /bin/ls coreutils: /bin/mv coreutils: /bin/rm coreutils: /usr/bin/du coreutils: /usr/bin/id coreutils: /usr/bin/nl coreutils: /usr/bin/od coreutils: /usr/bin/pr coreutils: /usr/bin/tr coreutils: /usr/bin/wc cups-client: /usr/bin/lp dash: /bin/sh dc: /usr/bin/dc debhelper: /usr/bin/dh diversion by dash from: /bin/sh diversion by dash to: /bin/sh.distrib ed: /bin/ed ghostscript: /usr/bin/gs graphviz: /usr/bin/gc gv: /usr/bin/gv i3-wm: /usr/bin/i3 ii: /usr/bin/ii iproute2: /bin/ip iproute2: /bin/ss ispell: /usr/bin/sq login: /bin/su login: /usr/bin/sg m4: /usr/bin/m4 mc: /usr/bin/mc mercurial: /usr/bin/hg mono-devel: /usr/bin/al mono-devel: /usr/bin/lc mono-devel: /usr/bin/sn mtools: /usr/bin/lz mtools: /usr/bin/uz p7zip-full: /usr/bin/7z procps: /bin/ps rcs: /usr/bin/ci rcs: /usr/bin/co rs: /usr/bin/rs ruby: /usr/bin/ri sc: /usr/bin/sc speech-tools: /usr/bin/dp tex4ht: /usr/bin/ht texlive-binaries: /usr/bin/mf util-linux: /usr/bin/pg xz-utils: /usr/bin/xz
Re: block-level incremental backup
On Wed, 10 Apr 2019 11:55:51 -0700 Andres Freund wrote: > Hi, > > On 2019-04-10 14:38:43 -0400, Robert Haas wrote: > > On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais > > wrote: > > > In my current design, the scan is done backward from end to start and I > > > keep all the records appearing after the last occurrence of their > > > respective FPI. > > > > Oh, interesting. That seems like it would require pretty major > > surgery on the WAL stream. > > Can't you just read each segment forward, and then reverse? Not sure what you mean. I first look for the very last XLOG record by jumping to the last WAL and scanning it forward. Then, I do a backward from there to record LSN of xlogrecord to keep. Finally, I clone each WAL and edit them as needed (as described in my previous email). This is my current WIP though. > That's not that much memory? I don't know, yet. I did not mesure it.
Re: block-level incremental backup
On Wed, 10 Apr 2019 14:38:43 -0400 Robert Haas wrote: > On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais > wrote: > > In my current design, the scan is done backward from end to start and I > > keep all the records appearing after the last occurrence of their > > respective FPI. > > Oh, interesting. That seems like it would require pretty major > surgery on the WAL stream. Indeed. Presently, the surgery in my code is replacing redundant xlogrecord with noop. I have now to deal with muti-blocks records. So far, I tried to mark non-needed block with !BKPBLOCK_HAS_DATA and made a simple patch in core to ignore such marked blocks, but it doesn't play well with dependency between xlogrecord, eg. during UPDATE. So my plan is to rewrite them to remove non-needed blocks using eg. XLOG_FPI. As I wrote, this is mainly an hobby project right now for my own education. Not sure where it leads me, but I learn a lot while working on it.
Re: pg_dump is broken for partition tablespaces
On 2019-Apr-10, Andres Freund wrote: > Hi, > > On 2019-04-10 09:28:21 -0400, Alvaro Herrera wrote: > > So I think that apart from David's patch, we should just document all > > these things carefully. > > Yea, I think that's the most important part. > > I'm not convinced that we should have any inheriting behaviour btw - it > seems like there's a lot of different ways to think this should behave, > with different good reason each. So, I ended up with the attached patch. I think it works pretty well, and it passes all my POLA tests. But it doesn't pass pg_upgrade tests! And investigating closer, it seems closely related to what David was complaining elsewhere about the tablespace being improperly set by some rewrite operations. Here's the setup as created by regress' create_table.sql: create table at_partitioned (a int, b text) partition by range (a); create table at_part_1 partition of at_partitioned for values from (0) to (1000); insert into at_partitioned values (512, '0.123'); create table at_part_2 (b text, a int); insert into at_part_2 values ('1.234', 1024); create index on at_partitioned (b); create index on at_partitioned (a); If you examine state at this point, it's all good: alvherre=# select relname, reltablespace from pg_class where relname like 'at_partitioned%'; relname| reltablespace --+--- at_partitioned | 0 at_partitioned_a_idx | 0 at_partitioned_b_idx | 0 but the test immediately does this: alter table at_partitioned alter column b type numeric using b::numeric; and watch what happens! (1663 is pg_default) alvherre=# select relname, reltablespace from pg_class where relname like 'at_partitioned%'; relname| reltablespace --+--- at_partitioned | 0 at_partitioned_a_idx | 0 at_partitioned_b_idx | 1663 (3 filas) Outrageous! I'm going to have a look at this behavior now. IMO it's a separate bug, but with that obviously we cannot fix the other one. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: block-level incremental backup
On 10.04.2019 19:51, Robert Haas wrote: On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik wrote: Some times ago I have implemented alternative version of ptrack utility (not one used in pg_probackup) which detects updated block at file level. It is very simple and may be it can be sometimes integrated in master. I don't think this is completely crash-safe. It looks like it arranges to msync() the ptrack file at appropriate times (although I haven't exhaustively verified the logic), but it uses MS_ASYNC, so it's possible that the ptrack file could get updated on disk either before or after the relation file itself. I think before is probably OK -- it just risks having some blocks look modified when they aren't really -- but after seems like it is very much not OK. And changing this to use MS_SYNC would probably be really expensive. Likely a better approach would be to hook into the new fsync queue machinery that Thomas Munro added to PostgreSQL 12. I do not think that MS_SYNC or fsync queue is needed here. If power failure or OS crash cause loose of some writes to ptrack map, then in any case {ostgres will perform recovery and updating pages from WAL cause once again marking them in ptrack map. So as in case of CLOG and many other Postgres files it is not critical to loose some writes because them will be restored from WAL. And before truncating WAL, Postgres performs checkpoint which flushes all changes to the disk, including ptrack map updates. It looks like your system maps all the blocks in the system into a fixed-size map using hashing. If the number of modified blocks between the full backup and the incremental backup is large compared to the size of the ptrack map, you'll start to get a lot of false-positives. It will look as if much of the database needs to be backed up. For example, in your sample configuration, you have ptrack_map_size = 103. If you've got a 100GB database with 20% daily turnover, that's about 2.6 million blocks. If you set bump a random entry ~2.6 million times in a map with 103 entries, on the average ~92% of the entries end up getting bumped, so you will get very little benefit from incremental backup. This problem drops off pretty fast if you raise the size of the map, but it's pretty critical that your map is large enough for the database you've got, or you may as well not bother. This is why ptrack block size should be larger than page size. Assume that it is 1Mb. 1MB is considered to be optimal amount of disk IO, when frequent seeks are not degrading read speed (it is most critical for HDD). In other words reading 10 random pages (20%) from this 1Mb block will takes almost the same amount of time (or even longer) than reading all this 1Mb in one operation. There will be just 10 used entries in ptrack map with very small probability of collision. Actually I have chosen this size (103) for ptrack map because with 1Mb block size is allows to map without noticable number of collisions 1Tb database which seems to be enough for most Postgres installations. But increasing ptrack map size 10 and even 100 times should not also cause problems with modern RAM sizes. It also appears that your system can't really handle resizing of the map in any friendly way. So if your data size grows, you may be faced with either letting the map become progressively less effective, or throwing it out and losing all the data you have. None of that is to say that what you're presenting here has no value, but I think it's possible to do better (and I think we should try). Definitely I didn't consider proposed patch as perfect solution and certainly it requires improvements (and may be complete redesign). I just want to present this approach (maintaining hash of block's LSN in mapped memory) and keeping track of modified blocks at file level (unlike current ptrack implementation which logs changes in all places in Postgres code where data is updated). Also, despite to the fact that this patch may be considered as raw prototype, I have spent some time thinking about all aspects of this approach including fault tolerance and false positives.
Re: PostgreSQL pollutes the file system
On 2019-04-10 15:01, Tatsuo Ishii wrote: >> On 2019-03-29 20:32, Joe Conway wrote: >>> pg_util >> >> How is that better than just renaming to pg_$oldname? > > As I already said in up thread: > >> This way, we would be free from the command name conflict problem Well, whatever we do -- if anything -- we would certainly need to keep the old names around for a while somehow. So this doesn't really make that issue go away. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PostgreSQL pollutes the file system
On 2019-04-10 15:15, Fred .Flintstone wrote: > The warnings would only be printed if the programs were executed with > the old file names. > This in order to inform people relying on the old names that they are > deprecated and they should move to the new names with the pg_ prefix. Yeah, that would be annoying. Let's not do that. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: block-level incremental backup
On 2019-04-10 17:31, Robert Haas wrote: > I think the way to think about this problem, or at least the way I > think about this problem, is that we need to decide whether want > file-level incremental backup, block-level incremental backup, or > byte-level incremental backup. That is a great analysis. Seems like block-level is the preferred way forward. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Should the docs have a warning about pg_stat_reset()?
On 2019-Apr-10, Bruce Momjian wrote: > On Thu, Apr 11, 2019 at 04:14:11AM +1200, David Rowley wrote: > > I still think we should start with a warning about pg_stat_reset(). > > People are surprised by this, and these are just the ones who notice: > > > > https://www.postgresql.org/message-id/CAB_myF4sZpxNXdb-x=welpqbdou6ue8fhtm0fverpm-1j7p...@mail.gmail.com > > > > I imagine there are many others just suffering from bloat due to > > auto-vacuum not knowing how many dead tuples there are in the tables. > > OK, let me step back. Why are people resetting the statistics > regularly? Based on that purpose, does it make sense to clear the > stats that effect autovacuum? I agree that we should research that angle. IMO resetting stats should be seriously frowned upon. And if they do need to reset some counters for some valid reason, offer a mechanism that leaves the autovac- guiding counters alone. IMO the answer for $SUBJECT is yes. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: block-level incremental backup
Hi, On 2019-04-10 14:38:43 -0400, Robert Haas wrote: > On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais > wrote: > > In my current design, the scan is done backward from end to start and I > > keep all > > the records appearing after the last occurrence of their respective FPI. > > Oh, interesting. That seems like it would require pretty major > surgery on the WAL stream. Can't you just read each segment forward, and then reverse? That's not that much memory? And sure, there's some inefficient cases where records span many segments, but that's rare enough that reading a few segments several times doesn't strike me as particularly bad? Greetings, Andres Freund
Re: Should the docs have a warning about pg_stat_reset()?
On Thu, Apr 11, 2019 at 04:14:11AM +1200, David Rowley wrote: > On Sat, 30 Mar 2019 at 00:59, Robert Haas wrote: > > > > On Wed, Mar 27, 2019 at 7:49 PM David Rowley > > wrote: > > > Yeah, analyze, not vacuum. It is a bit scary to add new ways for > > > auto-vacuum to suddenly have a lot of work to do. When all workers > > > are busy it can lead to neglect of other duties. It's true that there > > > won't be much in the way of routine vacuuming work for the database > > > the stats were just reset on, as of course, all the n_dead_tup > > > counters were just reset. However, it could starve other databases of > > > vacuum attention. Anti-wraparound vacuums on the current database may > > > get neglected too. > > > > > > I'm not saying let's not do it, I'm just saying we need to think of > > > what bad things could happen as a result of such a change. > > > > +1. I think that if we documented that pg_stat_reset() is going to > > trigger an auto-analyze of every table in your system, we'd have some > > people who didn't read the documentation and unleashed a storm of > > auto-analyze activity, and other people who did read the documentation > > and then intentionally used it to unleash a storm of auto-analyze > > activity. Neither sounds that great. > > I still think we should start with a warning about pg_stat_reset(). > People are surprised by this, and these are just the ones who notice: > > https://www.postgresql.org/message-id/CAB_myF4sZpxNXdb-x=welpqbdou6ue8fhtm0fverpm-1j7p...@mail.gmail.com > > I imagine there are many others just suffering from bloat due to > auto-vacuum not knowing how many dead tuples there are in the tables. OK, let me step back. Why are people resetting the statistics regularly? Based on that purpose, does it make sense to clear the stats that effect autovacuum? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Postgres "failures" dataset for machine learning
Hi all, I was wondering if there exists either a test suite of pathological failure cases for postgres, or a dataset of failure scenarios. I'm not exactly sure what such a dataset would look like, possibly a bunch of snapshots of test databases when undergoing a bunch of different failure scenarios? I'm experimenting with machine learning and I had an idea to build a classifier to determine if a running postgres database is having issues. Right now "issues" is very ambiguously defined, but I'm thinking of problems I've encountered at work, such as resource saturation, long running transactions, lock contention, etc. I know a lot of this is already covered by existing monitoring solutions, but I'm specifically interested to see if a ML model can learn monitoring rules on its own. If the classifier turns out to be feasible then my hope would to be to expand the ML model to have some diagnostic capabilities -- I've had difficulty in the past figuring out exactly what is going wrong with postgres when my workplace's production environment was having database issues. Thanks, Ben Simmons
Re: block-level incremental backup
On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais wrote: > In my current design, the scan is done backward from end to start and I keep > all > the records appearing after the last occurrence of their respective FPI. Oh, interesting. That seems like it would require pretty major surgery on the WAL stream. > Summary files looks like what Andrey Borodin described as delta-files and > change maps. Yep. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: block-level incremental backup
Hi, First thank you for your answer! On Wed, 10 Apr 2019 12:21:03 -0400 Robert Haas wrote: > On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais > wrote: > > My idea would be create a new tool working on archived WAL. No burden > > server side. Basic concept is: > > > > * parse archives > > * record latest relevant FPW for the incr backup > > * write new WALs with recorded FPW and removing/rewriting duplicated > > walrecords. > > > > It's just a PoC and I hadn't finished the WAL writing part...not even > > talking about the replay part. I'm not even sure this project is a good > > idea, but it is a good educational exercice to me in the meantime. > > > > Anyway, using real life OLTP production archives, my stats were: > > > > # WAL xlogrec kept Size WAL kept > > 12739% 50% > > 38322% 38% > > 63920% 29% > > > > Based on this stats, I expect this would save a lot of time during recovery > > in a first step. If it get mature, it might even save a lot of archives > > space or extend the retention period with degraded granularity. It would > > even help taking full backups with a lower frequency. > > > > Any thoughts about this design would be much appreciated. I suppose this > > should be offlist or in a new thread to avoid polluting this thread as this > > is a slightly different subject. > > Interesting idea, but I don't see how it can work if you only deal > with the FPWs and not the other records. For instance, suppose that > you take a full backup at time T0, and then at time T1 there are two > modifications to a certain block in quick succession. That block is > then never touched again. Since no checkpoint intervenes between the > modifications, the first one emits an FPI and the second does not. > Capturing the FPI is fine as far as it goes, but unless you also do > something with the non-FPI change, you lose that second modification. > You could fix that by having your tool replicate the effects of WAL > apply outside the server, but that sounds like a ton of work and a ton > of possible bugs. In my current design, the scan is done backward from end to start and I keep all the records appearing after the last occurrence of their respective FPI. The next challenge I have to achieve is to deal with multiple blocks records where some need to be removed and other are FPI to keep (eg. UPDATE). > I have a related idea, though. Suppose that, as Peter says upthread, > you have a replication slot that prevents old WAL from being removed. > You also have a background worker that is connected to that slot. It > decodes WAL and produces summary files containing all block-references > extracted from those WAL records and the associated LSN (or maybe some > approximation of the LSN instead of the exact value, to allow for > compression and combining of nearby references). Then you hold onto > those summary files after the actual WAL is removed. Now, when > somebody asks the server for all blocks changed since a certain LSN, > it can use those summary files to figure out which blocks to send > without having to read all the pages in the database. Although I > believe that a simple system that finds modified blocks by reading > them all is good enough for a first version of this feature and useful > in its own right, a more efficient system will be a lot more useful, > and something like this seems to me to be probably the best way to > implement it. Summary files looks like what Andrey Borodin described as delta-files and change maps. > With an approach based > on WAL-scanning, the work is done in the background and nobody has to > wait for it. Agree with this.
Re: Status of the table access method work
On Wed, Apr 10, 2019 at 8:32 PM Andres Freund wrote: > On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote: > > Your explanation of existing limitations looks very good and > > convincing. But I think there is one you didn't mention. We require > > new table AMs to basically save old "contract" between heap and > > indexes. We have "all or nothing" choice during updates. So, storage > > can either insert to none of indexes or insert to all of them > > (HOT-like update). > > I think that's a problem, and yea, I should have mentioned it. I'd > earlier thought about it and then forgot. > > I however don't think we should design the interface for this before we > have at least one AM that's actually in decent-ish shape that needs > it. I seriously doubt we'll get the interface right enough. > > Note: I'm *extremely* *extremely* doubtful that moving the full executor > invocations for expression indices etc into the tableam is a sane thing > to do. It's possible to convince me there's no alternative, but it'll be > really hard. > > I suspect the right direction will be more going in a direction of > computing new index tuples for expression indexes before tableam gets > involved. If we do that right, we can also implement the stuff that > 1c53c4dec3985512f7f2f53c9d76a5295cd0a2dd reverted in a proper way. Probably we can invent few modes table AM might work: calculation of all new index tuples, calculation of new and old index tuples for updated fields, calculation of all new and old index tuples and so on. And them index tuples would be calculated either in advance or by callback. > > I think any storage, which is going to address "write amplification" > > point raised by Uber, needs to break this "contract". > > FWIW, I don't think it makes much point in using Uber as a justification > for anything here. Their analysis was so deeply flawed and motivated by > non-technical reasons that it should just be ignored. Yeah, Uber is just a buzz word here. But problem that update of single indexed field leads to insertions to every index is well-known among the PostgreSQL users. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: pgbench - add minimal stats on initialization
Hello, Thanks for the feedback. I have one minor observation that in case of initDropTables you log 'drop' and in case of initCreateTables you log 'create table'. We need to be consistent. The "drop tables" and "create tables" are the best fit here. Ok. Attached version does that, plus avoids re-assigning "first" on each loop, plus checks that --no-vacuum indeed removes all vacuums in the TAP test. -- Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index b67ad5e823..7de0d914f1 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -3929,32 +3929,48 @@ checkInitSteps(const char *initialize_steps) static void runInitSteps(const char *initialize_steps) { - PGconn *con; - const char *step; + PQExpBufferData stats; + PGconn *con; + const char *step; + doublerun_time = 0.0; + boolfirst = true; + + initPQExpBuffer(); if ((con = doConnect()) == NULL) exit(1); for (step = initialize_steps; *step != '\0'; step++) { + instr_time start; + char *op = NULL; + + INSTR_TIME_SET_CURRENT(start); + switch (*step) { case 'd': +op = "drop tables"; initDropTables(con); break; case 't': +op = "create tables"; initCreateTables(con); break; case 'g': +op = "generate", initGenerateData(con); break; case 'v': +op = "vacuum"; initVacuum(con); break; case 'p': +op = "primary keys"; initCreatePKeys(con); break; case 'f': +op = "foreign keys"; initCreateFKeys(con); break; case ' ': @@ -3965,10 +3981,30 @@ runInitSteps(const char *initialize_steps) PQfinish(con); exit(1); } + + if (op != NULL) + { + instr_time diff; + double elapsed_sec; + + INSTR_TIME_SET_CURRENT(diff); + INSTR_TIME_SUBTRACT(diff, start); + elapsed_sec = INSTR_TIME_GET_DOUBLE(diff); + + if (!first) +appendPQExpBufferStr(, ", "); + else +first = false; + + appendPQExpBuffer(, "%s %.2f s", op, elapsed_sec); + + run_time += elapsed_sec; + } } - fprintf(stderr, "done.\n"); + fprintf(stderr, "done in %.2f s (%s).\n", run_time, stats.data); PQfinish(con); + termPQExpBuffer(); } /* diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl index 62906d5e55..e910c2fc01 100644 --- a/src/bin/pgbench/t/001_pgbench_with_server.pl +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl @@ -101,7 +101,7 @@ pgbench( [qr{^$}], [ qr{creating tables}, qr{vacuuming}, - qr{creating primary keys}, qr{done\.} + qr{creating primary keys}, qr{done in \d+\.\d\d s } ], 'pgbench scale 1 initialization',); @@ -116,7 +116,8 @@ pgbench( qr{vacuuming}, qr{creating primary keys}, qr{creating foreign keys}, - qr{done\.} + qr{(?!vacuuming)}, # no vacuum + qr{done in \d+\.\d\d s } ], 'pgbench scale 1 initialization'); @@ -131,7 +132,8 @@ pgbench( qr{creating primary keys}, qr{.* of .* tuples \(.*\) done}, qr{creating foreign keys}, - qr{done\.} + qr{(?!vacuuming)}, # no vacuum + qr{done in \d+\.\d\d s } ], 'pgbench --init-steps');
Re: Status of the table access method work
Hi, On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote: > Your explanation of existing limitations looks very good and > convincing. But I think there is one you didn't mention. We require > new table AMs to basically save old "contract" between heap and > indexes. We have "all or nothing" choice during updates. So, storage > can either insert to none of indexes or insert to all of them > (HOT-like update). I think that's a problem, and yea, I should have mentioned it. I'd earlier thought about it and then forgot. I however don't think we should design the interface for this before we have at least one AM that's actually in decent-ish shape that needs it. I seriously doubt we'll get the interface right enough. Note: I'm *extremely* *extremely* doubtful that moving the full executor invocations for expression indices etc into the tableam is a sane thing to do. It's possible to convince me there's no alternative, but it'll be really hard. I suspect the right direction will be more going in a direction of computing new index tuples for expression indexes before tableam gets involved. If we do that right, we can also implement the stuff that 1c53c4dec3985512f7f2f53c9d76a5295cd0a2dd reverted in a proper way. > I think any storage, which is going to address "write amplification" > point raised by Uber, needs to break this "contract". FWIW, I don't think it makes much point in using Uber as a justification for anything here. Their analysis was so deeply flawed and motivated by non-technical reasons that it should just be ignored. Greetings, Andres Freund
Re: Status of the table access method work
On Fri, Apr 5, 2019 at 11:25 PM Andres Freund wrote: > I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the > others that collaborated on making tableam happen. It was/is a huge > project. Thank you so much for bringing this project to commit! Excellent work! Your explanation of existing limitations looks very good and convincing. But I think there is one you didn't mention. We require new table AMs to basically save old "contract" between heap and indexes. We have "all or nothing" choice during updates. So, storage can either insert to none of indexes or insert to all of them (HOT-like update). I think any storage, which is going to address "write amplification" point raised by Uber, needs to break this "contract". For example, zheap is promised to implement delete-marking indexes. But it's not yet published. And for me it's not clear that this approach is better among the alternatives. With delete-marking approach you need to update index tuples corresponding to old values of updated fields. But additionally to that it's not trivial to delete index tuple. In order to do that, you need to both locate this index tuple and know that this index value isn't present in undo chain. So, it's likely required another index lookup during purging of undo chain. Thus, we basically need to random lookup index twice for every deleted index tuple. Also, it becomes more complex to lookup appropriate heap tuple during index scan. Then you need to check not only visibility, but also matching index value (here we need to adjust index_fetch_tuple interface). Because it might happen that visible to you version have different index value. That may lead to O(N^2) performance while accessing single row with N versions (MySQL InnoDB has this problem). Alternative idea is to have MVCC-aware indexes. This approach looks more attractive for me. In this approach you basically need xmin, xmax fields in index tuples. On insertion of index tuple you fill it's xmin. On update, previous index tuple is marked with xmax. After that outdated index tuples might be deleted in the lazy manner when page space is required. So, only one random access is required for deleted index tuple. With this approach fetching single row is O(N). Also, index-only scan becomes very easy and doesn't even need a visibility map. The only problem here is extra space requirements for index tuples. But I think, this is well-isolated problem, which is easy to attack. For instance, some visibility information could be evicted to undo chain (like zheap does for its tuples). Also, we can have special bit for "all visible" index tuples. With "all visible" bit set this tuple can get rid of visibility fields. We can do this for index tuples, because if index tuple requires extra space we can split the page, in spite of heap where tuples are fixed in pages and xmax needs to be updated in-place. I understand that delete-marking indexes have some advantages, and some people find them more appealing. But my point is that we shouldn't builtin one of this approaches into API unless we have concrete proof that this approach is strongly overcomes another. It would be better to have our table-AM API flexible enough to implement both. I can imagine we have proper encapsulation here bringing more interaction with indexes to the table AM side. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: block-level incremental backup
On Wed, Apr 10, 2019 at 12:56 PM Ashwin Agrawal wrote: > Not to fork the conversation from incremental backups, but similar approach > is what we have been thinking for pg_rewind. Currently, pg_rewind requires > all the WAL logs to be present on source side from point of divergence to > rewind. Instead just parse the wal and keep the changed blocks around on > sourece. Then don't need to retain the WAL but can still rewind using the > changed block map. So, rewind becomes much similar to incremental backup > proposed here after performing rewind activity using target side WAL only. Interesting. So if we build a system like this for incremental backup, or for pg_rewind, the other one can use the same infrastructure. That sound excellent. I'll start a new thread to talk about that, and hopefully you and Heikki and others will chime in with thoughts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: block-level incremental backup
On Wed, Apr 10, 2019 at 7:51 AM Andrey Borodin wrote: > > 9 апр. 2019 г., в 20:48, Robert Haas написал(а): > > - This is just a design proposal at this point; there is no code. If > > this proposal, or some modified version of it, seems likely to be > > acceptable, I and/or my colleagues might try to implement it. > > I'll be happy to help with code, discussion and patch review. That would be great! We should probably give this discussion some more time before we plunge into the implementation phase, but I'd love to have some help with that, whether it's with coding or review or whatever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Cleanup/remove/update references to OID column
On Wed, Apr 10, 2019 at 06:32:35PM +0200, Daniel Verite wrote: > Justin Pryzby wrote: > > > Cleanup/remove/update references to OID column... > > Just spotted a couple of other references that need updates: > #1. In catalogs.sgml: > #2. In ddl.sgml, when describing ctid: I found and included fixes for a few more references: doc/src/sgml/catalogs.sgml | 2 +- doc/src/sgml/ddl.sgml| 3 +-- doc/src/sgml/information_schema.sgml | 4 ++-- doc/src/sgml/ref/create_trigger.sgml | 2 +- doc/src/sgml/spi.sgml| 2 +- Justin >From d77f3d95b8cbb6bbc4addaaf4d9b1bcc11f10f10 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Tue, 2 Apr 2019 19:13:55 -0500 Subject: [PATCH v2] Cleanup/remove/update references to OID column... ..in wake of 578b229718e8f. See also 93507e67c9ca54026019ebec3026de35d30370f9 1464755fc490a9911214817fe83077a3689250ab Reviewed-by: Daniel Verite --- doc/src/sgml/catalogs.sgml | 2 +- doc/src/sgml/ddl.sgml| 12 +--- doc/src/sgml/information_schema.sgml | 4 ++-- doc/src/sgml/ref/create_trigger.sgml | 2 +- doc/src/sgml/ref/insert.sgml | 12 +--- doc/src/sgml/ref/psql-ref.sgml | 3 +++ doc/src/sgml/spi.sgml| 2 +- 7 files changed, 18 insertions(+), 19 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 40ddec4..d544e60 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -1047,7 +1047,7 @@ The number of the column. Ordinary columns are numbered from 1 - up. System columns, such as oid, + up. System columns, such as ctid, have (arbitrary) negative numbers. diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 9e761db..244d5ce 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -1202,8 +1202,7 @@ CREATE TABLE circles ( ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row - identifier. The OID, or even better a user-defined serial - number, should be used to identify logical rows. + identifier. A primary key should be used to identify logical rows. @@ -3672,11 +3671,10 @@ VALUES ('Albany', NULL, NULL, 'NY'); Partitions cannot have columns that are not present in the parent. It is not possible to specify columns when creating partitions with - CREATE TABLE, nor is it possible to add columns to - partitions after-the-fact using ALTER TABLE. Tables may be - added as a partition with ALTER TABLE ... ATTACH PARTITION - only if their columns exactly match the parent, including any - oid column. + CREATE TABLE, to add columns to + partitions after-the-fact using ALTER TABLE, nor to + add a partition with ALTER TABLE ... ATTACH PARTITION + if its columns would not exactly match those of the parent. diff --git a/doc/src/sgml/information_schema.sgml b/doc/src/sgml/information_schema.sgml index 1321ade..9c618b1 100644 --- a/doc/src/sgml/information_schema.sgml +++ b/doc/src/sgml/information_schema.sgml @@ -1312,8 +1312,8 @@ The view columns contains information about all table columns (or view columns) in the database. System columns - (oid, etc.) are not included. Only those columns are - shown that the current user has access to (by way of being the + (ctid, etc.) are not included. The only columns shown + are those to which the current user has access (by way of being the owner or having some privilege). diff --git a/doc/src/sgml/ref/create_trigger.sgml b/doc/src/sgml/ref/create_trigger.sgml index 6456105..3339a4b 100644 --- a/doc/src/sgml/ref/create_trigger.sgml +++ b/doc/src/sgml/ref/create_trigger.sgml @@ -465,7 +465,7 @@ UPDATE OF column_name1 [, column_name2NEW row seen by the condition is the current value, as possibly modified by earlier triggers. Also, a BEFORE trigger's WHEN condition is not allowed to examine the - system columns of the NEW row (such as oid), + system columns of the NEW row (such as ctid), because those won't have been set yet. diff --git a/doc/src/sgml/ref/insert.sgml b/doc/src/sgml/ref/insert.sgml index 62e142f..3e1be4c 100644 --- a/doc/src/sgml/ref/insert.sgml +++ b/doc/src/sgml/ref/insert.sgml @@ -552,13 +552,11 @@ INSERT INTO table_name [ AS oid count The count is the - number of rows inserted or updated. If count is exactly one, and the - target table has OIDs, then oid is the OID - assigned to the inserted row. The single row must have been - inserted rather than updated. Otherwise oid is zero. + number of rows inserted or updated. + oid used to be the object ID of the inserted row + if rows was 1 and the target table had OIDs, but + OIDs system columns are not supported anymore; therefore + oid is always 0.
Re: block-level incremental backup
On Wed, Apr 10, 2019 at 9:21 AM Robert Haas wrote: > I have a related idea, though. Suppose that, as Peter says upthread, > you have a replication slot that prevents old WAL from being removed. > You also have a background worker that is connected to that slot. It > decodes WAL and produces summary files containing all block-references > extracted from those WAL records and the associated LSN (or maybe some > approximation of the LSN instead of the exact value, to allow for > compression and combining of nearby references). Then you hold onto > those summary files after the actual WAL is removed. Now, when > somebody asks the server for all blocks changed since a certain LSN, > it can use those summary files to figure out which blocks to send > without having to read all the pages in the database. Although I > believe that a simple system that finds modified blocks by reading > them all is good enough for a first version of this feature and useful > in its own right, a more efficient system will be a lot more useful, > and something like this seems to me to be probably the best way to > implement it. > Not to fork the conversation from incremental backups, but similar approach is what we have been thinking for pg_rewind. Currently, pg_rewind requires all the WAL logs to be present on source side from point of divergence to rewind. Instead just parse the wal and keep the changed blocks around on sourece. Then don't need to retain the WAL but can still rewind using the changed block map. So, rewind becomes much similar to incremental backup proposed here after performing rewind activity using target side WAL only.
Re: block-level incremental backup
On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik wrote: > Some times ago I have implemented alternative version of ptrack utility > (not one used in pg_probackup) > which detects updated block at file level. It is very simple and may be > it can be sometimes integrated in master. I don't think this is completely crash-safe. It looks like it arranges to msync() the ptrack file at appropriate times (although I haven't exhaustively verified the logic), but it uses MS_ASYNC, so it's possible that the ptrack file could get updated on disk either before or after the relation file itself. I think before is probably OK -- it just risks having some blocks look modified when they aren't really -- but after seems like it is very much not OK. And changing this to use MS_SYNC would probably be really expensive. Likely a better approach would be to hook into the new fsync queue machinery that Thomas Munro added to PostgreSQL 12. It looks like your system maps all the blocks in the system into a fixed-size map using hashing. If the number of modified blocks between the full backup and the incremental backup is large compared to the size of the ptrack map, you'll start to get a lot of false-positives. It will look as if much of the database needs to be backed up. For example, in your sample configuration, you have ptrack_map_size = 103. If you've got a 100GB database with 20% daily turnover, that's about 2.6 million blocks. If you set bump a random entry ~2.6 million times in a map with 103 entries, on the average ~92% of the entries end up getting bumped, so you will get very little benefit from incremental backup. This problem drops off pretty fast if you raise the size of the map, but it's pretty critical that your map is large enough for the database you've got, or you may as well not bother. It also appears that your system can't really handle resizing of the map in any friendly way. So if your data size grows, you may be faced with either letting the map become progressively less effective, or throwing it out and losing all the data you have. None of that is to say that what you're presenting here has no value, but I think it's possible to do better (and I think we should try). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Cleanup/remove/update references to OID column
Justin Pryzby wrote: > Cleanup/remove/update references to OID column... > > ..in wake of 578b229718e8f. Just spotted a couple of other references that need updates: #1. In catalogs.sgml: attnum int2 The number of the column. Ordinary columns are numbered from 1 up. System columns, such as oid, have (arbitrary) negative numbers. oid should be replaced by xmin or some other system column. #2. In ddl.sgml, when describing ctid: The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. The OID, or even better a user-defined serial number, should be used to identify logical rows. "The OID" used to refer to an entry above in that list, now it's not clear what it refers to. "serial number" also sounds somewhat obsolete now that IDENTITY is supported. The last sentence could be changed to: "A primary key should be used to identify logical rows". Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Re: block-level incremental backup
On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais wrote: > My idea would be create a new tool working on archived WAL. No burden > server side. Basic concept is: > > * parse archives > * record latest relevant FPW for the incr backup > * write new WALs with recorded FPW and removing/rewriting duplicated > walrecords. > > It's just a PoC and I hadn't finished the WAL writing part...not even talking > about the replay part. I'm not even sure this project is a good idea, but it > is > a good educational exercice to me in the meantime. > > Anyway, using real life OLTP production archives, my stats were: > > # WAL xlogrec kept Size WAL kept > 12739% 50% > 38322% 38% > 63920% 29% > > Based on this stats, I expect this would save a lot of time during recovery in > a first step. If it get mature, it might even save a lot of archives space or > extend the retention period with degraded granularity. It would even help > taking full backups with a lower frequency. > > Any thoughts about this design would be much appreciated. I suppose this > should > be offlist or in a new thread to avoid polluting this thread as this is a > slightly different subject. Interesting idea, but I don't see how it can work if you only deal with the FPWs and not the other records. For instance, suppose that you take a full backup at time T0, and then at time T1 there are two modifications to a certain block in quick succession. That block is then never touched again. Since no checkpoint intervenes between the modifications, the first one emits an FPI and the second does not. Capturing the FPI is fine as far as it goes, but unless you also do something with the non-FPI change, you lose that second modification. You could fix that by having your tool replicate the effects of WAL apply outside the server, but that sounds like a ton of work and a ton of possible bugs. I have a related idea, though. Suppose that, as Peter says upthread, you have a replication slot that prevents old WAL from being removed. You also have a background worker that is connected to that slot. It decodes WAL and produces summary files containing all block-references extracted from those WAL records and the associated LSN (or maybe some approximation of the LSN instead of the exact value, to allow for compression and combining of nearby references). Then you hold onto those summary files after the actual WAL is removed. Now, when somebody asks the server for all blocks changed since a certain LSN, it can use those summary files to figure out which blocks to send without having to read all the pages in the database. Although I believe that a simple system that finds modified blocks by reading them all is good enough for a first version of this feature and useful in its own right, a more efficient system will be a lot more useful, and something like this seems to me to be probably the best way to implement it. The reason why I think this is likely to be superior to other possible approaches, such as the ptrack approach Konstantin suggests elsewhere on this thread, is because it pushes the work of figuring out which blocks have been modified into the background. With a ptrack-type approach, the server has to do some non-zero amount of extra work in the foreground every time it modifies a block. With an approach based on WAL-scanning, the work is done in the background and nobody has to wait for it. It's possible that there are other considerations which aren't occurring to me right now, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Should the docs have a warning about pg_stat_reset()?
On Sat, 30 Mar 2019 at 00:59, Robert Haas wrote: > > On Wed, Mar 27, 2019 at 7:49 PM David Rowley > wrote: > > Yeah, analyze, not vacuum. It is a bit scary to add new ways for > > auto-vacuum to suddenly have a lot of work to do. When all workers > > are busy it can lead to neglect of other duties. It's true that there > > won't be much in the way of routine vacuuming work for the database > > the stats were just reset on, as of course, all the n_dead_tup > > counters were just reset. However, it could starve other databases of > > vacuum attention. Anti-wraparound vacuums on the current database may > > get neglected too. > > > > I'm not saying let's not do it, I'm just saying we need to think of > > what bad things could happen as a result of such a change. > > +1. I think that if we documented that pg_stat_reset() is going to > trigger an auto-analyze of every table in your system, we'd have some > people who didn't read the documentation and unleashed a storm of > auto-analyze activity, and other people who did read the documentation > and then intentionally used it to unleash a storm of auto-analyze > activity. Neither sounds that great. I still think we should start with a warning about pg_stat_reset(). People are surprised by this, and these are just the ones who notice: https://www.postgresql.org/message-id/CAB_myF4sZpxNXdb-x=welpqbdou6ue8fhtm0fverpm-1j7p...@mail.gmail.com I imagine there are many others just suffering from bloat due to auto-vacuum not knowing how many dead tuples there are in the tables. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: pg_dump is broken for partition tablespaces
Hi, On 2019-04-10 09:28:21 -0400, Alvaro Herrera wrote: > So I think that apart from David's patch, we should just document all > these things carefully. Yea, I think that's the most important part. I'm not convinced that we should have any inheriting behaviour btw - it seems like there's a lot of different ways to think this should behave, with different good reason each. Greetings, Andres Freund
Re: Minimal logical decoding on standbys
Hi, On 2019-04-10 12:11:21 +0530, tushar wrote: > > On 03/13/2019 08:40 PM, tushar wrote: > > Hi , > > > > I am getting a server crash on standby while executing > > pg_logical_slot_get_changes function , please refer this scenario > > > > Master cluster( ./initdb -D master) > > set wal_level='hot_standby in master/postgresql.conf file > > start the server , connect to psql terminal and create a physical > > replication slot ( SELECT * from > > pg_create_physical_replication_slot('p1');) > > > > perform pg_basebackup using --slot 'p1' (./pg_basebackup -D slave/ -R > > --slot p1 -v)) > > set wal_level='logical' , hot_standby_feedback=on, > > primary_slot_name='p1' in slave/postgresql.conf file > > start the server , connect to psql terminal and create a logical > > replication slot ( SELECT * from > > pg_create_logical_replication_slot('t','test_decoding');) > > > > run pgbench ( ./pgbench -i -s 10 postgres) on master and select > > pg_logical_slot_get_changes on Slave database > > > > postgres=# select * from pg_logical_slot_get_changes('t',null,null); > > 2019-03-13 20:34:50.274 IST [26817] LOG: starting logical decoding for > > slot "t" > > 2019-03-13 20:34:50.274 IST [26817] DETAIL: Streaming transactions > > committing after 0/6C60, reading WAL from 0/6C28. > > 2019-03-13 20:34:50.274 IST [26817] STATEMENT: select * from > > pg_logical_slot_get_changes('t',null,null); > > 2019-03-13 20:34:50.275 IST [26817] LOG: logical decoding found > > consistent point at 0/6C28 > > 2019-03-13 20:34:50.275 IST [26817] DETAIL: There are no running > > transactions. > > 2019-03-13 20:34:50.275 IST [26817] STATEMENT: select * from > > pg_logical_slot_get_changes('t',null,null); > > TRAP: FailedAssertion("!(data == tupledata + tuplelen)", File: > > "decode.c", Line: 977) > > server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > The connection to the server was lost. Attempting reset: 2019-03-13 > > 20:34:50.276 IST [26809] LOG: server process (PID 26817) was terminated > > by signal 6: Aborted > > > Andres - Do you think - this is an issue which needs to be fixed ? Yes, it definitely needs to be fixed. I just haven't had sufficient time to look into it. Have you reproduced this with Amit's latest version? Amit, have you spent any time looking into it? I know that you're not that deeply steeped into the internals of logical decoding, but perhaps there's something obvious going on. Greetings, Andres Freund
Re: block-level incremental backup
On Tue, Apr 9, 2019 at 5:28 PM Alvaro Herrera wrote: > On 2019-Apr-09, Peter Eisentraut wrote: > > On 2019-04-09 17:48, Robert Haas wrote: > > > 3. There should be a new tool that knows how to merge a full backup > > > with any number of incremental backups and produce a complete data > > > directory with no remaining partial files. > > > > Are there by any chance standard file formats and tools that describe a > > binary difference between directories? That would be really useful here. > > VCDIFF? https://tools.ietf.org/html/rfc3284 I don't understand VCDIFF very well, but I see some potential problems with going in this direction. First, suppose we take a full backup on Monday. Then, on Tuesday, we want to take an incremental backup. In my proposal, the backup server only needs to provide the database with one piece of information: the start-LSN of the previous backup. The server determines which blocks are recently modified and sends them to the client, which stores them. The end. On the other hand, storing a maximally compact VCDIFF seems to require that, for each block modified in the Tuesday backup, we go read the corresponding block as it existed on Monday. Assuming that the server is using some efficient method of locating modified blocks, this will approximately double the amount of read I/O required to complete the backup: either the server or the client must now read not only the current version of the block but the previous versions. If the previous backup is an incremental backup that does not contain full block images but only VCDIFF content, whoever is performing the VCDIFF calculation will need to walk the entire backup chain and reconstruct the previous contents of the previous block so that it can compute the newest VCDIFF. A customer who does an incremental backup every day and maintains a synthetic full backup from 1 week prior will see a roughly eightfold increase in read I/O compared to the design I proposed. The same problem exists at restore time. In my design, the total read I/O required is equal to the size of the database, plus however much metadata needs to be read from older delta files -- and that should be fairly small compared to the actual data being read, at least in normal, non-extreme cases. But if we are going to proceed by applying a series of delta files, we're going to need to read every older backup in its entirety. If the turnover percentage is significant, say 20%/day, and if the backup chain is say 7 backups long to get back to a full backup, this is a huge difference. Instead of having to read ~100% of the database size, as in my proposal, we'll need to read 100% + (6 * 20%) = 220% of the database size. Since VCDIFF uses an add-copy-run language to described differences, we could try to work around the problem that I just described by describing each changed data block as an 8192-byte add, and unchanged blocks as an 8192-byte copy. If we did that, then I think that the problem at backup time goes away: we can write out a VCDIFF-format file for the changed blocks based just on knowing that those are the blocks that have changed, without needing to access the older file. Of course, if we do it this way, the file will be larger than it would be if we actually compared the old and new block contents and wrote out a minimal VCDIFF, but it does make taking a backup a lot simpler. Even with this proposal, though, I think we still have trouble with restore time. I proposed putting the metadata about which blocks are included in a delta file at the beginning of the file, which allows a restore of a new incremental backup to relatively efficiently flip through older backups to find just the blocks that it needs, without having to read the whole file. But I think (although I am not quite sure) that in the VCDIFF format, the payload for an ADD instruction is stored near the payload. The result would be that you'd have to basically read the whole file at restore time to figure out which blocks were available from that file and which ones needed to be retrieved from an older backup. So while this approach would fix the backup-time problem, I believe that it would still require significantly more read I/O at restore time than my proposal. Furthermore, if, at backup time, we have to do anything that requires access to the old data, either the client or the server needs to have access to that data. Nonwithstanding the costs of reading it, that doesn't seem very desirable. The server is quite unlikely to have access to the backups, because most users want to back up to a different server in order to guard against a hardware failure. The client is more likely to be running on a machine where it has access to the data, because many users back up to the same machine every day, so the machine that is taking the current backup probably has the older one. However, accessing that old backup might not be cheap. It could be located in an object store in the
Re: Failure in contrib test _int on loach
On 09/04/2019 19:11, Anastasia Lubennikova wrote: 05.04.2019 19:41, Anastasia Lubennikova writes: In attachment, you can find patch with a test that allows to reproduce the bug not randomly, but on every run. Now I'm trying to find a way to fix the issue. The problem was caused by incorrect detection of the page to insert new tuple after split. If gistinserttuple() of the tuple formed by gistgetadjusted() had to split the page, we must to go back to the parent and descend back to the child that's a better fit for the new tuple. Previously this was handled by the code block with the following comment: * Concurrent split detected. There's no guarantee that the * downlink for this page is consistent with the tuple we're * inserting anymore, so go back to parent and rechoose the best * child. After introducing GistBuildNSN this code path became unreachable. To fix it, I added new flag to detect such splits during indexbuild. Isn't it possible that the grandparent page is also split, so that we'd need to climb further up? - Heikki
Re: Berserk Autovacuum (let's save next Mandrill)
Hi, On April 10, 2019 8:13:06 AM PDT, Alvaro Herrera wrote: >On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote: > >> Alternative point of "if your database is super large and actively >written, >> you may want to set autovacuum_freeze_max_age to even smaller values >so >> that autovacuum load is more evenly spread over time" may be needed. > >I don't think it's helpful to force emergency vacuuming more >frequently; >quite the contrary, it's likely to cause even more issues. We should >tweak autovacuum to perform freezing more preemtively instead. I still think the fundamental issue with making vacuum less painful is that the all indexes have to be read entirely. Even if there's not much work (say millions of rows frozen, hundreds removed). Without that issue we could vacuum much more frequently. And do it properly in insert only workloads. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Berserk Autovacuum (let's save next Mandrill)
On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote: > Alternative point of "if your database is super large and actively written, > you may want to set autovacuum_freeze_max_age to even smaller values so > that autovacuum load is more evenly spread over time" may be needed. I don't think it's helpful to force emergency vacuuming more frequently; quite the contrary, it's likely to cause even more issues. We should tweak autovacuum to perform freezing more preemtively instead. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: block-level incremental backup
Hi, On Tue, 9 Apr 2019 11:48:38 -0400 Robert Haas wrote: > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have > developed technology that permits a block-level incremental backup to > be taken from a PostgreSQL server. I believe the idea in all of those > cases is that non-relation files should be backed up in their > entirety, but for relation files, only those blocks that have been > changed need to be backed up. I would like to propose that we should > have a solution for this problem in core, rather than leaving it to > each individual PostgreSQL company to develop and maintain their own > solution. Generally my idea is: > > 1. There should be a way to tell pg_basebackup to request from the > server only those blocks where LSN >= threshold_value. There are > several possible ways for the server to implement this, the simplest > of which is to just scan all the blocks and send only the ones that > satisfy that criterion. That might sound dumb, but it does still save > network bandwidth, and it works even without any prior setup. +1 this is a simple design and probably a first easy step bringing a lot of benefices already. > It will probably be more efficient in many cases to instead scan all the WAL > generated since that LSN and extract block references from it, but > that is only possible if the server has all of that WAL available or > can somehow get it from the archive. I seize the opportunity to discuss about this on the fly. I've been playing with the idea of producing incremental backups from archives since many years. But I've only started PoC'ing on it this year. My idea would be create a new tool working on archived WAL. No burden server side. Basic concept is: * parse archives * record latest relevant FPW for the incr backup * write new WALs with recorded FPW and removing/rewriting duplicated walrecords. It's just a PoC and I hadn't finished the WAL writing part...not even talking about the replay part. I'm not even sure this project is a good idea, but it is a good educational exercice to me in the meantime. Anyway, using real life OLTP production archives, my stats were: # WAL xlogrec kept Size WAL kept 12739% 50% 38322% 38% 63920% 29% Based on this stats, I expect this would save a lot of time during recovery in a first step. If it get mature, it might even save a lot of archives space or extend the retention period with degraded granularity. It would even help taking full backups with a lower frequency. Any thoughts about this design would be much appreciated. I suppose this should be offlist or in a new thread to avoid polluting this thread as this is a slightly different subject. Regards, PS: I was surprised to still find some existing piece of code related to pglesslog in core. This project has been discontinued and WAL format changed in the meantime.
Re: block-level incremental backup
On 09.04.2019 18:48, Robert Haas wrote: 1. There should be a way to tell pg_basebackup to request from the server only those blocks where LSN >= threshold_value. Some times ago I have implemented alternative version of ptrack utility (not one used in pg_probackup) which detects updated block at file level. It is very simple and may be it can be sometimes integrated in master. I attached patch to vanilla to this mail. Right now it contains just two GUCs: ptrack_map_size: Size of ptrack map (number of elements) used for incremental backup: 0 disabled. ptrack_block_log: Logarithm of ptrack block size (amount of pages) and one function: pg_ptrack_get_changeset(startlsn pg_lsn) returns {relid,relfilenode,reltablespace,forknum,blocknum,segsize,updlsn,path} Idea is very simple: it creates hash map of fixed size (ptrack_map_size) and stores LSN of written pages in this map. As far as postgres default page size seems to be too small for ptrack block (requiring too large hash map or increasing number of conflicts, as well as increasing number of random reads) it is possible to configure ptrack block to consists of multiple pages (power of 2). This patch is using memory mapping mechanism. Unfortunately there is no portable wrapper for it in Postgres, so I have to provide own implementations for Unix/Windows. Certainly it is not good and should be rewritten. How to use? 1. Define ptrack_map_size in postgres.conf, for example (use simple number for more uniform hashing): ptrack_map_size = 103 2. Remember current lsn. psql postgres -c "select pg_current_wal_lsn()" pg_current_wal_lsn 0/224A268 (1 row) 3. Do some updates. $ pgbench -T 10 postgres 4. Select changed blocks. select * from pg_ptrack_get_changeset('0/224A268'); relid | relfilenode | reltablespace | forknum | blocknum | segsize | updlsn | path ---+-+---+-+--+-+---+-- 16390 | 16396 | 1663 | 0 | 1640 | 1 | 0/224FD88 | base/12710/16396 16390 | 16396 | 1663 | 0 | 1641 | 1 | 0/2258680 | base/12710/16396 16390 | 16396 | 1663 | 0 | 1642 | 1 | 0/22615A0 | base/12710/16396 ... Certainly ptrack should be used as part of some backup tool (as pg_basebackup or pg_probackup). -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 61a8f11..f4b8506 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -25,9 +25,20 @@ #include #include +#include +#ifndef WIN32 +#include "sys/mman.h" +#endif + #include "miscadmin.h" +#include "funcapi.h" +#include "access/hash.h" +#include "access/table.h" #include "access/xlogutils.h" #include "access/xlog.h" +#include "access/htup_details.h" +#include "catalog/pg_class.h" +#include "catalog/pg_tablespace.h" #include "pgstat.h" #include "postmaster/bgwriter.h" #include "storage/fd.h" @@ -36,6 +47,7 @@ #include "storage/relfilenode.h" #include "storage/smgr.h" #include "storage/sync.h" +#include "utils/builtins.h" #include "utils/hsearch.h" #include "utils/memutils.h" #include "pg_trace.h" @@ -116,6 +128,18 @@ static MemoryContext MdCxt; /* context for all MdfdVec objects */ */ #define EXTENSION_DONT_CHECK_SIZE (1 << 4) +/* + * Size of ptrack map (number of entries) + */ +int ptrack_map_size; + +/* + * Logarithm of ptrack block size (amount of pages) + */ +int ptrack_block_log; + +static int ptrack_fd; +static pg_atomic_uint64* ptrack_map; /* local routines */ static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, @@ -138,7 +162,7 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno, BlockNumber blkno, bool skipFsync, int behavior); static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); - +static void ptrack_mark_block(SMgrRelation reln, ForkNumber forkno, BlockNumber blkno); /* * mdinit() -- Initialize private state for magnetic disk storage manager. @@ -422,6 +446,8 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, register_dirty_segment(reln, forknum, v); Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE)); + + ptrack_mark_block(reln, forknum, blocknum); } /* @@ -575,6 +601,8 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum, nblocks -= nflush; blocknum += nflush; } + + ptrack_sync(); } /* @@ -700,6 +728,8 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, if (!skipFsync && !SmgrIsTemp(reln)) register_dirty_segment(reln, forknum, v); + + ptrack_mark_block(reln, forknum, blocknum); } /* @@ -886,6 +916,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum) FilePathName(v->mdfd_vfd; segno--; } +
Re: Berserk Autovacuum (let's save next Mandrill)
On Sat, Apr 6, 2019 at 9:56 AM Darafei "Komяpa" Praliaskouski wrote: > The invoking autovacuum on table based on inserts, not only deletes >> and updates, seems good idea to me. But in this case, I think that we >> can not only freeze tuples but also update visibility map even when >> setting all-visible. Roughly speaking I think vacuum does the >> following operations. >> >> 1. heap vacuum >> 2. HOT pruning >> 3. freezing tuples >> 4. updating visibility map (all-visible and all-frozen) >> 5. index vacuum/cleanup >> 6. truncation >> >> With the proposed patch[1] we can control to do 5 or not. In addition >> to that, another proposed patch[2] allows us to control 6. >> > > [1] is committed, [2] nears commit. Seems we have now all the infra to > teach autovacuum to run itself based on inserts and not hurt anybody? > > ... > >> [1] https://commitfest.postgresql.org/22/1817/ >> [2] https://commitfest.postgresql.org/22/1981/ >> > > Reading the thread and the patch, I generally agree that: 1. With the current infrastructure having auto vacuum periodically scan append-only tables for freezing would be good, and 2. I can't think of any cases where this would be a bad thing. Also I am not 100% convinced that the problems are avoidable by setting the wraparound prevention thresholds low enough. In cases where one is doing large bulk inserts all the time, vacuum freeze could have a lot of work to do, and in some cases I could imagine IO storms making that difficult. I plan to run some benchmarks on this to try to assess performance impact of this patch in standard pgbench scenarios.I will also try to come up with some other benchmarks in append only workloads. > > -- > Darafei Praliaskouski > Support me: http://patreon.com/komzpa > -- Best Regards, Chris Travers Head of Database Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com Saarbrücker Straße 37a, 10405 Berlin
Re: PostgreSQL pollutes the file system
Re: Fred .Flintstone 2019-04-10 > Does anyone oppose the proposal? I don't think part #3 has been discussed, and I'd oppose printing these warnings. Christoph
Re: PostgreSQL pollutes the file system
Does anyone oppose the proposal? How can we determine consensus? Is there any voting process? Is there any developer who is more versed than me with C than me who can write this patch? On Wed, Apr 10, 2019 at 2:52 PM Christoph Berg wrote: > > Re: Fred .Flintstone 2019-04-10 > > > It seems we do have a clear path forward on how to accomplish this and > > implement this change. > > > > 1. Rename executables to carry the pg_ prefix. > > 2. Create symlinks from the old names to the new names. > > 3. Modify the executables to read argv[0] and print a warning if the > > executable is called from the old name (symlink). > > > > This seems technically feasible and easy. > > > > How can we proceed? > > You can send a patch. > > But I don't think there has been a "clear" agreement that this is a > good idea. > > Christoph
Re: PostgreSQL pollutes the file system
> On 2019-03-29 20:32, Joe Conway wrote: >> pg_util > > How is that better than just renaming to pg_$oldname? As I already said in up thread: > This way, we would be free from the command name conflict problem > and plus, we could do: > > pgsql --help > > which will prints subscommand names when a user is not sure what is > the sub command name. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Re: pgbench - add minimal stats on initialization
Hi Fabien, I have one minor observation that in case of initDropTables you log 'drop' and in case of initCreateTables you log 'create table'. We need to be consistent. The "drop tables" and "create tables" are the best fit here. Otherwise, the patch is good. On Wed, Apr 10, 2019 at 2:18 PM Ibrar Ahmed wrote: > > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: tested, passed > Documentation:not tested > > Please ignore the last email. > > Patch works perfectly and the code is well-written. I have one minor > observation that in case of initDropTables you log "drop" and in case of > initCreateTables you log "create table". I think you need to be consistent. > And why not "drop tables" and "create tables" > > The new status of this patch is: Waiting on Author -- Ibrar Ahmed
Re: PostgreSQL pollutes the file system
Re: Fred .Flintstone 2019-04-10 > It seems we do have a clear path forward on how to accomplish this and > implement this change. > > 1. Rename executables to carry the pg_ prefix. > 2. Create symlinks from the old names to the new names. > 3. Modify the executables to read argv[0] and print a warning if the > executable is called from the old name (symlink). > > This seems technically feasible and easy. > > How can we proceed? You can send a patch. But I don't think there has been a "clear" agreement that this is a good idea. Christoph
Re: PostgreSQL pollutes the file system
It seems we do have a clear path forward on how to accomplish this and implement this change. 1. Rename executables to carry the pg_ prefix. 2. Create symlinks from the old names to the new names. 3. Modify the executables to read argv[0] and print a warning if the executable is called from the old name (symlink). This seems technically feasible and easy. How can we proceed? On Tue, Apr 2, 2019 at 11:24 PM Fred .Flintstone wrote: > > It looks like this thread is featured on LWN under the article: > Program names and "pollution". > https://lwn.net/ > https://lwn.net/Articles/784508/ (Subscription required) > > On Sat, Mar 30, 2019 at 12:27 PM Peter Eisentraut > wrote: > > > > On 2019-03-29 20:32, Joe Conway wrote: > > > pg_util > > > > How is that better than just renaming to pg_$oldname? > > > > -- > > Peter Eisentraut http://www.2ndQuadrant.com/ > > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: GSOC 2019 proposal 'WAL-G safety features'
Hi! > 9 апр. 2019 г., в 18:20, Zhichao Liu написал(а): > > Dear PostgreSQL community, > > I am a GSoC 2019 applicant and am working on 'WAL-G safety features'. I have > finished an initial draft of my proposal and I would appreciate your comments > and advice on my proposal. I know it is pretty late for the improvement of my > proposal, but I will be glad to join in the project this summer even without > GSoC! Please help me make my proposal and ideas better. > Thank you! > > The link is > https://docs.google.com/document/d/18cxbj1zId1BpMjgUkZ0MZgb1HdMg1S9h0U1WecZON3U/edit?usp=sharing This is great that you want to work on WAL-G, you do not need proposal to do this. Just make a PRs, ask questions, etc :) On WAL-G github page you can find way to slack channel. This list is intended for PostgreSQL core features. in GSoC WAL-G is hosted under PostgreSQL umbrella. Thanks! Best regards, Andrey Borodin.
Re: block-level incremental backup
Hi! > 9 апр. 2019 г., в 20:48, Robert Haas написал(а): > > Thoughts? Thanks for this long and thoughtful post! At Yandex, we are using incremental backups for some years now. Initially, we used patched pgbarman, then we implemented this functionality in WAL-G. And there are many things to be done yet. We have more than 1Pb of clusters backuped with this technology. Most of the time we use this technology as a part of HA setup in managed PostgreSQL service. So, for us main goals are to operate backups cheaply and restore new node quickly. Here's what I see from our perspective. 1. Yes, this feature is important. 2. This importance comes not from reduced disk storage, magnetic disks and object storages are very cheap. 3. Incremental backups save a lot of network bandwidth. It is non-trivial for the storage system to ingest hundreds of Tb daily. 4. Incremental backups are a redundancy of WAL, intended for parallel application. Incremental backup applied sequentially is not very useful, it will not be much faster than simple WAL replay in many cases. 5. As long as increments duplicate WAL functionality - it is not worth pursuing tradeoffs of storage utilization reduction. We scan WAL during archivation, extract numbers of changed blocks and store changemap for a group of WALs in the archive. 6. This changemaps can be used for the increment of the visibility map (if I recall correctly). But you cannot compare LSNs on a page of visibility map: some operations do not bump them. 7. We use changemaps during backups and during WAL replay - we know blocks that will change far in advance and prefetch them to page cache like pg_prefaulter does. 8. There is similar functionality in RMAN for one well-known database. They used to store 8 sets of change maps. That database also has cool functionality "increment for catchup". 9. We call incremental backup a "delta backup". This wording describes purpose more precisely: it is not "next version of DB", it is "difference between two DB states". But wording choice does not matter much. Here are slides from my talk at PgConf.APAC[0]. I've proposed a talk on this matter to PgCon, but it was not accepted. I will try next year :) > 9 апр. 2019 г., в 20:48, Robert Haas написал(а): > - This is just a design proposal at this point; there is no code. If > this proposal, or some modified version of it, seems likely to be > acceptable, I and/or my colleagues might try to implement it. I'll be happy to help with code, discussion and patch review. Best regards, Andrey Borodin. [0] https://yadi.sk/i/Y_S1iqNN5WxS6A
Re: hyrax vs. RelationBuildPartitionDesc
On Wed, Apr 10, 2019 at 05:03:21PM +0900, Amit Langote wrote: > The problem lies in all branches that have partitioning, so it should be > listed under Older Bugs, right? You may have noticed that I posted > patches for all branches down to 10. I have noticed. The message from Tom upthread outlined that an open item should be added, but this is not one. That's what I wanted to emphasize. Thanks for making it clearer. -- Michael signature.asc Description: PGP signature
Re: Zedstore - compressed in-core columnar storage
On Wed, Apr 10, 2019 at 12:55 PM Heikki Linnakangas wrote: > > On 10/04/2019 09:29, Amit Kapila wrote: > > On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal wrote: > >> Row store > >> - > >> > >> The tuples are stored one after another, sorted by TID. For each > >> tuple, we store its 48-bit TID, a undo record pointer, and the actual > >> tuple data uncompressed. > >> > > > > Storing undo record pointer with each tuple can take quite a lot of > > space in cases where you can't compress them. > > Yeah. This does depend on compression to eliminate the unused fields > quite heavily at the moment. But you could have a flag in the header to > indicate "no undo pointer needed", and just leave it out, when it's needed. > > > Have you thought how will you implement the multi-locker scheme in > > this design? In zheap, we have used undo for the same and it is easy > > to imagine when you have separate transaction slots for each > > transaction. I am not sure how will you implement the same here. > I've been thinking that the undo record would store all the XIDs > involved. So if there are multiple lockers, the UNDO record would store > a list of XIDs. > This will be quite tricky. Whenever a new locker arrives, you first need to fetch previous undo to see which all XIDs already have a lock on it. Not only that, it will make discarding undo's way complicated. We have considered this approach before implementing the current approach in zheap. > Alternatively, I suppose you could store multiple UNDO > pointers for the same tuple. > This will not only make the length of the tuple unnecessarily long but would make it much harder to reclaim that space once the corresponding undo is discarded. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: pgsql: Unified logging system for command-line programs
On 2019-04-09 13:58, Christoph Berg wrote: > I'm not entirely sure what happened here, but I think this made > pg_restore verbose by default (and there is no --quiet option). That was by accident. Fixed. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgbench - add minimal stats on initialization
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:not tested Please ignore the last email. Patch works perfectly and the code is well-written. I have one minor observation that in case of initDropTables you log "drop" and in case of initCreateTables you log "create table". I think you need to be consistent. And why not "drop tables" and "create tables" The new status of this patch is: Waiting on Author
Re: pgbench - add minimal stats on initialization
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, failed Spec compliant: tested, failed Documentation:not tested Patch works perfectly and the code is well-written. I have one minor observation that in case of initDropTables you log "drop" and in case of initCreateTables you log "create table". I think you need to be consistent. And why not "drop tables" and "create tables" The new status of this patch is: Waiting on Author
Re: Transaction commits VS Transaction commits (with parallel) VS query mean time
On Wed, Apr 10, 2019 at 3:32 PM Amit Kapila wrote: > On Mon, Apr 8, 2019 at 8:51 AM Amit Kapila > wrote: > > > > On Mon, Apr 8, 2019 at 7:54 AM Jamison, Kirk > wrote: > > > So I am marking this thread as “Ready for Committer”. > > > > > > > Thanks, Hari and Jamison for verification. The patches for > > back-branches looks good to me. I will once again verify them before > > commit. I will commit this patch tomorrow unless someone has > > objections. Robert/Alvaro, do let me know if you see any problem with > > this fix? > > > > Pushed. > Thanks Amit. Will look into it further to handle all the internally generated transactions. Regards, Haribabu Kommi Fujitsu Australia
Re: shared-memory based stats collector
On Wed, Apr 10, 2019 at 09:39:29AM +0900, Kyotaro HORIGUCHI wrote: At Tue, 9 Apr 2019 17:03:33 +0200, Tomas Vondra wrote in <20190409150333.5iashyjxm5jmraml@development> Unfortunately, now that we're past code freeze it's clear this is a PG12 matter now :-( I personally consider this to be very worthwhile & beneficial improvement, but I agree with Andres the patch did not quite get to committable state in the last CF. Conidering how sensitive part it touches, I suggest we try to get it committed early in the PG13 cycle. I'm willing to spend some time on doing test/benchmarks and reviewing the code, if needed. I'm very happy to be told that. Actually the code was a rush work (mainly for reverting refactoring) and left some stupid mistakes. I'm going through on the patch again and polish code. While reviewing the patch I've always had issue with evaluating how it behaves for various scenarios / workloads. The reviews generally did one specific benchmark, but I find that unsatisfactory. I wonder whether if we could develop a small set of more comprehensive workloads for this patch (i.e. different numbers of objects, access patterns, ...). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: bug in update tuple routing with foreign partitions
(2019/03/06 18:33), Amit Langote wrote: I noticed a bug with how UPDATE tuple routing initializes ResultRelInfos to use for partition routing targets. Specifically, the bug occurs when UPDATE targets include a foreign partition that is locally modified (as opposed to being modified directly on the remove server) and its ResultRelInfo (called subplan result rel in the source code) is reused for tuple routing: -- setup create extension postgres_fdw ; create server loopback foreign data wrapper postgres_fdw; create user mapping for current_user server loopback; create table p (a int) partition by list (a); create table p1 partition of p for values in (1); create table p2base (a int check (a = 2)); create foreign table p2 partition of p for values in (2) server loopback options (table_name 'p2base'); insert into p values (1), (2); -- see in the plan that foreign partition p2 is locally modified explain verbose update p set a = 2 from (values (1), (2)) s(x) where a = s.x returning *; QUERY PLAN ─ Update on public.p (cost=0.05..236.97 rows=50 width=42) Output: p1.a, "*VALUES*".column1 Update on public.p1 Foreign Update on public.p2 Remote SQL: UPDATE public.p2base SET a = $2 WHERE ctid = $1 RETURNING a -> Hash Join (cost=0.05..45.37 rows=26 width=42) Output: 2, p1.ctid, "*VALUES*".*, "*VALUES*".column1 Hash Cond: (p1.a = "*VALUES*".column1) -> Seq Scan on public.p1 (cost=0.00..35.50 rows=2550 width=10) Output: p1.ctid, p1.a -> Hash (cost=0.03..0.03 rows=2 width=32) Output: "*VALUES*".*, "*VALUES*".column1 -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=32) Output: "*VALUES*".*, "*VALUES*".column1 -> Hash Join (cost=100.05..191.59 rows=24 width=42) Output: 2, p2.ctid, "*VALUES*".*, "*VALUES*".column1 Hash Cond: (p2.a = "*VALUES*".column1) -> Foreign Scan on public.p2 (cost=100.00..182.27 rows=2409 width=10) Output: p2.ctid, p2.a Remote SQL: SELECT a, ctid FROM public.p2base FOR UPDATE -> Hash (cost=0.03..0.03 rows=2 width=32) Output: "*VALUES*".*, "*VALUES*".column1 -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=32) Output: "*VALUES*".*, "*VALUES*".column1 -- as opposed to directly on remote side (because there's no local join) explain verbose update p set a = 2 returning *; QUERY PLAN ─ Update on public.p (cost=0.00..227.40 rows=5280 width=10) Output: p1.a Update on public.p1 Foreign Update on public.p2 -> Seq Scan on public.p1 (cost=0.00..35.50 rows=2550 width=10) Output: 2, p1.ctid -> Foreign Update on public.p2 (cost=100.00..191.90 rows=2730 width=10) Remote SQL: UPDATE public.p2base SET a = 2 RETURNING a (8 rows) Running the first update query crashes: update p set a = 2 from (values (1), (2)) s(x) where a = s.x returning tableoid::regclass, *; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. The problem seems to occur because ExecSetupPartitionTupleRouting thinks it can reuse p2's ResultRelInfo for tuple routing. In this case, it can't be used, because its ri_FdwState contains information that will be needed for postgresExecForeignUpdate to work, but it's still used today. Because it's assigned to be used for tuple routing, its ri_FdwState will be overwritten by postgresBeginForeignInsert that's invoked by the tuple routing code using the aforementioned ResultRelInfo. Crash occurs when postgresExecForeignUpdate () can't find the information it's expecting in the ri_FdwState. Agreed, as I said before in another thread. The solution is to teach ExecSetupPartitionTupleRouting to avoid using a subplan result rel if its ri_FdwState is already set. I'm not sure that is a good idea, because that requires to create a new ResultRelInfo, which is not free; I think it requires a lot of work. Another solution to avoid that would be to store the fmstate created by postgresBeginForeignInsert() into the ri_FdwState, not overwrite the ri_FdwState, like the attached. This would not need any changes to the core, and this would not cause any overheads either, IIUC. What do you think about that? I have attached 2 patches, one for PG 11 where the problem first appeared and another for HEAD. The patch for PG 11 is significantly bigger due to having to handle the complexities of mapping of subplan result rel indexes to leaf partition indexes. Most of that
Re: Problem with default partition pruning
At Wed, 10 Apr 2019 14:55:48 +0900, Amit Langote wrote in > On 2019/04/10 12:53, Kyotaro HORIGUCHI wrote: > > At Wed, 10 Apr 2019 11:17:53 +0900, Amit Langote > > wrote: > >> Yeah, I think we should move the "if (partconstr)" block to the "if > >> (is_orclause(clause))" block as I originally proposed in: > >> > >> https://www.postgresql.org/message-id/9bb31dfe-b0d0-53f3-3ea6-e64b811424cf%40lab.ntt.co.jp > >> > >> It's kind of a hack to get over the limitation that > >> get_matching_partitions() can't prune default partitions for certain OR > >> clauses and I think we shouldn't let that hack grow into what seems like > >> almost duplicating relation_excluded_by_constraints() but without the > >> constraint_exclusion GUC guard. > > > > That leaves an issue of contradicting clauses that is not an arm > > of OR-expr. Isn't that what Hosoya-san is trying to fix? > > Yes, that's right. But as I said, maybe we should try not to duplicate > the functionality of relation_excluded_by_constraints() in partprune.c. Currently we classify partition constraint as a constraint. So it should be handled not in partition pruning, but constraint exclusion code. That's sound reasonable. > To summarize, aside from the problem described by the subject of this > thread (patch for that is v4_default_partition_pruning.patch posted by > Hosoya-san on 2019/04/04), we have identified couple of other issues: > > 1. One that Thibaut reported on 2019/03/04 > > > I kept on testing with sub-partitioning.Thanks. > > I found a case, using 2 default partitions, where a default partition is > > not pruned: > > > > -- > > > > create table test2(id int, val text) partition by range (id); > > create table test2_20_plus_def partition of test2 default; > > create table test2_0_20 partition of test2 for values from (0) to (20) > > partition by range (id); > > create table test2_0_10 partition of test2_0_20 for values from (0) to (10); > > create table test2_10_20_def partition of test2_0_20 default; > > > > # explain (costs off) select * from test2 where id=5 or id=25; > >QUERY PLAN > > - > > Append > >-> Seq Scan on test2_0_10 > > Filter: ((id = 5) OR (id = 25)) > >-> Seq Scan on test2_10_20_def > > Filter: ((id = 5) OR (id = 25)) > >-> Seq Scan on test2_20_plus_def > > Filter: ((id = 5) OR (id = 25)) > > (7 rows) > > For this, we can move the "if (partconstr)" block in the same if > (is_orclause(clause)) block, as proposed in the v1-delta.patch that I > proposed on 2019/03/20. Note that that patch restricts the scope of > applying predicate_refuted_by() only to the problem that's currently > tricky to solve by partition pruning alone -- pruning default partitions > for OR clauses like in the above example. This is a failure of partition pruning, which should be resolved in the partprune code. > 2. Hosoya-san reported on 2019/03/22 that a contradictory WHERE clause > applied to a *partition* doesn't return an empty plan: > > > I understood Amit's proposal. But I think the issue Thibaut reported > > would occur regardless of whether clauses have OR clauses or not as > > follows. > > > > I tested a query which should output "One-Time Filter: false". > > > > # explain select * from test2_0_20 where id = 25; > > QUERY PLAN > > --- > > Append (cost=0.00..25.91 rows=6 width=36) > >-> Seq Scan on test2_10_20_def (cost=0.00..25.88 rows=6 width=36) > > Filter: (id = 25) > > So, she proposed to apply predicate_refuted_by to the whole > baserestrictinfo (at least in the latest patch), which is same as always > performing constraint exclusion to sub-partitioned partitions. I > initially thought it might be a good idea, but only later realized that > now there will be two places doing the same constraint exclusion proof -- > gen_partprune_steps_internal(), and set_rel_size() calling > relation_excluded_by_constraints(). The latter depends on > constraint_exclusion GUC whose default being 'partition' would mean we'd > not get an empty plan with it. Even if you turn it to 'on', a bug of > get_relation_constraints() will prevent the partition constraint from > being loaded and performing constraint exclusion with it; I reported it in > [1]. Hmm. One perplexing thing here is the fact that partition constraint is not a table constraint but a partitioning classification in users' view. > I think that we may be better off solving the latter problem as follows: > > 1. Modify relation_excluded_by_constraints() to *always* try to exclude > "baserel" partitions using their partition constraint (disregarding > constraint_exclusion = off/partition). > > 2. Modify prune_append_rel_partitions(), which runs much earlier these > days compared to set_rel_size(), to call relation_excluded_by_constraint() > modified as described
Re: hyrax vs. RelationBuildPartitionDesc
Hi, On 2019/04/10 15:42, Michael Paquier wrote: > On Mon, Apr 08, 2019 at 10:40:41AM -0400, Robert Haas wrote: >> On Mon, Apr 8, 2019 at 9:59 AM Tom Lane wrote: >>> Amit Langote writes: >>> Yeah, it's an open issue IMO. I think we've been focusing on getting >>> as many feature patches done as we could during the CF, but now it's >>> time to start mopping up problems like this one. > > Please note that it is registered as an older bug and not an open > item. The problem lies in all branches that have partitioning, so it should be listed under Older Bugs, right? You may have noticed that I posted patches for all branches down to 10. >> Do you have any further thoughts based on my last response? > > So your last response is that: > https://www.postgresql.org/message-id/ca+tgmoa5rt+zr+vv+q1xlwqtdmcqknl6b4pjr4v6yac9k_l...@mail.gmail.com > And what are you proposing as patch? Perhaps something among those > lines? > https://www.postgresql.org/message-id/036852f2-ba7f-7a1f-21c6-00bc3515e...@lab.ntt.co.jp AFAIK, the patch there isn't meant to solve problems discussed at the 1st link. It's meant to fix poor cache memory management of partition constraint expression trees, which seems to be a separate issue from the PartitionDesc memory management issue (the latter is the main topic of this thread.) Problem with partition constraint expression trees was only mentioned in passing in this thread [1], although it had also come up in the past, as I said when posting the patch. >> Does anyone else wish to offer opinions? > > It seems to me that Tom's argument to push in the way relcache > information is handled by copying its contents sounds sensible to me. > That's not perfect, but it is consistent with what exists (note > PublicationActions for a rather-still-not-much-complex example of > structure which gets copied from the relcache). I gave my vote for direct access of unchangeable relcache substructures (TupleDesc, PartitionDesc, etc.), because they're accessed during the planning of every user query and fairly expensive to copy compared to list of indexes or PublicationActions that you're citing. To further my point a bit, I wonder why PublicationActions needs to be copied out of relcache at all? Looking at its usage in execReplication.c, it seems we can do fine without copying, because we are holding both a lock and a relcache reference on the replication target relation during the entirety of apply_handle_insert(), which means that information can't change under us, neither logically nor physically. Also, to reiterate what I think was one of Robert's points upthread [2], the reason for requiring some code to copy the relcache substructures out of relcache should be that the caller might want change its content; for example, planner or its hooks may want to add/remove an index to/from the list of indexes copied from the relcache. The reason for copying should not be that relcache content may change under us despite holding a lock and relcache reference. Thanks, Amit [1] https://www.postgresql.org/message-id/7961.1552498252%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/CA%2BTgmoa5rT%2BZR%2BVv%2Bq1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw%40mail.gmail.com
Re: Zedstore - compressed in-core columnar storage
On 10/04/2019 10:38, Konstantin Knizhnik wrote: I also a little bit confused about UNDO records and MVCC support in Zedstore. Actually columnar store is mostly needed for analytic for read-only or append-only data. One of the disadvantages of Postgres is quite larger per-record space overhead caused by MVCC. It may be critical if you want to store huge timeseries with relatively small number of columns (like measurements of some sensor). It will be nice to have storage format which reduce this overhead when it is not needed (data is not updated). Sure. Definitely something we need to optimize. Right now, even without UNFO pages, size of zedstore is larger than size of original Postgres table. It seems to be very strange. If you have a table with a lot of columns, but each column is small, e.g. lots of boolean columns, the item headers that zedstore currently stores for each datum take up a lot of space. We'll need to squeeze those harder to make this competitive. Instead of storing a header for each datum, if a group of consecutive tuples have the same visibility information, we could store the header just once, with an array of the datums, for example. - Heikki
Re: Zedstore - compressed in-core columnar storage
On 10.04.2019 10:25, Heikki Linnakangas wrote: On 10/04/2019 09:29, Amit Kapila wrote: On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal wrote: Row store - The tuples are stored one after another, sorted by TID. For each tuple, we store its 48-bit TID, a undo record pointer, and the actual tuple data uncompressed. Storing undo record pointer with each tuple can take quite a lot of space in cases where you can't compress them. Yeah. This does depend on compression to eliminate the unused fields quite heavily at the moment. But you could have a flag in the header to indicate "no undo pointer needed", and just leave it out, when it's needed. Have you thought how will you implement the multi-locker scheme in this design? In zheap, we have used undo for the same and it is easy to imagine when you have separate transaction slots for each transaction. I am not sure how will you implement the same here. I've been thinking that the undo record would store all the XIDs involved. So if there are multiple lockers, the UNDO record would store a list of XIDs. Alternatively, I suppose you could store multiple UNDO pointers for the same tuple. - Heikki I also a little bit confused about UNDO records and MVCC support in Zedstore. Actually columnar store is mostly needed for analytic for read-only or append-only data. One of the disadvantages of Postgres is quite larger per-record space overhead caused by MVCC. It may be critical if you want to store huge timeseries with relatively small number of columns (like measurements of some sensor). It will be nice to have storage format which reduce this overhead when it is not needed (data is not updated). Right now, even without UNFO pages, size of zedstore is larger than size of original Postgres table. It seems to be very strange. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Zedstore - compressed in-core columnar storage
On 10/04/2019 09:29, Amit Kapila wrote: On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal wrote: Row store - The tuples are stored one after another, sorted by TID. For each tuple, we store its 48-bit TID, a undo record pointer, and the actual tuple data uncompressed. Storing undo record pointer with each tuple can take quite a lot of space in cases where you can't compress them. Yeah. This does depend on compression to eliminate the unused fields quite heavily at the moment. But you could have a flag in the header to indicate "no undo pointer needed", and just leave it out, when it's needed. Have you thought how will you implement the multi-locker scheme in this design? In zheap, we have used undo for the same and it is easy to imagine when you have separate transaction slots for each transaction. I am not sure how will you implement the same here. I've been thinking that the undo record would store all the XIDs involved. So if there are multiple lockers, the UNDO record would store a list of XIDs. Alternatively, I suppose you could store multiple UNDO pointers for the same tuple. - Heikki
Re: [survey] New "Stable" QueryId based on normalized query text
On Tue, Apr 9, 2019 at 11:26 PM Bruce Momjian wrote: > > On Wed, Mar 20, 2019 at 03:19:58PM -0700, legrand legrand wrote: > > > The rest of thread raise quite a lot of concerns about the semantics, > > > the cost and the correctness of this patch. After 5 minutes checking, > > > it wouldn't suits your need if you use custom functions, custom types, > > > custom operators (say using intarray extension) or if your tables > > > don't have columns in the same order in every environment. And there > > > are probably other caveats that I didn't see; > > > > Yes I know, > > It would have to be extended at less at functions, types, operators ... > > names > > and a guc pg_stat_statements.queryid_based= 'names' (default being 'oids') > > > > and with a second guc ('fullyqualifed' ?) > > sould include tables, functions, types, operators ... namespaces > > > > let "users" specify their needs, we will see ;o) > > Why can't we just explose the hash computation as an SQL function and > let people call it with pg_stat_activity.query or wherever they want the > value? We can install multiple functions if needed. It'd be very nice to exposing the queryid computation at SQL level. However it would allow to compute only the top-level queryid from pg_stat_activity. For monitoring and performance purpose, people would probably want to see the queryid of the underlying query actually running I think.
Re: hyrax vs. RelationBuildPartitionDesc
On Mon, Apr 08, 2019 at 10:40:41AM -0400, Robert Haas wrote: > On Mon, Apr 8, 2019 at 9:59 AM Tom Lane wrote: >> Amit Langote writes: >> Yeah, it's an open issue IMO. I think we've been focusing on getting >> as many feature patches done as we could during the CF, but now it's >> time to start mopping up problems like this one. Please note that it is registered as an older bug and not an open item. > Do you have any further thoughts based on my last response? So your last response is that: https://www.postgresql.org/message-id/ca+tgmoa5rt+zr+vv+q1xlwqtdmcqknl6b4pjr4v6yac9k_l...@mail.gmail.com And what are you proposing as patch? Perhaps something among those lines? https://www.postgresql.org/message-id/036852f2-ba7f-7a1f-21c6-00bc3515e...@lab.ntt.co.jp > Does anyone else wish to offer opinions? It seems to me that Tom's argument to push in the way relcache information is handled by copying its contents sounds sensible to me. That's not perfect, but it is consistent with what exists (note PublicationActions for a rather-still-not-much-complex example of structure which gets copied from the relcache). -- Michael signature.asc Description: PGP signature
Re: Minimal logical decoding on standbys
On 03/13/2019 08:40 PM, tushar wrote: Hi , I am getting a server crash on standby while executing pg_logical_slot_get_changes function , please refer this scenario Master cluster( ./initdb -D master) set wal_level='hot_standby in master/postgresql.conf file start the server , connect to psql terminal and create a physical replication slot ( SELECT * from pg_create_physical_replication_slot('p1');) perform pg_basebackup using --slot 'p1' (./pg_basebackup -D slave/ -R --slot p1 -v)) set wal_level='logical' , hot_standby_feedback=on, primary_slot_name='p1' in slave/postgresql.conf file start the server , connect to psql terminal and create a logical replication slot ( SELECT * from pg_create_logical_replication_slot('t','test_decoding');) run pgbench ( ./pgbench -i -s 10 postgres) on master and select pg_logical_slot_get_changes on Slave database postgres=# select * from pg_logical_slot_get_changes('t',null,null); 2019-03-13 20:34:50.274 IST [26817] LOG: starting logical decoding for slot "t" 2019-03-13 20:34:50.274 IST [26817] DETAIL: Streaming transactions committing after 0/6C60, reading WAL from 0/6C28. 2019-03-13 20:34:50.274 IST [26817] STATEMENT: select * from pg_logical_slot_get_changes('t',null,null); 2019-03-13 20:34:50.275 IST [26817] LOG: logical decoding found consistent point at 0/6C28 2019-03-13 20:34:50.275 IST [26817] DETAIL: There are no running transactions. 2019-03-13 20:34:50.275 IST [26817] STATEMENT: select * from pg_logical_slot_get_changes('t',null,null); TRAP: FailedAssertion("!(data == tupledata + tuplelen)", File: "decode.c", Line: 977) server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: 2019-03-13 20:34:50.276 IST [26809] LOG: server process (PID 26817) was terminated by signal 6: Aborted Andres - Do you think - this is an issue which needs to be fixed ? -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Re: Zedstore - compressed in-core columnar storage
On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal wrote: > > Heikki and I have been hacking recently for few weeks to implement > in-core columnar storage for PostgreSQL. Here's the design and initial > implementation of Zedstore, compressed in-core columnar storage (table > access method). Attaching the patch and link to github branch [1] to > follow along. > > The objective is to gather feedback on design and approach to the > same. The implementation has core basic pieces working but not close > to complete. > > Big thank you to Andres, Haribabu and team for the table access method > API's. Leveraged the API's for implementing zedstore, and proves API > to be in very good shape. Had to enhance the same minimally but > in-general didn't had to touch executor much. > > Motivations / Objectives > > * Performance improvement for queries selecting subset of columns > (reduced IO). > * Reduced on-disk footprint compared to heap table. Shorter tuple > headers and also leveraging compression of similar type data > * Be first-class citizen in the Postgres architecture (tables data can > just independently live in columnar storage) > * Fully MVCC compliant > * All Indexes supported > * Hybrid row-column store, where some columns are stored together, and > others separately. Provide flexibility of granularity on how to > divide the columns. Columns accessed together can be stored > together. > * Provide better control over bloat (similar to zheap) > * Eliminate need for separate toast tables > * Faster add / drop column or changing data type of column by avoiding > full rewrite of the table. > > High-level Design - B-trees for the win! > > > To start simple, let's ignore column store aspect for a moment and > consider it as compressed row store. The column store is natural > extension of this concept, explained in next section. > > The basic on-disk data structure leveraged is a B-tree, indexed by > TID. BTree being a great data structure, fast and versatile. Note this > is not referring to existing Btree indexes, but instead net new > separate BTree for table data storage. > > TID - logical row identifier: > TID is just a 48-bit row identifier. The traditional division into > block and offset numbers is meaningless. In order to find a tuple with > a given TID, one must always descend the B-tree. Having logical TID > provides flexibility to move the tuples around different pages on page > splits or page merges can be performed. > > The internal pages of the B-tree are super simple and boring. Each > internal page just stores an array of TID and downlink pairs. Let's > focus on the leaf level. Leaf blocks have short uncompressed header, > followed by btree items. Two kinds of items exist: > > - plain item, holds one tuple or one datum, uncompressed payload > - a "container item", holds multiple plain items, compressed payload > > +- > | Fixed-size page header: > | > | LSN > | TID low and hi key (for Lehman & Yao B-tree operations) > | left and right page pointers > | > | Items: > | > | TID | size | flags | uncompressed size | lastTID | payload (container > item) > | TID | size | flags | uncompressed size | lastTID | payload (container > item) > | TID | size | flags | undo pointer | payload (plain item) > | TID | size | flags | undo pointer | payload (plain item) > | ... > | > + > > Row store > - > > The tuples are stored one after another, sorted by TID. For each > tuple, we store its 48-bit TID, a undo record pointer, and the actual > tuple data uncompressed. > Storing undo record pointer with each tuple can take quite a lot of space in cases where you can't compress them. Have you thought how will you implement the multi-locker scheme in this design? In zheap, we have used undo for the same and it is easy to imagine when you have separate transaction slots for each transaction. I am not sure how will you implement the same here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Speed up build on Windows by generating symbol definition in batch
Thanks for reviewing! I've updated the patch according to your comments. Best regards, Peifeng Qiu On Sun, Apr 7, 2019 at 2:31 PM Noah Misch wrote: > On Sat, Mar 30, 2019 at 03:42:39PM +0900, Peifeng Qiu wrote: > > When I watched the whole build process with a task manager, I discovered > > that a lot of time was spent on generating export symbol definitions, > > without consuming much CPU or IO. > > The script that doing this is src/tools/msvc/gendef.pl, it enumerates > the > > whole directory for ".obj" files and call dumpbin utility to generate > > ".sym" files one by one like this: > > > > dumpbin /symbols /out:a.sym a.obj >NUL > > > > Actually the dumpbin utility accepts a wildcard file name, so we can > > generate the export symbols of all ".obj" files in batch. > > > > dumpbin /symbols /out:all.sym *.obj >NUL > > > > This will avoid wasting time by creating and destroying dumpbin process > > repeatedly and can speed up the build process considerably. > > I've tested on my 4-core 8-thread Intel i7 CPU. I've set MSBFLAGS=/m to > > ensure it can utilize all CPU cores. > > Building without this patch takes about 370 seconds. Building with this > > patch takes about 200 seconds. That's almost 2x speed up. > > I, too, get a strong improvement, from 201s to 149s. I can confirm it > yields > identical *.def files. Thanks for identifying this improvement. > > > - my ($objfile, $symfile) = @_; > > - my ($symvol, $symdirs, $symbase) = splitpath($symfile); > > - my $tmpfile = catpath($symvol, $symdirs, "symbols.out"); > > You removed the last use of File::Spec::Functions, so remove its "use" > statement. > > > - system("dumpbin /symbols /out:$tmpfile $_ >NUL") > > - && die "Could not call dumpbin"; > > This error handling was crude, but don't replace it with zero error > handling. > > > - rename($tmpfile, $symfile); > > Keep the use of a temporary file, too. > > > +system("dumpbin /symbols /out:$symfile $ARGV[0]/*obj >NUL"); > > That should be *.obj, not *obj. > generate-symbols-in-batch-on-windows-v2.patch Description: Binary data