Re: [HACKERS] Fun fact about autovacuum and orphan temp tables
On Sat, Oct 22, 2016 at 9:45 AM, Tom Lane wrote: > Michael Paquier writes: >> On Sat, Oct 22, 2016 at 12:15 AM, Jim Nasby wrote: >>> On 10/21/16 8:47 AM, Tom Lane wrote: If we are willing to do that then we don't really have to solve the problem on the backend side. One could expect that autovacuum would clean things up within a few minutes after a backend failure. > >> I am really thinking that we should just do that and call it a day >> then, but document the fact that if one wants to look at the content >> of orphaned tables after a crash he had better turn autovacuum to off >> for the time of the analysis. > > Yeah, agreed. This also points up the value of Robert's suggestion > of a "really off" setting. Okay, so I suggest something like the attached as HEAD-only change because that's a behavior modification. -- Michael diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index f87f3e0..0a365a0 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -800,6 +800,14 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu +Autovacuum detects orphaned temporary tables that could be left behind +by a crashed backend, and forcibly drops them. In the event of a backend +crash, looking at the contents of orphaned temporary tables to perform +some research on their contents is possible though it is advised to +set autovacuum to off before doing so. + + + The default thresholds and scale factors are taken from postgresql.conf, but it is possible to override them (and many other autovacuum control parameters) on a per-table basis; see diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index e3a6911..3a0f4a8 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -2037,34 +2037,21 @@ do_autovacuum(void) /* We just ignore it if the owning backend is still active */ if (backendID == MyBackendId || BackendIdGetProc(backendID) == NULL) { + ObjectAddress object; + /* -* We found an orphan temp table (which was probably left -* behind by a crashed backend). If it's so old as to need -* vacuum for wraparound, forcibly drop it. Otherwise just -* log a complaint. +* We found an orphan temp table which was probably left +* behind by a crashed backend, so forcibly drop it. */ - if (wraparound) - { - ObjectAddress object; - - ereport(LOG, - (errmsg("autovacuum: dropping orphan temp table \"%s\".\"%s\" in database \"%s\"", - get_namespace_name(classForm->relnamespace), - NameStr(classForm->relname), - get_database_name(MyDatabaseId; - object.classId = RelationRelationId; - object.objectId = relid; - object.objectSubId = 0; - performDeletion(&object, DROP_CASCADE, PERFORM_DELETION_INTERNAL); - } - else - { - ereport(LOG, - (errmsg("autovacuum: found orphan temp table \"%s\".\"%s\" in database \"%s\"", - get_namespace_name(classForm->relnamespace), - NameStr(classForm->relname), - get_database_name(MyDatabaseId; - } + ereport(LOG, + (errmsg("autovacuum: dropping orphan temp table \"%s\".\"%s\" in database \"%s\"", + get_namespace_name(classForm->relnamespace), + NameStr(classForm->relname), + get_database_name(MyDatabaseId; + object.classId = RelationRelationId; + objec
Re: [HACKERS] Parallel Index Scans
On Fri, Oct 21, 2016 at 10:55 PM, Robert Haas wrote: > On Fri, Oct 21, 2016 at 9:27 AM, Amit Kapila wrote: >>> I think the parallel_workers reloption should override the degree of >>> parallelism for any sort of parallel scan on that table. Had I >>> intended it to apply only to sequential scans, I would have named it >>> differently. >> >> I think there is big difference of size of relation to scan between >> parallel sequential scan and parallel (range) index scan which could >> make it difficult for user to choose the value of this parameter. Why >> do you think that the parallel_workers reloption should suffice all >> type of scans for a table? I could only think of providing it based >> on thinking that lesser config knobs makes life easier. > > Well, we could do that, but it would be fairly complicated and it > doesn't seem to me to be the right place to focus our efforts. I'd > rather try to figure out some way to make the planner smarter, because > even if users can override the number of workers on a > per-table-per-scan-type basis, they're probably still going to find > using parallel query pretty frustrating unless we make the > number-of-workers formula smarter than it is today. Anyway, even if > we do decide to add more reloptions than just parallel_degree someday, > couldn't that be left for a separate patch? > That makes sense to me. As of now, patch doesn't consider reloptions for parallel index scans. So, I think we can leave it as it is and then later as a separate patch decide whether to use reloption of table or a separate reloption for index would be better choice. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] make check-world failing for me
On 22 October 2016 at 01:52, Tom Lane wrote: > Thom Brown writes: > > I'm using the latest version of Linux Mint Debian Edition, having > recently > > upgraded from an older version, and now I can't get make check-world to > > finish successfully. > > > The initial theory was that the version of Bison I'm using (3.0.2) is to > > blame, but I've also tested it against 3.0.1, 2.7.1 and 2.3, but I get > the > > same problem: > > Don't see how it would be Bison's fault. What this looks like to me is > that you've got a corrupted copy of src/test/perl/PostgresNode.pm, > because this: > > > syntax error at > > /home/thom/Development/postgresql/src/test/modules/ > commit_ts/../../../../src/test/perl/PostgresNode.pm > > line 437, near ") > > print" > > doesn't seem to have anything to do with what's actually at line 437 > in that file (at least not in current HEAD). I don't see any close > matches elsewhere in that file, either. > I've just taken a fresh clone of the PostgreSQL git repo, and it appears you're right. All the tests appear to be passing now, so sorry for the noise. Thanks Thom
Re: [HACKERS] make check-world failing for me
Thom Brown writes: > I'm using the latest version of Linux Mint Debian Edition, having recently > upgraded from an older version, and now I can't get make check-world to > finish successfully. > The initial theory was that the version of Bison I'm using (3.0.2) is to > blame, but I've also tested it against 3.0.1, 2.7.1 and 2.3, but I get the > same problem: Don't see how it would be Bison's fault. What this looks like to me is that you've got a corrupted copy of src/test/perl/PostgresNode.pm, because this: > syntax error at > /home/thom/Development/postgresql/src/test/modules/commit_ts/../../../../src/test/perl/PostgresNode.pm > line 437, near ") > print" doesn't seem to have anything to do with what's actually at line 437 in that file (at least not in current HEAD). I don't see any close matches elsewhere in that file, either. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fun fact about autovacuum and orphan temp tables
Michael Paquier writes: > On Sat, Oct 22, 2016 at 12:15 AM, Jim Nasby wrote: >> On 10/21/16 8:47 AM, Tom Lane wrote: >>> If we are willing to do that then we don't really have to solve the >>> problem on the backend side. One could expect that autovacuum would >>> clean things up within a few minutes after a backend failure. > I am really thinking that we should just do that and call it a day > then, but document the fact that if one wants to look at the content > of orphaned tables after a crash he had better turn autovacuum to off > for the time of the analysis. Yeah, agreed. This also points up the value of Robert's suggestion of a "really off" setting. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] emergency outage requiring database restart
Alvaro Herrera writes: > Jim Nasby wrote: >> It occurs to me that it might be worth embedding the relation name in the >> free space of the first block. Most people would never notice the missing 64 >> bytes, but it would be incredibly helpful in cases like this... > Agreed. The problem is how to install it without breaking pg_upgrade. Well, that's the first problem. The second problem is how to cope with RENAME TABLE. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Draft release notes for next week's releases
On Fri, Oct 21, 2016 at 4:51 PM, Tom Lane wrote: > Please review ... Is somebody going to look at the bugfix for the issue where ON CONFLICT DO NOTHING is used at higher isolation levels [1]? I think that it's still possible to get it in. [1] https://www.postgresql.org/message-id/cam3swzr6an++h24e6y2nwetjmtcjxdbfgeifoss2jpfpemq...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Draft release notes for next week's releases
I have committed draft release notes at https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=eacaf6e29fd2a3047aff9738a35a8e9b05e55375 They should be up on the website at https://www.postgresql.org/docs/devel/static/release-9-6-1.html as soon as guaibasaurus does its next run (about half an hour from now, looks like). We need to consider how to document recovery procedures for the recently-fixed FSM and VM corruption bugs. For the moment I've assumed that the main documentation for this will be on the wiki, but if so, we need to put in at least placeholder pages for the release notes to point to. (Also, somebody want to start filling in the placeholders?) For the moment, there are just notes for 9.6.1; I'll slice and dice them for the back branches later. Note that quite a few of the items don't really apply to 9.6.1 because the bug fixes are already there in 9.6.0. This is visible in the SGML source because the comments for them look like Branch: REL9_6_STABLE Release: REL9_6_0 [36646d3af] 2016-09-03 13:28:53 -0400 Please review ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] tuplesort_gettuple_common() and *should_free argument
I noticed that the routine tuplesort_gettuple_common() fails to set *should_free in all paths in master branch (no bug in 9.6). Consider the TSS_SORTEDONTAPE case, where we can return false without also setting "*should_free = false" to instruct caller not to pfree() tuple memory that tuplesort still owns. I suppose that this is okay because caller should never pfree() a tuple when NULL pointer was returned at higher level (as "pointer-to-tuple"). Even still, it seems like a bad idea to rely on caller testing things such that caller doesn't go on to pfree() a NULL pointer when seemingly instructed to do so by tuplesort "get tuple" interface routine (that is, some higher level routine that calls tuplesort_gettuple_common()). More importantly, there are no remaining cases where tuplesort_gettuple_common() sets "*should_free = true", because there is no remaining need for caller to *ever* pfree() tuple. Moreover, I don't anticipate any future need for this -- the scheme recently established around per-tape "slab slots" makes it seem unlikely to me that the need to "*should_free = true" will ever arise again. So, for Postgres 10, why not rip out the "*should_free" arguments that appear in a bunch of places? This lets us slightly simplify most tuplesort.c callers. Now, it is still true that at least some callers to tuplesort_gettupleslot() (but not any other "get tuple" interface routine) are going to *require* ownership of memory for returned tuples, which means a call to heap_copy_minimal_tuple() remains necessary there (see recent commit d8589946ddd5c43e1ebd01c5e92d0e177cbfc198 for details). But, that's beside the point. In the master branch only, the tuplesort_gettupleslot() test "if (!should_free)" seems tautological, just as similar tests are for every other tuplesort_gettuple_common() caller. So, tuplesort_gettupleslot() can safely assume it should *always* heap_copy_minimal_tuple() in the master branch. (BTW, I'm going to teach tuplesort_gettuple_common() to avoid this heap_copy_minimal_tuple() call for callers that don't actually need it, but that's a discussion for another thread). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fast Default WIP patch for discussion
As promised and requested find attached a work in progress patch for fast defaults.This is my first patch, I hope I used the right format…..The premise of the feature is to avoid a table rewrite when adding a column with a default.This is done by remembering the default value in pg_attribute instead of updating all the rows.When a tuple is read from disk which is shorter than the tuple descriptor mandates the default is “plugged in”either when filling in the datum/null array of a virtual tuple, or by “expanding” the tuple to a complete heap or minimal tuplewith all attributes.Example:CREATE TABLE t (pk INT NOT NULL PRIMARY KEY);INSERT INTO t VALUES (1);=> t: (1)ALTER TABLE t ADD COLUMN c1 INT DEFAULT 5;=> 1. Stores the default _expression_ in pg_attrdef.adbin (no news) 2. a. Stores the default values (computed at time of ALTER TABLE) in pg_attribute.att_existdef b. Sets pg_attribute.att_hasexistdef to trueSELECT * FROM t;=> Build a virtual tuple using getAttr() API (no news) but since tuple has fewer mats than table signature fill in extra attributes from pg_attribute.att_existdef if att_hasexistdef is true=> (1) becomes (1, 5)INSERT INTO t VALUES (2);=> Fill in DEFAULT for t.c1 from pg_attrdef.adbin as usual: (2, 5)=> t: (1), (2, 5)ALTER TABLE t ALTER COLUMN c1 SET DEFAULT -1;=> 1. Drop row from pg_attrdef for c1 (no news) 2. Add row to pg_attrdef for c1 DEFAULT -1 (no news) 3. Leave pg_atribute.att_existdef alone!!INSERT INTO t VALUES (3);=> Fill in DEFAULT for t.c1 from pg_attrdef.adbin as usual: (3, -1)=> t: (1), (2, 5), (3, -1)SELECT * FROM t;=> Build a virtual tuple using get*Attr() API (no news) but since tuple has fewer mats than table signature fill in extra attributes from pg_attribute.att_existdef if att_hasexistdef is true=> (1) becomes (1, 5) (2, 5) (3, -1)ALTER TABLE t ALTER COLUMN c1 DROP DEFAULT;=> 1. Drop row from pg_attrdef for c1 (no news) 2. Leave pg_atribute.att_existdef alone!!INSERT INTO t VALUES (4);=> Fill in default DEFAULT (4, NULL)=> t: (1), (2, 5), (3, -1), (4, NULL)SELECT * FROM t;=> Build a virtual tuple using get*Attr() API (no news) but since tuple has fewer mats than table signature fill in extra attributes from pg_attribute.att_existdef if att_hasexistdef is true=> (1) becomes (1, 5) (2, 5) (3, -1) (4, NULL)You can find a new (incomplete) test file fast_default.sql in regress/sql Some key design points requiring discussion:1. Storage of the “exist” (working name) default Right now the patch stores the default value in its binary form as it would be in the tuple into a BYTEA. It would be feasible to store the pretty printed literal, however this requires calling the io functions when a tuple descriptor is built.2. The exist default is cached alongside the “current” default in the tuple descriptor’s constraint structure. Seems most natural too me, but debatable.3. Delayed vs. early expansion of the tuples. To avoid having to decide when to copy tuple descriptors with or without constraints and defaults I have opted to expand the tuple at the core entry points. How do I know I have them all? An omission means wrong results!4. attisnull() This routine is used in many places, but to give correct result sit must now be accompanied by the tuple descriptor. This becomes moderately messy and it’s not always clear where to get that. Interestingly most usages are related to catalog lookups. Assuming we have no intention to support fast default for catalog tables we could keep using the existing attisnull() api for catalog lookups and use a new version (name tbd) for user tables.5. My head hurts looking at the PK/FK code - it’s not always clear which tuple descriptor belongs to which tuple6. Performance of the expansion code. The current code needs to walk all defaults and then start padding by filling in values. But the outcome is always the same. We will produce the same payload and the name null map. It would be feasible to cache an “all defaults tuple”, remember the offsets (and VARWIDTH, HASNULL) for each attribute and then simply splice the short and default tuples together. This ought to be faster, but the meta data to cache is not insignificant and the expansion code is messy enough without this already.7. Grooming Obviously we can remove all exist defaults for a table from pg_attribute whenever the table is rewrittem. That’s easy. But could we/should we keep track of the short tuples and either eliminate them or drop exist defaults once they become obsolete because there is no tuple short enough for them to matter. 8. Do we need to worry about toasted defaults?CheersSerge RielauSalesforce.com fast_default.patch Description: Binary data
Re: [HACKERS] Indirect indexes
Alvaro Herrera writes: > Robert Haas wrote: >> So, I think that this is a really promising direction, but also that >> you should try very hard to try to get out from under this 6-byte PK >> limitation. That seems really ugly, and in practice it probably means >> your PK is probably going to be limited to int4, which is kind of sad >> since it leaves people using int8 or text PKs out in the cold. > I think we could just add a new type, unsigned 6 byte int, specifically > for this purpose. I think that's a really bad idea, because after you've fixed this hopefully-temporary limitation, we'll still be stuck carrying this weird type forever. Besides which, doesn't the existing TID type already serve the purpose? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
Robert Haas wrote: > So, I think that this is a really promising direction, but also that > you should try very hard to try to get out from under this 6-byte PK > limitation. That seems really ugly, and in practice it probably means > your PK is probably going to be limited to int4, which is kind of sad > since it leaves people using int8 or text PKs out in the cold. I think we could just add a new type, unsigned 6 byte int, specifically for this purpose. Little in the way of operators, as it's pointless to try to do arithmetic with object identifiers. (It'd be similar to UUID in spirit, but I wouldn't try to do anything too smart to generate them.) > I believe Claudio Freire is on to something when he suggests storing > the PK in the index tuple; one could try to skip storing the TID, or > always store it as all-zeroes. That bloats the index a bit. But then, maybe that is fine ... > The VACUUM problems seem fairly serious. It's true that these indexes > will be less subject to bloat, because they only need updating when > the PK or the indexed columns change, not when other indexed columns > change. On the other hand, there's nothing to prevent a PK from being > recycled for an unrelated tuple. We can guarantee that a TID won't be > recycled until all index references to the TID are gone, but there's > no such guarantee for a PK. AFAICT, that would mean that an indirect > index would have to be viewed as unreliable: after looking up the PK, > you'd *always* have to recheck that it actually matched the index > qual. Yes, recheck is always needed. As for vacuum, I was thinking this morning that perhaps the answer to that is just to not vacuum the index at all and instead rely on the killtuple interface (which removes tuples during scan). So we don't need to spend precious maint_work_mem space on a large list of PK values. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fun fact about autovacuum and orphan temp tables
On Sat, Oct 22, 2016 at 12:15 AM, Jim Nasby wrote: > On 10/21/16 8:47 AM, Tom Lane wrote: >>> >>> It seems to me that you'd even want to make the drop of orphaned >>> > tables mandatory once it is detected even it is not a wraparound >>> > autovacuum. >> >> If we are willing to do that then we don't really have to solve the >> problem on the backend side. One could expect that autovacuum would >> clean things up within a few minutes after a backend failure. > > Unless all the autovac workers are busy working on huge tables... maybe a > delay of several hours/days is OK in this case, but it's not wise to assume > autovac will always get to something within minutes. I am really thinking that we should just do that and call it a day then, but document the fact that if one wants to look at the content of orphaned tables after a crash he had better turn autovacuum to off for the time of the analysis. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] emergency outage requiring database restart
Jim Nasby wrote: > On 10/21/16 2:02 PM, Alvaro Herrera wrote: > > Merlin Moncure wrote: > > > > > OK, I have some good (very- in the specific case of yours truly) news > > > to report. Doing a filesystem level copy to a test server I was able > > > to relfilenode swap one of the critical tables over the place of the > > > refilenode of the stored backup. Not being able know the file to copy > > > from, I figured out the source node by judging the size and using > > > 'strings' utility. Data recovery for that table at least appears to > > > be 100%. > > > > FWIW you can use pg_filedump and match based on the number of columns. > > I suppose you could also use the pageinspect extension, by 'dd'ing a > > page from the file into the database and feeding into heap_page_items as > > bytea. > > It occurs to me that it might be worth embedding the relation name in the > free space of the first block. Most people would never notice the missing 64 > bytes, but it would be incredibly helpful in cases like this... Agreed. The problem is how to install it without breaking pg_upgrade. Note that for DR purposes it is better to put that data in the first block of *each segment*. Otherwise if you have many >1GB tables, you have to go segment by segment ... -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Renaming of pg_xlog and pg_clog
On Sat, Oct 22, 2016 at 4:29 AM, Alvaro Herrera wrote: > David Steele wrote: >> On 10/21/16 3:12 AM, David G. Johnston wrote: >> >> > I have no problem continuing keeping with historical precedent and >> > allowing mnemonic abbreviations in our directory and file names at this >> > point. >> >> I'm still in favor of pg_xact. A search of the 9.6 docs brings up a number >> of hits for "xact": pg_last_xact_replay_timestamp(), >> pg_advisory_xact_lock(), pg_advisory_xact_lock_shared(), >> pg_last_committed_xact(), pg_prepared_xacts(), etc. There are also numerous >> column names that have "xact" in them. > > I'm +1 on pg_clog -> pg_xact. So let's say that's the winner then. > Also +1 to renaming pg_subtrans to pg_subxact. Nice suggestion, good naming for consistency with the rest. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FSM corruption leading to errors
On Sat, Oct 22, 2016 at 5:17 AM, Jim Nasby wrote: > On 10/20/16 10:15 PM, Michael Paquier wrote: >> >> 2) If anything is found, stop the server and delete the files manually. >> 3) Re-start the server. >> OK, that's troublesome and costly for large relations, but we know >> that's the safest way to go for any versions, and there is no need to >> complicate the code with any one-time repairing extensions. > > Wouldn't you need to run around to all your replicas and do that as well? Yeah... >> Speaking of which, I implemented a small extension able to truncate >> the FSM up to the size of the relation as attached, but as I looked at >> it SMGR_TRUNCATE_FSM has been introduced in 9.6 so its range of action >> is rather limited... And I pushed as well a version on github: >> https://github.com/michaelpq/pg_plugins/tree/master/pg_fix_truncation >> The limitation range of such an extension is a argument good enough to >> just rely on the stop/delete-FSM/start method to fix an instance and >> let VACUUM do the rest of the work. That looks to work but use it at >> your own risk. > > > Shouldn't the truncation be logged before it's performed? Doh. Yes, thanks for the reminder. And I commented on that upthread.. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On 10/21/16 4:05 PM, Adam Brusselback wrote: My main point is that maybe the reason why most users use int4 pkeys (besides conventional wisdom) is because it gets the most support from features like this, and it may not be quite as skewed if that same support were given to other types. I think it's a pretty safe bet that they're the most popular simply because serial means int4 and not int8. I bet a fair number of users don't even realize bigserial exists. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve output of BitmapAnd EXPLAIN ANALYZE
On 10/21/16 12:30 PM, Stephen Frost wrote: I don't see why we would want to stick 'N/A' in for the header, even if we are reporting the details, when we can provide a pretty reasonable number. Because then it's absolutely clear that we don't have a valid rowcount, only a guess (and a guess that's potentially off by a lot). No one is used to seeing "N/A" in explain, so when they do see it they'll immediately realize they don't know what's going on and hit google or the docs up. Otherwise they'll just think it's an accurate rowcount like for any other node... In particular, I certainly don't think we would want to report N/A sometimes (lossy case) and then an actual number other times (all exact case). That strikes me as much more likely to be confusing. Fair enough. I'd certainly rather have a constant N/A then a guess at the rowcount. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
Just throwing an anecdote out there, but my company uses UUID for primary keys on every table in the DB. While int4 is for sure more popular, it would be nice if there weren't even more reasons to "force" people in that direction. I know I started regretting the decision to go with UUID primary keys slightly once I realized that we'd need exclusion constraints, and you have to jump through hoops to use them together. My main point is that maybe the reason why most users use int4 pkeys (besides conventional wisdom) is because it gets the most support from features like this, and it may not be quite as skewed if that same support were given to other types. Just my $0.02 -Adam On Fri, Oct 21, 2016 at 4:46 PM, Jim Nasby wrote: > On 10/19/16 7:52 AM, Robert Haas wrote: > >> So, I think that this is a really promising direction, but also that >> you should try very hard to try to get out from under this 6-byte PK >> limitation. That seems really ugly, and in practice it probably means >> your PK is probably going to be limited to int4, which is kind of sad >> since it leaves people using int8 or text PKs out in the cold. >> > > My impression is that int4 is by far the most popular PK type. Even if the > initial implementation is limited to that I think it'd have a lot of > potential. > -- > Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX > Experts in Analytics, Data Architecture and PostgreSQL > Data in Trouble? Get it in Treble! http://BlueTreble.com > 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] emergency outage requiring database restart
On 10/21/16 2:02 PM, Alvaro Herrera wrote: Merlin Moncure wrote: OK, I have some good (very- in the specific case of yours truly) news to report. Doing a filesystem level copy to a test server I was able to relfilenode swap one of the critical tables over the place of the refilenode of the stored backup. Not being able know the file to copy from, I figured out the source node by judging the size and using 'strings' utility. Data recovery for that table at least appears to be 100%. FWIW you can use pg_filedump and match based on the number of columns. I suppose you could also use the pageinspect extension, by 'dd'ing a page from the file into the database and feeding into heap_page_items as bytea. It occurs to me that it might be worth embedding the relation name in the free space of the first block. Most people would never notice the missing 64 bytes, but it would be incredibly helpful in cases like this... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On 10/21/16 2:48 PM, Sven R. Kunze wrote: You don't need that limitation (and vacuum will be simpler) if you add the PK as another key, akin to: CREATE INDIRECT INDEX idx ON tab (a, b, c); turns into CREATE INDEX idx ON tab (a, b, c, pk); I know I am late to this point but I wanted to present my mere user's point of view. First I liked it, as does not introduce yet another syntax to learn. I believe you mis-understood what Claudio was saying. He's not suggesting an index with the PK on the end magically becomes an indirect index; he was saying that a "simple" way to overcome the 6 byte index TID limitation would be to store the PK as part of the index key. He used existing DDL to illustrate that, but that was just for illustration, not how this would actually be implemented. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On 10/19/16 7:52 AM, Robert Haas wrote: So, I think that this is a really promising direction, but also that you should try very hard to try to get out from under this 6-byte PK limitation. That seems really ugly, and in practice it probably means your PK is probably going to be limited to int4, which is kind of sad since it leaves people using int8 or text PKs out in the cold. My impression is that int4 is by far the most popular PK type. Even if the initial implementation is limited to that I think it'd have a lot of potential. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for changes to recovery.conf API
On 09/28/2016 10:13 AM, Robert Haas wrote: > On Tue, Sep 6, 2016 at 10:11 AM, David Steele wrote: >> On 9/6/16 8:07 AM, Robert Haas wrote: >>> On Wed, Aug 31, 2016 at 9:45 PM, Simon Riggs >>> wrote: Related cleanup * Promotion signal file is now called "promote.trigger" rather than just "promote" * Remove user configurable "trigger_file" mechanism - use "promote.trigger" for all cases >>> >>> >>> I'm in favor of this. I don't think that it's very hard for authors >>> of backup tools to adapt to this new world, and I don't see that >>> allowing configurability here does anything other than create more >>> cases to worry about. >> >> +1 from a backup tool author. > > It's time to wrap up this CommitFest, and this thread doesn't seem to > contain anything that looks like a committable patch. So, I'm marking > this "Returned with Feedback". I hope that the fact that there's been > no discussion for the last three weeks doesn't mean this effort is > dead; I would like very much to see it move forward. Has this gone anywhere? Given that we're in "break all the things" mode for PostgreSQL 10, it would be the ideal time to consolidate recovery.conf with pg.conf. -- -- Josh Berkus Red Hat OSAS (any opinions are my own) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remove autovacuum GUC?
On 10/20/16 11:50 PM, Craig Ringer wrote: Personally what I think is needed here is to make monitoring and bloat visibility not completely suck. So we can warn users if tables haven't been vac'd in ages and have recent churn. And so they can easily SELECT a view to get bloat estimates with an estimate of how much drift there could've been since last vacuum. +10. I've seen people spend a bunch of time screwing around with the 2 major "bloat queries", both of which have some issues. And there is *no* way to actually quantify whether autovac is keeping up with things or not. Users turn off vacuum because they cannot see that it is doing anything except wasting I/O and cpu. So: * A TL;DR in the docs saying what vac does and why not to turn it off. In particular warning that turning autovac off will make a slow SB get slower even though it seems to help at first. IMHO we should also suggest that for users that have periods of lower activity that they run a manual vacuum. That reduces the odds of autovac ruining your day unexpectedly, as well as allowing it to to focus on high-velocity tables that need more vacuuming and not on huge tables that just happened to surpass their threshold during a site busy period. * A comment in the conf file with the same TL;DR. Comments are free, let's use a few lines. * Warn on startup when autovac is off? Well, I suspect that someone who thinks autovac=off is a good idea probably doesn't monitor their logs either, but it wouldn't hurt. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FSM corruption leading to errors
On 10/20/16 10:15 PM, Michael Paquier wrote: 2) If anything is found, stop the server and delete the files manually. 3) Re-start the server. OK, that's troublesome and costly for large relations, but we know that's the safest way to go for any versions, and there is no need to complicate the code with any one-time repairing extensions. Wouldn't you need to run around to all your replicas and do that as well? Speaking of which, I implemented a small extension able to truncate the FSM up to the size of the relation as attached, but as I looked at it SMGR_TRUNCATE_FSM has been introduced in 9.6 so its range of action is rather limited... And I pushed as well a version on github: https://github.com/michaelpq/pg_plugins/tree/master/pg_fix_truncation The limitation range of such an extension is a argument good enough to just rely on the stop/delete-FSM/start method to fix an instance and let VACUUM do the rest of the work. That looks to work but use it at your own risk. Shouldn't the truncation be logged before it's performed? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On 2016-10-18 20:04:32, Claudio Freire wrote: > You don't need that limitation (and vacuum will be simpler) if you add the PK as another key, akin to: > > CREATE INDIRECT INDEX idx ON tab (a, b, c); > > turns into > > CREATE INDEX idx ON tab (a, b, c, pk); I know I am late to this point but I wanted to present my mere user's point of view. First I liked it, as does not introduce yet another syntax to learn. However, after following the discussion, I see that indirect indexes have their disadvantages/slowdowns as well. If adding "pk" to the end of the column list just converts the index to an indirect index, I am unable to use a direct index which might be better in certain cases. So, from a "dumb" user's point of view, I wonder if PostgreSQL can make the right decision of direct/indirect reliably (which would be great). And if not, what would be the alternatives? Introducing CREATE DIRECT INDEX? Cheers, Sven PS: I mot saying I would be affected by this but IIRC we have (..., pk) indexes in production which then would be converted to indirect ones. But I cannot tell whether indirect indexes would do good or harm there. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Renaming of pg_xlog and pg_clog
David Steele wrote: > On 10/21/16 3:12 AM, David G. Johnston wrote: > > > I have no problem continuing keeping with historical precedent and > > allowing mnemonic abbreviations in our directory and file names at this > > point. > > I'm still in favor of pg_xact. A search of the 9.6 docs brings up a number > of hits for "xact": pg_last_xact_replay_timestamp(), > pg_advisory_xact_lock(), pg_advisory_xact_lock_shared(), > pg_last_committed_xact(), pg_prepared_xacts(), etc. There are also numerous > column names that have "xact" in them. I'm +1 on pg_clog -> pg_xact. Also +1 to renaming pg_subtrans to pg_subxact. In general, +1 to short mnemonic meaningful names. -1 to overly long names, -1 to meaningless names. Regarding pg_receivexlog I think I'd propose pg_recvwal rather than pg_receivewal, because of pg_recvlogical. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] emergency outage requiring database restart
Merlin Moncure wrote: > OK, I have some good (very- in the specific case of yours truly) news > to report. Doing a filesystem level copy to a test server I was able > to relfilenode swap one of the critical tables over the place of the > refilenode of the stored backup. Not being able know the file to copy > from, I figured out the source node by judging the size and using > 'strings' utility. Data recovery for that table at least appears to > be 100%. FWIW you can use pg_filedump and match based on the number of columns. I suppose you could also use the pageinspect extension, by 'dd'ing a page from the file into the database and feeding into heap_page_items as bytea. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] emergency outage requiring database restart
On Fri, Oct 21, 2016 at 1:37 PM, Merlin Moncure wrote: > On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner wrote: >> On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure wrote: >> >>> Most or all the damage seemed to be to the system catalogs with >>> at least two critical tables dropped or inaccessible in some >>> fashion. A lot of the OIDs seemed to be pointing at the wrong >>> thing. >> >> While the oid in pg_class often matches the filename, that is not >> true after some operations (like CLUSTER or VACUUM FULL). It is >> the relfilenode column that is the definitive link to the file. > > no such operations happened. In the first instance at least one table > dropped from the system catalogs. I have a hunch that the heap is > fine (supported by the size of the database on disk). At this > precise moment I'm restoring the database to another fileserver in > order to do some forensic analysis, also in the hopes of getting the > second database online in order to expedite recovery. > > ah -- done. :-) deleting the init file didn't help, but starting up > single user allowed the start up to gracefully fail with a FATAL cache > lookup. OK, I have some good (very- in the specific case of yours truly) news to report. Doing a filesystem level copy to a test server I was able to relfilenode swap one of the critical tables over the place of the refilenode of the stored backup. Not being able know the file to copy from, I figured out the source node by judging the size and using 'strings' utility. Data recovery for that table at least appears to be 100%. For those following along, this simple process is only likely to work easily if the table contains only system types; no user types, enums, composites, etc, since those have a unique ID for each data restore. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] emergency outage requiring database restart
On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner wrote: > On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure wrote: > >> Most or all the damage seemed to be to the system catalogs with >> at least two critical tables dropped or inaccessible in some >> fashion. A lot of the OIDs seemed to be pointing at the wrong >> thing. > > While the oid in pg_class often matches the filename, that is not > true after some operations (like CLUSTER or VACUUM FULL). It is > the relfilenode column that is the definitive link to the file. no such operations happened. In the first instance at least one table dropped from the system catalogs. I have a hunch that the heap is fine (supported by the size of the database on disk). At this precise moment I'm restoring the database to another fileserver in order to do some forensic analysis, also in the hopes of getting the second database online in order to expedite recovery. ah -- done. :-) deleting the init file didn't help, but starting up single user allowed the start up to gracefully fail with a FATAL cache lookup. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Renaming of pg_xlog and pg_clog
* Robert Haas (robertmh...@gmail.com) wrote: > I don't think that the problem is that people are accidentally typing > "pg_resetxlog $PGDATA" and pressing return. They're typing that on > purpose, and if you change the sequence of characters required to get > that effect, they'll just type the new thing instead. The problem is > that they don't understand when it's appropriate to use that command > and what the consequences are. I agree that they don't understand, and that's why I believe that making the command name, or a required option, a bit more ominous would make them pause and realize that maybe they want to consider other options first. This is not exactly unheard of- apt-get requires an entire phrase be to be entered when you're asking it to do something extremely dangerous (remove an essential package): --- root@ionith:~# apt-get remove login Reading package lists... Done Building dependency tree Reading state information... Done The following packages will be REMOVED: login WARNING: The following essential packages will be removed. This should NOT be done unless you know exactly what you are doing! login 0 upgraded, 0 newly installed, 1 to remove and 71 not upgraded. After this operation, 1,212 kB disk space will be freed. You are about to do something potentially harmful. To continue type in the phrase 'Yes, do as I say!' ?] --- My thought would be to make pg_resetxlog do something more along the lines of what pg_control does and have it, by default, just investigate the state of things and complain about problems and then have it include an option to actually reset things with an appropriately ominous name. The goal there being, primairly, to give a way for users to get information about why PG isn't starting or what it is complaining about, with some additional information about what happens if they forcibly reset the xlog, noting that it could very likely cause corruption, etc. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Renaming of pg_xlog and pg_clog
On Fri, Oct 21, 2016 at 9:47 AM, Stephen Frost wrote: > * Robert Haas (robertmh...@gmail.com) wrote: >> On Thu, Oct 20, 2016 at 12:12 PM, Stephen Frost wrote: >> > That said, I'd also like to see a --force or similar option or mechanism >> > put in place to reduce the risk of users trashing their system because >> > they think pg_resetwal is "safe." ("It's just gonna reset things to make >> > the database start again, should be fine."). >> >> You know we already have that, right? > > Yes, but I was meaning an option which would be required to make > pg_resetxlog actually *do* anything. In other words, I'd rather have it > report some info back to the user, if it's run without the > '--really-force' or what-have-you option, and only proceed with > clearing the WAL or rewriting pg_control when '--really-force' is used. I don't think that the problem is that people are accidentally typing "pg_resetxlog $PGDATA" and pressing return. They're typing that on purpose, and if you change the sequence of characters required to get that effect, they'll just type the new thing instead. The problem is that they don't understand when it's appropriate to use that command and what the consequences are. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PSA: Systemd will kill PostgreSQL
On Fri, Oct 21, 2016 at 8:29 AM, Peter Eisentraut wrote: > On 8/16/16 1:05 PM, Tom Lane wrote: >> Oh, interesting. It had occurred to me that we might be able to dodge >> this issue if we started to recommend using unnamed POSIX semaphores >> instead of SysV. (Obviously we'd want to check performance, but it's >> at least a plausible alternative.) I had not wanted to go there if >> it meant that we could have silent loss of SysV shmem with no other >> symptoms, because as I said upthread, I'm concerned about that breaking >> the multiple-postmaster interlock. However, if the cleanup kills only >> semaphores and not attached-to shmem, then that objection goes away and >> this becomes something we should seriously consider. > > I was digging around this issue the other day again. We have switched > to unnamed POSIX semaphores by default now, which will help. But for > dynamic shared memory (DSM) we use POSIX shared memory by default, which > is cleaned up without regarding to attachment. So there is still a > potential for failures here, possibly more rare or obscure, given the > usage of DSM. The reason I did it that way is because System V shared memory is often subject to very low limits on how much can be allocated, which can also produce failures. It would be easy to switch the default implementation from POSIX to System V, but I suspect that would be a loser overall -- in other words, I suspect that if we switched the default, more people would get hosed by not being able to create those segments in the first place than are currently getting hosed by having them removed prematurely. Also, POSIX shared memory segments at least on Linux are implemented as files. If you remove a file, people who have it open can normally continue to access it. So it might work OK as long as the file isn't removed until after everybody involved in a parallel query has already attached. That's a dangerous thing to bet on, though, because the DSM facility also supports server-lifetime DSMs. We're not using those capabilities right now in core, but we might start - e.g. Magnus was suggesting that we could use DSMs plus Thomas Munro's DSA and DHT patches to replace the stats collector or the temp files used by pg_stat_statements. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve output of BitmapAnd EXPLAIN ANALYZE
* Jim Nasby (jim.na...@bluetreble.com) wrote: > On 10/21/16 8:21 AM, Stephen Frost wrote: > >Counting each page as the relation's average number of tuples per page > >seems entirely reasonable to me, for what that is trying to report. > > My concern is that still leaves a lot of room for confusion when > interpreting EXPLAIN ANALYZE. Every other node will tell you exactly > what happened and it's pretty easy to reason about whether rows > should have gone up or down based on the type of node. You can't do > that for Bitmap(And|Or) unless you know the details of how > TIDBitmaps work. Reporting N/A makes it crystal clear that these > nodes operate very differently than all the others. I don't see why you think the numbers reported by BitmapAnd based on this approach wouldn't go up and down in a similar manner to what you would expect to get, based on that node type. Reporting N/A is entirely punting on it when we have perfectly useful information that can be reported. > (On a related note, it would also be nice if we reported fractional > rows when the row count low and loops is high.) I can certainly understand that, though I think I'd rather have an actual 'total' value or similar instead, but that's really a different discussion. > >That said, I'm a big fan of how we have more detail for things like a > >HashJoin (buckets, batches, memory usage) and it might be nice to have > >more information like that for a BitmapAnd (and friends). In > >particular, I'm thinking of memory usage, exact vs. lossy pages, etc. > >Knowing that the bitmap has gotten to the point of being lossy might > >indicate that a user could up work_mem, for example, and possibly avoid > >recheck costs. > > I think that's the best way to handle this: report N/A in the header > and then provide details on exact vs lossy. That provides a clear > indication to users that these kinds of nodes are special, as well > as a reminder as to why they're special. Certainly the node could > report an exact rowcount in the header if there were no lossy pages > too. I don't see why we would want to stick 'N/A' in for the header, even if we are reporting the details, when we can provide a pretty reasonable number. In particular, I certainly don't think we would want to report N/A sometimes (lossy case) and then an actual number other times (all exact case). That strikes me as much more likely to be confusing. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Default setting for autovacuum_freeze_max_age
On Fri, Oct 21, 2016 at 1:17 PM, Josh Berkus wrote: > Particularly, with 9.6's freeze map, point (2) is even stronger reason > to *lower* autovacuum_max_freeze_age. Since there's little duplicate > work in a freeze scan, a lot of users will find that frequent freezing > benefits them a lot ... That's a very good point, although I hope that vacuum is mostly being triggered by vacuum_freeze_table_age rather than autovacuum_freeze_max_age. On Bruce's original question, there is an answer written into our documentation: "Vacuum also allows removal of old files from the pg_clog subdirectory, which is why the default is a relatively low 200 million transactions." -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Index Scans
On Fri, Oct 21, 2016 at 9:27 AM, Amit Kapila wrote: >> I think the parallel_workers reloption should override the degree of >> parallelism for any sort of parallel scan on that table. Had I >> intended it to apply only to sequential scans, I would have named it >> differently. > > I think there is big difference of size of relation to scan between > parallel sequential scan and parallel (range) index scan which could > make it difficult for user to choose the value of this parameter. Why > do you think that the parallel_workers reloption should suffice all > type of scans for a table? I could only think of providing it based > on thinking that lesser config knobs makes life easier. Well, we could do that, but it would be fairly complicated and it doesn't seem to me to be the right place to focus our efforts. I'd rather try to figure out some way to make the planner smarter, because even if users can override the number of workers on a per-table-per-scan-type basis, they're probably still going to find using parallel query pretty frustrating unless we make the number-of-workers formula smarter than it is today. Anyway, even if we do decide to add more reloptions than just parallel_degree someday, couldn't that be left for a separate patch? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Default setting for autovacuum_freeze_max_age
On 10/21/2016 07:44 AM, Tom Lane wrote: > Bruce Momjian writes: >> Why is autovacuum_freeze_max_age's default set to 200 million, rather >> than something like 2 billion? It seems 2 billion is half way to >> wrap-around and would be a better default. Right now, the default seems >> to freeze 10x more often than it has to. > > Please see the archives. I do not remember the reasoning, but there > was some, and you need to justify why it was wrong not just assert > that you think it's silly. IIRC, there were a couple reasons (and I think they're still good reasons, which is why I haven't asked to change the default): 1. By setting it to 10% of the max space, we give users plenty of room to raise the number if they need to without getting into crisis territory. 2. Raising this threshold isn't an unalloyed good. The longer you wait to freeze, the more work you'll need to do when autovac freeze rolls around. There's actually situations where you want to make this threshold *lower*, although generally scheduled manual vacuum freezes serve that. Particularly, with 9.6's freeze map, point (2) is even stronger reason to *lower* autovacuum_max_freeze_age. Since there's little duplicate work in a freeze scan, a lot of users will find that frequent freezing benefits them a lot ... especially if they can take advantage of index-only scans. -- -- Josh Berkus Red Hat OSAS (any opinions are my own) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUG] pg_basebackup from disconnected standby fails
On Sun, Oct 2, 2016 at 8:52 AM, Michael Paquier wrote: >> So, if I understand correctly, then we can mark the version posted by >> you upthread [1] which includes a test along with Kyotaro's fix can be >> marked as Ready for committer. If so, then please change the status >> of patch accordingly. > > Patch moved to next CF 2016-11, still with status "ready for committer". > > IMO, this thread has reached as conclusion to use this patch to > address the problem reported: > cab7npqtv5gmkqcndofgtgqoqxz2xlz4rrw247oqojzztvy6...@mail.gmail.com I spent some time reading this thread today and trying to understand exactly what's going on here. Here's my attempt to restate the problem: 1. If a restartpoint flushes no dirty buffers, then the redo location advances but the minimum recovery point does not; therefore, the minimum recovery point can fall behind the redo location. That's correct behavior, because if the standby is subsequently restarted, it can correctly begin at the checkpoint record associated with the restartpoint and need not replay any further. 2. However, it causes a problem when a base backup with the "include WAL" option is taken from a standby because -- on standby servers only -- the end location for a backup as returned by do_pg_stop_backup() is the minimum recovery point. If this precedes the start point for the backup -- which is the restart point -- perform_base_backup() will conclude that no WAL files are required for the backup and fail an internal sanity check. 3. Removing the sanity check wouldn't help, because it can never be right to end up with zero WAL files as a result of a base backup. At a minimum, we need the WAL file which contains the checkpoint record from which the control file specifies that redo should start. Actually, I think that checkpoint record could be spread across multiple files: it might be big. We need them all, but the current code won't ensure that we get them. The consensus solution on this thread seems to be that we should have pg_do_stop_backup() return the last-replayed XLOG location as the backup end point. If the control file has been updated with a newer redo location, then the associated checkpoint record has surely been replayed, so we'll definitely have enough WAL to replay that checkpoint record (and we don't need to replay anything else, because we're supposing that this is the case where the minimum recovery point precedes the redo location). Although this will work, it might include more WAL in the backup than is strictly necessary. If the additional WAL replayed beyond the minimum recovery point does NOT include a checkpoint, we don't need any of it; if it does, we need only the portion up to and including the last checkpoint record, and not anything beyond that. I can think of two solutions that would be "tighter": 1. When performing a restartpoint, update the minimum recovery point to just beyond the checkpoint record. I think this can't hurt anyone who is actually restarting recovery on the same machine, because that's exactly the point where they are going to start recovery. A minimum recovery point which precedes the point at which they will start recovery is no better than one which is equal to the point at which they will start recovery. 2. In perform_base_backup(), if the endptr returned by do_pg_stop_backup() precedes the end of the checkpoint record returned by do_pg_start_backup(), use the latter value instead. Unfortunately, that's not so easy: we can't just say if (endptr < starptr) startptr = endptr; because startptr is the *start* of the checkpoint record, not the end. I suspect somebody could figure out a solution to this problem, though. If we decide we don't want to aim for one of these tighter solutions and just adopt the previously-discussed patch, then at the very least it needs better comments. The minimum comment change the patch makes right now doesn't give any explanation of why the new algorithm is better than the old, and that's not cool. Without that, the next person who comes along to maintain this code won't have any easy way of knowing that we grappled with this problem and what our conclusions were. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Default setting for autovacuum_freeze_max_age
Bruce Momjian wrote: > On Fri, Oct 21, 2016 at 10:44:41AM -0400, Tom Lane wrote: > > Bruce Momjian writes: > > > Why is autovacuum_freeze_max_age's default set to 200 million, rather > > > than something like 2 billion? It seems 2 billion is half way to > > > wrap-around and would be a better default. Right now, the default seems > > > to freeze 10x more often than it has to. > > > > Please see the archives. I do not remember the reasoning, but there > > was some, and you need to justify why it was wrong not just assert > > that you think it's silly. > > I think the reasoning was to avoid checking old clog files, but with > tuple transaction status bits, e.g. HEAP_XMIN_COMMITTED, which were > added long ago, I don't remember why either. HEAP_XMIN_COMMITTED existed way before autovacuum, so that doesn't add up, does it. As I recall, the reason was to be able to truncate pg_clog. I suppose nowadays it's possible to claim that we're not really bothered by a gigabyte or two of pg_clog? *If* we're to raise the default then it should not be to 2 billion. That gives users no breathing room if they find themselves struggling with the freezing; with the current default, it's possible to increase it 2x or 4x if you're in trouble, which gives some breathing room until a permanent solution is found (better vacuuming). That disappears if you set the max to its max. > I remember asking years ago and not getting a good answer, and giving > up. [citation needed] -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Specifying the log file name of pgbench -l option
Hi all, The log file generated by pgbench -l option is fixed file name 'pgbench_log..'. And it's a little complicated for the script that runs pgbench repeatedly to identify the log file name. Attached patch make it possible to specify the log file name. I think it's useful for the use who want to run pgbench repeatedly in script and collects and analyze the result. The one thing I concern is that this patch changes -l option so that it requires argument. But changing its behavior would be good rather than adding new option. Please give me feedback. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center pgbench_log_file_name.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aggregate Push Down - Performing aggregation on foreign server
On Fri, Oct 21, 2016 at 11:20 AM, Tom Lane wrote: > Robert Haas writes: >> On Fri, Oct 21, 2016 at 10:47 AM, Tom Lane wrote: >>> dromedary seems to have found one, or at least an unstable test result. > >> OK, looking at that now. Thanks. > > Looking at further failures, it looks like 32-bit machines in general > get that result. Might be just a cost estimation difference. > > Also, some of the windows machines are folding "sqrt(2)" to a different > constant than is hard-wired into the expected-result file. That's > slightly annoying because it calls into question whether we can ship > floating-point computations to the far end at all :-(. IMHO, not shipping floating-point computations for that reason would be more annoying than helpful. To really guarantee that the remote and identical results are identical, you'd need far more infrastructure than we have - you'd have to make sure the operating system collations matched, for example. And we're already assuming (without any proof) that the default collations match and, for that matter, that the datatypes match. If you're pushing down to PostgreSQL on the other end, you at least have a hope that the other side might be sufficiently identical to the local side that the results will be the same, but if somebody implements these pushdowns for Oracle or SQL server, you almost might as well give up and go home. Even for PostgreSQL, getting the same results in every possible corner case requires a certain degree of optimism. For my money, the way to handle that is to add more control over pushdown rather than automatically disabling it in certain cases. For example, we could have GUCs that disable all pushdown or certain types of pushdown - e.g. you could specify that you don't trust the remote side to sort data properly, or that you don't like it's floating-point implementation. I'm not sure what controls are most useful here, but I'd be willing to bet you a nice glass of white wine that many people will be happier with a system that they can control than they will be with one that just disables the optimization in every case where it might hypothetically not work out. It's not clear that we can even reasonably foresee all such cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve output of BitmapAnd EXPLAIN ANALYZE
On 10/21/16 8:21 AM, Stephen Frost wrote: Counting each page as the relation's average number of tuples per page seems entirely reasonable to me, for what that is trying to report. My concern is that still leaves a lot of room for confusion when interpreting EXPLAIN ANALYZE. Every other node will tell you exactly what happened and it's pretty easy to reason about whether rows should have gone up or down based on the type of node. You can't do that for Bitmap(And|Or) unless you know the details of how TIDBitmaps work. Reporting N/A makes it crystal clear that these nodes operate very differently than all the others. (On a related note, it would also be nice if we reported fractional rows when the row count low and loops is high.) That said, I'm a big fan of how we have more detail for things like a HashJoin (buckets, batches, memory usage) and it might be nice to have more information like that for a BitmapAnd (and friends). In particular, I'm thinking of memory usage, exact vs. lossy pages, etc. Knowing that the bitmap has gotten to the point of being lossy might indicate that a user could up work_mem, for example, and possibly avoid recheck costs. I think that's the best way to handle this: report N/A in the header and then provide details on exact vs lossy. That provides a clear indication to users that these kinds of nodes are special, as well as a reminder as to why they're special. Certainly the node could report an exact rowcount in the header if there were no lossy pages too. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aggregate Push Down - Performing aggregation on foreign server
Robert Haas writes: > On Fri, Oct 21, 2016 at 10:47 AM, Tom Lane wrote: >> dromedary seems to have found one, or at least an unstable test result. > OK, looking at that now. Thanks. Looking at further failures, it looks like 32-bit machines in general get that result. Might be just a cost estimation difference. Also, some of the windows machines are folding "sqrt(2)" to a different constant than is hard-wired into the expected-result file. That's slightly annoying because it calls into question whether we can ship floating-point computations to the far end at all :-(. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fun fact about autovacuum and orphan temp tables
On 10/21/16 8:47 AM, Tom Lane wrote: It seems to me that you'd even want to make the drop of orphaned > tables mandatory once it is detected even it is not a wraparound > autovacuum. If we are willing to do that then we don't really have to solve the problem on the backend side. One could expect that autovacuum would clean things up within a few minutes after a backend failure. Unless all the autovac workers are busy working on huge tables... maybe a delay of several hours/days is OK in this case, but it's not wise to assume autovac will always get to something within minutes. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aggregate Push Down - Performing aggregation on foreign server
On Fri, Oct 21, 2016 at 10:47 AM, Tom Lane wrote: > Robert Haas writes: >> I didn't find anything structurally wrong with this patch, so I've >> committed it with many cosmetic changes. Judging by what happened >> with join pushdown, there are probably some residual bugs, but >> hopefully not too many. > > dromedary seems to have found one, or at least an unstable test result. OK, looking at that now. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Default setting for autovacuum_freeze_max_age
On Fri, Oct 21, 2016 at 10:44:41AM -0400, Tom Lane wrote: > Bruce Momjian writes: > > Why is autovacuum_freeze_max_age's default set to 200 million, rather > > than something like 2 billion? It seems 2 billion is half way to > > wrap-around and would be a better default. Right now, the default seems > > to freeze 10x more often than it has to. > > Please see the archives. I do not remember the reasoning, but there > was some, and you need to justify why it was wrong not just assert > that you think it's silly. I think the reasoning was to avoid checking old clog files, but with tuple transaction status bits, e.g. HEAP_XMIN_COMMITTED, which were added long ago, I don't remember why either. I remember asking years ago and not getting a good answer, and giving up. If no one can give an answer, I suggest we change the default. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aggregate Push Down - Performing aggregation on foreign server
Robert Haas writes: > I didn't find anything structurally wrong with this patch, so I've > committed it with many cosmetic changes. Judging by what happened > with join pushdown, there are probably some residual bugs, but > hopefully not too many. dromedary seems to have found one, or at least an unstable test result. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Default setting for autovacuum_freeze_max_age
Bruce Momjian writes: > Why is autovacuum_freeze_max_age's default set to 200 million, rather > than something like 2 billion? It seems 2 billion is half way to > wrap-around and would be a better default. Right now, the default seems > to freeze 10x more often than it has to. Please see the archives. I do not remember the reasoning, but there was some, and you need to justify why it was wrong not just assert that you think it's silly. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aggregate Push Down - Performing aggregation on foreign server
On Thu, Oct 20, 2016 at 5:38 AM, Jeevan Chalke wrote: > Changes look good to me. > Thanks for the detailed review. I didn't find anything structurally wrong with this patch, so I've committed it with many cosmetic changes. Judging by what happened with join pushdown, there are probably some residual bugs, but hopefully not too many. Anyhow, I don't think waiting longer to commit this is necessarily going to help, so let's see what happens... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Renaming of pg_xlog and pg_clog
* Robert Haas (robertmh...@gmail.com) wrote: > On Thu, Oct 20, 2016 at 12:12 PM, Stephen Frost wrote: > > That said, I'd also like to see a --force or similar option or mechanism > > put in place to reduce the risk of users trashing their system because > > they think pg_resetwal is "safe." ("It's just gonna reset things to make > > the database start again, should be fine."). > > You know we already have that, right? Yes, but I was meaning an option which would be required to make pg_resetxlog actually *do* anything. In other words, I'd rather have it report some info back to the user, if it's run without the '--really-force' or what-have-you option, and only proceed with clearing the WAL or rewriting pg_control when '--really-force' is used. > > pg_destroydb almost seems like a better choice, though I suppose > > 'pg_clearwal' would be more acceptable. Doesn't have quite the same > > impact though. > > > > Not sure on the best answer here, but it's definitely foot-gun that some > > users have ended up using on themselves with depressing regularity. > > Just to provide some perspective from the other side of this, I [...] I wasn't suggesting that we remove the capability. There are certainly use-cases for it, but, unfortunately, I've seen a number of cases where users simply google'd an error that they got back when trying to start PG and found someone saying "well, I got this error, but then I ran pg_resetxlog, and now the database starts up again." It likely doesn't help that the top links tend to be to mailing list archives where pg_resetxlog was brought up. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Fun fact about autovacuum and orphan temp tables
Michael Paquier writes: > On Thu, Oct 20, 2016 at 9:30 PM, Constantin S. Pan wrote: >> I tried to fix the problem with a new backend not being >> able to reuse a temporary namespace when it contains >> thousands of temporary tables. I disabled locking of objects >> during namespace clearing process. See the patch attached. >> Please tell me if there are any reasons why this is wrong. > That's invasive. Invasive or otherwise, it's *completely unacceptable*. Without a lock you have no way to be sure that nothing else is touching the table. A less broken approach might be to split the cleanup into multiple shorter transactions, that is, after every N objects stop and commit what you've done so far. This shouldn't be that hard to do during backend exit, as I'm pretty sure we're starting a new transaction just for this purpose anyway. I don't know if it'd be possible to do it during the automatic cleanup when glomming onto a pre-existing temp namespace, because we're already within a user-started transaction at that point. But if we solve the problem where it's being created, maybe that's enough for now. >> I also added a GUC variable and changed the condition in >> autovacuum to let it nuke orphan tables on its way. >> See another patch attached. > It seems to me that you'd even want to make the drop of orphaned > tables mandatory once it is detected even it is not a wraparound > autovacuum. If we are willing to do that then we don't really have to solve the problem on the backend side. One could expect that autovacuum would clean things up within a few minutes after a backend failure. We'd have to be really darn sure that that "orphaned backend" test can never have any false positives, though. I'm not sure that it was ever designed to be race-condition-proof. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Index Scans
On Thu, Oct 20, 2016 at 10:33 PM, Robert Haas wrote: > On Wed, Oct 19, 2016 at 11:07 PM, Amit Kapila wrote: >>> Ideally, the parallel_workers storage parameter will rarely be >>> necessary because the optimizer will generally do the right thing in >>> all case. >> >> Yeah, we can choose not to provide any parameter for parallel index >> scans, but some users might want to have a parameter similar to >> parallel table scans, so it could be handy for them to use. > > I think the parallel_workers reloption should override the degree of > parallelism for any sort of parallel scan on that table. Had I > intended it to apply only to sequential scans, I would have named it > differently. > I think there is big difference of size of relation to scan between parallel sequential scan and parallel (range) index scan which could make it difficult for user to choose the value of this parameter. Why do you think that the parallel_workers reloption should suffice all type of scans for a table? I could only think of providing it based on thinking that lesser config knobs makes life easier. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve output of BitmapAnd EXPLAIN ANALYZE
* Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost writes: > > Changing it in a new major release seems entirely reasonable. > > It's still a crock though. I wonder whether it wouldn't be better to > change the nodeBitmap code so that when EXPLAIN ANALYZE is active, > it expends extra effort to try to produce a rowcount number. I'm certainly all for doing something better, just didn't think that we should be worried about making a change to the EXPLAIN ANALYZE output in a major release because Depesz might have to update the explain site. > We could certainly run through the result bitmap and count the number > of exact-TID bits. I don't see a practical way of doing something > with lossy page bits, but maybe those occur infrequently enough > that we could ignore them? Or we could arbitrarily decide that > a lossy page should be counted as MaxHeapTuplesPerPage, or a bit > less arbitrarily, count it as the relation's average number > of tuples per page. Counting each page as the relation's average number of tuples per page seems entirely reasonable to me, for what that is trying to report. That said, I'm a big fan of how we have more detail for things like a HashJoin (buckets, batches, memory usage) and it might be nice to have more information like that for a BitmapAnd (and friends). In particular, I'm thinking of memory usage, exact vs. lossy pages, etc. Knowing that the bitmap has gotten to the point of being lossy might indicate that a user could up work_mem, for example, and possibly avoid recheck costs. Thanks! Stephen signature.asc Description: Digital signature
[HACKERS] Default setting for autovacuum_freeze_max_age
Why is autovacuum_freeze_max_age's default set to 200 million, rather than something like 2 billion? It seems 2 billion is half way to wrap-around and would be a better default. Right now, the default seems to freeze 10x more often than it has to. Does it default to 200 million so clog can be trimmed? Is that reasonable? We have tuple status flags of commit status so I assume changing from a normal xid to a frozen one doesn't have a performance benefit, does it? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] emergency outage requiring database restart
On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure wrote: > Most or all the damage seemed to be to the system catalogs with > at least two critical tables dropped or inaccessible in some > fashion. A lot of the OIDs seemed to be pointing at the wrong > thing. While the oid in pg_class often matches the filename, that is not true after some operations (like CLUSTER or VACUUM FULL). It is the relfilenode column that is the definitive link to the file. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Transaction management overhaul is necessary?
On 21 October 2016 at 18:57, Pavel Stehule wrote: > 2016-10-21 10:24 GMT+02:00 Tsunakawa, Takayuki > : >> >> Hello, >> >> From our experience in handling customers' problems, I feel it's necessary >> to evolve PostgreSQL's transaction management. The concrete problems are: >> >> 1. PostgreSQL cannot end and begin transactions in PL/pgSQL and PL/Java >> stored functions. >> This is often the reason people could not migrate to PostgreSQL. I've run into relatively few for whom this has landed up being a showstopper, but I agree it's a pain. There'll probably be more as bigger outfits seek to move from That Other Database where it's routine to do this. >> This was because psqlODBC starts and ends a subtransaction for each SQL >> statement by default to implement statement-level rollback. And PostgreSQL >> creates one CurTransactionContext memory context, which is 8KB, for each >> subtransaction and retain them until the top transaction ends. Surely that's where to start then. Find a way to pool and re-use, fully release, or otherwise be done with transaction contexts for released savepoints. >> The total >> memory used becomes 40GB (8KB * 5 million subtransactions.) This was >> avoided by setting the Protocol parameter to 7.4-1, which means >> transaction-level rollback. You can control transaction level rollback in psqlODBC directly. You do not need to fall back to the old protocol. Check the driver options. >> The savepoint approach for supporting statement-level rollback is >> inefficient, because it adds two roundtrips (SAVEPOINT and RELEASE) for each >> statement. Right. We can't just fire off each statement wrapped in SAVEPOINT and RELEASE SAVEPOINT because we need to get the result of the statement and decide whether to ROLLBACK TO SAVEPOINT or RELEASE SAVEPOINT. It only requires two round trips if you shove the SAVEPOINT in with the intended statement, but it's still messy. I'd like to see an alternative statement with semantics more akin to COMMIT - which automatically into ROLLBACK if the tx is aborted. COMMIT SAVEPOINT would be too confusing since it's not truly committed. I don't know what to call it. But basically something that does RELEASE SAVEPOINT [named savepoint] unless the subxact is in aborted state, in which case it does ROLLBACK TO [named savepoint]. Bonus points for letting it remember the last savepoint created and use that. Furthermore, we should really add it on the protocol level so drivers can send subtransaction control messages more compactly, without needing to go through the parser etc, and without massively spamming the logs. For this purpose savepoint names would be internally generated so the driver wouldn't have to send them. We'd log savepoint boundaries when transaction logging was enabled. Since the client would send the first such protocol request we could do it on the sly without a protocol version bump; clients could just check server version and not use the new messages for older servers. If they send it to an older server they get a protocol error, which is fine. > You should to implement a CALL statement - that can be independent on outer > transaction. The behave inside procedure called by CALL statement should be > same like client side - and there you can controll transactions explicitly > without nesting. I agree that'd be desirable. Top level "procedures" are necessary for this, really. This would also enable us to return multiple result sets. We'd probably have to start at least one small read-only tx for the initial cache access to look up the proc and set everything up, but if we don't allocate xids local transactions are super cheap. However, I think trying to tackle the memory context bloat reported upthread would be a more effective starting point since it immediately targets the problem actually experienced. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PSA: Systemd will kill PostgreSQL
On 8/16/16 1:05 PM, Tom Lane wrote: > Oh, interesting. It had occurred to me that we might be able to dodge > this issue if we started to recommend using unnamed POSIX semaphores > instead of SysV. (Obviously we'd want to check performance, but it's > at least a plausible alternative.) I had not wanted to go there if > it meant that we could have silent loss of SysV shmem with no other > symptoms, because as I said upthread, I'm concerned about that breaking > the multiple-postmaster interlock. However, if the cleanup kills only > semaphores and not attached-to shmem, then that objection goes away and > this becomes something we should seriously consider. I was digging around this issue the other day again. We have switched to unnamed POSIX semaphores by default now, which will help. But for dynamic shared memory (DSM) we use POSIX shared memory by default, which is cleaned up without regarding to attachment. So there is still a potential for failures here, possibly more rare or obscure, given the usage of DSM. (If someone is keeping score, it appears the "safest" combination is SysV shared memory + POSIX semaphores.) I have started a wiki page to collect this information: https://wiki.postgresql.org/wiki/Systemd To be continued, I suppose ... -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup stream xlog to tar
On Mon, Oct 17, 2016 at 2:37 PM, Michael Paquier wrote: > Except that it looks in pretty good to me, so I am switching that as > ready for committer. + /* +* Create pg_xlog/archive_status (and thus pg_xlog) so we can write to +* basedir/pg_xlog as the directory entry in the tar file may arrive +* later. +*/ + snprintf(statusdir, sizeof(statusdir), "%s/pg_xlog/archive_status", +basedir); This part conflicts with f82ec32, where you need make pg_basebackup aware of the backend version.. I promise that's the last conflict, at least I don't have more patches planned in the area. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Stopping logical replication protocol
Craig, Andres what do you thinks about previous message? 2016-10-06 0:35 GMT+03:00 Vladimir Gordiychuk : > > Vladimir? I'm running out of time to spend on this at least until the next > CF. Think you can make these changes? > > Yes, I can. But I thinks It should be discuss first. > > > Terminating COPY BOTH with ERRCODE_QUERY_CANCELED seems fine. If it > > was expecting the error the client can Sync and do whatever it next > > wants to do on the walsender protocol, or bail out nicely. > > Do you want return CopyFail with ERRCODE_QUERY_CANCELED instead of > CopyDone on client initialized stop replication? > > > Hm, I'm *VERY* doubtful about doing this via yet another callback. Why > > don't we just error out in the prepare write callback? > > But what about scenario when output plugin collect changes in memory with > some transformation and send it only inside commit_cb? > It means that OutputPluginPrepareWrite will not execute until end > transaction and we face with too long stops replication when decodes huge > transaction. > > 2016-10-05 13:06 GMT+03:00 Craig Ringer : > >> On 5 October 2016 at 05:14, Andres Freund wrote: >> > Hi, >> > >> > On 2016-09-23 13:01:27 +0800, Craig Ringer wrote: >> >> From f98f2388c57d938ebbe07ccd2dbe02138312858f Mon Sep 17 00:00:00 2001 >> >> From: Vladimir Gordiychuk >> >> Date: Wed, 7 Sep 2016 00:39:18 +0300 >> >> Subject: [PATCH 2/4] Client-initiated CopyDone during transaction >> decoding in >> >> walsender >> >> >> >> The prior patch caused the walsender to react to a client-initiated >> >> CopyDone while it's in the WalSndLoop. That's not enough to let clients >> >> end logical decoding mid-transaction because we loop in >> ReorderBufferCommit >> >> during decoding of a transaction without ever returning to WalSndLoop. >> >> >> >> Allow breaking out of ReorderBufferCommit by detecting that client >> >> input has requested an end to COPY BOTH mode, so clients can abort >> >> decoding mid-transaction. >> > >> > Hm, I'm *VERY* doubtful about doing this via yet another callback. Why >> > don't we just error out in the prepare write callback? >> >> That's sensible. >> >> > I don't think it's a good idea to return a non-error state if a command >> > is interrupted in the middle. We might already have sent a partial >> > transaction or such. Also compare this to interrupting a query - we >> > don't just returning rows, we error out. >> >> Good point. It's not usually visible to the user because if it's a >> client-initiated cancel the client will generally consume the error, >> but there's still an error generated. >> >> Terminating COPY BOTH with ERRCODE_QUERY_CANCELED seems fine. If it >> was expecting the error the client can Sync and do whatever it next >> wants to do on the walsender protocol, or bail out nicely. >> >> Vladimir? I'm running out of time to spend on this at least until the >> next CF. Think you can make these changes? >> >> >> -- >> Craig Ringer http://www.2ndQuadrant.com/ >> PostgreSQL Development, 24x7 Support, Training & Services >> > >
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Fri, Oct 21, 2016 at 1:07 PM, Tomas Vondra wrote: > On 10/21/2016 08:13 AM, Amit Kapila wrote: >> >> On Fri, Oct 21, 2016 at 6:31 AM, Robert Haas >> wrote: >>> >>> On Thu, Oct 20, 2016 at 4:04 PM, Tomas Vondra >>> wrote: > > I then started a run at 96 clients which I accidentally killed shortly > before it was scheduled to finish, but the results are not much > different; there is no hint of the runaway CLogControlLock contention > that Dilip sees on power2. > What shared_buffer size were you using? I assume the data set fit into shared buffers, right? >>> >>> >>> 8GB. >>> FWIW as I explained in the lengthy post earlier today, I can actually reproduce the significant CLogControlLock contention (and the patches do reduce it), even on x86_64. >>> >>> >>> /me goes back, rereads post. Sorry, I didn't look at this carefully >>> the first time. >>> For example consider these two tests: * http://tvondra.bitbucket.org/#dilip-300-unlogged-sync * http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip However, it seems I can also reproduce fairly bad regressions, like for example this case with data set exceeding shared_buffers: * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip >>> >>> >>> I'm not sure how seriously we should take the regressions. I mean, >>> what I see there is that CLogControlLock contention goes down by about >>> 50% -- which is the point of the patch -- and WALWriteLock contention >>> goes up dramatically -- which sucks, but can't really be blamed on the >>> patch except in the indirect sense that a backend can't spend much >>> time waiting for A if it's already spending all of its time waiting >>> for B. >>> >> >> Right, I think not only WALWriteLock, but contention on other locks >> also goes up as you can see in below table. I think there is nothing >> much we can do for that with this patch. One thing which is unclear >> is why on unlogged tests it is showing WALWriteLock? >> > > Well, although we don't write the table data to the WAL, we still need to > write commits and other stuff, right? > We do need to write commit, but do we need to flush it immediately to WAL for unlogged tables? It seems we allow WALWriter to do that, refer logic in RecordTransactionCommit. And on scale 3000 (which exceeds the > 16GB shared buffers in this case), there's a continuous stream of dirty > pages (not to WAL, but evicted from shared buffers), so iostat looks like > this: > > timetps wr_sec/s avgrq-sz avgqu-sz await %util > 08:48:21 81654 1367483 16.75 127264.60 1294.80 97.41 > 08:48:31 41514697516 16.80 103271.11 3015.01 97.64 > 08:48:41 78892 1359779 17.24 97308.42928.36 96.76 > 08:48:51 58735978475 16.66 92303.00 1472.82 95.92 > 08:49:01 62441 1068605 17.11 78482.71 1615.56 95.57 > 08:49:11 55571945365 17.01 113672.62 1923.37 98.07 > 08:49:21 69016 1161586 16.83 87055.66 1363.05 95.53 > 08:49:31 54552913461 16.74 98695.87 1761.30 97.84 > > That's ~500-600 MB/s of continuous writes. I'm sure the storage could handle > more than this (will do some testing after the tests complete), but surely > the WAL has to compete for bandwidth (it's on the same volume / devices). > Another thing is that we only have 8 WAL insert locks, and maybe that leads > to contention with such high client counts. > Yeah, quite possible, but I don't think increasing that would benefit in general, because while writing WAL we need to take all the wal_insert locks. In any case, I think that is a separate problem to study. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Transaction management overhaul is necessary?
2016-10-21 10:24 GMT+02:00 Tsunakawa, Takayuki < tsunakawa.ta...@jp.fujitsu.com>: > Hello, > > From our experience in handling customers' problems, I feel it's necessary > to evolve PostgreSQL's transaction management. The concrete problems are: > > 1. PostgreSQL cannot end and begin transactions in PL/pgSQL and PL/Java > stored functions. > This is often the reason people could not migrate to PostgreSQL. > > 2. PostgreSQL does not support statement-level rollback. > When some customer ran a batch app using psqlODBC, one postgres process > used dozens of GBs of memory and crashed the OS. The batch app prepares > some SQL statements with parameters, execute it five millions of times with > different parameter values in a single transaction. They didn't experience > a problem with Oracle. > > This was because psqlODBC starts and ends a subtransaction for each SQL > statement by default to implement statement-level rollback. And PostgreSQL > creates one CurTransactionContext memory context, which is 8KB, for each > subtransaction and retain them until the top transaction ends. The total > memory used becomes 40GB (8KB * 5 million subtransactions.) This was > avoided by setting the Protocol parameter to 7.4-1, which means > transaction-level rollback. > > The savepoint approach for supporting statement-level rollback is > inefficient, because it adds two roundtrips (SAVEPOINT and RELEASE) for > each statement. > > > > I know autonomous transaction is also discussed, which seems to be > difficult, so I hope some kind of transaction management overhaul can be > discussed to cover all these transaction-related features. How should I > start? I found the following item in the TODO list (but I haven't read it > yet.) What other discussions should I look at? > You should to implement a CALL statement - that can be independent on outer transaction. The behave inside procedure called by CALL statement should be same like client side - and there you can controll transactions explicitly without nesting. Regards Pavel > > -- > Implement stored procedures > This might involve the control of transaction state and the return of > multiple result sets > PL/pgSQL stored procedure returning multiple result sets (SELECTs)? > Proposal: real procedures again (8.4) > http://archives.postgresql.org/pgsql-hackers/2010-09/msg00542.php > Gathering specs and discussion on feature (post 9.1) > -- > > > Regards > Takayuki Tsunakawa > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Fri, Oct 21, 2016 at 4:25 PM, Amit Kapila wrote: > On Tue, Oct 18, 2016 at 3:48 AM, Peter Geoghegan wrote: >> On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila wrote: >> >> I read the following paragraph from the Volcano paper just now: >> >> """ >> During implementation and benchmarking of parallel sorting, we added >> two more features to exchange. First, we wanted to implement a merge >> network in which some processors produce sorted streams merge >> concurrently by other processors. Volcano’s sort iterator can be used >> to generate a sorted stream. A merge iterator was easily derived from >> the sort module. It uses a single level merge, instead of the cascaded >> merge of runs used in sort. The input of a merge iterator is an >> exchange. Differently from other operators, the merge iterator >> requires to distinguish the input records by their producer. As an >> example, for a join operation it does not matter where the input >> records were created, and all inputs can be accumulated in a single >> input stream. For a merge operation, it is crucial to distinguish the >> input records by their producer in order to merge multiple sorted >> streams correctly. >> """ >> >> I don't really understand this paragraph, but thought I'd ask: why the >> need to "distinguish the input records by their producer in order to >> merge multiple sorted streams correctly"? Isn't that talking about >> partitioning, where each workers *ownership* of a range matters? >> > > I think so, but it seems from above text that is mainly required for > merge iterator which probably will be used in merge join. > >> My >> patch doesn't care which values belong to which workers. And, it >> focuses quite a lot on dealing well with the memory bandwidth bound, >> I/O bound part of the sort where we write out the index itself, just >> by piggy-backing on tuplesort.c. I don't think that that's useful for >> a general-purpose executor node -- tuple-at-a-time processing when >> fetching from workers would kill performance. >> > > Right, but what is written in text quoted by you seems to be do-able > with tuple-at-a-time processing. > To be clear, by saying above, I don't mean that we should try that approach instead of what you are proposing, but it is worth some discussion to see if that has any significant merits. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Oct 18, 2016 at 3:48 AM, Peter Geoghegan wrote: > On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila wrote: > > I read the following paragraph from the Volcano paper just now: > > """ > During implementation and benchmarking of parallel sorting, we added > two more features to exchange. First, we wanted to implement a merge > network in which some processors produce sorted streams merge > concurrently by other processors. Volcano’s sort iterator can be used > to generate a sorted stream. A merge iterator was easily derived from > the sort module. It uses a single level merge, instead of the cascaded > merge of runs used in sort. The input of a merge iterator is an > exchange. Differently from other operators, the merge iterator > requires to distinguish the input records by their producer. As an > example, for a join operation it does not matter where the input > records were created, and all inputs can be accumulated in a single > input stream. For a merge operation, it is crucial to distinguish the > input records by their producer in order to merge multiple sorted > streams correctly. > """ > > I don't really understand this paragraph, but thought I'd ask: why the > need to "distinguish the input records by their producer in order to > merge multiple sorted streams correctly"? Isn't that talking about > partitioning, where each workers *ownership* of a range matters? > I think so, but it seems from above text that is mainly required for merge iterator which probably will be used in merge join. > My > patch doesn't care which values belong to which workers. And, it > focuses quite a lot on dealing well with the memory bandwidth bound, > I/O bound part of the sort where we write out the index itself, just > by piggy-backing on tuplesort.c. I don't think that that's useful for > a general-purpose executor node -- tuple-at-a-time processing when > fetching from workers would kill performance. > Right, but what is written in text quoted by you seems to be do-able with tuple-at-a-time processing. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Oct 20, 2016 at 12:03 AM, Peter Geoghegan wrote: > On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas wrote: >> Gather Merge can't emit a tuple unless it has buffered at least one >> tuple from every producer; otherwise, the next tuple it receives from >> one of those producers might proceed whichever tuple it chooses to >> emit. Right. Now, after again looking at Gather Merge patch, I think I can better understand how it performs merging. >> However, it doesn't need to wait until all of the workers are >> completely done. The leader only needs to be at least slightly ahead >> of the slowest worker. I'm not sure how that compares to Peter's >> approach. > > I don't think that eager merging will prove all that effective, > however it's implemented. I see a very I/O bound system when parallel > CREATE INDEX merges serially. There is no obvious reason why you'd > have a straggler worker process with CREATE INDEX, really. > >> What I'm worried about is that we're implementing two separate systems >> to do the same thing, and that the parallel sort approach is actually >> a lot less general. I think it's possible to imagine a Parallel Sort >> implementation which does things Gather Merge can't. If all of the >> workers collaborate to sort all of the data rather than each worker >> sorting its own data, then you've got something which Gather Merge >> can't match. But this is not that. > > It's not that yet, certainly. I think I've sketched a path forward for > making partitioning a part of logtape.c that is promising. The sharing > of ranges within tapes and so on will probably have a significant > amount in common with what I've come up with. > > I don't think that any executor infrastructure is a particularly good > model when *batch output* is needed -- the tuple queue mechanism will > be a significant bottleneck, particularly because it does not > integrate read-ahead, etc. > Tuple queue mechanism might not be super-efficient for *batch output* (cases where many tuples needs to be read and written), but I see no reason why it will be slower than disk I/O which I think you are using in the patch. IIUC, in the patch each worker including leader does the tape sort for it's share of tuples and then finally leader merges and populates the index. I am not sure if the mechanism used in patch can be useful as compare to using tuple queue, if the workers can finish their part of sorting in-memory. > > The bottom line is that it's inherently difficult for me to refute the > idea that Gather Merge could do just as well as what I have here, > because proving that involves adding a significant amount of new > infrastructure (e.g., to teach the executor about IndexTuples). > I think, there could be a simpler way, like we can force the gather merge node when all the tuples needs to be sorted and compute the time till it merges all tuples. Similarly, with your patch, we can wait till final merge is completed. However, after doing initial study of both the patches, I feel one can construct cases where Gather Merge can win and also there will be cases where your patch can win. In particular, the Gather Merge can win where workers needs to perform sort mostly in-memory. I am not sure if it's easy to get best of both the worlds. Your patch needs rebase and I noticed one warning. sort\logtape.c(1422): warning C4700: uninitialized local variable 'lt' used -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fun fact about autovacuum and orphan temp tables
On Fri, 21 Oct 2016 14:29:24 +0900 Michael Paquier wrote: > That's invasive. I am wondering if a cleaner approach here would be a > flag in deleteOneObject() that performs the lock cleanup, as that's > what you are trying to solve here. The problem occurs earlier, at the findDependentObjects step. All the objects inside the namespace are being locked before any of them gets deleted, which leads to the "too many locks" condition. Cheers, Constantin Pan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Question about behavior of snapshot too old feature
On Mon, Oct 17, 2016 at 10:04 PM, Kevin Grittner wrote: > On Sun, Oct 16, 2016 at 9:26 PM, Masahiko Sawada > wrote: > >> When I set old_snapshot_threshold = 0 I got error at step #3, which >> means that the error is occurred without table pruning. > > The "snapshot too old" error can happen without pruning, but only > because there is no way to tell the difference between a page that > has been pruned since the snapshot was taken and a page which has > had some other kind of modification since the snapshot was taken. > > Ignoring false positives for a moment (where the page is updated by > something other than pruning), what is required for early pruning > is that the snapshot has expired (which due to "rounding" and > avoidance of locking could easily take up to a minute or two more > than the old_snapshot_threshold setting) and then there is page > pruning due to a vacuum or just HOT pruning from a page read. At > some point after that, a read which is part of returning data to > the user (e.g., not just positioning for index modification) can > see that the snapshot is too old and that the LSN for the page is > past the snapshot LSN. That is when you get the error. >> We have regression test for this feature but it sets >> old_snapshot_threshold = 0, I doubt about we can test it properly. >> Am I missing something? > > This is a hard feature to test properly, and certainly hard to test > without the test running for a long time. The zero setting is > really not intended to be used in production, but only to allow > some half-way decent testing that doesn't take extreme lengths of > time. If you add some delays of a few minutes each at key points > in a test, you should be able to get a test that works with a > setting of 1min. It is not impossible that we might need to add a > memory barrier to one or two places to get such tests to behave > consistently, but I have not been able to spot where, if anywhere, > that would be. Thank you for explanation! I understood. When old_snapshot_threshold = 0, it skips to allocate shared memory area for the xid array and skips the some logic in order to avoid using the shared memory, so I was concerned about that a little. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On Thu, Oct 20, 2016 at 4:24 PM, Bruce Momjian wrote: > On Thu, Oct 20, 2016 at 05:14:51PM +0200, Petr Jelinek wrote: > > > Also, it seems indirect indexes would be useful for indexing columns > > > that are not updated frequently on tables that are updated frequently, > > > and whose primary key is not updated frequently. That's quite a logic > > > problem for users to understand. > > > > > > > Which covers like 99.9% of problematic cases I see on daily basis. > > > > And by that logic we should not have indexes at all, they are not > > automatically created and user needs to think about if they need them or > > not. > > Do you have to resort to extreme statements to make your point? The use > of indexes is clear to most users, while the use of indirect indexes > would not be, as I stated earlier. > It's not that difficult to explain I think. We just tell them (to non-sophisticated users) that they are similar to the non-clustered indexes that other dbms have (SQL Server, MySQL), which add the PK columns to the non-clustered index when the table is clustered. Same way as there, the index doesn't need update when the columns or the PK isn't updated. So we have the same benefit, except that we have the feature for our heap tables. I think it's the same for any other feature that is added (partial indexes, cubes, new syntax like LATERAL and FILTER). People will learn and start to use it. We can't expect it to be used by everyone the day it's released. > > > Also helping user who does not have performance problem by 1% is very > > different from helping user who has performance problem by 50% even if > > she needs to think about the solution a bit. > > > > WARM can do WARM update 50% of time, indirect index can do HOT update > > 100% of time (provided the column is not changed), I don't see why we > > could not have both solutions. > > We don't know enough about the limits of WARM to say it is limited to > 50%. > > >
Re: [HACKERS] Renaming of pg_xlog and pg_clog
On 10/21/16 3:12 AM, David G. Johnston wrote: I have no problem continuing keeping with historical precedent and allowing mnemonic abbreviations in our directory and file names at this point. I'm still in favor of pg_xact. A search of the 9.6 docs brings up a number of hits for "xact": pg_last_xact_replay_timestamp(), pg_advisory_xact_lock(), pg_advisory_xact_lock_shared(), pg_last_committed_xact(), pg_prepared_xacts(), etc. There are also numerous column names that have "xact" in them. It's not just an arcane developer term when it shows up in a number of our user-facing functions/columns. -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [RFC] Transaction management overhaul is necessary?
Hello, >From our experience in handling customers' problems, I feel it's necessary to >evolve PostgreSQL's transaction management. The concrete problems are: 1. PostgreSQL cannot end and begin transactions in PL/pgSQL and PL/Java stored functions. This is often the reason people could not migrate to PostgreSQL. 2. PostgreSQL does not support statement-level rollback. When some customer ran a batch app using psqlODBC, one postgres process used dozens of GBs of memory and crashed the OS. The batch app prepares some SQL statements with parameters, execute it five millions of times with different parameter values in a single transaction. They didn't experience a problem with Oracle. This was because psqlODBC starts and ends a subtransaction for each SQL statement by default to implement statement-level rollback. And PostgreSQL creates one CurTransactionContext memory context, which is 8KB, for each subtransaction and retain them until the top transaction ends. The total memory used becomes 40GB (8KB * 5 million subtransactions.) This was avoided by setting the Protocol parameter to 7.4-1, which means transaction-level rollback. The savepoint approach for supporting statement-level rollback is inefficient, because it adds two roundtrips (SAVEPOINT and RELEASE) for each statement. I know autonomous transaction is also discussed, which seems to be difficult, so I hope some kind of transaction management overhaul can be discussed to cover all these transaction-related features. How should I start? I found the following item in the TODO list (but I haven't read it yet.) What other discussions should I look at? -- Implement stored procedures This might involve the control of transaction state and the return of multiple result sets PL/pgSQL stored procedure returning multiple result sets (SELECTs)? Proposal: real procedures again (8.4) http://archives.postgresql.org/pgsql-hackers/2010-09/msg00542.php Gathering specs and discussion on feature (post 9.1) -- Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On 10/21/2016 08:13 AM, Amit Kapila wrote: On Fri, Oct 21, 2016 at 6:31 AM, Robert Haas wrote: On Thu, Oct 20, 2016 at 4:04 PM, Tomas Vondra wrote: I then started a run at 96 clients which I accidentally killed shortly before it was scheduled to finish, but the results are not much different; there is no hint of the runaway CLogControlLock contention that Dilip sees on power2. What shared_buffer size were you using? I assume the data set fit into shared buffers, right? 8GB. FWIW as I explained in the lengthy post earlier today, I can actually reproduce the significant CLogControlLock contention (and the patches do reduce it), even on x86_64. /me goes back, rereads post. Sorry, I didn't look at this carefully the first time. For example consider these two tests: * http://tvondra.bitbucket.org/#dilip-300-unlogged-sync * http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip However, it seems I can also reproduce fairly bad regressions, like for example this case with data set exceeding shared_buffers: * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip I'm not sure how seriously we should take the regressions. I mean, what I see there is that CLogControlLock contention goes down by about 50% -- which is the point of the patch -- and WALWriteLock contention goes up dramatically -- which sucks, but can't really be blamed on the patch except in the indirect sense that a backend can't spend much time waiting for A if it's already spending all of its time waiting for B. Right, I think not only WALWriteLock, but contention on other locks also goes up as you can see in below table. I think there is nothing much we can do for that with this patch. One thing which is unclear is why on unlogged tests it is showing WALWriteLock? Well, although we don't write the table data to the WAL, we still need to write commits and other stuff, right? And on scale 3000 (which exceeds the 16GB shared buffers in this case), there's a continuous stream of dirty pages (not to WAL, but evicted from shared buffers), so iostat looks like this: timetps wr_sec/s avgrq-sz avgqu-sz await %util 08:48:21 81654 1367483 16.75 127264.60 1294.80 97.41 08:48:31 41514697516 16.80 103271.11 3015.01 97.64 08:48:41 78892 1359779 17.24 97308.42928.36 96.76 08:48:51 58735978475 16.66 92303.00 1472.82 95.92 08:49:01 62441 1068605 17.11 78482.71 1615.56 95.57 08:49:11 55571945365 17.01 113672.62 1923.37 98.07 08:49:21 69016 1161586 16.83 87055.66 1363.05 95.53 08:49:31 54552913461 16.74 98695.87 1761.30 97.84 That's ~500-600 MB/s of continuous writes. I'm sure the storage could handle more than this (will do some testing after the tests complete), but surely the WAL has to compete for bandwidth (it's on the same volume / devices). Another thing is that we only have 8 WAL insert locks, and maybe that leads to contention with such high client counts. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers