Logging the feature of SQL-level read/write commits
Hi,I am trying to efficiently rollback a manually selectedd subset of committed SQL transactions by scanning an SQL transaction log. This feature is useful when a database administrator wants to rollback not the entire database system, but only particular SQL statements that affect a certain set of SQL tables. Unfortunately, this is impossible in the current PostgreSQL setup, because PostgreSQL's WAL(Write-Ahead Log) file doesn't provide any SQL statement-level redo records, but only physical block-level redo records.To this end, my goal is to improve PostgreSQL to produce augmented transaction logs. In particular, the augmented transaction log's every committed transaction ID will contain an additional section called "rollback SQL statements", which is a minimal series of DELETE & INSERT SQL statements that effectively rolls back one transaction to its immediately previous transaction. For example, suppose that we have the following SQL table: = Table1 column1 | column2 1 | 20 2 | 30 3 | 40 3 | 40 4 | 50 = And suppose that the following 100th transaction was committed: UPDATE Table1 SET column1 = 10, column2 = 20 WHERE colum2 > 20 ; Then, the augmented transaction log file will generate the following log entry for the above committed transaction: Committed Transaction ID = 100 Rollback SQL Statements = - DELETE FROM Table1 WHERE column1 = 2 AND column2 = 30 - INSERT INTO TABLE Table1 VALUES(column1, column2) (10, 20) - DELETE FROM Table1 WHERE column1 = 3 AND column2 = 40 - INSERT INTO TABLE Table1 VALUES(column1, column2) (10, 20) - DELETE FROM Table1 WHERE column1 = 4 AND column2 = 50 - INSERT INTO TABLE Table1 VALUES(column1, column2) (10, 20) Note that the above Rollback SQL statements are in the simplest forms without involving any complex SQL operations such as JOIN or sub-queries. Also note that we cannot create the above Rollback SQL statements purely based on original consecutive SQL transactions, because we don't know which rows of Table1 will need to be DELETED without actually scanning the entire Table1 and evaluating Transction #100's WHERE clause (i.e., colum2 > 20) on every single row of Table1. Therefore, to generate a list of simple Rollback SQL statements like the above, we have no choice but to embed this logging feature in the PostgreSQL's source code where the WAL(Write-Ahead Log) file is being updated. Since the current PostgreSQL doesn't support this feature, I plan to implement the above feature in the source code. But I have never worked on PostgreSQL source code in my life, and I wonder if anybody could give me a hint on which source code files (and functions) are about recording redo records in the WAL file. In particular, when the SQL server records the information of updated block location & values into the WAL file for each SQL statement that modifies any relations, we can additionally make the SQL server also write the list of the simplest INSERT & DELETE SQL statements that effectively enforces such SQL table write operations. If such an SQL-level inforcement information is available in the WAL file, one can easily conjecture what will be the corresponding Rollback (i.e., inverse) SQL statements from there. Thanks for anybody's comments. Ronny
Re: First-draft release notes for back branches are up
On Sat, May 4, 2019 at 1:29 PM Tom Lane wrote: > Thomas Munro writes: > > On Sat, May 4, 2019 at 10:29 AM Tom Lane wrote: > > + Tolerate EINVAL and ENOSYS > > + error results, where appropriate, for fsync calls (Thomas Munro, > > + James Sewell) > > > Nit-picking: ENOSYS is for sync_file_range, EINVAL is for fsync. > > Yeah, I didn't really think it was worth distinguishing. If there > is some more general term that covers both calls, maybe we should > use that? I would just do s/fsync/fsync and sync_file_range/. And I guess also wrap them in ? -- Thomas Munro https://enterprisedb.com
Re: improving wraparound behavior
Em sex, 3 de mai de 2019 às 17:27, Robert Haas escreveu: > > HINT: Commit or roll back old prepared transactions, drop stale > replication slots, or kill long-running sessions. > Ensure that autovacuum is progressing, or run a manual database-wide VACUUM. > First of all, +1 for this patch. But after reading this email, I couldn't resist to expose an idea about stale XID horizon. [A cup of tea...] I realized that there is no automatic way to recover from old prepared transactions, stale replication slots or even long-running sessions when we have a wraparound situation. Isn't it the case to add a parameter that recover from stale XID horizon? I mean if we reach a critical situation (xidStopLimit), free resource that is preventing the XID advance (and hope autovacuum have some time to prevent a stop-assigning-xids situation). The situation is analogous to OOM killer that it kills some processes that are starving resources. -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Re: improving wraparound behavior
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2019-05-03 22:41:11 -0400, Stephen Frost wrote: > > I suppose it is a pretty big change in the base autovacuum launcher to > > be something that's run per database instead and then deal with the > > coordination between the two... but I can't help but feel like it > > wouldn't be that much *work*. I'm not against doing something smaller > > but was something smaller actually proposed for this specific issue..? > > I think it'd be fairly significant. And that we should redo it from > scratch if we go there - because what we have isn't worth using as a > basis. Alright, what I'm hearing here is that we should probably have a dedicated thread for this discussion, if someone has the cycles to spend on it. I'm not against that. > > > I'm thinking that we'd do something roughly like (in actual code) for > > > GetNewTransactionId(): > > > > > > TransactionId dat_limit = ShmemVariableCache->oldestXid; > > > TransactionId slot_limit = Min(replication_slot_xmin, > > > replication_slot_catalog_xmin); > > > Transactionid walsender_limit; > > > Transactionid prepared_xact_limit; > > > Transactionid backend_limit; > > > > > > ComputeOldestXminFromProcarray(_limit, > > > _xact_limit, _limit); > > > > > > if (IsOldest(dat_limit)) > > >ereport(elevel, > > >errmsg("close to xid wraparound, held back by database > > > %s"), > > >errdetail("current xid %u, horizon for database %u, > > > shutting down at %u"), > > >errhint("...")); > > > else if (IsOldest(slot_limit)) > > > ereport(elevel, errmsg("close to xid wraparound, held back by > > > replication slot %s"), > > > ...); > > > > > > where IsOldest wouldn't actually compare plainly numerically, but would > > > actually prefer showing the slot, backend, walsender, prepared_xact, as > > > long as they are pretty close to the dat_limit - as in those cases > > > vacuuming wouldn't actually solve the issue, unless the other problems > > > are addressed first (as autovacuum won't compute a cutoff horizon that's > > > newer than any of those). > > > > Where the errhint() above includes a recommendation to run the SRF > > described below, I take it? > > Not necessarily. I feel conciseness is important too, and this would be > the most imporant thing to tackle. I'm imagining a relatively rare scenario, just to be clear, where "pretty close to the dat_limit" would apply to more than just one thing. > > Also, should this really be an 'else if', or should it be just a set of > > 'if()'s, thereby giving users more info right up-front? > > Possibly? But it'd also make it even harder to read the log / the system > to keep up with logging, because we already log *so* much when close to > wraparound. Yes, we definitely log a *lot*, and probably too much since other critical messages might get lost in the noise. > If we didn't order it, it'd be hard for users to figure out which to > address first. If we ordered it, people have to further up in the log to > figure out which is the most urgent one (unless we reverse the order, > which is odd too). This makes me think we should both order it and combine it into one message... but that'd then be pretty difficult to deal with, potentially, from a translation standpoint and just from a "wow, that's a huge log message", which is kind of the idea behind the SRF- to give you all that info in a more easily digestible manner. Not sure I've got any great ideas on how to improve on this. I do think that if we know that there's multiple different things that are within a small number of xids of the oldest xmin then we should notify the user about all of them, either directly in the error messages or by referring them to the SRF, so they have the opportunity to address them all, or at least know about them all. As mentioned though, it's likely to be a quite rare thing to run into, so you'd have to be extra unlucky to even hit this case and perhaps the extra code complication just isn't worth it. Thanks, Stephen signature.asc Description: PGP signature
Re: improving wraparound behavior
Hi, On 2019-05-03 22:41:11 -0400, Stephen Frost wrote: > I suppose it is a pretty big change in the base autovacuum launcher to > be something that's run per database instead and then deal with the > coordination between the two... but I can't help but feel like it > wouldn't be that much *work*. I'm not against doing something smaller > but was something smaller actually proposed for this specific issue..? I think it'd be fairly significant. And that we should redo it from scratch if we go there - because what we have isn't worth using as a basis. > > I'm thinking that we'd do something roughly like (in actual code) for > > GetNewTransactionId(): > > > > TransactionId dat_limit = ShmemVariableCache->oldestXid; > > TransactionId slot_limit = Min(replication_slot_xmin, > > replication_slot_catalog_xmin); > > Transactionid walsender_limit; > > Transactionid prepared_xact_limit; > > Transactionid backend_limit; > > > > ComputeOldestXminFromProcarray(_limit, _xact_limit, > > _limit); > > > > if (IsOldest(dat_limit)) > >ereport(elevel, > >errmsg("close to xid wraparound, held back by database %s"), > >errdetail("current xid %u, horizon for database %u, shutting > > down at %u"), > >errhint("...")); > > else if (IsOldest(slot_limit)) > > ereport(elevel, errmsg("close to xid wraparound, held back by > > replication slot %s"), > > ...); > > > > where IsOldest wouldn't actually compare plainly numerically, but would > > actually prefer showing the slot, backend, walsender, prepared_xact, as > > long as they are pretty close to the dat_limit - as in those cases > > vacuuming wouldn't actually solve the issue, unless the other problems > > are addressed first (as autovacuum won't compute a cutoff horizon that's > > newer than any of those). > > Where the errhint() above includes a recommendation to run the SRF > described below, I take it? Not necessarily. I feel conciseness is important too, and this would be the most imporant thing to tackle. > Also, should this really be an 'else if', or should it be just a set of > 'if()'s, thereby giving users more info right up-front? Possibly? But it'd also make it even harder to read the log / the system to keep up with logging, because we already log *so* much when close to wraparound. If we didn't order it, it'd be hard for users to figure out which to address first. If we ordered it, people have to further up in the log to figure out which is the most urgent one (unless we reverse the order, which is odd too). Greetings, Andres Freund
Re: improving wraparound behavior
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2019-05-03 22:03:18 -0400, Stephen Frost wrote: > > * Robert Haas (robertmh...@gmail.com) wrote: > > > I am not sure exactly how to fix this, > > > because the calculation we use to determine the XID that can be used > > > to vacuum a specific table is pretty complex; how can the postmaster > > > know whether it's going to be able to make any progress in *any* table > > > in some database to which it's not even connected? But it's surely > > > crazy to just keep doing something over and over that can't possibly > > > work. > > > > I definitely agree that it's foolish to keep doing something that isn't > > going to work, and it seems like a pretty large part of the issue here > > is that we don't have enough information to be more intelligent because > > we aren't connected to the database that needs the work to be done. > > > > Now, presuming we're talking about 'new feature work' here to try and > > address this, and not something that we think we can back-patch, I had > > another thought. > > > > I certainly get that having lots of extra processes around can be a > > waste of resources... but I don't recall a lot of people complaining > > about the autovacuum launcher process using up lots of resources > > unnecessairly. > > > > Perhaps we should consider having the 'global' autovacuum launcher, when > > it's decided that a database needs work to be done, launch a 'per-DB' > > launcher which manages launching autovacuum processes for that database? > > If the launcher is still running then it's because there's still work to > > be done on that database and the 'global' autovacuum launcher can skip > > it. If the 'per-DB' launcher runs out of things to do, and the database > > it's working on is no longer in a danger zone, then it exits. > > > > There are certainly some other variations on this idea and I don't know > > that it's really better than keeping more information in shared > > memory or something else, but it seems like part of the issue is that > > the thing firing off the processes hasn't got enough info to do so > > intelligently and maybe we could fix that by having per-DB launchers > > that are actually connected to a DB. > > This sounds a lot more like a wholesale redesign than some small > incremental work. Which I think we should do, but we probably ought to > do something more minimal before the resources for something like this > are there. Perfect being the enemy of the good, and all that. I suppose it is a pretty big change in the base autovacuum launcher to be something that's run per database instead and then deal with the coordination between the two... but I can't help but feel like it wouldn't be that much *work*. I'm not against doing something smaller but was something smaller actually proposed for this specific issue..? > > I might have misunderstood it, but it sounded like Andres was > > proposing a new function which would basically tell you what's holding > > back the xid horizon and that sounds fantastic and would be great to > > include in this message, if possible. > > > > As in: > > > > HINT: Run the function pg_what_is_holding_xmin_back() to identify what > > is preventing autovacuum from progressing and address it. > > I was basically thinking of doing *both*, amending the message, *and* > having a new UDF. > > Basically, instead of the current: > > char *oldest_datname = get_database_name(oldest_datoid); > > /* complain even if that DB has disappeared */ > if (oldest_datname) > ereport(WARNING, > (errmsg("database \"%s\" must be vacuumed within %u > transactions", > oldest_datname, > xidWrapLimit - xid), > errhint("To avoid a database shutdown, execute a > database-wide VACUUM in that database.\n" > "You might also need to commit or roll back > old prepared transactions, or drop stale replication slots."))); > else > ereport(WARNING, > (errmsg("database with OID %u must be vacuumed within > %u transactions", > oldest_datoid, > xidWrapLimit - xid), > errhint("To avoid a database shutdown, execute a > database-wide VACUUM in that database.\n" > "You might also need to commit or roll back > old prepared transactions, or drop stale replication slots."))); > } > > which is dramatically unhelpful, because often vacuuming won't do squat > (because of old [prepared] transaction, replication connection, or slot). > > I'm thinking that we'd do something roughly like (in actual code) for > GetNewTransactionId(): > > TransactionId dat_limit = ShmemVariableCache->oldestXid; > TransactionId slot_limit =
Re: improving wraparound behavior
Hi, On 2019-05-03 22:03:18 -0400, Stephen Frost wrote: > * Robert Haas (robertmh...@gmail.com) wrote: > > I am not sure exactly how to fix this, > > because the calculation we use to determine the XID that can be used > > to vacuum a specific table is pretty complex; how can the postmaster > > know whether it's going to be able to make any progress in *any* table > > in some database to which it's not even connected? But it's surely > > crazy to just keep doing something over and over that can't possibly > > work. > > I definitely agree that it's foolish to keep doing something that isn't > going to work, and it seems like a pretty large part of the issue here > is that we don't have enough information to be more intelligent because > we aren't connected to the database that needs the work to be done. > > Now, presuming we're talking about 'new feature work' here to try and > address this, and not something that we think we can back-patch, I had > another thought. > > I certainly get that having lots of extra processes around can be a > waste of resources... but I don't recall a lot of people complaining > about the autovacuum launcher process using up lots of resources > unnecessairly. > > Perhaps we should consider having the 'global' autovacuum launcher, when > it's decided that a database needs work to be done, launch a 'per-DB' > launcher which manages launching autovacuum processes for that database? > If the launcher is still running then it's because there's still work to > be done on that database and the 'global' autovacuum launcher can skip > it. If the 'per-DB' launcher runs out of things to do, and the database > it's working on is no longer in a danger zone, then it exits. > > There are certainly some other variations on this idea and I don't know > that it's really better than keeping more information in shared > memory or something else, but it seems like part of the issue is that > the thing firing off the processes hasn't got enough info to do so > intelligently and maybe we could fix that by having per-DB launchers > that are actually connected to a DB. This sounds a lot more like a wholesale redesign than some small incremental work. Which I think we should do, but we probably ought to do something more minimal before the resources for something like this are there. Perfect being the enemy of the good, and all that. > I might have misunderstood it, but it sounded like Andres was > proposing a new function which would basically tell you what's holding > back the xid horizon and that sounds fantastic and would be great to > include in this message, if possible. > > As in: > > HINT: Run the function pg_what_is_holding_xmin_back() to identify what > is preventing autovacuum from progressing and address it. I was basically thinking of doing *both*, amending the message, *and* having a new UDF. Basically, instead of the current: char *oldest_datname = get_database_name(oldest_datoid); /* complain even if that DB has disappeared */ if (oldest_datname) ereport(WARNING, (errmsg("database \"%s\" must be vacuumed within %u transactions", oldest_datname, xidWrapLimit - xid), errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n" "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); else ereport(WARNING, (errmsg("database with OID %u must be vacuumed within %u transactions", oldest_datoid, xidWrapLimit - xid), errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n" "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); } which is dramatically unhelpful, because often vacuuming won't do squat (because of old [prepared] transaction, replication connection, or slot). I'm thinking that we'd do something roughly like (in actual code) for GetNewTransactionId(): TransactionId dat_limit = ShmemVariableCache->oldestXid; TransactionId slot_limit = Min(replication_slot_xmin, replication_slot_catalog_xmin); Transactionid walsender_limit; Transactionid prepared_xact_limit; Transactionid backend_limit; ComputeOldestXminFromProcarray(_limit, _xact_limit, _limit); if (IsOldest(dat_limit)) ereport(elevel, errmsg("close to xid wraparound, held back by database %s"), errdetail("current xid %u, horizon for database %u, shutting down at %u"), errhint("...")); else if (IsOldest(slot_limit)) ereport(elevel, errmsg("close to xid wraparound, held back by
Re: improving wraparound behavior
Hi, On 2019-05-03 22:11:08 -0400, Stephen Frost wrote: > * Andres Freund (and...@anarazel.de) wrote: > > I don't think we necessarily need a new WAL record for what I'm > > describing above (as XLOG_SMGR_TRUNCATE already carries information > > about which forks are truncated, we could just have it acquire the > > exclusive lock), and I don't think we'd need a ton of code for eliding > > the WAL logged lock either. Think the issue with backpatching would be > > that we can't remove the logged lock, without creating hazards for > > standbys running older versions of postgres. > > While it's pretty rare, I don't believe this would be the only case of > "you need to upgrade your replicas before your primary" due to changes > in WAL. Of course, we need to make sure that we actually figure out > that the WAL being sent is something that the replica doesn't know how > to properly handle because it's from a newer primary; we can't simply do > the wrong thing in that case. Don't think this is severe enough to warrant a step like this. I think for the back-patch case we could live with a) move truncation to after the rest of vacuum b) don't truncate if it'd error out anyway, but log an error Greetings, Andres Freund
Re: improving wraparound behavior
Greetings, * Andres Freund (and...@anarazel.de) wrote: > I don't think we necessarily need a new WAL record for what I'm > describing above (as XLOG_SMGR_TRUNCATE already carries information > about which forks are truncated, we could just have it acquire the > exclusive lock), and I don't think we'd need a ton of code for eliding > the WAL logged lock either. Think the issue with backpatching would be > that we can't remove the logged lock, without creating hazards for > standbys running older versions of postgres. While it's pretty rare, I don't believe this would be the only case of "you need to upgrade your replicas before your primary" due to changes in WAL. Of course, we need to make sure that we actually figure out that the WAL being sent is something that the replica doesn't know how to properly handle because it's from a newer primary; we can't simply do the wrong thing in that case. Thanks, Stephen signature.asc Description: PGP signature
Re: improving wraparound behavior
Hi, On 2019-05-03 21:36:24 -0400, Robert Haas wrote: > On Fri, May 3, 2019 at 8:45 PM Andres Freund wrote: > > Part of my opposition to just disabling it when close to a wraparound, > > is that it still allows to get close to wraparound because of truncation > > issues. > > Sure ... it would definitely be better if vacuum didn't consume XIDs > when it truncates. On the other hand, only a minority of VACUUM > operations will truncate, so I don't think there's really a big > problem in practice here. I've seen a number of out-of-xid shutdowns precisely because of truncations. Initially because autovacuum commits suicide because somebody else wants a conflicting lock, later because there's so much dead space that people kill (auto)vacuum to get rid of the exclusive locks. > > IMO preventing getting closer to wraparound is more important > > than making it more "comfortable" to be in a wraparound situation. > > I think that's a false dichotomy. It's impossible to create a > situation where no user ever gets into a wraparound situation, unless > we're prepared to do things like automatically drop replication slots > and automatically roll back (or commit?) prepared transactions. So, > while it is good to prevent a user from getting into a wraparound > situation where we can, it is ALSO good to make it easy to recover > from those situations as painlessly as possible when they do happen. Sure, but I've seen a number of real-world cases of xid wraparound shutdowns related to truncations, and no real world problem due to truncations assigning an xid. > > The second problem I see is that even somebody close to a wraparound > > might have an urgent need to free up some space. So I'm a bit wary of > > just disabling it. > > I would find that ... really surprising. If you have < 11 million > XIDs left before your data gets eaten by a grue, and you file a bug > report complaining that vacuum won't truncate your tables until you > catch up on vacuuming a bit, I am prepared to offer you no sympathy at > all. I've seen wraparound issues triggered by auto-vacuum generating so much WAL that the system ran out of space, crash-restart, repeat. And being unable to reclaim space could make that even harder to tackle. > > Wonder if there's a reasonable way that'd allow to do the WAL logging > > for the truncation without using an xid. One way would be to just get > > rid of the lock on the primary as previously discussed. But we could > > also drive the locking through the WAL records that do the actual > > truncation - then there'd not be a need for an xid. It's probably not a > > entirely trivial change, but I don't think it'd be too bad? > > Beats me. For me, this is just a bug, not an excuse to redesign > vacuum truncation. Before Hot Standby, when you got into severe > wraparound trouble, you could vacuum all your tables without consuming > any XIDs. Now you can't. That's bad, and I think we should come up > with some kind of back-patchable solution to that problem. I agree we need to do at least a minimal version that can be backpatched. I don't think we necessarily need a new WAL record for what I'm describing above (as XLOG_SMGR_TRUNCATE already carries information about which forks are truncated, we could just have it acquire the exclusive lock), and I don't think we'd need a ton of code for eliding the WAL logged lock either. Think the issue with backpatching would be that we can't remove the logged lock, without creating hazards for standbys running older versions of postgres. Greetings, Andres Freund
Re: improving wraparound behavior
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > I am not sure exactly how to fix this, > because the calculation we use to determine the XID that can be used > to vacuum a specific table is pretty complex; how can the postmaster > know whether it's going to be able to make any progress in *any* table > in some database to which it's not even connected? But it's surely > crazy to just keep doing something over and over that can't possibly > work. I definitely agree that it's foolish to keep doing something that isn't going to work, and it seems like a pretty large part of the issue here is that we don't have enough information to be more intelligent because we aren't connected to the database that needs the work to be done. Now, presuming we're talking about 'new feature work' here to try and address this, and not something that we think we can back-patch, I had another thought. I certainly get that having lots of extra processes around can be a waste of resources... but I don't recall a lot of people complaining about the autovacuum launcher process using up lots of resources unnecessairly. Perhaps we should consider having the 'global' autovacuum launcher, when it's decided that a database needs work to be done, launch a 'per-DB' launcher which manages launching autovacuum processes for that database? If the launcher is still running then it's because there's still work to be done on that database and the 'global' autovacuum launcher can skip it. If the 'per-DB' launcher runs out of things to do, and the database it's working on is no longer in a danger zone, then it exits. There are certainly some other variations on this idea and I don't know that it's really better than keeping more information in shared memory or something else, but it seems like part of the issue is that the thing firing off the processes hasn't got enough info to do so intelligently and maybe we could fix that by having per-DB launchers that are actually connected to a DB. > 2. Once you get to the point where you start to emit errors when > attempting to assign an XID, you can still run plain old VACUUM > because it doesn't consume an XID ... except that if it tries to > truncate the relation, then it will take AccessExclusiveLock, which > has to be logged, which forces an XID assignment, which makes VACUUM > fail. So if you burn through XIDs until the system gets to this > point, and then you roll back the prepared transaction that caused the > problem in the first place, autovacuum sits there trying to vacuum > tables in a tight loop and fails over and over again as soon as hits a > table that it thinks needs to be truncated. This seems really lame, > and also easily fixed. > > Attached is a patch that disables vacuum truncation if xidWarnLimit > has been reached. With this patch, in my testing, autovacuum is able > to recover the system once the prepared transaction has been rolled > back. Without this patch, not only does that not happen, but if you > had a database with enough relations that need truncation, you could > conceivably cause XID wraparound just from running a database-wide > VACUUM, the one tool you have available to avoid XID wraparound. I > think that this amounts to a back-patchable bug fix. Sounds reasonable to me but I've not looked at the patch at all. > 3. The message you get when you hit xidStopLimit seems like bad advice to me: > > ERROR: database is not accepting commands to avoid wraparound data > loss in database "%s" > HINT: Stop the postmaster and vacuum that database in single-user mode. > You might also need to commit or roll back old prepared transactions, > or drop stale replication slots. > > Why do we want people to stop the postmaster and vacuum that database > in single user mode? Why not just run VACUUM in multi-user mode, or > let autovacuum take care of the problem? Granted, if VACUUM is going > to fail in multi-user mode, and if switching to single-user mode is > going to make it succeed, then it's a good suggestion. But it seems > that it doesn't fail in multi-user mode, unless it tries to truncate > something, which is a bug we should fix. Telling people to go to > single-user mode where they can continue to assign XIDs even though > they have almost no XIDs left seems extremely dangerous, actually. > > Also, I think that old prepared transactions and stale replication > slots should be emphasized more prominently. Maybe something like: > > HINT: Commit or roll back old prepared transactions, drop stale > replication slots, or kill long-running sessions. > Ensure that autovacuum is progressing, or run a manual database-wide VACUUM. I agree that a better message would definitely be good and that recommending single-user isn't a terribly useful thing to do. I might have misunderstood it, but it sounded like Andres was proposing a new function which would basically tell you what's holding back the xid horizon and that sounds fantastic and would be
Re: range_agg
On Fri, May 03, 2019 at 03:56:41PM -0700, Paul Jungwirth wrote: > Hello, > > I wrote an extension to add a range_agg function with similar behavior to > existing *_agg functions, and I'm wondering if folks would like to have it > in core? Here is the repo: https://github.com/pjungwir/range_agg > > I'm also working on a patch for temporal foreign keys, and having range_agg > would make the FK check easier and faster, which is why I'd like to get it > added. But also it just seems useful, like array_agg, json_agg, etc. > > One question is how to aggregate ranges that would leave gaps and/or > overlaps. This suggests two different ways to extend ranges over aggregation: one which is a union of (in general) disjoint intervals, two others are a union of intervals, each of which has a weight. Please pardon the ASCII art. The aggregation of: [1, 4) [2, 5) [8, 10) could turn into: {[1,5), [8, 10)} (union without weight) {{[1,2),1}, {[2,4),2}, {[4,5),1}, {[8,10),1}} (strictly positive weights which don't (in general) cover the space) {{[1,2),1}, {[2,4),2}, {[4,5),1}, {[5,8),0}, {[8,10),1}} (non-negative weights which guarantee the space is covered) There is no principled reason to choose one over the others. > What do people think? I plan to work on a patch regardless, so that I can > use it for temporal FKs, but I'd appreciate some feedback on the "user > interface". I think the cases above, or at least the first two of them, should be available. They could be called range_agg, weighted_range_agg, and covering_range_agg. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: improving wraparound behavior
On Fri, May 3, 2019 at 8:45 PM Andres Freund wrote: > Part of my opposition to just disabling it when close to a wraparound, > is that it still allows to get close to wraparound because of truncation > issues. Sure ... it would definitely be better if vacuum didn't consume XIDs when it truncates. On the other hand, only a minority of VACUUM operations will truncate, so I don't think there's really a big problem in practice here. > IMO preventing getting closer to wraparound is more important > than making it more "comfortable" to be in a wraparound situation. I think that's a false dichotomy. It's impossible to create a situation where no user ever gets into a wraparound situation, unless we're prepared to do things like automatically drop replication slots and automatically roll back (or commit?) prepared transactions. So, while it is good to prevent a user from getting into a wraparound situation where we can, it is ALSO good to make it easy to recover from those situations as painlessly as possible when they do happen. > The second problem I see is that even somebody close to a wraparound > might have an urgent need to free up some space. So I'm a bit wary of > just disabling it. I would find that ... really surprising. If you have < 11 million XIDs left before your data gets eaten by a grue, and you file a bug report complaining that vacuum won't truncate your tables until you catch up on vacuuming a bit, I am prepared to offer you no sympathy at all. I mean, I'm not going to say that we couldn't invent more complicated behavior, at least on master, like making the new VACUUM (TRUNCATE) object ternary-valued: default is on when you have more than 11 million XIDs remaining and off otherwise, but you can override either value by saying VACUUM (TRUNCATE { ON | OFF }). But is that really a thing? People have less than 11 million XIDs left and they're like "forget anti-wraparound, I want to truncate away some empty pages"? > Wonder if there's a reasonable way that'd allow to do the WAL logging > for the truncation without using an xid. One way would be to just get > rid of the lock on the primary as previously discussed. But we could > also drive the locking through the WAL records that do the actual > truncation - then there'd not be a need for an xid. It's probably not a > entirely trivial change, but I don't think it'd be too bad? Beats me. For me, this is just a bug, not an excuse to redesign vacuum truncation. Before Hot Standby, when you got into severe wraparound trouble, you could vacuum all your tables without consuming any XIDs. Now you can't. That's bad, and I think we should come up with some kind of back-patchable solution to that problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: First-draft release notes for back branches are up
Thomas Munro writes: > On Sat, May 4, 2019 at 10:29 AM Tom Lane wrote: > + Tolerate EINVAL and ENOSYS > + error results, where appropriate, for fsync calls (Thomas Munro, > + James Sewell) > Nit-picking: ENOSYS is for sync_file_range, EINVAL is for fsync. Yeah, I didn't really think it was worth distinguishing. If there is some more general term that covers both calls, maybe we should use that? regards, tom lane
Re: First-draft release notes for back branches are up
On Sat, May 4, 2019 at 10:29 AM Tom Lane wrote: > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8b3bce2017b15e05f000c3c5947653a3e2c5a29f > > Please send any corrections by Sunday. + Tolerate EINVAL and ENOSYS + error results, where appropriate, for fsync calls (Thomas Munro, + James Sewell) Nit-picking: ENOSYS is for sync_file_range, EINVAL is for fsync. -- Thomas Munro https://enterprisedb.com
Re: improving wraparound behavior
Hi, On 2019-05-03 18:42:35 -0400, Robert Haas wrote: > On Fri, May 3, 2019 at 4:47 PM Andres Freund wrote: > > I'd actually say the proper fix would be to instead move the truncation > > to *after* finishing updating relfrozenxid etc. If we truncate, the > > additional cost of another in-place pg_class update, to update relpages, > > is basically insignificant. And the risk of errors, or being cancelled, > > during truncation is much higher than before (due to the AEL). > > That would prevent the ERROR from impeding relfrozenxid advancement, > but it does not prevent the error itself, nor the XID consumption. If > autovacuum is hitting that ERROR, it will spew errors in the log but > succeed in advancing relfrozenxid anyway. I don't think that's as > nice as the behavior I proposed, but it's certainly better than the > status quo. If you are hitting that error due to a manual VACUUM, > under your proposal, you'll stop the manual VACUUM as soon as you hit > the first table where this happens, which is not what you want. > You'll also keep consuming XIDs, which is not what you want either, > especially if you are in single-user mode because the number of > remaining XIDs is less that a million. Part of my opposition to just disabling it when close to a wraparound, is that it still allows to get close to wraparound because of truncation issues. IMO preventing getting closer to wraparound is more important than making it more "comfortable" to be in a wraparound situation. The second problem I see is that even somebody close to a wraparound might have an urgent need to free up some space. So I'm a bit wary of just disabling it. Wonder if there's a reasonable way that'd allow to do the WAL logging for the truncation without using an xid. One way would be to just get rid of the lock on the primary as previously discussed. But we could also drive the locking through the WAL records that do the actual truncation - then there'd not be a need for an xid. It's probably not a entirely trivial change, but I don't think it'd be too bad? > > > Also, I think that old prepared transactions and stale replication > > > slots should be emphasized more prominently. Maybe something like: > > > > > > HINT: Commit or roll back old prepared transactions, drop stale > > > replication slots, or kill long-running sessions. > > > Ensure that autovacuum is progressing, or run a manual database-wide > > > VACUUM. > > > > I think it'd be good to instead compute what the actual problem is. It'd > > not be particularly hard to show some these in the errdetail: > > > > 1) the xid horizon (xid + age) of the problematic database; potentially, > >if connected to that database, additionally compute what the oldest > >xid is (although that's computationally potentially too expensive) s/oldest xid/oldest relfrozenxid/ > > 2) the xid horizon (xid + age) due to prepared transactions, and the > >oldest transaction's name > > 3) the xid horizon (xid + age) due to replication slot, and the "oldest" > >slot's name > > 4) the xid horizon (xid + age) and pid for the connection with the > >oldest snapshot. > > > > I think that'd allow users much much easier to pinpoint what's going on. > > > > In fact, I think we probably should additionally add a function that can > > display the above. That'd make it much easier to write monitoring > > queries. > > I think that the error and hint that you get from > GetNewTransactionId() has to be something that we can generate very > quickly, without doing anything that might hang on cluster with lots > of databases or lots of relations; but if there's useful detail we can > display there, that's good. With a view, it's more OK if it takes a > long time on a big cluster. Yea, I agree it has to be reasonably fast. But all of the above, with the exception of the optional "oldest table", should be cheap enough to compute. Sure, a scan through PGXACT isn't cheap, but in comparison to an ereport() and an impending shutdown it's peanuts. In contrast to scanning a pg_class that could be many many gigabytes. Greetings, Andres Freund
accounting for memory used for BufFile during hash joins
Hi, I'm starting this thread mostly to keep track of patches developed in response to issue [1] reported on pgsql-performance. The symptoms are very simple - query performing a hash join ends up using much more memory than expected (pretty much ignoring work_mem), and possibly ending up with OOM. The root cause is that hash join treats batches as pretty much free, but that's not really true - we do allocate two BufFile structs per batch, and each BufFile is ~8kB as it includes PGAlignedBuffer. This is not ideal even if we happen to estimate everything correctly, because for example with work_mem=4MB and nbatch=1024, it means we'll use about 16MB (2*8kB*1024) for the BufFile structures alone, plus the work_mem for hash table itself. But it can easily explode when we under-estimate the hash side. In the pgsql-performance message, the hash side (with the patches applied, allowing the query to complete) it looks like this: Hash (cost=2823846.37..2823846.37 rows=34619 width=930) (actual time=252946.367..252946.367 rows=113478127 loops=1) So it's 3277x under-estimated. It starts with 16 batches, and ends up adding more and more batches until it fails with 524288 of them (it gets to that many batches because some of the values are very common and we don't disable the growth earlier). The OOM is not very surprising, because with 524288 batches it'd need about 8GB of memory, and the system only has 8GB RAM installed. The two attached patches both account for the BufFile memory, but then use very different strategies when the work_mem limit is reached. The first patch realizes it's impossible to keep adding batches without breaking the work_mem limit, because at some point the BufFile will need more memory than that. But it does not make sense to stop adding batches entirely, because then the hash table could grow indefinitely. So the patch abandons the idea of enforcing work_mem in this situation, and instead attempts to minimize memory usage over time - it increases the spaceAllowed in a way that ensures doubling the number of batches actually reduces memory usage in the long run. The second patch tries to enforce work_mem more strictly. That would be impossible if we were to keep all the BufFile structs in memory, so instead it slices the batches into chunks that fit into work_mem, and then uses a single "overflow" file for slices currently not in memory. These extra slices can't be counted into work_mem, but we should need just very few of them. For example with work_mem=4MB the slice is 128 batches, so we need 128x less overflow files (compared to per-batch). Neither of those patches tweaks ExecChooseHashTableSize() to consider memory needed for BufFiles while deciding how many batches will be needed. That's something that probably needs to happen, but it would not help with the underestimate issue. I'm not entirely sure which of those approaches is the right one. The first one is clearly just a "damage control" for cases where the hash side turned out to be much larger than we expected. With good estimates we probably would not have picked a hash join for those (that is, we should have realized we can't keep work_mem and prohibit hash join). The second patch however makes hash join viable for some of those cases, and it seems to work pretty well (there are some numbers in the message posted to pgsql-performance thread). So I kinda like this second one. It's all just PoC quality, at this point, far from committable state. [1] https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 6ffaa751f2..4d5a6872cc 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -80,6 +80,7 @@ static bool ExecParallelHashTuplePrealloc(HashJoinTable hashtable, static void ExecParallelHashMergeCounters(HashJoinTable hashtable); static void ExecParallelHashCloseBatchAccessors(HashJoinTable hashtable); +static void ExecHashUpdateSpacePeak(HashJoinTable hashtable); /* * ExecHash @@ -193,10 +194,8 @@ MultiExecPrivateHash(HashState *node) if (hashtable->nbuckets != hashtable->nbuckets_optimal) ExecHashIncreaseNumBuckets(hashtable); - /* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */ - hashtable->spaceUsed += hashtable->nbuckets * sizeof(HashJoinTuple); - if (hashtable->spaceUsed > hashtable->spacePeak) - hashtable->spacePeak = hashtable->spaceUsed; + /* refresh info about peak used memory */ + ExecHashUpdateSpacePeak(hashtable); hashtable->partialTuples = hashtable->totalTuples; } @@ -1647,12 +1646,56 @@ ExecHashTableInsert(HashJoinTable
Re: using index or check in ALTER TABLE SET NOT NULL
On Thu, 2 May 2019 at 14:22, David Rowley wrote: > > On Thu, 2 May 2019 at 13:08, Tom Lane wrote: > > > > Not a blocker perhaps, but it's better if we can get new behavior to > > be more or less right the first time. > > It's not really new behaviour though. The code in question is for > ATTACH PARTITION and was added in c31e9d4bafd (back in 2017). > bbb96c3704c is the commit for using constraints to speed up SET NOT > NULL. The mention of using the constraint for proof was made DEBUG1 in > the initial commit. What we need to decide is if we want to make > ATTACH PARTITION's INFO message a DEBUG1 in pg12 or not. FWIW, I'm in favour of changing ATTACH PARTITION's INFO messages to DEBUG1 for PG12 to align it to what bbb96c3704c did. Anyone else? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
range_agg
Hello, I wrote an extension to add a range_agg function with similar behavior to existing *_agg functions, and I'm wondering if folks would like to have it in core? Here is the repo: https://github.com/pjungwir/range_agg I'm also working on a patch for temporal foreign keys, and having range_agg would make the FK check easier and faster, which is why I'd like to get it added. But also it just seems useful, like array_agg, json_agg, etc. One question is how to aggregate ranges that would leave gaps and/or overlaps. So in my extension there is a one-param version that forbids gaps & overlaps, but I let you permit them by passing extra parameters, so the signature is: range_agg(r anyrange, permit_gaps boolean, permit_overlaps boolean) Perhaps another useful choice would be to return NULL if a gap/overlap is found, so that each param would have three choices instead of just two: accept the inputs, raise an error, return a NULL. What do people think? I plan to work on a patch regardless, so that I can use it for temporal FKs, but I'd appreciate some feedback on the "user interface". Thanks, -- Paul ~{:-) p...@illuminatedcomputing.com
Re: improving wraparound behavior
Robert Haas writes: > I spent a significant chunk of today burning through roughly 2^31 XIDs > just to see what would happen. ... > 2. Once you get to the point where you start to emit errors when > attempting to assign an XID, you can still run plain old VACUUM > because it doesn't consume an XID ... except that if it tries to > truncate the relation, then it will take AccessExclusiveLock, which > has to be logged, which forces an XID assignment, which makes VACUUM > fail. Yeah. I tripped over that earlier this week in connection with the REINDEX business: taking an AEL only forces XID assignment when wal_level is above "minimal", so it's easy to come to the wrong conclusions depending on your test environment. I suspect that previous testing of wraparound behavior (yes there has been some) didn't see this effect because the default wal_level didn't use to cause it to happen. But anyway, it's there now and I agree we'd better do something about it. My brain is too fried from release-note-writing to have any trustworthy opinion right now about whether your patch is the best way. regards, tom lane
Re: improving wraparound behavior
On Fri, May 3, 2019 at 4:47 PM Andres Freund wrote: > I'd actually say the proper fix would be to instead move the truncation > to *after* finishing updating relfrozenxid etc. If we truncate, the > additional cost of another in-place pg_class update, to update relpages, > is basically insignificant. And the risk of errors, or being cancelled, > during truncation is much higher than before (due to the AEL). That would prevent the ERROR from impeding relfrozenxid advancement, but it does not prevent the error itself, nor the XID consumption. If autovacuum is hitting that ERROR, it will spew errors in the log but succeed in advancing relfrozenxid anyway. I don't think that's as nice as the behavior I proposed, but it's certainly better than the status quo. If you are hitting that error due to a manual VACUUM, under your proposal, you'll stop the manual VACUUM as soon as you hit the first table where this happens, which is not what you want. You'll also keep consuming XIDs, which is not what you want either, especially if you are in single-user mode because the number of remaining XIDs is less that a million. > > Also, I think that old prepared transactions and stale replication > > slots should be emphasized more prominently. Maybe something like: > > > > HINT: Commit or roll back old prepared transactions, drop stale > > replication slots, or kill long-running sessions. > > Ensure that autovacuum is progressing, or run a manual database-wide VACUUM. > > I think it'd be good to instead compute what the actual problem is. It'd > not be particularly hard to show some these in the errdetail: > > 1) the xid horizon (xid + age) of the problematic database; potentially, >if connected to that database, additionally compute what the oldest >xid is (although that's computationally potentially too expensive) > 2) the xid horizon (xid + age) due to prepared transactions, and the >oldest transaction's name > 3) the xid horizon (xid + age) due to replication slot, and the "oldest" >slot's name > 4) the xid horizon (xid + age) and pid for the connection with the >oldest snapshot. > > I think that'd allow users much much easier to pinpoint what's going on. > > In fact, I think we probably should additionally add a function that can > display the above. That'd make it much easier to write monitoring > queries. I think that the error and hint that you get from GetNewTransactionId() has to be something that we can generate very quickly, without doing anything that might hang on cluster with lots of databases or lots of relations; but if there's useful detail we can display there, that's good. With a view, it's more OK if it takes a long time on a big cluster. > IMO we also ought to compute the *actual* relfrozenxid/relminmxid for a > table. I.e. the oldest xid actually present. It's pretty common for most > tables to have effective horizons that are much newer than what > GetOldestXmin()/vacuum_set_xid_limits() can return. Obviously we can do > so only when scanning all non-frozen pages. But being able to record > "more aggressive" horizons would often save unnecessary work. And it > ought to not be hard. I think especially for regular non-freeze, > non-wraparound vacuums that'll often result in a much newer relfrozenxid > (as we'll otherwise just have GetOldestXmin() - vacuum_freeze_min_age). Sure, that would make sense. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: POC: GROUP BY optimization
On Fri, May 03, 2019 at 10:28:21PM +0200, Dmitry Dolgov wrote: On Tue, Apr 9, 2019 at 5:21 PM Tomas Vondra wrote: So I personally would suggest to treat those patches as independent until the very last moment, develop the costing improvements needed by each of them, and then decide which of them are committable / in what order. I had the same idea, but judging from the questions, raised in this thread, it's quite hard to go with reordering based only on frequency of values. I hoped that the cost_sort improvement patch would be simple enough to incorporate it here, but of course it wasn't. Having an assumption, that the amount of work, required for performing sorting, depends only on the number of distinct groups and how costly it is to compare a values of this data type, I've ended up extracting get_width_multiplier and get_func_cost parts from cost_sort patch and including them into 0003-Reorder-by-values-distribution. This allows to take into account situations when we compare e.g. long strings or a custom data type with high procost for comparison (but I've used this values directly without any adjusting coefficients yet). On Wed, Jun 13, 2018 at 6:41 PM Teodor Sigaev wrote: > So that's a nice improvement, although I think we should also consider > non-uniform distributions (using the per-column MCV lists). Could you clarify how to do that? Since I'm not familiar with this topic, I would like to ask the same question, how to do that and what are the advantages? I don't recall the exact details of the discussion, since most of it happened almost a year ago, but the main concern about the original costing approach is that it very much assumes uniform distribution. For example if you have N tuples with M groups (which are not computed using estimate_num_groups IIRC, and we can hardly do much better), then the patch assumed each groups is ~N/M rows and used that for computing the number of comparisons for a particular sequence of ORDER BY clauses. But that may result in pretty significant errors in causes with a couple of very large groups. But hey - we can get some of that information from MCV lists, so maybe we can compute a safer less-optimistic estimate. I mean, we can't compute "perfect" estimate of how many comparisons we'll have to do, because we only have per-column MCV lits and maybe some multi-column MCV lists (but definitely not on all combinations of columns in the ORDER BY clause). But what I think we could do is using largest possible group instead of the average one. So for example when estimating number of comparisons for columns (c1,..,cN), you could look at MCV lists for these columns and compute m(i) = Max(largest group in MCV list for i-th column) and then use Min(m(1), ..., m(k)) when estimating the number of comparisons. Of course, this is likely to be a worst-case estimate, and it's not quite obvious that comparing worst-case estimates is necessarily safer than comparing the regular ones. So I'm not sure about it. It might be better to just compute the 'average group' in a different way, not by arithmetic mean, but say as geometric mean from each MCV list. Not sure. I guess this needs some research and experimentation - constructing cases that are likely to cause problems (non-uniform distributions, columns with a small number of exceptionally large groups, ...) and then showing how the costing deals with those. FWIW I've mentioned MCV lists in the incrementalsort thread too, in which case it was much clearer how to use them to improve the startup cost estimate - it's enough to look at the first group in per-column MCV lists (first in the ORDER BY sense, not by frequency). On Sat, Jun 16, 2018 at 5:59 PM Tomas Vondra wrote: I still think we need to be careful when introducing new optimizations in this area - reordering the grouping keys by ndistinct, ORDER BY or whatever. In particular I don't think we should commit these patches that may quite easily cause regressions, and then hope some hypothetical future patch fixes the costing. I'm a bit concerned about this part of the discussion. There is an idea through the whole thread about avoiding the situation, when a user knows which order is better and we generate different one by mistake. From what I see right now even if all the objections would be addressed, there is a chance that some coefficients will be not good enough (e.g. width multiplier is based on an average width, or it can suddenly happen that all the compared string have some difference at the very beginning) and the chosen order will be not optimal. Does it mean that in any case the implementation of such optimization should provide a way to override it? I don't think we have to provide an override no matter what. Otherwise we'd have to have an override for everything, because no optimizer decision is perfect - it's all built on estimates, after all. But I assume we agree that optimizations based on estimates thare are known to be
Re: improving wraparound behavior
Hi, On 2019-05-03 16:26:46 -0400, Robert Haas wrote: > 2. Once you get to the point where you start to emit errors when > attempting to assign an XID, you can still run plain old VACUUM > because it doesn't consume an XID ... except that if it tries to > truncate the relation, then it will take AccessExclusiveLock, which > has to be logged, which forces an XID assignment, which makes VACUUM > fail. So if you burn through XIDs until the system gets to this > point, and then you roll back the prepared transaction that caused the > problem in the first place, autovacuum sits there trying to vacuum > tables in a tight loop and fails over and over again as soon as hits a > table that it thinks needs to be truncated. This seems really lame, > and also easily fixed. > > Attached is a patch that disables vacuum truncation if xidWarnLimit > has been reached. With this patch, in my testing, autovacuum is able > to recover the system once the prepared transaction has been rolled > back. Without this patch, not only does that not happen, but if you > had a database with enough relations that need truncation, you could > conceivably cause XID wraparound just from running a database-wide > VACUUM, the one tool you have available to avoid XID wraparound. I > think that this amounts to a back-patchable bug fix. > > (One could argue that truncation should be disabled sooner than this, > like when we've exceed autovacuum_freeze_max_age, or later than this, > like when we hit xidStopLimit, but I think xidWarnLimit is probably > the best compromise.) I'd actually say the proper fix would be to instead move the truncation to *after* finishing updating relfrozenxid etc. If we truncate, the additional cost of another in-place pg_class update, to update relpages, is basically insignificant. And the risk of errors, or being cancelled, during truncation is much higher than before (due to the AEL). > Also, I think that old prepared transactions and stale replication > slots should be emphasized more prominently. Maybe something like: > > HINT: Commit or roll back old prepared transactions, drop stale > replication slots, or kill long-running sessions. > Ensure that autovacuum is progressing, or run a manual database-wide VACUUM. I think it'd be good to instead compute what the actual problem is. It'd not be particularly hard to show some these in the errdetail: 1) the xid horizon (xid + age) of the problematic database; potentially, if connected to that database, additionally compute what the oldest xid is (although that's computationally potentially too expensive) 2) the xid horizon (xid + age) due to prepared transactions, and the oldest transaction's name 3) the xid horizon (xid + age) due to replication slot, and the "oldest" slot's name 4) the xid horizon (xid + age) and pid for the connection with the oldest snapshot. I think that'd allow users much much easier to pinpoint what's going on. In fact, I think we probably should additionally add a function that can display the above. That'd make it much easier to write monitoring queries. IMO we also ought to compute the *actual* relfrozenxid/relminmxid for a table. I.e. the oldest xid actually present. It's pretty common for most tables to have effective horizons that are much newer than what GetOldestXmin()/vacuum_set_xid_limits() can return. Obviously we can do so only when scanning all non-frozen pages. But being able to record "more aggressive" horizons would often save unnecessary work. And it ought to not be hard. I think especially for regular non-freeze, non-wraparound vacuums that'll often result in a much newer relfrozenxid (as we'll otherwise just have GetOldestXmin() - vacuum_freeze_min_age). Greetings, Andres Freund
Re: POC: GROUP BY optimization
> On Tue, Apr 9, 2019 at 5:21 PM Tomas Vondra > wrote: > > So I personally would suggest to treat those patches as independent until > the very last moment, develop the costing improvements needed by each > of them, and then decide which of them are committable / in what order. I had the same idea, but judging from the questions, raised in this thread, it's quite hard to go with reordering based only on frequency of values. I hoped that the cost_sort improvement patch would be simple enough to incorporate it here, but of course it wasn't. Having an assumption, that the amount of work, required for performing sorting, depends only on the number of distinct groups and how costly it is to compare a values of this data type, I've ended up extracting get_width_multiplier and get_func_cost parts from cost_sort patch and including them into 0003-Reorder-by-values-distribution. This allows to take into account situations when we compare e.g. long strings or a custom data type with high procost for comparison (but I've used this values directly without any adjusting coefficients yet). > On Wed, Jun 13, 2018 at 6:41 PM Teodor Sigaev wrote: > > > So that's a nice improvement, although I think we should also consider > > non-uniform distributions (using the per-column MCV lists). > > Could you clarify how to do that? Since I'm not familiar with this topic, I would like to ask the same question, how to do that and what are the advantages? > On Sat, Jun 16, 2018 at 5:59 PM Tomas Vondra > wrote: > > I still think we need to be careful when introducing new optimizations > in this area - reordering the grouping keys by ndistinct, ORDER BY or > whatever. In particular I don't think we should commit these patches > that may quite easily cause regressions, and then hope some hypothetical > future patch fixes the costing. I'm a bit concerned about this part of the discussion. There is an idea through the whole thread about avoiding the situation, when a user knows which order is better and we generate different one by mistake. From what I see right now even if all the objections would be addressed, there is a chance that some coefficients will be not good enough (e.g. width multiplier is based on an average width, or it can suddenly happen that all the compared string have some difference at the very beginning) and the chosen order will be not optimal. Does it mean that in any case the implementation of such optimization should provide a way to override it? v12-0001-Add-tests-for-group-by-optimization.patch Description: Binary data v12-0002-Reorder-to-match-ORDER-BY-or-index.patch Description: Binary data v12-0003-Reorder-by-values-distribution.patch Description: Binary data
Re: error messages in extended statistics
On Fri, May 03, 2019 at 12:21:36PM -0400, Tom Lane wrote: Alvaro Herrera writes: Error reporting in extended statistics is inconsistent -- many messages that are ereport() in mvdistinct.c are elog() in the other modules. ... I think this should be cleaned up, while at the same time not giving too much hassle for translators; for example, this message dependencies.c: elog(ERROR, "invalid MVDependencies size %zd (expected at least %zd)", should not only be turned into an ereport(), but also the MVDependencies part turned into a %s. (Alternatively, we could decide I was wrong and turn them all back into elogs, but I obviously vote against that.) FWIW, I'd vote the other way: that seems like a clear "internal error", so making translators deal with it is just make-work. It should be an elog. If there's a reasonably plausible way for a user to trigger an error condition, then yes ereport, but if we're reporting situations that couldn't happen without a server bug then elog seems fine. Yeah, I agree. Most of (peshaps all) those errors are internal errors, and thus should be elog. I'll take care of clening this up a bit. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Why is infinite_recurse test suddenly failing?
Mark Wong writes: > On Thu, May 02, 2019 at 11:45:34AM -0400, Tom Lane wrote: >> While we're looking at this --- Mark, if you could install gdb >> on your buildfarm hosts, that would be really handy. I think that's >> the only extra thing the buildfarm script needs to extract stack >> traces from core dumps. We'd likely already know where the problem >> is if we had a stack trace ... > Ok, I think I have gdb installed now... Thanks! regards, tom lane
Re: Why is infinite_recurse test suddenly failing?
On Thu, May 02, 2019 at 11:45:34AM -0400, Tom Lane wrote: > Andres Freund writes: > > Hm, I just noticed: > >'HEAD' => [ > >'force_parallel_mode = > > regress' > > ] > > Oooh, I didn't see that. > > > on all those animals. So it's not necessarily the case that HEAD and > > backbranch runs are behaving all that identical. Note that isn't a > > recent config change, so it's not an explanation as to why they started > > to fail only recently. > > No, but it does point at another area of the code in which a relevant > change could've occurred. > > While we're looking at this --- Mark, if you could install gdb > on your buildfarm hosts, that would be really handy. I think that's > the only extra thing the buildfarm script needs to extract stack > traces from core dumps. We'd likely already know where the problem > is if we had a stack trace ... Ok, I think I have gdb installed now... Regards, Mark -- Mark Wong 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
Re: Per-tablespace autovacuum settings
On Thu, Apr 25, 2019 at 12:36 PM Oleksii Kliukin wrote: > - Fallbacks to autovacuum parameters in another scope. Right now in the > absence of the per-table and per-tablespace autovacuum parameters the code > uses the ones from the global scope. However, if only some of the reloptions > are set on a per-table level (i.e. none of the autovacuum related ones), we > assume defaults for the rest of reloptions without consulting the lower > level (i.e .per-tablespace options). This is so because we don’t have the > mechanism to tell whether the option is set to its default value (some of > them use -1 to request the fallback to the outer level, but for some it’s > not possible, i.e. autovacuum_enabled is just a boolean value). That sounds like it's probably not acceptable? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: error messages in extended statistics
Alvaro Herrera writes: > Error reporting in extended statistics is inconsistent -- many messages > that are ereport() in mvdistinct.c are elog() in the other modules. > ... > I think this should be cleaned up, while at the same time not giving too > much hassle for translators; for example, this message > dependencies.c: elog(ERROR, "invalid MVDependencies size %zd (expected at > least %zd)", > should not only be turned into an ereport(), but also the MVDependencies > part turned into a %s. (Alternatively, we could decide I was wrong and > turn them all back into elogs, but I obviously vote against that.) FWIW, I'd vote the other way: that seems like a clear "internal error", so making translators deal with it is just make-work. It should be an elog. If there's a reasonably plausible way for a user to trigger an error condition, then yes ereport, but if we're reporting situations that couldn't happen without a server bug then elog seems fine. regards, tom lane
error messages in extended statistics
Hello Error reporting in extended statistics is inconsistent -- many messages that are ereport() in mvdistinct.c are elog() in the other modules. I think what happened is that I changed them from elog to ereport when committing mvdistinct, but Tomas and Simon didn't follow suit when committing the other two modules. As a result, some messages that should be essentially duplicates only show up once, because the elog() ones are not marked translatable. I think this should be cleaned up, while at the same time not giving too much hassle for translators; for example, this message dependencies.c: elog(ERROR, "invalid MVDependencies size %zd (expected at least %zd)", should not only be turned into an ereport(), but also the MVDependencies part turned into a %s. (Alternatively, we could decide I was wrong and turn them all back into elogs, but I obviously vote against that.) $ git grep 'elog\|errmsg' src/backend/statistics dependencies.c: elog(ERROR, "cache lookup failed for ordering operator for type %u", dependencies.c: elog(ERROR, "invalid MVDependencies size %zd (expected at least %zd)", dependencies.c: elog(ERROR, "invalid dependency magic %d (expected %d)", dependencies.c: elog(ERROR, "invalid dependency type %d (expected %d)", dependencies.c: errmsg("invalid zero-length item array in MVDependencies"))); dependencies.c: elog(ERROR, "invalid dependencies size %zd (expected at least %zd)", dependencies.c: elog(ERROR, "cache lookup failed for statistics object %u", mvoid); dependencies.c: elog(ERROR, dependencies.c: errmsg("cannot accept a value of type %s", "pg_dependencies"))); dependencies.c: errmsg("cannot accept a value of type %s", "pg_dependencies"))); extended_stats.c:errmsg("statistics object \"%s.%s\" could not be computed for relation \"%s.%s\"", extended_stats.c: elog(ERROR, "unexpected statistics type requested: %d", type); extended_stats.c: elog(ERROR, "stxkind is not a 1-D char array"); extended_stats.c: elog(ERROR, "cache lookup failed for statistics object %u", statOid); mcv.c: elog(ERROR, "cache lookup failed for ordering operator for type %u", mcv.c: elog(ERROR, "cache lookup failed for statistics object %u", mvoid); mcv.c: elog(ERROR, mcv.c: elog(ERROR, "invalid MCV size %zd (expected at least %zu)", mcv.c: elog(ERROR, "invalid MCV magic %u (expected %u)", mcv.c: elog(ERROR, "invalid MCV type %u (expected %u)", mcv.c: elog(ERROR, "invalid zero-length dimension array in MCVList"); mcv.c: elog(ERROR, "invalid length (%d) dimension array in MCVList", mcv.c: elog(ERROR, "invalid zero-length item array in MCVList"); mcv.c: elog(ERROR, "invalid length (%u) item array in MCVList", mcv.c: elog(ERROR, "invalid MCV size %zd (expected %zu)", mcv.c: elog(ERROR, "invalid MCV size %zd (expected %zu)", mcv.c: errmsg("function returning record called in context " mcv.c: errmsg("cannot accept a value of type %s", "pg_mcv_list"))); mcv.c: errmsg("cannot accept a value of type %s", "pg_mcv_list"))); mcv.c: elog(ERROR, "unknown clause type: %d", clause->type); mvdistinct.c: elog(ERROR, "cache lookup failed for statistics object %u", mvoid); mvdistinct.c: elog(ERROR, mvdistinct.c: elog(ERROR, "invalid MVNDistinct size %zd (expected at least %zd)", mvdistinct.c:errmsg("invalid ndistinct magic %08x (expected %08x)", mvdistinct.c:errmsg("invalid ndistinct type %d (expected %d)", mvdistinct.c:errmsg("invalid zero-length item array in MVNDistinct"))); mvdistinct.c:errmsg("invalid MVNDistinct size %zd (expected at least %zd)", mvdistinct.c:errmsg("cannot accept a value of type %s", "pg_ndistinct"))); mvdistinct.c:errmsg("cannot accept a value of type %s", "pg_ndistinct"))); mvdistinct.c: elog(ERROR, "cache lookup failed for ordering operator for type %u", -- Álvaro Herrera Developer, https://www.PostgreSQL.org/
Re: Heap lock levels for REINDEX INDEX CONCURRENTLY not quite right?
Hi, On 2019-05-03 09:37:07 +0200, Peter Eisentraut wrote: > REINDEX CONCURRENTLY is still deadlock prone because of > WaitForOlderSnapshots(), so this doesn't actually fix your test case, > but that seems unrelated to this particular issue. Right. I've not tested the change, but it looks reasonable to me. The change of moving the logic the reset of *heapOid to the unlock perhaps is debatable, but I think it's OK. Greetings, Andres Freund
Re: POC: Cleaning up orphaned files using undo logs
On Fri, May 3, 2019 at 12:46 AM Dilip Kumar wrote: > I might be completely missing but (ptr - array_base) is only valid > when first time you get the array, but qsort will swap the element > around and after that you will never be able to make out which element > was at lower index and which one was at higher index. Basically, our > goal is to preserve the order of the undo record for the same block > but their order might get changed due to swap when they are getting > compared with the undo record pointer of the another block and once > the order is swap we will never know what was their initial positions? *facepalm* Yeah, you're right. Still, I think we should see if there's some way of getting rid of that structure, or at least making it an internal detail that is used by the code that's doing the sorting rather than something that is exposed as an external interface. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: make \d pg_toast.foo show its indices
On Fri, May 03, 2019 at 02:55:47PM +0200, Rafia Sabih wrote: > On Mon, 22 Apr 2019 at 17:49, Justin Pryzby wrote: > > > > It's deliberate that \dt doesn't show toast tables. > > \d shows them, but doesn't show their indices. > > > > It seems to me that their indices should be shown, without having to think > > and > > know to query pg_index. > > > > postgres=# \d pg_toast.pg_toast_2600 > > TOAST table "pg_toast.pg_toast_2600" > >Column | Type > > +- > > chunk_id | oid > > chunk_seq | integer > > chunk_data | bytea > > Indexes: > > "pg_toast_2600_index" PRIMARY KEY, btree (chunk_id, chunk_seq) > > +1. Thanks - what about also showing the associated non-toast table ? postgres=# \d pg_toast.pg_toast_2620 TOAST table "pg_toast.pg_toast_2620" Column | Type +- chunk_id | oid chunk_seq | integer chunk_data | bytea FOR TABLE: "pg_catalog.pg_trigger" Indexes: "pg_toast_2620_index" PRIMARY KEY, btree (chunk_id, chunk_seq) That could be displayed differently, perhaps in the header, but I think this is more consistent with other display. Justin >From 7c15ebe408cc5f2af51120ea152e7997ee768f81 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Fri, 3 May 2019 09:24:51 -0500 Subject: [PATCH v2 1/2] make \d pg_toast.foo show its indices --- src/bin/psql/describe.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index d7390d5..d26d986 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -2274,6 +2274,7 @@ describeOneTableDetails(const char *schemaname, else if (tableinfo.relkind == RELKIND_RELATION || tableinfo.relkind == RELKIND_MATVIEW || tableinfo.relkind == RELKIND_FOREIGN_TABLE || + tableinfo.relkind == RELKIND_TOASTVALUE || tableinfo.relkind == RELKIND_PARTITIONED_TABLE) { /* Footer information about a table */ -- 2.7.4 >From 38f50cdb727c67ae7aece8e85caf2960e824cb65 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Tue, 30 Apr 2019 19:05:53 -0500 Subject: [PATCH v2 2/2] print table associated with given TOAST table --- src/bin/psql/describe.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index d26d986..ebdf18a 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -2152,6 +2152,28 @@ describeOneTableDetails(const char *schemaname, } } + /* print table associated with given TOAST table */ + if (tableinfo.relkind == RELKIND_TOASTVALUE) + { + PGresult *result = NULL; + printfPQExpBuffer(, + "SELECT relnamespace::pg_catalog.regnamespace, relname FROM pg_class WHERE reltoastrelid = '%s'", + oid); + result = PSQLexec(buf.data); + if (!result) { + goto error_return; + } else if (1 != PQntuples(result)) { + PQclear(result); + goto error_return; + } else { + char *schemaname = PQgetvalue(result, 0, 0); + char *relname = PQgetvalue(result, 0, 1); + appendPQExpBuffer(, _("For table: \"%s.%s\""), + schemaname, relname); + printTableAddFooter(, tmpbuf.data); + } + } + if (tableinfo.relkind == RELKIND_PARTITIONED_TABLE) { /* Get the partition key information */ -- 2.7.4
Re: Google Season of Docs 2019 - Starter
Greetings, * Evangelos Karatarakis (baggeliskap...@gmail.com) wrote: > I am interested in participating in GSoD 2019 and more specifically I am > interested in working on the project of impooving the Introductory Tutorial > for beginners PostgreSQL and in databases. I have a practical and > theoretical backround in PostgreSQL due to previous work. Thus I am not in > lack of prior knowledge which is needed. Besides, I am about to get my > degree as an Electrical and Computer Engineer in a while. On the other hand > I am a starter in the field of Technical Writing but I am willing to engage > with this profession. For this reason I consider GSoD and this project idea > a great opportunity for me. Do you think is a good idea for me to work on a > proposal since I am a starter Technical Writer? I am looking forward for > your opinion. This is covered here, I believe: https://developers.google.com/season-of-docs/docs/ Specifically: "Technical writers are technical writers worldwide who’re accepted to take part in this year’s Season of Docs. Applicants should be able to demonstrate prior technical writing experience by submitting role descriptions and work samples. See the technical writer guide and responsibilities." This program isn't intended, as I understand the above to say, as a starter for technical writers. I recommend you ask Google directly regarding your interest in the program. Thanks, Stephen signature.asc Description: PGP signature
Google Season of Docs 2019 - Starter
Greetings I am interested in participating in GSoD 2019 and more specifically I am interested in working on the project of impooving the Introductory Tutorial for beginners PostgreSQL and in databases. I have a practical and theoretical backround in PostgreSQL due to previous work. Thus I am not in lack of prior knowledge which is needed. Besides, I am about to get my degree as an Electrical and Computer Engineer in a while. On the other hand I am a starter in the field of Technical Writing but I am willing to engage with this profession. For this reason I consider GSoD and this project idea a great opportunity for me. Do you think is a good idea for me to work on a proposal since I am a starter Technical Writer? I am looking forward for your opinion. Thank you in advance. Regards. Evangelos Karatarakis Electrical and Computer Engineering Student Technical University of Crete Greece
Re: pg_upgrade --clone error checking
On Fri, May 3, 2019 at 3:53 AM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 2019-05-02 20:03, Jeff Janes wrote: > > It looks like it was designed for early checking, it just wasn't placed > > early enough. So changing it is pretty easy, as check_file_clone does > > not need to be invented, and there is no additional code duplication > > over what was already there. > > > > This patch moves the checking to near the beginning. > > I think the reason it was ordered that way is that it wants to do all > the checks of the old cluster before doing any checks touching the new > cluster. > But is there a reason to want to do that? I understand we don't want to keep starting and stopping the clusters needlessly, so we should do everything we can in one before moving to the other. But for checks that don't need a running cluster, why would it matter? The existence and contents of PG_VERSION of the new cluster directory is already checked at the very beginning (and even tries to start it up and shuts it down again if a pid file also exists), so there is precedence for touching the new cluster directory at the filesystem level early (albeit in a readonly manner) and if a pid file exists then doing even more than that. I didn't move check_file_clone to before the liveness check is done, out of a abundance of caution. But creating a transient file with a name of no significance ("PG_VERSION.clonetest") in a cluster that is not even running seems like a very low risk thing to do. The pay off is that we get an inevitable error message much sooner. Cheers, Jeff
Re: make \d pg_toast.foo show its indices
On Mon, 22 Apr 2019 at 17:49, Justin Pryzby wrote: > > It's deliberate that \dt doesn't show toast tables. > \d shows them, but doesn't show their indices. > > It seems to me that their indices should be shown, without having to think and > know to query pg_index. > > postgres=# \d pg_toast.pg_toast_2600 > TOAST table "pg_toast.pg_toast_2600" >Column | Type > +- > chunk_id | oid > chunk_seq | integer > chunk_data | bytea > Indexes: > "pg_toast_2600_index" PRIMARY KEY, btree (chunk_id, chunk_seq) > +1. -- Regards, Rafia Sabih
Re: Statistical aggregate functions are not working with partitionwise aggregate
On Fri, May 3, 2019 at 2:56 PM Rajkumar Raghuwanshi < rajkumar.raghuwan...@enterprisedb.com> wrote: > Hi, > > On PG-head, Some of statistical aggregate function are not giving correct > output when enable partitionwise aggregate while same is working on v11. > I had a quick look over this and observed that something broken with the PARTIAL aggregation. I can reproduce same issue with the larger dataset which results into parallel scan. CREATE TABLE tbl1(a int2,b float4) partition by range(a); create table tbl1_p1 partition of tbl1 for values from (minvalue) to (0); create table tbl1_p2 partition of tbl1 for values from (0) to (maxvalue); insert into tbl1 select i%2, i from generate_series(1, 100) i; # SELECT regr_count(b, a) FROM tbl1; regr_count 0 (1 row) postgres:5432 [120536]=# explain SELECT regr_count(b, a) FROM tbl1; QUERY PLAN Finalize Aggregate (cost=15418.08..15418.09 rows=1 width=8) -> Gather (cost=15417.87..15418.08 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=14417.87..14417.88 rows=1 width=8) -> Parallel Append (cost=0.00..11091.62 rows=443500 width=6) -> Parallel Seq Scan on tbl1_p2 (cost=0.00..8850.00 rows=442500 width=6) -> Parallel Seq Scan on tbl1_p1 (cost=0.00..24.12 rows=1412 width=6) (7 rows) postgres:5432 [120536]=# set max_parallel_workers_per_gather to 0; SET postgres:5432 [120536]=# SELECT regr_count(b, a) FROM tbl1; regr_count 100 (1 row) After looking further, it seems that it got broken by following commit: commit a9c35cf85ca1ff72f16f0f10d7ddee6e582b62b8 Author: Andres Freund Date: Sat Jan 26 14:17:52 2019 -0800 Change function call information to be variable length. This commit is too big to understand and thus could not get into the excact cause. Thanks > below are some of examples. > > CREATE TABLE tbl(a int2,b float4) partition by range(a); > create table tbl_p1 partition of tbl for values from (minvalue) to (0); > create table tbl_p2 partition of tbl for values from (0) to (maxvalue); > insert into tbl values (-1,-1),(0,0),(1,1),(2,2); > > --when partitionwise aggregate is off > postgres=# SELECT regr_count(b, a) FROM tbl; > regr_count > > 4 > (1 row) > postgres=# SELECT regr_avgx(b, a), regr_avgy(b, a) FROM tbl; > regr_avgx | regr_avgy > ---+--- >0.5 | 0.5 > (1 row) > postgres=# SELECT corr(b, a) FROM tbl; > corr > -- > 1 > (1 row) > > --when partitionwise aggregate is on > postgres=# SET enable_partitionwise_aggregate = true; > SET > postgres=# SELECT regr_count(b, a) FROM tbl; > regr_count > > 0 > (1 row) > postgres=# SELECT regr_avgx(b, a), regr_avgy(b, a) FROM tbl; > regr_avgx | regr_avgy > ---+--- >| > (1 row) > postgres=# SELECT corr(b, a) FROM tbl; > corr > -- > > (1 row) > > Thanks & Regards, > Rajkumar Raghuwanshi > QMG, EnterpriseDB Corporation > -- Jeevan Chalke Technical Architect, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company
Statistical aggregate functions are not working with partitionwise aggregate
Hi, On PG-head, Some of statistical aggregate function are not giving correct output when enable partitionwise aggregate while same is working on v11. below are some of examples. CREATE TABLE tbl(a int2,b float4) partition by range(a); create table tbl_p1 partition of tbl for values from (minvalue) to (0); create table tbl_p2 partition of tbl for values from (0) to (maxvalue); insert into tbl values (-1,-1),(0,0),(1,1),(2,2); --when partitionwise aggregate is off postgres=# SELECT regr_count(b, a) FROM tbl; regr_count 4 (1 row) postgres=# SELECT regr_avgx(b, a), regr_avgy(b, a) FROM tbl; regr_avgx | regr_avgy ---+--- 0.5 | 0.5 (1 row) postgres=# SELECT corr(b, a) FROM tbl; corr -- 1 (1 row) --when partitionwise aggregate is on postgres=# SET enable_partitionwise_aggregate = true; SET postgres=# SELECT regr_count(b, a) FROM tbl; regr_count 0 (1 row) postgres=# SELECT regr_avgx(b, a), regr_avgy(b, a) FROM tbl; regr_avgx | regr_avgy ---+--- | (1 row) postgres=# SELECT corr(b, a) FROM tbl; corr -- (1 row) Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB Corporation
Re: Unhappy about API changes in the no-fsm-for-small-rels patch
On Fri, May 3, 2019 at 11:43 AM John Naylor wrote: > On Thu, May 2, 2019 at 4:57 PM Amit Kapila wrote: > > On Thu, May 2, 2019 at 12:39 PM John Naylor > > wrote: > > > > > Can you please test/review? > > There isn't enough time. But since I already wrote some debugging > calls earlier (attached), I gave it a brief spin, I found this patch > isn't as careful as HEAD making sure we don't try the same block twice > in a row. If you insert enough tuples into an empty table such that we > need to extend, you get something like this: > > DEBUG: Not enough space on block 0 > DEBUG: Now trying block 0 > DEBUG: Not enough space on block 0 > DEBUG: Updating local map for block 0 > > At this point, I'm sorry to say, but I'm in favor of reverting. > Fair enough. I think we have tried to come up with a patch for an alternative approach, but it needs time. I will revert this tomorrow. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: pg_upgrade --clone error checking
On 2019-05-02 20:03, Jeff Janes wrote: > It looks like it was designed for early checking, it just wasn't placed > early enough. So changing it is pretty easy, as check_file_clone does > not need to be invented, and there is no additional code duplication > over what was already there. > > This patch moves the checking to near the beginning. I think the reason it was ordered that way is that it wants to do all the checks of the old cluster before doing any checks touching the new cluster. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH v1] Show whether tables are logged in \dt+
On Sat, 27 Apr 2019 at 06:18, David Fetter wrote: > > On Fri, Apr 26, 2019 at 04:22:18PM +0200, Rafia Sabih wrote: > > On Fri, 26 Apr 2019 at 14:49, Rafia Sabih wrote: > > > > > > On Wed, 24 Apr 2019 at 10:30, Fabien COELHO wrote: > > > > > > > > > > > > Hello David, > > > > > > > > >>> I noticed that there wasn't a bulk way to see table logged-ness in > > > > >>> psql, > > > > >>> so I made it part of \dt+. > > > > >> > > > > >> Applies, compiles, works for me. > > > > >> > > > > >> ISTM That temporary-ness is not shown either. Maybe the persistence > > > > >> column > > > > >> should be shown as is? > > > > > > > > > > Temporariness added, but not raw. > > > > > > > > Ok, it is better like this way. > > > > > > > > >> Tests? > > > > > > > > > > Included, but they're not stable for temp tables. I'm a little stumped > > > > > as to how to either stabilize them or test some other way. > > > > > > > > Hmmm. First there is the username which appears, so there should be a > > > > dedicated user for the test. > > > > > > > > I'm unsure how to work around the temporary schema number, which is > > > > undeterministic with parallel execution it. I'm afraid the only viable > > > > approach is not to show temporary tables, too bad:-( > > > > > > > > >> Doc? > > > > > > > > > > What further documentation does it need? > > > > > > > > Indeed, there is no precise doc, so nothing to update :-)/:-( > > > > > > > > > > > > Maybe you could consider adding a case for prior 9.1 version, something > > > > like: > > > >... case c.relistemp then 'temporary' else 'permanent' end as ... > > > > > > > > > > > I was reviewing this patch and found a bug, > > > > > > create table t (i int); > > > create index idx on t(i); > > > \di+ > > > psql: print.c:3452: printQuery: Assertion `opt->translate_columns == > > > ((void *)0) || opt->n_translate_columns >= cont.ncolumns' failed. > > > > Looking into this further, apparently the position of > > > > if (verbose) > > { > > + /* > > + * Show whether the table is permanent, temporary, or unlogged. > > + */ > > + if (pset.sversion >= 91000) > > + appendPQExpBuffer(, > > + ",\n case c.relpersistence when 'p' then 'permanent' when 't' > > then 'temporary' when 'u' then 'unlogged' else 'unknown' end as > > \"%s\"", > > + gettext_noop("Persistence")); > > > > is not right, it is being called for indexes with verbose option also. > > There should be an extra check for it being not called for index case. > > Something like, > > if (verbose) > > { > > /* > > * Show whether the table is permanent, temporary, or unlogged. > > */ > > if (!showIndexes) > > if (pset.sversion >= 91000) > > appendPQExpBuffer(, > > ",\n case c.relpersistence when 'p' then 'permanent' when 't' then > > 'temporary' when 'u' then 'unlogged' else 'unknown' end as \"%s\"", > > gettext_noop("Persistence")); > > > > Not sure, how do modify it in a more neat way. > > I suspect that as this may get a little messier, but I've made it > fairly neat short of a major refactor. > I found the following warning on the compilation, describe.c: In function ‘listTables’: describe.c:3705:7: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation] else if (pset.sversion >= 80100) ^~ describe.c:3710:3: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’ appendPQExpBuffer(, Talking of indentation, you might want to run pgindent once. Other than that the patch looks good to me. -- Regards, Rafia Sabih
Re: Heap lock levels for REINDEX INDEX CONCURRENTLY not quite right?
On 2019-05-02 10:44, Peter Eisentraut wrote: >> so there's a lock upgrade hazard. > Confirmed. Here is a patch along the lines of your sketch. I cleaned up the variable naming a bit too. REINDEX CONCURRENTLY is still deadlock prone because of WaitForOlderSnapshots(), so this doesn't actually fix your test case, but that seems unrelated to this particular issue. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 8a4dcc3e6b849fd72bed29c9308c9a897eec0cc5 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Fri, 3 May 2019 09:15:29 +0200 Subject: [PATCH] Fix table lock levels for REINDEX INDEX CONCURRENTLY REINDEX CONCURRENTLY locks tables with ShareUpdateExclusiveLock rather than the ShareLock used by a plain REINDEX. However, RangeVarCallbackForReindexIndex() was not updated for that and still used the ShareLock only. This would lead to lock upgrades later, leading to possible deadlocks. Reported-by: Andres Freund Discussion: https://www.postgresql.org/message-id/flat/20190430151735.wi52sxjvxsjvaxxt%40alap3.anarazel.de --- src/backend/commands/indexcmds.c | 45 +++- 1 file changed, 33 insertions(+), 12 deletions(-) diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index fabba3bacd..acfb2d24c9 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -91,6 +91,15 @@ static bool ReindexRelationConcurrently(Oid relationOid, int options); static void ReindexPartitionedIndex(Relation parentIdx); static void update_relispartition(Oid relationId, bool newval); +/* + * callback argument type for RangeVarCallbackForReindexIndex() + */ +struct ReindexIndexCallbackState +{ + boolconcurrent; /* flag from statement */ + Oid locked_table_oid; /* tracks previously locked table */ +}; + /* * CheckIndexCompatible * Determine whether an existing index definition is compatible with a @@ -2303,8 +2312,8 @@ ChooseIndexColumnNames(List *indexElems) void ReindexIndex(RangeVar *indexRelation, int options, bool concurrent) { + struct ReindexIndexCallbackState state; Oid indOid; - Oid heapOid = InvalidOid; Relationirel; charpersistence; @@ -2313,11 +2322,13 @@ ReindexIndex(RangeVar *indexRelation, int options, bool concurrent) * obtain lock on table first, to avoid deadlock hazard. The lock level * used here must match the index lock obtained in reindex_index(). */ + state.concurrent = concurrent; + state.locked_table_oid = InvalidOid; indOid = RangeVarGetRelidExtended(indexRelation, concurrent ? ShareUpdateExclusiveLock : AccessExclusiveLock, 0, RangeVarCallbackForReindexIndex, - (void *) ); + ); /* * Obtain the current persistence of the existing index. We already hold @@ -2350,7 +2361,15 @@ RangeVarCallbackForReindexIndex(const RangeVar *relation, Oid relId, Oid oldRelId, void *arg) { charrelkind; - Oid*heapOid = (Oid *) arg; + struct ReindexIndexCallbackState *state = arg; + LOCKMODEtable_lockmode; + + /* +* Lock level here should match table lock in reindex_index() for +* non-concurrent case and table locks used by index_concurrently_*() for +* concurrent case. +*/ + table_lockmode = state->concurrent ? ShareUpdateExclusiveLock : ShareLock; /* * If we previously locked some other index's heap, and the name we're @@ -2359,9 +2378,8 @@ RangeVarCallbackForReindexIndex(const RangeVar *relation, */ if (relId != oldRelId && OidIsValid(oldRelId)) { - /* lock level here should match reindex_index() heap lock */ - UnlockRelationOid(*heapOid, ShareLock); - *heapOid = InvalidOid; + UnlockRelationOid(state->locked_table_oid, table_lockmode); + state->locked_table_oid = InvalidOid; } /* If the relation does not exist, there's nothing more to do. */ @@ -2389,14 +2407,17 @@ RangeVarCallbackForReindexIndex(const RangeVar *relation, /* Lock heap before index to avoid deadlock. */ if (relId != oldRelId) { + Oid table_oid = IndexGetRelation(relId, true); + /* -* Lock level
Re: Heap lock levels for REINDEX INDEX CONCURRENTLY not quite right?
On 2019-05-02 22:42, Andres Freund wrote: > RangeVarGetRelidExtended() can call the callback multiple times, if > there are any concurrent schema changes. That's why it's unlocking the > previously locked heap oid. Ah that explains it then. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Identity columns should own only one sequence
On Thu, 2019-05-02 at 22:43 +0200, Peter Eisentraut wrote: > On 2019-04-29 18:28, Laurenz Albe wrote: > > I still think thatthat there is merit to Michael's idea of removing > > sequence "ownership" (which is just a dependency) when the DEFAULT > > on the column is dropped, but this approach is possibly cleaner. > > I think the proper way to address this would be to create some kind of > dependency between the sequence and the default. That is certainly true. But that's hard to retrofit into existing databases, so it would probably be a modification that is not backpatchable. Yours, Laurenz Albe
Re: Unhappy about API changes in the no-fsm-for-small-rels patch
On Thu, May 2, 2019 at 4:57 PM Amit Kapila wrote: > > On Thu, May 2, 2019 at 12:39 PM John Naylor > wrote: > > > Okay, I have updated the patch to incorporate your changes and call > relcache invalidation at required places. I have updated comments at a > few places as well. The summarization of this patch is that (a) it > moves the local map to relation cache (b) performs the cache > invalidation whenever we create fsm (either via backend or vacuum), > when some space in a page is freed by vacuum (provided fsm doesn't > exist) or whenever the local map is cleared (c) additionally, we clear > the local map when we found a block from FSM, when we have already > tried all the blocks present in cache or when we are allowed to create > FSM. > > If we agree on this, then we can update the README accordingly. > > Can you please test/review? There isn't enough time. But since I already wrote some debugging calls earlier (attached), I gave it a brief spin, I found this patch isn't as careful as HEAD making sure we don't try the same block twice in a row. If you insert enough tuples into an empty table such that we need to extend, you get something like this: DEBUG: Not enough space on block 0 DEBUG: Now trying block 0 DEBUG: Not enough space on block 0 DEBUG: Updating local map for block 0 At this point, I'm sorry to say, but I'm in favor of reverting. There just wasn't enough time to redesign and debug a feature like this during feature freeze. -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services debug-free-space.patch Description: Binary data