Re: function for testing that causes the backend to terminate
On Thu, Apr 29, 2021 at 4:36 PM Bharath Rupireddy wrote: > > On Thu, Apr 29, 2021 at 4:27 PM Dave Cramer wrote: > > For testing unusual situations I'd like to be able to cause a backend to > > terminate due to something like a segfault. Do we currently have this in > > testing ? > > Well, you could use pg_terminate_backend which sends SIGTERM to the > backend. However, we don't have a function that sends SIGSEGV yet, you > could signal the backend with SIGSEGV directly, if possible. And, I came across an extension called pg_crash [1], see if that helps. [1] https://www.cybertec-postgresql.com/en/pg_crash-crashing-postgresql-automatically/ With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Enhanced error message to include hint messages for redundant options error
On Sat, May 1, 2021 at 10:43 AM Bharath Rupireddy wrote: > > On Fri, Apr 30, 2021 at 2:49 PM Dilip Kumar wrote: > > Looking into this again, why not as shown below? IMHO, this way the > > code will be logically the same as it was before the patch, basically > > why to process an extra statement ( *volatility_item = defel;) if we > > have already decided to error. > > I changed my mind given the concerns raised on removing the goto > statements. We could just do as below: Okay, that make sense. > diff --git a/src/backend/commands/functioncmds.c > b/src/backend/commands/functioncmds.c > index 9548287217..1f1c74c379 100644 > --- a/src/backend/commands/functioncmds.c > +++ b/src/backend/commands/functioncmds.c > @@ -575,7 +575,7 @@ compute_common_attribute(ParseState *pstate, > duplicate_error: > ereport(ERROR, > (errcode(ERRCODE_SYNTAX_ERROR), > - errmsg("conflicting or redundant options"), > + errmsg("option \"%s\" specified more than once", > defel->defname), > parser_errposition(pstate, defel->location))); > return false;/* keep compiler quiet */ > > I'm not attaching above one line change as a patch, maybe Vignesh can > merge this into the main patch. +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Enhanced error message to include hint messages for redundant options error
On Fri, Apr 30, 2021 at 2:49 PM Dilip Kumar wrote: > Looking into this again, why not as shown below? IMHO, this way the > code will be logically the same as it was before the patch, basically > why to process an extra statement ( *volatility_item = defel;) if we > have already decided to error. I changed my mind given the concerns raised on removing the goto statements. We could just do as below: diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c index 9548287217..1f1c74c379 100644 --- a/src/backend/commands/functioncmds.c +++ b/src/backend/commands/functioncmds.c @@ -575,7 +575,7 @@ compute_common_attribute(ParseState *pstate, duplicate_error: ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), - errmsg("conflicting or redundant options"), + errmsg("option \"%s\" specified more than once", defel->defname), parser_errposition(pstate, defel->location))); return false;/* keep compiler quiet */ I'm not attaching above one line change as a patch, maybe Vignesh can merge this into the main patch. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Log enhancement for aclcheck permissions failures
On Sat, May 1, 2021 at 5:26 AM Bingyu Shen wrote: > Hi hackers, > > I was wondering if we can improve the error messages for acl permission > failures. > Current implementation to report errors is in "backend/catalog/aclchk.c" > void aclcheck_error(AclResult aclerr, ObjectType objtype, const char > *objectname); > > based on the AclResult type, it print log messages like > "permission denied for schema %s" > which tells the admins what could be the domain of the permission-deny, > like table name or schema name. > > However, I find that the log messages *lack* more details, i.e., the > *exact permission* that causes the permission-deny. For the novice users, > they may end up over-granting the permission to fix the issues > and cause security vulnerability in the database. > > I think the log messages can be better if we add some diagnostic > information like which *role* is denied and what *permission* it lacks. > This way the users know which permission to grant exactly > without the trial-and-errors. I think it's easy for users (even if they are novice) to know exactly what permission they are lacking by just looking at the query. See, the permissions we have in parsenodes.h with ACL_, they are quite clear and can be understood by the type of query. So, I don't think printing that obvious information in the log message is something we would want to improve. To know the current role with which the query is run, users can use CURRENT_ROLE or pg_roles. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: New assertion failed in lazy_scan_heap() on mereswine
On Sat, May 01, 2021 at 04:43:21PM +1200, Thomas Munro wrote: > TRAP: FailedAssertion("!all_visible_according_to_vm || > prunestate.all_visible", File: > "/home/pgsql/build-farm/buildroot/HEAD/pgsql.build/../pgsql/src/backend/access/heap/vacuumlazy.c", > Line: 1347, PID: 16926) > 2021-04-30 04:15:50.317 PDT [10349:18] DETAIL: Failed process was > running: autovacuum: VACUUM ANALYZE pg_catalog.pg_attribute > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine=2021-04-30%2006:43:27 The same assertion was reported here https://www.postgresql.org/message-id/flat/OS0PR01MB611340CBD300A7C4FD6B6101FB5F9%40OS0PR01MB6113.jpnprd01.prod.outlook.com |TRAP: FailedAssertion("!all_visible_according_to_vm || prunestate.all_visible", File: "/home/pgsql/build-farm/buildroot/HEAD/pgsql.build/../pgsql/src/backend/access/heap/vacuumlazy.c", Line: 1347, PID: 16926) |2021-04-30 04:15:50.317 PDT [10349:18] DETAIL: Failed process was running: autovacuum: VACUUM ANALYZE pg_catalog.pg_attribute And Michael added https://wiki.postgresql.org/index.php?title=PostgreSQL_14_Open_Items=revision=35954=35952 -- Justin
Re: New assertion failed in lazy_scan_heap() on mereswine
On Sat, May 1, 2021 at 10:14 AM Thomas Munro wrote: > > Hi, > > TRAP: FailedAssertion("!all_visible_according_to_vm || > prunestate.all_visible", File: > "/home/pgsql/build-farm/buildroot/HEAD/pgsql.build/../pgsql/src/backend/access/heap/vacuumlazy.c", > Line: 1347, PID: 16926) > > 2021-04-30 04:15:50.317 PDT [10349:18] DETAIL: Failed process was > running: autovacuum: VACUUM ANALYZE pg_catalog.pg_attribute > There is another thread[1] that has reported the same issue and also provided a script to reproduce the issue. [1] https://www.postgresql.org/message-id/os0pr01mb611340cbd300a7c4fd6b6101fb...@os0pr01mb6113.jpnprd01.prod.outlook.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
New assertion failed in lazy_scan_heap() on mereswine
Hi, TRAP: FailedAssertion("!all_visible_according_to_vm || prunestate.all_visible", File: "/home/pgsql/build-farm/buildroot/HEAD/pgsql.build/../pgsql/src/backend/access/heap/vacuumlazy.c", Line: 1347, PID: 16926) 2021-04-30 04:15:50.317 PDT [10349:18] DETAIL: Failed process was running: autovacuum: VACUUM ANALYZE pg_catalog.pg_attribute https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine=2021-04-30%2006:43:27
Re: Transactions involving multiple postgres foreign servers, take 2
On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu wrote: > > Hi, > For v35-0007-Prepare-foreign-transactions-at-commit-time.patch : Thank you for reviewing the patch! > > With this commit, the foreign server modified within the transaction marked > as 'modified'. > > transaction marked -> transaction is marked Will fix. > > +#define IsForeignTwophaseCommitRequested() \ > +(foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED) > > Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro > should be named: IsForeignTwophaseCommitRequired. But even if foreign_twophase_commit is FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if there is only one modified server, right? It seems the name IsForeignTwophaseCommitRequested is fine. > > +static bool > +checkForeignTwophaseCommitRequired(bool local_modified) > > + if (!ServerSupportTwophaseCommit(fdw_part)) > + have_no_twophase = true; > ... > + if (have_no_twophase) > + ereport(ERROR, > > It seems the error case should be reported within the loop. This way, we > don't need to iterate the other participant(s). > Accordingly, nserverswritten should be incremented for local server prior to > the loop. The condition in the loop would become if > (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1). > have_no_twophase is no longer needed. Hmm, I think If we process one 2pc-non-capable server first and then process another one 2pc-capable server, we should raise an error but cannot detect that. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Granting control of SUSET gucs to non-superusers
On Fri, 30 Apr 2021 at 22:00, Mark Dilger wrote: > Viewing all of this in terms of which controls allow the tenant to escape > a hypothetical sandbox seems like the wrong approach. Shouldn't we let > service providers decide which controls would allow the tenant to escape > the specific sandbox the provider has designed? > I’m not even sure I should be mentioning this possibility, but what if we made each GUC parameter a grantable privilege? I’m honestly not sure if this is insane or not. I mean numerically it’s a lot of privileges, but conceptually it’s relatively simple. What I like the least about it is actually the idea of giving up entirely on the notion of grouping privileges into reasonable packages: some of these privileges would be quite safe to grant in many or even most circumstances, while others would usually not be reasonable to grant.
Re: Granting control of SUSET gucs to non-superusers
> On Apr 30, 2021, at 4:28 PM, Stephen Frost wrote: > > “Can’t be used to gain superuser” may be a sufficiently clear grouping, as > was more or less contemplated by the “admin” approach. If that doesn’t work > though then we need an understanding of what the limits on these groups are, > so we can competently fit new GUCs into these groups (or invent new ones if a > new GUC truly falls outside all existing but I would expect that to be a > rather rare case..). When I first heard that providers want to build sandboxes around PostgreSQL, I thought the idea was a little silly, because providers can just spin up a virtual machine per tenant and give each tenant superuser privileges on their respective VM. Who cares if they mess it up after that? The problem with that idea turns out to be that the providers want to take responsibility for some of the database maintenance, possibly including backups, replication, etc. I think the set of controls the provider hands over to the tenant will depend very much on the division of responsibility. If the provider is managing replication, then control over session_replication_role and wal_compression is unlikely to be handed to the tenant, but if the tenant is responsible for their own replication scheme, it might be. Viewing all of this in terms of which controls allow the tenant to escape a hypothetical sandbox seems like the wrong approach. Shouldn't we let service providers decide which controls would allow the tenant to escape the specific sandbox the provider has designed? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Transactions involving multiple postgres foreign servers, take 2
On Tue, Apr 27, 2021 at 10:03 AM Masahiro Ikeda wrote: > > > > On 2021/03/17 12:03, Masahiko Sawada wrote: > > I've attached the updated version patch set. > > Thanks for updating the patches! I'm now restarting to review of 2PC because > I'd like to use this feature in PG15. Thank you for reviewing the patch! Much appreciated. > > > I think the following logic of resolving and removing the fdwxact entries > by the transaction resolver needs to be fixed. > > 1. check if pending fdwxact entries exist > > HoldInDoubtFdwXacts() checks if there are entries which the condition is > InvalidBackendId and so on. After that it gets the indexes of the fdwxacts > array. The fdwXactLock is released at the end of this phase. > > 2. resolve and remove the entries held in 1th phase. > > ResolveFdwXacts() resloves the status per each fdwxact entry using the > indexes. The end of resolving, the transaction resolver remove the entry in > fdwxacts array via remove_fdwact(). > > The way to remove the entry is the following. Since to control using the > index, the indexes of getting in the 1st phase are meaningless anymore. > > /* Remove the entry from active array */ > FdwXactCtl->num_fdwxacts--; > FdwXactCtl->fdwxacts[i] = FdwXactCtl->fdwxacts[FdwXactCtl->num_fdwxacts]; > > This seems to lead resolving the unexpected fdwxacts and it can occur the > following assertion error. That's why I noticed. For example, there is the > case which a backend inserts new fdwxact entry in the free space, which the > resolver removed the entry right before, and the resolver accesses the new > entry which doesn't need to resolve yet because it use the indexes checked in > 1st phase. > > Assert(fdwxact->lockeing_backend == MyBackendId); Good point. I agree with your analysis. > > > > The simple solution is that to get fdwXactLock exclusive all the time from the > begining of 1st phase to the finishing of 2nd phase. But, I worried that the > performance impact became too big... > > I came up with two solutions although there may be better solutions. > > A. to remove resolved entries at once after resolution for all held entries is > finished > > If so, we don't need to take the exclusive lock for a long time. But, this > have other problems, which pg_remove_foreign_xact() can still remove entries > and we need to handle the fail of resolving. > > I wondered that we can solve the first problem to introduce a new lock like > "removing lock" and only the processes which hold the lock can remove the > entries. The performance impact is limited since the insertion the fdwxact > entries is not blocked by this lock. And second problem can be solved using > try-catch sentence. > > > B. to merge 1st and 2nd phase > > Now, the resolver resolves the entries together. That's the reason why it's > difficult to remove the entries. So, it seems to solve the problem to execute > checking, resolving and removing per each entry. I think it's better since > this is simpler than A. If I'm missing something, please let me know. It seems to me that solution B would be simpler and better. I'll try to fix this issue by using solution B and rebase the patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: RFE: Make statistics robust for unplanned events
On Thu, Apr 22, 2021 at 3:35 PM Tom Lane wrote: > Peter Geoghegan writes: > > We already *almost* pay the full cost of durably storing the > > information used by autovacuum.c's relation_needs_vacanalyze() to > > determine if a VACUUM is required -- we're only missing > > new_dead_tuples/tabentry->n_dead_tuples. Why not go one tiny baby step > > further to fix this issue? > > Definitely worth thinking about, but I'm a little confused about how > you see this working. Those pg_class fields are updated by vacuum > (or analyze) itself. How could they usefully serve as input to > autovacuum's decisions? Honestly, the details weren't very well thought out. My basic point still stands, which is that it shouldn't be *that* expensive to make the relevant information crash-safe, which would protect the system from certain pathological cases. Maybe it doesn't even have to be crash-safe in the way that pg_class.reltuples is -- something approximate might work quite well. Assuming that there are no dead tuples after a crash seems rather naive. I seem to recall that certain init scripts I saw years ago used Immediate Shutdown mode very frequently. Stuff like that is bound to happen in some installations, and so we should protect users from hard-to-foresee extreme consequences. Sure, using immediate shutdown like that isn't recommended practice, but that's no reason to allow a nasty failure mode. -- Peter Geoghegan
Re: Granting control of SUSET gucs to non-superusers
Stephen Frost writes: > On Fri, Apr 30, 2021 at 19:19 Mark Dilger > wrote: >> PostgreSQL defines a number of GUCs that can only be set by superusers. I >> would like to support granting privileges on subsets of these to >> non-superuser roles, inspired by Stephen Frost's recent work on >> pg_read_all_data and pg_write_all_data roles. > New predefined roles are relatively inexpensive. That said, whatever sets > we define need to have some meaning to them- one which is reasonably > future-proofed so that we have some idea what category a new GUC would fit > into. > “Can’t be used to gain superuser” may be a sufficiently clear grouping, as > was more or less contemplated by the “admin” approach. If that doesn’t > work though then we need an understanding of what the limits on these > groups are, so we can competently fit new GUCs into these groups (or invent > new ones if a new GUC truly falls outside all existing but I would expect > that to be a rather rare case..). > We may also wish to keep some GUCs superuser only when they really only > make sense to be used in a developer context... Hmm, is there really any point in that? We already have roles like "pg_write_server_files" and "pg_execute_server_program", which allow trivial escalation to superuser if one wishes, but are still useful as being roles you're a bit less likely to break your database with accidentally than running as full superuser. So ISTM that "pg_set_superuser_parameters" could be treated as being one of those same sorts of roles that you don't give out to untrusted people, and then we don't have to worry about exactly which GUCs might be exactly how dangerous to whom. If we try to define it as being some lesser level of privilege than that, I'm afraid we will spend lots of not-very-productive time trying to classify the security threats from different GUCs ... and they all have *some* security issue involved, or they wouldn't be restricted in the first place. Plus, I'm not looking forward to having to issue CVEs when we realize we misclassified something. regards, tom lane
Re: Granting control of SUSET gucs to non-superusers
On 04/30/21 19:19, Mark Dilger wrote: > We could certainly debate which GUCs could be used to escape the sandbox > vs. which ones could not, but I would prefer a design that allows the > provider to make that determination. I find myself wondering how many GUCs flagged SUSET are not flagged that way because of a determination already made that they could be used to escape. (Maybe some of the logging ones, only usable to conceal your escape.) But there might be ways for a provider, scrutinizing each of those individually, to conclude "this will not allow escape from the sandbox /I/ have set up, provided the value being set satisfies constraints x and y" ... a generalization of the LOAD from $libdir/plugins idea. So that suggests to me some mechanism where a provider could grant setting foo to role bar using validator baz(). Can SUSET GUCs be set from SECURITY DEFINER functions? Maybe there are already the pieces to do that, minus some syntax sugar. Regards, -Chap
Log enhancement for aclcheck permissions failures
Hi hackers, I was wondering if we can improve the error messages for acl permission failures. Current implementation to report errors is in "backend/catalog/aclchk.c" void aclcheck_error(AclResult aclerr, ObjectType objtype, const char *objectname); based on the AclResult type, it print log messages like "permission denied for schema %s" which tells the admins what could be the domain of the permission-deny, like table name or schema name. However, I find that the log messages *lack* more details, i.e., the *exact permission* that causes the permission-deny. For the novice users, they may end up over-granting the permission to fix the issues and cause security vulnerability in the database. I think the log messages can be better if we add some diagnostic information like which *role* is denied and what *permission* it lacks. This way the users know which permission to grant exactly without the trial-and-errors. It is not hard to improve the log messages after looking into the code. Most places use the function aclcheck_error() exactly after the permission check, e.g., pg_type_aclcheck(), pg_tablespace_aclcheck(). For example, in backend/commands/dbcommands.c, it checks whether the user has CREATE permission. aclresult = pg_tablespace_aclcheck(dst_deftablespace, GetUserId(), ACL_CREATE); if (aclresult != ACLCHECK_OK) aclcheck_error(aclresult, OBJECT_TABLESPACE, tablespacename); We can simply change the aclcheck_error() function parameter a bit, then we can pass the exact permission to the function, and tell the users exactly why the permission is denied. Something would be like void aclcheck_error(AclResult aclerr, ObjectType objtype, const char *objectname, const char *privilegename) Any thoughts would be appreciated. Thanks! Best regards, Bingyu
Re: Granting control of SUSET gucs to non-superusers
Greetings, On Fri, Apr 30, 2021 at 19:19 Mark Dilger wrote: > PostgreSQL defines a number of GUCs that can only be set by superusers. I > would like to support granting privileges on subsets of these to > non-superuser roles, inspired by Stephen Frost's recent work on > pg_read_all_data and pg_write_all_data roles. There’s been some effort started in this direction which I was working on (see the patches about an “admin” role and set of GUCs). I have been meaning to get back to that but the specific concern I had was about coming up with how to define the proper set of GUCs. The specific use case motivating this work is that of a PostgreSQL service > provider. The provider guarantees certain aspects of the service, such as > periodic backups, replication, uptime, availability, etc., while making no > guarantees of other aspects, such as performance associated with the design > of the schema or the queries executed. The provider should be able to > grant to the tenant privileges to set any GUC which cannot be used to > "escape the sandbox" and interfere with the handful of metrics being > guaranteed. Given that the guarantees made by one provider may differ from > those made by another, the exact set of GUCs which the provider allows the > tenant to control may differ. > > By my count, there are currently 50 such GUCs, already broken down into 15 > config groups. Creating a single new role pg_set_all_gucs seems much too > coarse a control, but creating 50 new groups may be excessive. We could > certainly debate which GUCs could be used to escape the sandbox vs. which > ones could not, but I would prefer a design that allows the provider to > make that determination. The patch I would like to submit would only give > the provider the mechanism for controlling these things, but would not make > the security choices for them. > > Do folks think it would make sense to create a role per config group? > Adding an extra 15 default roles seems high to me, but organizing the > feature this way would make the roles easier to document, because there > would be a one-to-one correlation between the roles and the config groups. New predefined roles are relatively inexpensive. That said, whatever sets we define need to have some meaning to them- one which is reasonably future-proofed so that we have some idea what category a new GUC would fit into. “Can’t be used to gain superuser” may be a sufficiently clear grouping, as was more or less contemplated by the “admin” approach. If that doesn’t work though then we need an understanding of what the limits on these groups are, so we can competently fit new GUCs into these groups (or invent new ones if a new GUC truly falls outside all existing but I would expect that to be a rather rare case..). We may also wish to keep some GUCs superuser only when they really only make sense to be used in a developer context... Thanks, Stephen
Granting control of SUSET gucs to non-superusers
Hackers, PostgreSQL defines a number of GUCs that can only be set by superusers. I would like to support granting privileges on subsets of these to non-superuser roles, inspired by Stephen Frost's recent work on pg_read_all_data and pg_write_all_data roles. The specific use case motivating this work is that of a PostgreSQL service provider. The provider guarantees certain aspects of the service, such as periodic backups, replication, uptime, availability, etc., while making no guarantees of other aspects, such as performance associated with the design of the schema or the queries executed. The provider should be able to grant to the tenant privileges to set any GUC which cannot be used to "escape the sandbox" and interfere with the handful of metrics being guaranteed. Given that the guarantees made by one provider may differ from those made by another, the exact set of GUCs which the provider allows the tenant to control may differ. By my count, there are currently 50 such GUCs, already broken down into 15 config groups. Creating a single new role pg_set_all_gucs seems much too coarse a control, but creating 50 new groups may be excessive. We could certainly debate which GUCs could be used to escape the sandbox vs. which ones could not, but I would prefer a design that allows the provider to make that determination. The patch I would like to submit would only give the provider the mechanism for controlling these things, but would not make the security choices for them. Do folks think it would make sense to create a role per config group? Adding an extra 15 default roles seems high to me, but organizing the feature this way would make the roles easier to document, because there would be a one-to-one correlation between the roles and the config groups. I have a WIP patch that I'm not attaching, but if I get any feedback, I might be able to adjust the patch before the first version posted. The basic idea is that it allows things like: GRANT pg_set_stats_monitoring TO tenant_role; And then tenant_role could, for example SET log_parser_stats TO off; Thanks — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 2:07 PM Robert Haas wrote: > OK. I thought about this in regards to zheap, which has this exact > problem, because it wants to do so-called "in place" updates where the > new version of the row goes right on top of the old one in the table > page, and the old version of the row gets written into the undo log. > Just to keep things simple, we said that initially we'd only use this > in-place update strategy when no indexed columns were changed, so that > there's only ever one set of index entries for a given TID. In that > model, the index AMs don't really need to care that there are actually > multiple tuples for the same TID, because those tuples differ only in > columns that the index doesn't care about anyway. An index scan has to > be careful to fetch the correct version of the tuple, but it has a > Snapshot available, so it can do that. Right. So zheap (in the current prototype implementation) is like heapam with its HOT optimization, except that it isn't subject to the same limitations with regard to fitting heap tuples on the same heap page to keep the same HOT chain going over time. You kind of have the moral equivalent of a HOT chain that can largely live in UNDO. That seems like a very useful thing on its own. A lot of the problems with HOT are in this area -- maybe the vast majority, even. A remaining problem is that we must generate a new round of index tuples for each and every index when only one indexed column is logically modified by an UPDATE statement. I think that this is much less of a problem now due to bottom-up index deletion. Sure, it sucks that we still have to dirty the page at all. But it's nevertheless true that it all but eliminates version-driven page splits, which are where almost all of the remaining downside is. It's very reasonable to now wonder if this particular all-indexes problem is worth solving at all in light of that. (Modern hardware characteristics also make a comprehensive fix less valuable in practice.) > However, there's no easy and > efficient way to handle updates and deletes. Suppose for example that > a tuple has been updated 5 times, creating versions t1..t5. t5 is now > in the zheap page, and the other versions are in the undo. t5 points > to t4 which points to t3 and so forth. Now an updater comes along and > let's say that the updater's snapshot sees t2. It may be that t3..t5 > are *uncommitted* updates in which case the attempt to update t2 may > succeed if the transaction that performed then aborts, or it may be > that the updating transactions have committed, in which case we're > going to have to fail. But that decision isn't made by the scan that > sees t3; it happens when the TID reaches the ModifyTable node. So what > zheap ends up doing is finding the right tuple version during the > scan, by making use of the snapshot, and then having to go repeat that > work when it's time to try to perform the update. It would be nice to > avoid this. I believe that this is another consequence of the fact that Postgres versions tuples, not pages. This is not a minor theoretical point. It's very different to what Oracle does. It's almost a necessary consequence of our basic approach to extensibility, because you can have things like index tuples whose values are equal but visibly distinct (e.g., the numeric value '5.0' is equal to but distinct from '5'). It also has a lot to do with how crash recovery works. > If we could feed system columns from the scan through to > the update, we could pass along an undo pointer and avoid the extra > overhead. So it seems to me, so far anyway, that there's no very > fundamental problem here, but there is an efficiency issue which we > could address if we had a bit more planner and executor infrastructure > to help us out. FWIW you don't necessarily have to do the EPQ stuff. You could in theory do a statement-level rollback, and repeat. The EPQ mechanism is unique to Postgres. Maybe it doesn't matter, but I don't think that it's essential to follow this in other table AMs. > Now in the long run the vision for zheap was that we'd eventually want > to do in-place updates even when indexed columns have been modified, > and this gets a whole lot trickier, because now there can be multiple > sets of index entries pointing at the same TID which don't agree on > the values of the indexed columns. It's much easier when you have a very simple type system that doesn't allow differences like my "numeric '5.0' vs '5'" example -- a system that is built for this from the ground up. If there are meaningful semantic differences among opclass-equal index tuples, then we can never assume that index tuples will always be locatable after an update affecting indexed columns (if only because we need to preserve the '5.0' and '5' variants in an index on a numeric column). If we could at least be sure that two index tuples that point to the same stable/logical zheap TID (in a world where TIDs were stable
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 2:13 PM Jeff Davis wrote: > FWIW, this is not a problem in my table AM. I am fine having different > TIDs for each version, just like heapam. This means that we are largely in agreement about the general nature of the problem. That seems like a good basis to redefine TID-like identifiers so that they can accommodate what you want to do. > For index-organized tables it does seem like an interesting problem. I strongly suspect that index-organized tables (or indirect indexes, or anything else that assumes that TID-like identifiers map directly to logical rows as opposed to physical versions) are going to break too many assumptions to ever be tractable. Assuming I have that right, it would advance the discussion if we could all agree on that being a non-goal for the tableam interface in general. This would allow us to clearly discuss how to solve the remaining problem of accommodating column stores and suchlike. That seems hard, but much more tractable. The fact that the tableam has almost no non-goals has always bothered me a bit. Especially on this particular point about purely logical TID-like identifiers. -- Peter Geoghegan
Re: MaxOffsetNumber for Table AMs
On Fri, 2021-04-30 at 12:29 -0700, Peter Geoghegan wrote: > > Is the problem you're worried about here that, with something like > > an > > index-organized table, you can have multiple row versions that have > > the same logical tuple ID, i.e. primary key value? > > That's what I'm talking about. I'd like to hear what you think about > it. FWIW, this is not a problem in my table AM. I am fine having different TIDs for each version, just like heapam. For index-organized tables it does seem like an interesting problem. Regards, Jeff Davis
Re: pg_amcheck contrib application
On Fri, Apr 30, 2021 at 4:26 PM Mark Dilger wrote: > After further reflection, no other verbiage seems any better. I'd say go > ahead and commit it this way. OK. I'll plan to do that on Monday, barring objections. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 3:30 PM Peter Geoghegan wrote: > > Is the problem you're worried about here that, with something like an > > index-organized table, you can have multiple row versions that have > > the same logical tuple ID, i.e. primary key value? And that the > > interfaces aren't well-suited to that? Because that's a problem I have > > thought about and can comment on, even though I think the question of > > having multiple versions with the same TID is distinguishable from the > > question of how *wide* TIDs should be. But maybe that's not what you > > are talking about here, in which case I guess I need a clearer > > explanation of the concern. > > That's what I'm talking about. I'd like to hear what you think about it. OK. I thought about this in regards to zheap, which has this exact problem, because it wants to do so-called "in place" updates where the new version of the row goes right on top of the old one in the table page, and the old version of the row gets written into the undo log. Just to keep things simple, we said that initially we'd only use this in-place update strategy when no indexed columns were changed, so that there's only ever one set of index entries for a given TID. In that model, the index AMs don't really need to care that there are actually multiple tuples for the same TID, because those tuples differ only in columns that the index doesn't care about anyway. An index scan has to be careful to fetch the correct version of the tuple, but it has a Snapshot available, so it can do that. However, there's no easy and efficient way to handle updates and deletes. Suppose for example that a tuple has been updated 5 times, creating versions t1..t5. t5 is now in the zheap page, and the other versions are in the undo. t5 points to t4 which points to t3 and so forth. Now an updater comes along and let's say that the updater's snapshot sees t2. It may be that t3..t5 are *uncommitted* updates in which case the attempt to update t2 may succeed if the transaction that performed then aborts, or it may be that the updating transactions have committed, in which case we're going to have to fail. But that decision isn't made by the scan that sees t3; it happens when the TID reaches the ModifyTable node. So what zheap ends up doing is finding the right tuple version during the scan, by making use of the snapshot, and then having to go repeat that work when it's time to try to perform the update. It would be nice to avoid this. If we could feed system columns from the scan through to the update, we could pass along an undo pointer and avoid the extra overhead. So it seems to me, so far anyway, that there's no very fundamental problem here, but there is an efficiency issue which we could address if we had a bit more planner and executor infrastructure to help us out. Now in the long run the vision for zheap was that we'd eventually want to do in-place updates even when indexed columns have been modified, and this gets a whole lot trickier, because now there can be multiple sets of index entries pointing at the same TID which don't agree on the values of the indexed columns. As old row versions die off, some of those pointers need to be cleaned out, and others do not. I thought we might solve this problem by something akin to retail index deletion: have an update or delete on a zheap tuple go re-find the associated index entries and mark them for possible cleanup, and then vacuum can ignore all unmarked tuples. There might be some efficiency problems with this idea I hadn't considered, based on your remarks today. But regardless of the wisdom or folly of this approach, the broader point is that we can't assume that all heap types are going to have the same maintenance requirements. I think most of them are going to have some kind of maintenance operation that need to or at least can optionally be performed from time to time, but it might be triggered by completely different criteria than vacuum. New table AMs might well choose to use 64-bit XIDs, avoiding the need for wraparound processing altogether. Maybe they have such good opportunistic cleanup mechanisms that periodic vacuum for bloat isn't even really needed. Maybe they bloat when updates and deletes commit but not when inserts and updates abort, because those cases are handled via some other mechanism. Who knows, really? It's hard to predict what not-yet-written AMs might care about, and even if we knew, it seems crazy to try to rewrite the way vacuum works to cater to those needs before we actually have some working AMs to use as a testbed. It strikes me that certain interesting cases might not really need anything very in-depth here. For example, consider indirect indexes, where the index references the primary key value rather than the TID. Well, the indirect index should probably be vacuumed periodically to prevent bloat, but it doesn't need to be vacuumed to recycle TIDs because it doesn't contain TIDs. BRIN indexes, BTW, also
Re: pg_amcheck contrib application
> On Apr 30, 2021, at 1:04 PM, Mark Dilger wrote: > >> toast value %u was expected to end at chunk %d, but ended while >> expecting chunk %d >> >> i.e. same as the currently-committed code, except for changing "ended >> at" to "ended while expecting." > > I find the grammar of this new formulation anomalous for hard to articulate > reasons not quite the same as but akin to mismatched verb aspect. After further reflection, no other verbiage seems any better. I'd say go ahead and commit it this way. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_amcheck contrib application
> On Apr 30, 2021, at 12:47 PM, Robert Haas wrote: > > Hmm, I think that might need adjustment, actually. What I was trying > to do is compensate for the fact that what we now have is the next > chunk_seq value we expect, not the last one we saw, nor the total > number of chunks we've seen regardless of what chunk_seq they had. But > I thought it would be too confusing to just give the chunk number we > were expecting and not say anything about how many chunks we thought > there would be in total. So maybe what I should do is change it to > something like this: > > toast value %u was expected to end at chunk %d, but ended while > expecting chunk %d > > i.e. same as the currently-committed code, except for changing "ended > at" to "ended while expecting." I find the grammar of this new formulation anomalous for hard to articulate reasons not quite the same as but akin to mismatched verb aspect. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_amcheck contrib application
On Fri, Apr 30, 2021 at 3:41 PM Mark Dilger wrote: > I think that's committable. > > The only nitpick might be > > - psprintf("toast value %u was expected to end > at chunk %d, but ended at chunk %d", > + psprintf("toast value %u index scan ended > early while expecting chunk %d of %d", > > When reporting to users about positions within a zero-based indexing scheme, > what does "while expecting chunk 3 of 4" mean? Is it talking about the last > chunk from the set [0..3] which has cardinality 4, or does it mean the > next-to-last chunk from [0..4] which ends with chunk 4, or what? The prior > language isn't any more clear than what you have here, so I have no objection > to committing this, but the prior language was probably as goofy as it was > because it was trying to deal with this issue. Hmm, I think that might need adjustment, actually. What I was trying to do is compensate for the fact that what we now have is the next chunk_seq value we expect, not the last one we saw, nor the total number of chunks we've seen regardless of what chunk_seq they had. But I thought it would be too confusing to just give the chunk number we were expecting and not say anything about how many chunks we thought there would be in total. So maybe what I should do is change it to something like this: toast value %u was expected to end at chunk %d, but ended while expecting chunk %d i.e. same as the currently-committed code, except for changing "ended at" to "ended while expecting." -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_amcheck contrib application
> On Apr 30, 2021, at 12:29 PM, Robert Haas wrote: > > OK, how about this version? I think that's committable. The only nitpick might be - psprintf("toast value %u was expected to end at chunk %d, but ended at chunk %d", + psprintf("toast value %u index scan ended early while expecting chunk %d of %d", When reporting to users about positions within a zero-based indexing scheme, what does "while expecting chunk 3 of 4" mean? Is it talking about the last chunk from the set [0..3] which has cardinality 4, or does it mean the next-to-last chunk from [0..4] which ends with chunk 4, or what? The prior language isn't any more clear than what you have here, so I have no objection to committing this, but the prior language was probably as goofy as it was because it was trying to deal with this issue. Thoughts? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 12:20 PM Robert Haas wrote: > Why can't it do what it does already? It's not broken for heap, so why > should it be broken for anything else? And why are non-HOT updates > specifically a problem? No reason. > > You obviously cannot have the equivalent of > > duplicate TIDs when your new table AM runs into these scenarios. So > > what do you do instead? How do you make your clustered index/IoT style > > identifiers (i.e. your strictly logical TID-like identifiers) deal > > with that case? > > Is the problem you're worried about here that, with something like an > index-organized table, you can have multiple row versions that have > the same logical tuple ID, i.e. primary key value? And that the > interfaces aren't well-suited to that? Because that's a problem I have > thought about and can comment on, even though I think the question of > having multiple versions with the same TID is distinguishable from the > question of how *wide* TIDs should be. But maybe that's not what you > are talking about here, in which case I guess I need a clearer > explanation of the concern. That's what I'm talking about. I'd like to hear what you think about it. It's not exactly a narrow concern. For one thing, it is enough to totally validate my suggestion about how we might widen TIDs and still have nbtree deduplication. -- Peter Geoghegan
Re: pg_amcheck contrib application
On Fri, Apr 30, 2021 at 3:26 PM Mark Dilger wrote: > It looks mostly good to me. There is a off-by-one error introduced with: > > - else if (chunkno != (endchunk + 1)) > + else if (expected_chunk_seq < last_chunk_seq) > > I think that needs to be > > + else if (expected_chunk_seq <= last_chunk_seq) > > because otherwise it won't complain if the only missing chunk is the very > last one. OK, how about this version? -- Robert Haas EDB: http://www.enterprisedb.com simply-remove-chunkno-concept-v4.patch Description: Binary data
Re: pg_amcheck contrib application
> On Apr 30, 2021, at 11:56 AM, Robert Haas wrote: > > Can you review this version? It looks mostly good to me. There is a off-by-one error introduced with: - else if (chunkno != (endchunk + 1)) + else if (expected_chunk_seq < last_chunk_seq) I think that needs to be + else if (expected_chunk_seq <= last_chunk_seq) because otherwise it won't complain if the only missing chunk is the very last one. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_amcheck contrib application
> On Apr 30, 2021, at 11:56 AM, Robert Haas wrote: > > On Fri, Apr 30, 2021 at 2:31 PM Mark Dilger > wrote: >> Just to be clear, I did not use your patch v1 as the starting point. > > I thought that might be the case, but I was trying to understand what > you didn't like about my version, and comparing them seemed like a way > to figure that out. > >> I took the code as committed to master as the starting point, used your >> corruption report verbiage changes and at least some of your variable naming >> choices, but did not use the rest, in large part because it didn't work. It >> caused corruption messages to be reported against tables that have no >> corruption. For that matter, your v2 patch doesn't work either, and in the >> same way. To wit: >> >> heap table "postgres"."pg_catalog"."pg_rewrite", block 6, offset 4, >> attribute 7: >> toast value 13461 chunk 0 has size 1995, but expected size 1996 >> >> I think there is something wrong with the way you are trying to calculate >> and use extsize, because I'm not corrupting pg_catalog.pg_rewrite. You can >> get these same results by applying your patch to master, building, and >> running 'make check' from src/bin/pg_amcheck/ > > Argh, OK, I didn't realize. Should be fixed in this version. > >>> 4. You don't return if chunk_seq > last_chunk_seq. That seems wrong, >>> because we cannot compute a sensible expected size in that case. I >>> think your code will subtract a larger value from a smaller one and, >>> this being unsigned arithmetic, say that the expected chunk size is >>> something gigantic. >> >> Your conclusion is probably right, but I think your analysis is based on a >> misreading of what "last_chunk_seq" means. It's not the last one seen, but >> the last one expected. (Should we rename the variable to avoid confusion?) >> It won't compute a gigantic size. Rather, it will expect *every* chunk with >> chunk_seq >= last_chunk_seq to have whatever size is appropriate for the >> last chunk. > > I realize it's the last one expected. That's the point: we don't have > any expectation for the sizes of chunks higher than the last one we > expected to see. If the value is 2000 bytes and the chunk size is 1996 > bytes, we expect chunk 0 to be 1996 bytes and chunk 1 to be 4 bytes. > If not, we can complain. But it makes no sense to complain about chunk > 2 being of a size we don't expect. We don't expect it to exist in the > first place, so we have no notion of what size it ought to be. > >> If we have seen any chunks, the variable is holding the expected next chunk >> seq, which is one greater than the last chunk seq we saw. >> >> If we expect chunks 0..3 and see chunk 0 but not chunk 1, it will complain >> ..."expected to end at chunk 4, but ended at chunk 1". This is clearly by >> design and not merely a bug, though I tend to agree with you that this is a >> strange wording choice. I can't remember exactly when and how we decided to >> word the message this way, but it has annoyed me for a while, and I assumed >> it was something you suggested a while back, because I don't recall doing >> it. Either way, since you seem to also be bothered by this, I agree we >> should change it. > > Can you review this version? > > -- > Robert Haas > EDB: http://www.enterprisedb.com > As requested off-list, here are NOT FOR COMMIT, WIP patches for testing only. The first patch allows toast tables to be updated and adds regression tests of corrupted toasted attributes. I never quite got deletes from toast tables to work, and there are probably other gotchas still lurking even with inserts and updates, but it limps along well enough for testing pg_amcheck. The second patch updates the expected output of pg_amcheck to match the verbiage that you suggested upthread. v1-0001-Adding-modify-toast-and-test-pg_amcheck.patch.WIP Description: Binary data v1-0002-Modifying-toast-corruption-test-expected-output.patch.WIP Description: Binary data — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 2:23 PM Peter Geoghegan wrote: > I don't know how it's possible to do any of this without first > addressing what the table AM does in cases where heapam currently does > a non-HOT update. Why can't it do what it does already? It's not broken for heap, so why should it be broken for anything else? And why are non-HOT updates specifically a problem? > You obviously cannot have the equivalent of > duplicate TIDs when your new table AM runs into these scenarios. So > what do you do instead? How do you make your clustered index/IoT style > identifiers (i.e. your strictly logical TID-like identifiers) deal > with that case? Is the problem you're worried about here that, with something like an index-organized table, you can have multiple row versions that have the same logical tuple ID, i.e. primary key value? And that the interfaces aren't well-suited to that? Because that's a problem I have thought about and can comment on, even though I think the question of having multiple versions with the same TID is distinguishable from the question of how *wide* TIDs should be. But maybe that's not what you are talking about here, in which case I guess I need a clearer explanation of the concern. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_amcheck contrib application
On Fri, Apr 30, 2021 at 2:31 PM Mark Dilger wrote: > Just to be clear, I did not use your patch v1 as the starting point. I thought that might be the case, but I was trying to understand what you didn't like about my version, and comparing them seemed like a way to figure that out. > I took the code as committed to master as the starting point, used your > corruption report verbiage changes and at least some of your variable naming > choices, but did not use the rest, in large part because it didn't work. It > caused corruption messages to be reported against tables that have no > corruption. For that matter, your v2 patch doesn't work either, and in the > same way. To wit: > > heap table "postgres"."pg_catalog"."pg_rewrite", block 6, offset 4, > attribute 7: > toast value 13461 chunk 0 has size 1995, but expected size 1996 > > I think there is something wrong with the way you are trying to calculate and > use extsize, because I'm not corrupting pg_catalog.pg_rewrite. You can get > these same results by applying your patch to master, building, and running > 'make check' from src/bin/pg_amcheck/ Argh, OK, I didn't realize. Should be fixed in this version. > > 4. You don't return if chunk_seq > last_chunk_seq. That seems wrong, > > because we cannot compute a sensible expected size in that case. I > > think your code will subtract a larger value from a smaller one and, > > this being unsigned arithmetic, say that the expected chunk size is > > something gigantic. > > Your conclusion is probably right, but I think your analysis is based on a > misreading of what "last_chunk_seq" means. It's not the last one seen, but > the last one expected. (Should we rename the variable to avoid confusion?) > It won't compute a gigantic size. Rather, it will expect *every* chunk with > chunk_seq >= last_chunk_seq to have whatever size is appropriate for the last > chunk. I realize it's the last one expected. That's the point: we don't have any expectation for the sizes of chunks higher than the last one we expected to see. If the value is 2000 bytes and the chunk size is 1996 bytes, we expect chunk 0 to be 1996 bytes and chunk 1 to be 4 bytes. If not, we can complain. But it makes no sense to complain about chunk 2 being of a size we don't expect. We don't expect it to exist in the first place, so we have no notion of what size it ought to be. > If we have seen any chunks, the variable is holding the expected next chunk > seq, which is one greater than the last chunk seq we saw. > > If we expect chunks 0..3 and see chunk 0 but not chunk 1, it will complain > ..."expected to end at chunk 4, but ended at chunk 1". This is clearly by > design and not merely a bug, though I tend to agree with you that this is a > strange wording choice. I can't remember exactly when and how we decided to > word the message this way, but it has annoyed me for a while, and I assumed > it was something you suggested a while back, because I don't recall doing it. > Either way, since you seem to also be bothered by this, I agree we should > change it. Can you review this version? -- Robert Haas EDB: http://www.enterprisedb.com simply-remove-chunkno-concept-v3.patch Description: Binary data
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 11:23 AM Robert Haas wrote: > On Fri, Apr 30, 2021 at 2:05 PM Peter Geoghegan wrote: > > I agree in principle, but making that work well is very hard in > > practice because of the format of IndexTuple -- which bleeds into > > everything. That TID is special is probably a natural consequence of > > the fact that we don't have an offset-based format of the kind you see > > in other DB systems -- systems that don't emphasize extensibility. We > > cannot jump to a hypothetical TID attribute inexpensively inside code > > like _bt_compare() because we don't have a cheap way to jump straight > > to the datum for any attribute. So we just store TID in IndexTuple > > directly instead. Imagine how much more expensive VACUUM would be if > > it had to grovel through the IndexTuple format. > > I can't imagine that, so maybe you want to enlighten me? I see that > there's a potential problem there, and I'm glad you pointed it out > because I hadn't thought about it previously ... but if you always put > the column or columns that VACUUM would need first, it's not obvious > to me that it would be all that expensive. Maybe. The point is that it is a problem that needs to be solved. > Deforming the tuple to a > sufficient degree to extract the first column, which would even be > fixed-width, shouldn't take much work. I think that it's reasonable to impose some cost on index AMs here, but that needs to be bounded sensibly and unambiguously. For example, it would probably be okay if you had either 6 byte or 8 byte TIDs, but no other variations. You could require index AMs (the subset of index AMs that are ever able to store 8 byte TIDs) to directly encode which width they're dealing with at the level of each IndexTuple. That would create some problems for nbtree deduplication, especially in boundary cases, but ISTM that you can manage the complexity by sensibly restricting how the TIDs work across the board. For example, the TIDs should always work like unsigned integers -- the table AM must be willing to work with that restriction. You'd then have posting lists tuples in nbtree whose TIDs were all either 6 bytes or 8 bytes wide, with a mix of each possible (though not particularly likely) on the same leaf page. Say when you have a table that exceeds the current MaxBlockNumber restrictions. It would be relatively straightforward for nbtree deduplication to simply refuse to mix 6 byte and 8 byte datums together to avoid complexity in boundary cases. The deduplication pass logic has the flexibility that this requires already. > > I wonder how the same useful performance characteristics can be > > maintained with a variable-width TID design. If you solve the problem > > by changing IndexTuple, then you are kind of obligated to not use > > varlena headers to keep the on-disk size manageable. Who knows where > > it all ends? > > What's wrong with varlena headers? It would end up being a 1-byte > header in practically every case, and no variable-width representation > can do without a length word of some sort. I'm not saying varlena is > as efficient as some new design could hypothetically be, but it > doesn't seem like it'd be a big enough problem to stress about. If you > used a variable-width representation for integers, you might actually > save bytes in a lot of cases. An awful lot of the TIDs people store in > practice probably contain several zero bytes, and if we make them > wider, that's going to be even more true. Maybe all of this is true, and maybe it works out to be the best path forward in the long term, all things considered. But whether or not that's true is crucially dependent on what real practical table AMs (of which there will only ever be a tiny number) actually need to do. Why should we assume that the table AM cannot accept some restrictions? What good does it do to legalistically define the problem as a problem for index AMs to solve? -- Peter Geoghegan
Re: MaxOffsetNumber for Table AMs
On Fri, 2021-04-30 at 13:56 -0400, Robert Haas wrote: > I think that would be the best long-term plan. We should still have *some* answer in the short term for table AM authors like me. If I use offset numbers as high as MaxOffsetNumber, then itemptr_to_uint64 will fail. If I base my calculations for the TID to row number mapping on MaxOffsetNumber at all, then it will break if we change MaxOffsetNumber (as was suggested[1]). My takeaway so far is that the only safe thing to do is hard code it to 2000. I suppose I can do that until we settle on something better (at which point I can force a reindex, I suppose). [1] https://postgr.es/m/caeze2wit1ethhbhj+cyvbpthrwuzu2vqc-bmzu3apk3ioth...@mail.gmail.com > Even though these are distinguishable concerns, they basically point > in the same direction as far as index layout is concerned. The > implications for the table AM layer are somewhat different in the two > cases, but both argue that some places that are now talking about > TIDs > should be changed to talk about Datums or something of that sort. Logically, that makes a lot of sense to me. Peter seems to have quite a few practical implementation concerns though, so it could be a long road. Regards, Jeff Davis
Re: pg_upgrade test for binary compatibility of core data types
On Sat, Mar 06, 2021 at 03:01:43PM -0500, Tom Lane wrote: > Peter Eisentraut writes: > > On 2021-01-12 22:44, Andrew Dunstan wrote: > >> Cross version pg_upgrade is tested regularly in the buildfarm, but not > >> using test.sh. Instead it uses the saved data repository from a previous > >> run of the buildfarm client for the source branch, and tries to upgrade > >> that to the target branch. > > > Does it maintain a set of fixups similar to what is in test.sh? Are > > those two sets the same? > > Responding to Peter: the first answer is yes, the second is I didn't > check, but certainly Justin's patch makes them closer. Right - I had meant to send this. https://github.com/PGBuildFarm/client-code/blob/master/PGBuild/Modules/TestUpgradeXversion.pm $opsql = 'drop operator if exists public.=> (bigint, NONE)'; .. my $missing_funcs = q{drop function if exists public.boxarea(box); drop function if exists public.funny_dup17(); .. my $prstmt = join(';', 'drop operator if exists #@# (bigint,NONE)', 'drop operator if exists #%# (bigint,NONE)', 'drop operator if exists !=- (bigint,NONE)', .. $prstmt = join(';', 'drop operator @#@ (NONE, bigint)', .. 'drop aggregate if exists public.array_cat_accum(anyarray)', > I spent some time poking through this set of patches. I agree that > there's problem(s) here that we need to solve, but it feels like this > isn't a great way to solve them. What I see in the patchset is: For starters, is there a "release beta checklist" ? Testing test.sh should be on it. So should fuzz testing. > v4-0001 mostly teaches test.sh about specific changes that have to be > made to historic versions of the regression database to allow them > to be reloaded into current servers. As already discussed, this is > really duplicative of knowledge that's been embedded into the buildfarm > client over time. It'd be better if we could refactor that so that > the buildfarm shares a common database of these actions with test.sh. > And said database ought to be in our git tree, so committers could > fix problems without having to get Andrew involved every time. > I think this could be represented as a psql script, at least in > versions that have psql \if (but that came in in v10, so maybe > we're there already). I started this. I don't know if it's compatible with the buildfarm client, but I think any issues maybe can be avoided by using "IF EXISTS". > v4-0002 is a bunch of random changes that mostly seem to revert hacky > adjustments previously made to improve test coverage. I don't really > agree with any of these, nor see why they're necessary. If they > are necessary then we need to restore the coverage somewhere else. > Admittedly, the previous changes were a bit hacky, but deleting them > (without even bothering to adjust the relevant comments) isn't the > answer. It was necessary to avoid --wal-segsize and -g to allow testing upgrades from versions which don't support those options. I think test.sh should be portable back to all supported versions. When those options were added, it broke test.sh upgrading from old versions. I changed this to a shell conditional for the "new" features: | "$1" -N -A trust ${oldsrc:+--wal-segsize 1 -g} Ideally it would check the version. > v4-0003 is really the heart of the matter: it adds a table with some > previously-not-covered datatypes plus a query that purports to make sure > that we are covering all types of interest. Actually the 'manytypes' table intends to include *all* core datatypes itself, not just those that aren't included somewhere else. I think "included somewhere else" depends on the order of the regression these, and type_sanity runs early, so the table might need to include many types that are created later, to avoid "false positives" in the associated test. > But I'm not sure I believe > that query. It's got hard-wired assumptions about which typtype values > need to be covered. Why is it okay to exclude range and multirange? > Are we sure that all composites are okay to exclude? Likewise, the > restriction to pg_catalog and information_schema schemas seems likely to > bite us someday. There are some very random exclusions based on name > patterns, which seem unsafe (let's list the specific type OIDs), and > again the nearby comments don't match the code. But the biggest issue > is that this can only cover core datatypes, not any contrib stuff. I changed to use regtype/OIDs, included range/multirange and stopped including only pg_catalog/information_schema. But didn't yet handle composites. > I don't know what we could do about contrib types. Maybe we should > figure that covering core types is already a step forward, and be > happy with getting that done. Right .. this is meant
Re: pg_amcheck contrib application
> On Apr 30, 2021, at 9:39 AM, Robert Haas wrote: > > On Mon, Apr 26, 2021 at 1:52 PM Mark Dilger > wrote: >> The attached patch changes amcheck corruption reports as discussed upthread. >> This patch is submitted for the v14 development cycle as a bug fix, per >> your complaint that the committed code generates reports sufficiently >> confusing to a user as to constitute a bug. >> >> All other code refactoring and additional checks discussed upthread are >> reserved for the v15 development cycle and are not included here. >> >> The minimal patch (not attached) that does not rename any variables is 135 >> lines. Your patch was 159 lines. The patch (attached) which includes your >> variable renaming is 174 lines. > > Hi, > > I have compared this against my version. I found the following differences: Just to be clear, I did not use your patch v1 as the starting point. I took the code as committed to master as the starting point, used your corruption report verbiage changes and at least some of your variable naming choices, but did not use the rest, in large part because it didn't work. It caused corruption messages to be reported against tables that have no corruption. For that matter, your v2 patch doesn't work either, and in the same way. To wit: heap table "postgres"."pg_catalog"."pg_rewrite", block 6, offset 4, attribute 7: toast value 13461 chunk 0 has size 1995, but expected size 1996 I think there is something wrong with the way you are trying to calculate and use extsize, because I'm not corrupting pg_catalog.pg_rewrite. You can get these same results by applying your patch to master, building, and running 'make check' from src/bin/pg_amcheck/ > 1. This version passes last_chunk_seq rather than extsize to > check_toast_tuple(). But this results in having to call > VARATT_EXTERNAL_GET_EXTSIZE() inside that function. I thought it was > nicer to do that in the caller, so that we don't do it twice. I don't see that VARATT_EXTERNAL_GET_EXTSIZE() is worth too much concern, given that is just a struct access and a bit mask. You are avoiding calculating that twice, but at the expense of calculating last_chunk_seq twice, which involves division. I don't think the division can be optimized as a mere bit shift, since TOAST_MAX_CHUNK_SIZE is not in general a power of two. (For example, on my laptop it is 1996.) I don't say this to nitpick at the performance one way vs. the other. I doubt it makes any real difference. I'm just confused why you want to change this particular thing right now, given that it is not a bug. > 2. You fixed some out-of-date comments. Yes, because they were wrong. That's on me. I failed to update them in a prior patch. > 3. You move the test for an unexpected chunk sequence further down in > the function. I don't see the point; Relative to your patch, perhaps. Relative to master, no tests have been moved. > I had put it by the related null > check, and still think that's better. You also deleted my comment /* > Either the TOAST index is corrupt, or we don't have all chunks. */ > which I would have preferred to keep. That's fine. I didn't mean to remove it. I was just taking a minimalist approach to constructing the patch. > 4. You don't return if chunk_seq > last_chunk_seq. That seems wrong, > because we cannot compute a sensible expected size in that case. I > think your code will subtract a larger value from a smaller one and, > this being unsigned arithmetic, say that the expected chunk size is > something gigantic. Your conclusion is probably right, but I think your analysis is based on a misreading of what "last_chunk_seq" means. It's not the last one seen, but the last one expected. (Should we rename the variable to avoid confusion?) It won't compute a gigantic size. Rather, it will expect *every* chunk with chunk_seq >= last_chunk_seq to have whatever size is appropriate for the last chunk. > Returning and not issuing that complaint at all > seems better. That might be best. I had been resisting that because I don't want the extraneous chunks to be reported without chunk size information. When debugging corrupted toast, it may be interesting to know the size of the extraneous chunks. If there are 1000 extra chunks, somebody might want to see the sizes of them. > 5. You fixed the incorrect formula I had introduced for the expected > size of the last chunk. Not really. I just didn't introduce any change in that area. > 6. You changed the variable name in check_toasted_attribute() from > expected_chunkno to chunkno, and initialized it later in the function > instead of at declaration time. I don't find this to be an > improvement; I think I just left the variable name and its initialization unchanged. > including the word "expected" seems to me to be > substantially clearer. But I think I should have gone with > expected_chunk_seq for better consistency. I agree that is a better name.
Re: MaxOffsetNumber for Table AMs
On Fri, 2021-04-30 at 10:50 -0700, Peter Geoghegan wrote: > I don't know. This conversation is still too abstract for me to be > able to take a firm position. ISTM that we tend to talk about the > table AM in excessively abstract terms. It would be a lot easier if > we > had clear fixed goals for a small number of additional table AMs. https://github.com/citusdata/citus/tree/master/src/backend/columnar My colleagues and I have been working on a "columnar" table AM. It doesn't currently support indexes, but it would be useful to support them. The basic idea is we have "stripes" of ~15 tuples that are rearranged and compressed, and stored in an smgr-controlled file that goes through the buffer cache and uses generic WAL. To support indexes, we could do our own lookups from a "row number" to a particular offset where we can find and decompress the stripe that holds that row number, and then scan forward in the stripe to find the particular row. This will be terrible for random access, but [*waves hands*] we will keep state and use a few optimizations so that this is not terribly slow for clustered access. Granted, TID lookup on columnar will be much slower than for a heap (and we can use a CustomScan so that the costs reflect that). But it will satisfy important use cases: 1. Indexes defined on partition parent tables. Even if the index is never used for queries, we don't want to throw an error when defining the partitioned parent index. 2. Unique indexes and exclusion constraints. 3. Clustered index scans can still be reasonably fast. 4. Could be used for UPDATE/DELETE as well. > More generally, it seems like a good idea to try to make new table > AMs > reasonably close to heapam insofar as possible. The reality is that > everything evolved around heapam, and that that's likely to matter in > all kinds of ways that nobody fully understands just yet. Agreed. I think of this as an evolving situation where we take steps toward a better abstraction. One (hopefully reasonable) step I'd like to take is a well-specified TID. Regards, Jeff Davis
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 2:05 PM Peter Geoghegan wrote: > I agree in principle, but making that work well is very hard in > practice because of the format of IndexTuple -- which bleeds into > everything. That TID is special is probably a natural consequence of > the fact that we don't have an offset-based format of the kind you see > in other DB systems -- systems that don't emphasize extensibility. We > cannot jump to a hypothetical TID attribute inexpensively inside code > like _bt_compare() because we don't have a cheap way to jump straight > to the datum for any attribute. So we just store TID in IndexTuple > directly instead. Imagine how much more expensive VACUUM would be if > it had to grovel through the IndexTuple format. I can't imagine that, so maybe you want to enlighten me? I see that there's a potential problem there, and I'm glad you pointed it out because I hadn't thought about it previously ... but if you always put the column or columns that VACUUM would need first, it's not obvious to me that it would be all that expensive. Deforming the tuple to a sufficient degree to extract the first column, which would even be fixed-width, shouldn't take much work. > I wonder how the same useful performance characteristics can be > maintained with a variable-width TID design. If you solve the problem > by changing IndexTuple, then you are kind of obligated to not use > varlena headers to keep the on-disk size manageable. Who knows where > it all ends? What's wrong with varlena headers? It would end up being a 1-byte header in practically every case, and no variable-width representation can do without a length word of some sort. I'm not saying varlena is as efficient as some new design could hypothetically be, but it doesn't seem like it'd be a big enough problem to stress about. If you used a variable-width representation for integers, you might actually save bytes in a lot of cases. An awful lot of the TIDs people store in practice probably contain several zero bytes, and if we make them wider, that's going to be even more true. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 10:56 AM Robert Haas wrote: > I think that would be the best long-term plan. I guess I have two > distinguishable concerns. One is that I want to be able to have > indexes with a payload that's not just a 6-byte TID; e.g. adding a > partition identifier to support global indexes, or replacing the > 6-byte TID with a primary key reference to support indirect indexes. > The other concern is to be able to have table AMs that use arbitrary > methods to identify a tuple. For example, if somebody implemented an > index-organized table, the "TID" would really be the primary key. > > Even though these are distinguishable concerns, they basically point > in the same direction as far as index layout is concerned. The > implications for the table AM layer are somewhat different in the two > cases, but both argue that some places that are now talking about TIDs > should be changed to talk about Datums or something of that sort. I don't know how it's possible to do any of this without first addressing what the table AM does in cases where heapam currently does a non-HOT update. You obviously cannot have the equivalent of duplicate TIDs when your new table AM runs into these scenarios. So what do you do instead? How do you make your clustered index/IoT style identifiers (i.e. your strictly logical TID-like identifiers) deal with that case? ISTM that this is by far the biggest issue with generalizing the table AM for use by a tableam (that isn't already very much like heapam). I am always surprised to be the one that has to point it out during these discussions. It's a huge issue. -- Peter Geoghegan
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 10:39 AM Robert Haas wrote: > I agree up to a point but ... are you imagining that the TID continues > to have its own special place in the page, while the partition > identifier is stored more like a regular tuple column? Because it > seems to me that it would be better to try to eliminate the > special-storage case, just like we did for OIDs. I agree in principle, but making that work well is very hard in practice because of the format of IndexTuple -- which bleeds into everything. That TID is special is probably a natural consequence of the fact that we don't have an offset-based format of the kind you see in other DB systems -- systems that don't emphasize extensibility. We cannot jump to a hypothetical TID attribute inexpensively inside code like _bt_compare() because we don't have a cheap way to jump straight to the datum for any attribute. So we just store TID in IndexTuple directly instead. Imagine how much more expensive VACUUM would be if it had to grovel through the IndexTuple format. I wonder how the same useful performance characteristics can be maintained with a variable-width TID design. If you solve the problem by changing IndexTuple, then you are kind of obligated to not use varlena headers to keep the on-disk size manageable. Who knows where it all ends? -- Peter Geoghegan
Re: Procedures versus the "fastpath" API
On Fri, Apr 30, 2021 at 12:57 PM Tom Lane wrote: > By my count, we have three votes for forbidding procedure calls via > fastpath in all branches (me, Joe, Michael), and two for doing > something laxer (Noah, Laurenz). The former is surely the safer > choice, so I'm going to go do that. FWIW, I'm also for the stricter approach. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 1:37 PM Jeff Davis wrote: > The particular problem I have now is that Table AMs seem to support > indexes just fine, but TIDs are under-specified so I don't know what I > really have to work with. BlockNumber seems well-specified as > 0..0XFFFE (inclusive), but I don't know what the valid range of > OffsetNumber is for the purposes of a TableAM. I agree that this is a problem. > Part of changing to uint64 would be specifying the TIDs in a way that I > could rely on in the future. I mean, from my perspective, the problem here is that the abstraction layer is leaky and things outside of the table AM layer know what heap is doing under the hood, and rely on it. If we could refactor the abstraction to be less leaky, it would be clearer what assumptions table AM authors can make. If we can't, any specification doesn't seem worth much. > In the future we may support primary unique indexes at the table AM > layer, which would get more interesting. I can see an argument for a > TID being an arbitrary datum in that case, but I haven't really > considered the design implications. Is this what you are suggesting? I think that would be the best long-term plan. I guess I have two distinguishable concerns. One is that I want to be able to have indexes with a payload that's not just a 6-byte TID; e.g. adding a partition identifier to support global indexes, or replacing the 6-byte TID with a primary key reference to support indirect indexes. The other concern is to be able to have table AMs that use arbitrary methods to identify a tuple. For example, if somebody implemented an index-organized table, the "TID" would really be the primary key. Even though these are distinguishable concerns, they basically point in the same direction as far as index layout is concerned. The implications for the table AM layer are somewhat different in the two cases, but both argue that some places that are now talking about TIDs should be changed to talk about Datums or something of that sort. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Fri, 2021-04-30 at 12:35 -0400, Tom Lane wrote: > ISTM that would be up to the index AM. We'd need some interlocks on > which index AMs could be used with which table AMs in any case, I > think. I'm not sure why? It seems like we should be able to come up with something that's generic enough. > I think the hard part may really be in places like tidbitmap.c, which > one would wish to be AM-independent, but right now it's quite > specific > to heap-style TIDs. Maybe we can think of a way to parameterize it. For my particular AM, being able to have a parameterized granularity might be nice, but not required. Regards, Jeff Davis
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 10:04 AM Jeff Davis wrote: > On Fri, 2021-04-30 at 08:36 -0700, Peter Geoghegan wrote: > > Compatibility with index AMs is more than a matter of switching out > > the tuple identifier -- if you invent something that has totally > > different performance characteristics for index AMs, then it's likely > > to break tacit assumptions about the cost of certain things. > > I think it would be reasonable to document and expect that table AMs > offer some locality of access for tuples with similar IDs. Do you think > we need something stronger than that? I don't know. This conversation is still too abstract for me to be able to take a firm position. ISTM that we tend to talk about the table AM in excessively abstract terms. It would be a lot easier if we had clear fixed goals for a small number of additional table AMs. > > Plus deduplication ameliorates problems with version churn > > in > > indexes -- presumably the same problems will exist when any new table > > AM is used, and so it'll be useful to address the same problems in > > the > > same way. > > I got lost after "presumably the same problems", can you explain? Well, there are now two things in nbtree that specifically help with version churn caused by non-HOT updates: deduplication, and bottom-up index deletion (especially the latter). Presumably any new table AM will have something like non-HOT updates. Though they may rarely be a problem (say because the new table AM isn't really for OLTP), whenever they are a problem they'll be a very big problem. It seems like a good idea to have the same index AM level protections against accumulating version-churn index tuples in an unbounded way. More generally, it seems like a good idea to try to make new table AMs reasonably close to heapam insofar as possible. The reality is that everything evolved around heapam, and that that's likely to matter in all kinds of ways that nobody fully understands just yet. We have a somewhat abstract table AM interface, which is good, but that doesn't mean that table AMs can be designed as if it was a green field situation. -- Peter Geoghegan
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 1:28 PM Peter Geoghegan wrote: > Global indexes should work by adding an extra column that is somewhat > like a TID, that may even have its own pg_attribute entry. It's much > more natural to make the partition number a separate column IMV -- > nbtree suffix truncation and deduplication can work in about the same > way as before. Plus you'll need to do predicate pushdown using the > partition identifier in some scenarios anyway. You can make the > partition identifier variable-width without imposing the cost and > complexity of variable-width TIDs on index AMs. I agree up to a point but ... are you imagining that the TID continues to have its own special place in the page, while the partition identifier is stored more like a regular tuple column? Because it seems to me that it would be better to try to eliminate the special-storage case, just like we did for OIDs. If you want a 6-byte TID, put a 6-byte column in the tuple for it. If you also want a partition identifier, put an extra column in the tuple for that. If you want a wider TID or a varlena TID, well, put columns for that into the tuple instead of the 6-byte column you'd normally put. This seems extremely flexible and a lot more aesthetically appealing than what we have today. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Fri, 2021-04-30 at 12:51 -0400, Robert Haas wrote: > There are two major reasons why I want variable-width tuple IDs. One > is global indexes, where you need as many bits as the AMs > implementing > the partitions need, plus some extra bits to identify which partition > is relevant for a particular tuple. No fixed number of bits that you > make available can ever be sufficient here, because a global index > always needs to have extra bits compared to a partition-local index; > if you let the partition-local index use more bits, the global index > now needs even more space. The other is indirect indexes, work Álvaro > proposed a few years ago, where the index entry points to the primary > key value rather than a TID. The space needs for that are based on > the > type of the primary key column. This proposal solves neither of those > problems. The particular problem I have now is that Table AMs seem to support indexes just fine, but TIDs are under-specified so I don't know what I really have to work with. BlockNumber seems well-specified as 0..0XFFFE (inclusive), but I don't know what the valid range of OffsetNumber is for the purposes of a TableAM. Part of changing to uint64 would be specifying the TIDs in a way that I could rely on in the future. The problems you mention above are above the table AM layer, so they seem orthogonal. There would still need to be an ID that table AMs can use to fetch a tuple from a particular physical table. In the future we may support primary unique indexes at the table AM layer, which would get more interesting. I can see an argument for a TID being an arbitrary datum in that case, but I haven't really considered the design implications. Is this what you are suggesting? Regards, Jeff Davis
Re: Use simplehash.h instead of dynahash in SMgr
Hi David, You're probably aware of this, but just to make it explicit: Jakub Wartak was testing performance of recovery, and one of the bottlenecks he found in some of his cases was dynahash as used by SMgr. It seems quite possible that this work would benefit some of his test workloads. He last posted about it here: https://postgr.es/m/vi1pr0701mb69608cbce44d80857e59572ef6...@vi1pr0701mb6960.eurprd07.prod.outlook.com though the fraction of dynahash-from-SMgr is not as high there as in some of other reports I saw. -- Álvaro Herrera Valdivia, Chile
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 1:10 PM Tom Lane wrote: > I agree that global indexes need more bits, but it doesn't necessarily > follow that we must have variable-width TIDs. We could for example > say that "real" TIDs are only 48 bits and index AMs that want to be > usable as global indexes must be capable of handling 64-bit TIDs, > leaving 16 bits for partition ID. A more forward-looking definition > would require global index AMs to store 96 bits (partition OID plus > 64-bit TID). Either way would be far simpler for every moving part > involved than going over to full varlena TIDs. 16 bits is not much for a partition identifier. We've already had complaints about INNER_VAR being too small, so apparently there are people who want to use really large numbers of partitions. But even if we imagine a hypothetical world where nobody uses more than a couple thousand partitions at once, it's very reasonable to want to avoid recycling partition identifiers so that detaching a partition can be O(1), and there's no way that's going to be viable if the whole address space is only 16 bits, because with time series data people are going to be continually creating new partitions and dropping old ones. I would guess that it probably is viable with 32 bits, but we'd have to have a mapping layer rather than using the OID directly to avoid wraparound collisions. Now this problem can be avoided by just requiring the AM to store more bits, exactly as you say. I suspect 96 bits is large enough for all of the practical use cases people have, or at least within spitting distance. But it strikes me as relatively inefficient to say that we're always going to store 96 bits for every TID. I certainly don't think we want to break on-disk compatibility and widen every existing btree index by changing all the 6-byte TIDs they're storing now to store 12 bytes TIDs that are at least half zero bytes, so I think we're bound to end up with at least two options: 6 and 12. But variable-width would be a lot nicer. You could store small TIDs and small partition identifiers very compactly, and only use the full number of bytes when the situation demands it. > > What problem do you think this proposal does solve? > > Accommodating table AMs that want more than 48 bits for a TID. > We're already starting to run up against the fact that that's not > enough bits for plausible use-cases. 64 bits may someday in the far > future not be enough either, but I think that's a very long way off. Do people actually want to store more than 2^48 rows in a table, or is this more about the division of a TID into a block number and an item number? -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 10:10 AM Tom Lane wrote: > > There are two major reasons why I want variable-width tuple IDs. One > > is global indexes, where you need as many bits as the AMs implementing > > the partitions need, plus some extra bits to identify which partition > > is relevant for a particular tuple. No fixed number of bits that you > > make available can ever be sufficient here, > > I agree that global indexes need more bits, but it doesn't necessarily > follow that we must have variable-width TIDs. We could for example > say that "real" TIDs are only 48 bits and index AMs that want to be > usable as global indexes must be capable of handling 64-bit TIDs, > leaving 16 bits for partition ID. A more forward-looking definition > would require global index AMs to store 96 bits (partition OID plus > 64-bit TID). Either way would be far simpler for every moving part > involved than going over to full varlena TIDs. The question of how the on-disk format on indexes needs to be changed to accomodate global indexes seems like an entirely separate question to how we go about expanding or redefining TIDs. Global indexes should work by adding an extra column that is somewhat like a TID, that may even have its own pg_attribute entry. It's much more natural to make the partition number a separate column IMV -- nbtree suffix truncation and deduplication can work in about the same way as before. Plus you'll need to do predicate pushdown using the partition identifier in some scenarios anyway. You can make the partition identifier variable-width without imposing the cost and complexity of variable-width TIDs on index AMs. I believe that the main reason why there have been so few problems with any of the nbtree work in the past few releases is that it avoided certain kinds of special cases. Any special cases in the on-disk format and in the space accounting used when choosing a split point ought to be avoided at all costs. We can probably afford to add a lot of complexity to make global indexes work, but it ought to be contained to cases that actually use global indexes in an obvious way. -- Peter Geoghegan
Re: MaxOffsetNumber for Table AMs
Robert Haas writes: > On Fri, Apr 30, 2021 at 11:06 AM Tom Lane wrote: >> Andres seems to feel that we should try to allow variable-width >> tupleids, but I'm afraid that the cost/benefit ratio for that >> would be pretty poor. > There are two major reasons why I want variable-width tuple IDs. One > is global indexes, where you need as many bits as the AMs implementing > the partitions need, plus some extra bits to identify which partition > is relevant for a particular tuple. No fixed number of bits that you > make available can ever be sufficient here, I agree that global indexes need more bits, but it doesn't necessarily follow that we must have variable-width TIDs. We could for example say that "real" TIDs are only 48 bits and index AMs that want to be usable as global indexes must be capable of handling 64-bit TIDs, leaving 16 bits for partition ID. A more forward-looking definition would require global index AMs to store 96 bits (partition OID plus 64-bit TID). Either way would be far simpler for every moving part involved than going over to full varlena TIDs. > Another problem in this general area is that there is quite a bit of > code that thinks a TID is specifically a block number and an offset, > like the Bitmap Index/Heap Scan code, for example. But making tuple > IDs wider doesn't help with that problem either. Agreed, that's an area that will need a lot of thought for anything that we do here. But varlena TIDs surely do not make that easier to fix. > What problem do you think this proposal does solve? Accommodating table AMs that want more than 48 bits for a TID. We're already starting to run up against the fact that that's not enough bits for plausible use-cases. 64 bits may someday in the far future not be enough either, but I think that's a very long way off. regards, tom lane
Re: MaxOffsetNumber for Table AMs
On Fri, 2021-04-30 at 08:36 -0700, Peter Geoghegan wrote: > Compatibility with index AMs is more than a matter of switching out > the tuple identifier -- if you invent something that has totally > different performance characteristics for index AMs, then it's likely > to break tacit assumptions about the cost of certain things. I think it would be reasonable to document and expect that table AMs offer some locality of access for tuples with similar IDs. Do you think we need something stronger than that? > Plus deduplication ameliorates problems with version churn > in > indexes -- presumably the same problems will exist when any new table > AM is used, and so it'll be useful to address the same problems in > the > same way. I got lost after "presumably the same problems", can you explain? > I agree that it might well be useful to make TIDs fully logical (not > "physiological" -- physical across blocks, logical within blocks) for > some new table AM. Even then, it would still definitely be a good > idea > to make these logical TID values correlate with the physical table > structure in some way. Agreed. Regards, Jeff Davis
Re: Procedures versus the "fastpath" API
Michael Paquier writes: > On Tue, Mar 09, 2021 at 02:33:47PM -0500, Joe Conway wrote: >> My vote would be reject using fastpath for procedures in all relevant >> branches. >> If someday someone cares enough to make it work, it is a new feature for a >> new >> major release. > FWIW, my vote would go for issuing an error if attempting to use a > procedure in the fast path for all the branches. The lack of > complaint about the error you are mentioning sounds like a pretty good > argument to fail properly on existing branches, and work on this as a > new feature in the future if there is anybody willing to make a case > for it. I let this thread grow cold because I was hoping for some more votes, but with the quarterly releases approaching, it's time to close out the issue one way or the other. By my count, we have three votes for forbidding procedure calls via fastpath in all branches (me, Joe, Michael), and two for doing something laxer (Noah, Laurenz). The former is surely the safer choice, so I'm going to go do that. regards, tom lane
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 11:06 AM Tom Lane wrote: > My thought at the moment is that all APIs above the AM level ought > to be redefined to use uint64 for tuple identifiers. heapam and > related index AMs could map block + offset into that in some > convenient way, and other AMs could do what they like. > > Andres seems to feel that we should try to allow variable-width > tupleids, but I'm afraid that the cost/benefit ratio for that > would be pretty poor. There are two major reasons why I want variable-width tuple IDs. One is global indexes, where you need as many bits as the AMs implementing the partitions need, plus some extra bits to identify which partition is relevant for a particular tuple. No fixed number of bits that you make available can ever be sufficient here, because a global index always needs to have extra bits compared to a partition-local index; if you let the partition-local index use more bits, the global index now needs even more space. The other is indirect indexes, work Álvaro proposed a few years ago, where the index entry points to the primary key value rather than a TID. The space needs for that are based on the type of the primary key column. This proposal solves neither of those problems. Another problem in this general area is that there is quite a bit of code that thinks a TID is specifically a block number and an offset, like the Bitmap Index/Heap Scan code, for example. But making tuple IDs wider doesn't help with that problem either. What problem do you think this proposal does solve? -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Fri, 2021-04-30 at 12:04 +0200, Matthias van de Meent wrote: > Other than that, I believe you've also missed the special offset > numbers specified in itemptr.h (SpecTokenOffsetNumber and > MovedPartitionsOffsetNumber). I am not well enough aware of the usage > of these OffsetNumber values, but those might also be limiting the > values any tableAM can use for their TIDs. Yes, thank you. If it is treated specially in a heap tuple, it can't be a regular TID. > > It just seems like it's used when scanning heap or index pages for > > stack-allocated arrays. For a table AM it appears to waste 5 bits. > > MaxOffsetNumber is used for postgres' Page layout, of which the > MaxOffsetNumber is defined as how many item pointers could exist on a > page, and AFAIK should be used for postgres' Page layout only. No > thing can or should change that. If any code asserts limitations to > the ip_posid of table tuples that could also not be tableam tuples, > then I believe that is probably a mistake in postgres, and that > should be fixed. A name that would better fit your definition would be something like "MaxItemsPerPage". The name "MaxOffsetNumber" implies that any number past that must be either invalid or special. But it seems like you are saying that if I use an offset number of 5000 in my table AM, then that's fine and should be treated like a normal TID. > No. The documentation for that function explicitly mentions that > these item pointers are optimized for storage when using the heap > tableam, and that that code will be changed once there exist tableAMs > with different TID / ip_posid constraints (see the comment on lines > 32 and 33 of that file). Thank you. I'm a table AM author, and I'd like to use whatever the real range of TIDs is. Does that mean it's time to change that code in ginpostinglist.c now? > > 1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64(). > > I believe that this is the right way when there exist tableAMs that > use those upper 5 bits. Does that mean we should declare the valid range of offsets to be between 1 and 0xfffc (inclusive)? I'm trying to use some mapping now that's somewhat stable so that I don't have to worry that something will break later, and then require reindexing all tables with my table AM. Regards, Jeff Davis
Re: pg_amcheck contrib application
On Mon, Apr 26, 2021 at 1:52 PM Mark Dilger wrote: > The attached patch changes amcheck corruption reports as discussed upthread. > This patch is submitted for the v14 development cycle as a bug fix, per your > complaint that the committed code generates reports sufficiently confusing to > a user as to constitute a bug. > > All other code refactoring and additional checks discussed upthread are > reserved for the v15 development cycle and are not included here. > > The minimal patch (not attached) that does not rename any variables is 135 > lines. Your patch was 159 lines. The patch (attached) which includes your > variable renaming is 174 lines. Hi, I have compared this against my version. I found the following differences: 1. This version passes last_chunk_seq rather than extsize to check_toast_tuple(). But this results in having to call VARATT_EXTERNAL_GET_EXTSIZE() inside that function. I thought it was nicer to do that in the caller, so that we don't do it twice. 2. You fixed some out-of-date comments. 3. You move the test for an unexpected chunk sequence further down in the function. I don't see the point; I had put it by the related null check, and still think that's better. You also deleted my comment /* Either the TOAST index is corrupt, or we don't have all chunks. */ which I would have preferred to keep. 4. You don't return if chunk_seq > last_chunk_seq. That seems wrong, because we cannot compute a sensible expected size in that case. I think your code will subtract a larger value from a smaller one and, this being unsigned arithmetic, say that the expected chunk size is something gigantic. Returning and not issuing that complaint at all seems better. 5. You fixed the incorrect formula I had introduced for the expected size of the last chunk. 6. You changed the variable name in check_toasted_attribute() from expected_chunkno to chunkno, and initialized it later in the function instead of at declaration time. I don't find this to be an improvement; including the word "expected" seems to me to be substantially clearer. But I think I should have gone with expected_chunk_seq for better consistency. 7. You restored the message "toast value %u was expected to end at chunk %d, but ended at chunk %d" which my version deleted. I deleted that message because I thought it was redundant, but I guess it's not: there's nothing else to complain if the sequence of chunks ends early. I think we should change the test from != to < though, because if it's >, then we must have already complained about unexpected chunks. Also, I think the message is actually wrong, because even though you renamed the variable, it still ends up being the expected next chunkno rather than the last chunkno we actually saw. PFA my counter-proposal based on the above analysis. -- Robert Haas EDB: http://www.enterprisedb.com simply-remove-chunkno-concept-v2.patch Description: Binary data
Re: MaxOffsetNumber for Table AMs
Jeff Davis writes: > On Fri, 2021-04-30 at 11:06 -0400, Tom Lane wrote: >> My thought at the moment is that all APIs above the AM level ought >> to be redefined to use uint64 for tuple identifiers. > Do you mean that indexes would be expected to hold a uint64, a 48-bit > int (that directly maps to a uint64), or still hold an ItemPointerData? ISTM that would be up to the index AM. We'd need some interlocks on which index AMs could be used with which table AMs in any case, I think. It'd likely not be hard for existing index AMs to be repurposed to store "any 48-bit TID", but extending them to full 64-bit TIDs may be impractical. I think the hard part may really be in places like tidbitmap.c, which one would wish to be AM-independent, but right now it's quite specific to heap-style TIDs. Maybe we can think of a way to parameterize it. regards, tom lane
Re: MaxOffsetNumber for Table AMs
On Fri, 2021-04-30 at 11:06 -0400, Tom Lane wrote: > My thought at the moment is that all APIs above the AM level ought > to be redefined to use uint64 for tuple identifiers. One challenge might be reliance on InvalidOffsetNumber as a special value in a number of places (e.g. bitmap index scans). That doesn't seem like a major problem though. > heapam and > related index AMs could map block + offset into that in some > convenient way, and other AMs could do what they like. Do you mean that indexes would be expected to hold a uint64, a 48-bit int (that directly maps to a uint64), or still hold an ItemPointerData? Regards, Jeff Davis
Re: Race condition in InvalidateObsoleteReplicationSlots()
Hi, On 2021-04-29 13:28:20 -0400, Álvaro Herrera wrote: > On 2021-Apr-07, Andres Freund wrote: > > > I'm also confused by the use of ConditionVariableTimedSleep(timeout = > > 10). Why do we need a timed sleep here in the first place? And why with > > such a short sleep? > > I was scared of the possibility that a process would not set the CV for > whatever reason, causing checkpointing to become stuck. Maybe that's > misguided thinking if CVs are reliable enough. They better be, or we have bigger problems. And if it's an escape hatch we surely ought not to use 10ms as the timeout. That's an appropriate time for something *not* using condition variables... > I attach a couple of changes to your 0001. It's all cosmetic; what > looks not so cosmetic is the change of "continue" to "break" in helper > routine; if !s->in_use, we'd loop infinitely. The other routine > already checks that before calling the helper; since you hold > ReplicationSlotControlLock at that point, it should not be possible to > drop it in between. Anyway, it's a trivial change to make, so it should > be correct. > I also added a "continue" at the bottom of one block; currently that > doesn't change any behavior, but if we add code at the other block, it > might not be what's intended. Seems sane. > Are you getting this set pushed, or would you like me to handle it? > (There seems to be some minor conflict in 13) I'd be quite happy for you to handle it... Greetings, Andres Freund
Re: Race condition in InvalidateObsoleteReplicationSlots()
Hi, On 2021-04-08 17:03:41 +0530, Amit Kapila wrote: > I haven't tested the patch but I couldn't spot any problems while > reading it. A minor point, don't we need to use > ConditionVariableCancelSleep() at some point after doing > ConditionVariableTimedSleep? It's not really necessary - unless the CV could get deallocated as part of dynamic shared memory or such. Greetings, Andres Freund
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 8:06 AM Tom Lane wrote: > My thought at the moment is that all APIs above the AM level ought > to be redefined to use uint64 for tuple identifiers. heapam and > related index AMs could map block + offset into that in some > convenient way, and other AMs could do what they like. > > Andres seems to feel that we should try to allow variable-width > tupleids, but I'm afraid that the cost/benefit ratio for that > would be pretty poor. I agree. It'll be easier for a new table AM to be developed with that constraint than it will be to retrofit it to every index AM. It probably wouldn't be that hard to make nbtree deduplication and GIN posting list compression work with uint64 TIDs. But variable-width TIDs are a very different matter. Compatibility with index AMs is more than a matter of switching out the tuple identifier -- if you invent something that has totally different performance characteristics for index AMs, then it's likely to break tacit assumptions about the cost of certain things. For example, index tuple deletion probably relies on the fact that there just isn't that many table blocks to visit (to get an XID for recovery conflict purposes) in practice due to various locality-related effects. Plus deduplication ameliorates problems with version churn in indexes -- presumably the same problems will exist when any new table AM is used, and so it'll be useful to address the same problems in the same way. I agree that it might well be useful to make TIDs fully logical (not "physiological" -- physical across blocks, logical within blocks) for some new table AM. Even then, it would still definitely be a good idea to make these logical TID values correlate with the physical table structure in some way. Plus TIDs should still be fixed size. If a new table AM can't do it that way then that certainly needs to be justified -- it's unreasonable to imagine that it simply isn't the table AM's problem to solve. -- Peter Geoghegan
Re: Help needed with a reproducer for CVE-2020-25695 not based on REFRESH MATERIALIZED VIEW
We've figured it out. Please ignore. Regards. On Fri, Apr 30, 2021 at 3:13 PM Patrik Novotny wrote: > Hi, > > I need to reproduce the CVE-2020-25695 on PostgreSQL 9.2.24. I know this > is not a supported version, however, it is important for us to have a > reproducer for this version as well. > > The reproducer for supported versions[1] is based on REFRESH MATERIALIZED > VIEW which is not implemented until version 9.3. > > I was trying to reproduce this using ANALYZE as you can see in this > poc.sql file[2]. However, it doesn't reproduce the issue. > > It would be really appreciated if someone could take a look at it and help. > > > [1] > https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/test/regress/sql/privileges.sql;h=013bc95c74bd20e5ab7f1826ea7e676da2a0e85b;hb=HEAD#l896 > [2] https://pastebin.com/6hgziYRD > > > Regards, > > -- > Patrik Novotný > Associate Software Engineer > Red Hat > panov...@redhat.com > -- Patrik Novotný Associate Software Engineer Red Hat panov...@redhat.com
Re: MaxOffsetNumber for Table AMs
Jeff Davis writes: > The notion of TID is based on pages and line pointers, which makes > sense for heapam, but that's not likely to make sense for a custom > table AM. > The obvious answer is to make a simple mapping between a TID and > whatever makes sense to the AM (for the sake of discussion, let's say a > plain row number). I'm inclined to think that when we get around to doing something about this, we need to make a much bigger change than just poking at the margins of type tid. My thought at the moment is that all APIs above the AM level ought to be redefined to use uint64 for tuple identifiers. heapam and related index AMs could map block + offset into that in some convenient way, and other AMs could do what they like. Andres seems to feel that we should try to allow variable-width tupleids, but I'm afraid that the cost/benefit ratio for that would be pretty poor. regards, tom lane
Re: pg_hba.conf.sample wording improvement
On Wed, Apr 28, 2021 at 07:51:43AM +0200, Peter Eisentraut wrote: > I propose the attached patch to shake up the wording in the connection type > section of pg_hba.conf.sample a bit. After the hostgssenc part was added > on, the whole thing became a bit wordy, and it's also a bit inaccurate for > example in that the current wording for "host" appears to say that it does > not apply to GSS-encrypted connections. Yes, much better. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Apr 30, 2021 at 3:01 PM Amit Kapila wrote: > > On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar wrote: > > > > On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila > > wrote: > > > > > > > The idea I have is to additionally check that we are decoding > > > streaming or prepared transaction (the same check as we have for > > > setting curtxn) or we can check if CheckXidAlive is a valid > > > transaction id. What do you think? > > > > I think a check based on CheckXidAlive looks good to me. This will > > protect against if a similar error is raised from any other path as > > you mentioned above. > > > > We can't use CheckXidAlive because it is reset by that time. Right. So, I > used the other approach which led to the attached. The patch looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
(Thanks for committing the fix.) On Thu, Apr 29, 2021 at 1:11 AM Álvaro Herrera wrote: > On Wed, Apr 28, 2021, at 10:21 AM, Amit Langote wrote: > I noticed that rd_partdesc_nodetached_xmin can sometimes end up with > value 0. While you seem to be already aware of that, because otherwise > you wouldn't have added TransactionIdIsValid(...) in condition in > RelationGetPartitionDesc(), the comments nearby don't mention why such > a thing might happen. Also, I guess it can't be helped that the > partdesc_nodetached will have to be leaked when the xmin is 0, but > that shouldn't be as problematic as the case we discussed earlier. > > > The only case I am aware where that can happen is if the pg_inherits tuple is > frozen. (That's exactly what the affected test case was testing, note the > "VACUUM FREEZE pg_inherits" there). So that test case blew up immediately; > but I think the real-world chances that people are going to be doing that are > pretty low, so I'm not really concerned about the leak. The case I was looking at is when a partition detach appears as in-progress to a serializable transaction. If the caller wants to omit detached partitions, such a partition ends up in rd_partdesc_nodetached, with the corresponding xmin being set to 0 due to the way find_inheritance_children_extended() sets *detached_xmin. The next query in the transaction that wants to omit detached partitions, seeing rd_partdesc_nodetached_xmin being invalid, rebuilds the partdesc, again including that partition because the snapshot wouldn't have changed, and so on until the transaction ends. Now, this can perhaps be "fixed" by making find_inheritance_children_extended() set the xmin outside the snapshot-checking block, but maybe there's no need to address this on priority. Rather, a point that bothers me a bit is that we're including a detached partition in the partdesc labeled "nodetached" in this particular case. Maybe we should avoid that by considering in this scenario that no detached partitions exist for this transactions and so initialize rd_partdesc, instead of rd_partdesc_nodetached. That will let us avoid the situations where the xmin is left in invalid state. Maybe like the attached (it also fixes a couple of typos/thinkos in the previous commit). Note that we still end up in the same situation as before where each query in the serializable transaction that sees the detach as in-progress to have to rebuild the partition descriptor omitting the detached partitions, even when it's clear that the detach-in-progress partition will be included every time. -- Amit Langote EDB: http://www.enterprisedb.com partdesc_nodetached-only-if-detached-visible.patch Description: Binary data
Help needed with a reproducer for CVE-2020-25695 not based on REFRESH MATERIALIZED VIEW
Hi, I need to reproduce the CVE-2020-25695 on PostgreSQL 9.2.24. I know this is not a supported version, however, it is important for us to have a reproducer for this version as well. The reproducer for supported versions[1] is based on REFRESH MATERIALIZED VIEW which is not implemented until version 9.3. I was trying to reproduce this using ANALYZE as you can see in this poc.sql file[2]. However, it doesn't reproduce the issue. It would be really appreciated if someone could take a look at it and help. [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/test/regress/sql/privileges.sql;h=013bc95c74bd20e5ab7f1826ea7e676da2a0e85b;hb=HEAD#l896 [2] https://pastebin.com/6hgziYRD Regards, -- Patrik Novotný Associate Software Engineer Red Hat panov...@redhat.com
Re: MaxOffsetNumber for Table AMs
On Fri, 30 Apr 2021, 09:46 Jeff Davis, wrote: > > The notion of TID is based on pages and line pointers, which makes > sense for heapam, but that's not likely to make sense for a custom > table AM. > > The obvious answer is to make a simple mapping between a TID and > whatever makes sense to the AM (for the sake of discussion, let's say a > plain row number). > > The most natural thing would be to say that we have 48 bits, so it can > just be a 48-bit number. Of course, there are some restrictions on > valid values that complicate this: > > * InvalidBlockNumber of 0x. Not a problem. > * InvalidOffsetNumber of 0. Not a problem. > * MaxOffsetNumber of 2048. Does this limit really apply to table AMs? > MaxOffsetNumber is not per se 2048. It is BLCKSZ / sizeof(ItemIdData), which is only 2048 for a 8kiB BLCKSZ. As we support BLCKSZ up to 32kiB, MaxOffsetNumber can be as large as 8196. Other than that, I believe you've also missed the special offset numbers specified in itemptr.h (SpecTokenOffsetNumber and MovedPartitionsOffsetNumber). I am not well enough aware of the usage of these OffsetNumber values, but those might also be limiting the values any tableAM can use for their TIDs. It just seems like it's used when scanning heap or index pages for > stack-allocated arrays. For a table AM it appears to waste 5 bits. > MaxOffsetNumber is used for postgres' Page layout, of which the MaxOffsetNumber is defined as how many item pointers could exist on a page, and AFAIK should be used for postgres' Page layout only. No thing can or should change that. If any code asserts limitations to the ip_posid of table tuples that could also not be tableam tuples, then I believe that is probably a mistake in postgres, and that should be fixed. * ginpostinglist.c:itemptr_to_uint64() seems to think 2047 is the max > offset number. Is this a bug? No. The documentation for that function explicitly mentions that these item pointers are optimized for storage when using the heap tableam, and that that code will be changed once there exist tableAMs with different TID / ip_posid constraints (see the comment on lines 32 and 33 of that file). Note that the limiting number that itemptr_to_uint64 should mention for bit calculations is actually MaxHeaptuplesPerPage, which is about one seventh of MaxOffsetNumber. The resulting number of bits reserved is not a miscalculation though, because MaxHeaptuplesPerPage (for 32kiB BLCKSZ) requires the mentioned 11 bits, and adapting bit swizzling for multiple page sizes was apparently not considered worth the effort. As a table AM author, I'd like to know what the real limits are so that > I can use whatever bits are available to map from TID to row number and > back, without worrying that something will break in the future. A few > possibilities: > > 1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64(). > I believe that this is the right way when there exist tableAMs that use those upper 5 bits. > 2. Change MaxOffsetNumber to 2047. This seems likely to break > extensions that rely on it. > If you're going to change MaxOffsetNumber, I believe that it's better to change it to ((BLCKSZ - sizeof(PageHeaderData)) / sizeof(ItemIdData)), as that is the maximum amount of ItemIds you could put on a Page that has no page opaque. 3. Define MaxOffsetNumber as 65536 and introduce a new > MaxItemsPerPage as 2048 for the stack-allocated arrays. We'd still need > to fix itemptr_to_uint64(). I believe that that breaks more things than otherwise required. ip_posid is already limited to uint16, so I see no reason to add a constant that would assert that the value of any uint16 is less its max value plus one. With regards, Matthias van de Meent
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar wrote: > > On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila wrote: > > > > The idea I have is to additionally check that we are decoding > > streaming or prepared transaction (the same check as we have for > > setting curtxn) or we can check if CheckXidAlive is a valid > > transaction id. What do you think? > > I think a check based on CheckXidAlive looks good to me. This will > protect against if a similar error is raised from any other path as > you mentioned above. > We can't use CheckXidAlive because it is reset by that time. So, I used the other approach which led to the attached. -- With Regards, Amit Kapila. v1-0001-Tighten-the-concurrent-abort-check-during-decodin.patch Description: Binary data
Re: Enhanced error message to include hint messages for redundant options error
On Fri, Apr 30, 2021 at 12:36 PM Bharath Rupireddy wrote: > > > > other comments > > > > if (strcmp(defel->defname, "volatility") == 0) > > { > > if (is_procedure) > > - goto procedure_error; > > + is_procedure_error = true; > > if (*volatility_item) > > - goto duplicate_error; > > + is_duplicate_error = true; > > > > Another side effect I see is, in the above check earlier if > > is_procedure was true it was directly goto procedure_error, but now it > > will also check the if (*volatility_item) and it may set > > is_duplicate_error also true, which seems wrong to me. I think you > > can change it to > > > > if (is_procedure) > >is_procedure_error = true; > > else if (*volatility_item) > > is_duplicate_error = true; > > Thanks. Done that way, see the attached v3. Let's see what others has to say. > > Also attaching Vignesh's v3 patch as-is, just for completion. Looking into this again, why not as shown below? IMHO, this way the code will be logically the same as it was before the patch, basically why to process an extra statement ( *volatility_item = defel;) if we have already decided to error. if (is_procedure) is_procedure_error = true; else if (*volatility_item) is_duplicate_error = true; else *volatility_item = defel; -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Enhanced error message to include hint messages for redundant options error
On Fri, Apr 30, 2021 at 5:06 PM Bharath Rupireddy wrote: > > On Fri, Apr 30, 2021 at 11:23 AM Dilip Kumar wrote: > > > > On Fri, Apr 30, 2021 at 11:09 AM Bharath Rupireddy > > wrote: > > > > > > On Fri, Apr 30, 2021 at 10:51 AM Dilip Kumar > > > wrote: > > > > > > > > On Fri, Apr 30, 2021 at 10:43 AM Bharath Rupireddy > > > > wrote: > > > > > > > > > > On Fri, Apr 30, 2021 at 10:17 AM Dilip Kumar > > > > > wrote: > > > > > > In this function, we already have the "defel" variable then I do not > > > > > > understand why you are using one extra variable and assigning defel > > > > > > to > > > > > > that? > > > > > > If the goal is to just improve the error message then you can simply > > > > > > use defel->defname? > > > > > > > > > > Yeah. I can do that. Thanks for the comment. > > > > > > > > > > While on this, I also removed the duplicate_error and procedure_error > > > > > goto statements, because IMHO, using goto statements is not an elegant > > > > > way. I used boolean flags to do the job instead. See the attached and > > > > > let me know what you think. > > > > > > > > Okay, but I see one side effect of this, basically earlier on > > > > procedure_error and duplicate_error we were not assigning anything to > > > > output parameters, e.g. volatility_item, but now those values will be > > > > assigned with defel even if there is an error. > > > > > > Yes, but on ereport(ERROR, we don't come back right? The txn gets > > > aborted and the control is not returned to the caller instead it will > > > go to sigjmp_buf of the backend. > > > > > > > So I think we should > > > > better avoid such change. But even if you want to do then better > > > > check for any impacts on the caller. > > > > > > AFAICS, there will not be any impact on the caller, as the control > > > doesn't return to the caller on error. > > > > I see. > > > > other comments > > > > if (strcmp(defel->defname, "volatility") == 0) > > { > > if (is_procedure) > > - goto procedure_error; > > + is_procedure_error = true; > > if (*volatility_item) > > - goto duplicate_error; > > + is_duplicate_error = true; > > > > Another side effect I see is, in the above check earlier if > > is_procedure was true it was directly goto procedure_error, but now it > > will also check the if (*volatility_item) and it may set > > is_duplicate_error also true, which seems wrong to me. I think you > > can change it to > > > > if (is_procedure) > >is_procedure_error = true; > > else if (*volatility_item) > > is_duplicate_error = true; > > Thanks. Done that way, see the attached v3. Let's see what others has to say. > Hmmm - I am not so sure about those goto replacements. I think the poor goto has a bad reputation, but not all gotos are bad. I've met some very nice gotos. Each goto here was doing exactly what it looked like it was doing, whereas all these boolean replacements have now introduced subtle differences. e.g. now the *volatility_item = defel; assignment (and all similar assignments) will happen which previously did not happen at all. It leaves the reader wondering if assigning to those references could have any side-effects at the caller. Probably there are no problems at allbut can you be sure? Meanwhile, those "inelegant" gotos did not give any cause for such doubts. -- Kind Regards, Peter Smith. Fujitsu Australia
Replication slot used in logical decoding of documental database give error: got sequence entry 258 for toast chunk 538757697 instead of seq 0
Hi, our setup: Postgres server is running on CentOS release 6.10 (Final) instance. Server Version is PostgreSQL 9.5.9 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18), 64-bit With the following parameters set: wal_level = 'logical' # replica < logical max_replication_slots = 10 max_wal_senders = 10 track_commit_timestamp = on We are using decoder_raw module for retrieving/read the WAL data through the logical decoding mechanism. Application setup: The actual TCapture engine is a Java application which runs as a separate program outside Postgres, and which must be started explicitly. When TCapture is running, it will scan the transaction log (with TCapt module) of all primary databases and pick up transactions which must be replicated. Transactions which have been picked up are stored in the “Replication Database”, a PG user database exclusively used by TCapture. In the Replication Database, transaction is ‘copied’ to all replicate databases which have a subscription for this transaction. Transaction is then applied to the replicate tables by inserting it into by the dedicated Java application module We runs TCapt module in the loop for reading a primary database which is a documental database (with binary columns) . Behavior reported (Bug) We have TCapture Replication Server running for successfully for weeks but recently we encountered following error: cat log/TCapture_enodp_2021-04-12-11\:30\:16_err.log org.postgresql.util.PSQLException: ERROR: got sequence entry 258 for toast chunk 538757697 instead of seq 0 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2553) at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1212) at org.postgresql.core.v3.QueryExecutorImpl.readFromCopy(QueryExecutorImpl.java:1112) at org.postgresql.core.v3.CopyDualImpl.readFromCopy(CopyDualImpl.java:44) at org.postgresql.core.v3.replication.V3PGReplicationStream.receiveNextData(V3PGReplicationStream.java:160) at org.postgresql.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:125) at org.postgresql.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:82) at com.edslab.TCapt.receiveChangesOccursBeforTCapt(TCapt.java:421) at com.edslab.TCapt.run(TCapt.java:182) at java.lang.Thread.run(Thread.java:745) After restarting our TCapt module (see https://www.tcapture.net/ for better understand the project TCapture), the error went away. But this causes the producer module (Tapt) to shut down. Please note that we run TCapture with other Postgres versions (9.6, 10, 11,ec..) without problems !! Is there any resolution for this issue or is it resolved in the higher version of postgres? Regards, Silvio
Re: Replication slot stats misgivings
On Fri, Apr 30, 2021 at 5:55 AM Masahiko Sawada wrote: > > After more thought, it seems to me that we should use txn->size here > regardless of the top transaction or subtransaction since we're > iterating changes associated with a transaction that is either the top > transaction or a subtransaction. Otherwise, I think if some > subtransactions are not serialized, we will end up adding bytes > including those subtransactions during iterating other serialized > subtransactions. Whereas in ReorderBufferProcessTXN() we should use > txn->total_txn since txn is always the top transaction. I've attached > another patch to do this. > LGTM. I have slightly edited the comments in the attached. I'll push this early next week unless there are more comments. -- With Regards, Amit Kapila. use_total_size_v6.patch Description: Binary data
MaxOffsetNumber for Table AMs
The notion of TID is based on pages and line pointers, which makes sense for heapam, but that's not likely to make sense for a custom table AM. The obvious answer is to make a simple mapping between a TID and whatever makes sense to the AM (for the sake of discussion, let's say a plain row number). The most natural thing would be to say that we have 48 bits, so it can just be a 48-bit number. Of course, there are some restrictions on valid values that complicate this: * InvalidBlockNumber of 0x. Not a problem. * InvalidOffsetNumber of 0. Not a problem. * MaxOffsetNumber of 2048. Does this limit really apply to table AMs? It just seems like it's used when scanning heap or index pages for stack-allocated arrays. For a table AM it appears to waste 5 bits. * ginpostinglist.c:itemptr_to_uint64() seems to think 2047 is the max offset number. Is this a bug? As a table AM author, I'd like to know what the real limits are so that I can use whatever bits are available to map from TID to row number and back, without worrying that something will break in the future. A few possibilities: 1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64(). 2. Change MaxOffsetNumber to 2047. This seems likely to break extensions that rely on it. 3. Define MaxOffsetNumber as 65536 and introduce a new MaxItemsPerPage as 2048 for the stack-allocated arrays. We'd still need to fix itemptr_to_uint64(). Thoughts? Regards, Jeff Davis
Re: Enhanced error message to include hint messages for redundant options error
On Fri, Apr 30, 2021 at 11:23 AM Dilip Kumar wrote: > > On Fri, Apr 30, 2021 at 11:09 AM Bharath Rupireddy > wrote: > > > > On Fri, Apr 30, 2021 at 10:51 AM Dilip Kumar wrote: > > > > > > On Fri, Apr 30, 2021 at 10:43 AM Bharath Rupireddy > > > wrote: > > > > > > > > On Fri, Apr 30, 2021 at 10:17 AM Dilip Kumar > > > > wrote: > > > > > In this function, we already have the "defel" variable then I do not > > > > > understand why you are using one extra variable and assigning defel to > > > > > that? > > > > > If the goal is to just improve the error message then you can simply > > > > > use defel->defname? > > > > > > > > Yeah. I can do that. Thanks for the comment. > > > > > > > > While on this, I also removed the duplicate_error and procedure_error > > > > goto statements, because IMHO, using goto statements is not an elegant > > > > way. I used boolean flags to do the job instead. See the attached and > > > > let me know what you think. > > > > > > Okay, but I see one side effect of this, basically earlier on > > > procedure_error and duplicate_error we were not assigning anything to > > > output parameters, e.g. volatility_item, but now those values will be > > > assigned with defel even if there is an error. > > > > Yes, but on ereport(ERROR, we don't come back right? The txn gets > > aborted and the control is not returned to the caller instead it will > > go to sigjmp_buf of the backend. > > > > > So I think we should > > > better avoid such change. But even if you want to do then better > > > check for any impacts on the caller. > > > > AFAICS, there will not be any impact on the caller, as the control > > doesn't return to the caller on error. > > I see. > > other comments > > if (strcmp(defel->defname, "volatility") == 0) > { > if (is_procedure) > - goto procedure_error; > + is_procedure_error = true; > if (*volatility_item) > - goto duplicate_error; > + is_duplicate_error = true; > > Another side effect I see is, in the above check earlier if > is_procedure was true it was directly goto procedure_error, but now it > will also check the if (*volatility_item) and it may set > is_duplicate_error also true, which seems wrong to me. I think you > can change it to > > if (is_procedure) >is_procedure_error = true; > else if (*volatility_item) > is_duplicate_error = true; Thanks. Done that way, see the attached v3. Let's see what others has to say. Also attaching Vignesh's v3 patch as-is, just for completion. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com From 6ed153cdb45a0d41d3889dc7d80b3a743eb66725 Mon Sep 17 00:00:00 2001 From: vignesh Date: Mon, 26 Apr 2021 18:40:36 +0530 Subject: [PATCH v3] Enhance error message. Enhanced error message, so that the user can easily identify the error. --- contrib/file_fdw/file_fdw.c | 6 +-- src/backend/catalog/aclchk.c| 4 +- src/backend/commands/copy.c | 23 +- src/backend/commands/dbcommands.c | 28 ++-- src/backend/commands/extension.c| 8 ++-- src/backend/commands/foreigncmds.c | 4 +- src/backend/commands/functioncmds.c | 12 +++--- src/backend/commands/publicationcmds.c | 4 +- src/backend/commands/sequence.c | 18 src/backend/commands/subscriptioncmds.c | 18 src/backend/commands/tablecmds.c| 2 +- src/backend/commands/typecmds.c | 14 +++--- src/backend/commands/user.c | 48 ++--- src/backend/parser/parse_utilcmd.c | 2 +- src/backend/replication/pgoutput/pgoutput.c | 10 ++--- src/backend/replication/walsender.c | 6 +-- src/test/regress/expected/copy2.out | 24 ++- src/test/regress/expected/foreign_data.out | 4 +- src/test/regress/expected/publication.out | 2 +- 19 files changed, 119 insertions(+), 118 deletions(-) diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c index 2c2f149fb0..10aa2fca28 100644 --- a/contrib/file_fdw/file_fdw.c +++ b/contrib/file_fdw/file_fdw.c @@ -292,8 +292,7 @@ file_fdw_validator(PG_FUNCTION_ARGS) if (force_not_null) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), - errmsg("conflicting or redundant options"), - errhint("Option \"force_not_null\" supplied more than once for a column."))); + errmsg("option \"%s\" specified more than once", def->defname))); force_not_null = def; /* Don't care what the value is, as long as it's a legal boolean */ (void) defGetBoolean(def); @@ -304,8 +303,7 @@ file_fdw_validator(PG_FUNCTION_ARGS) if (force_null) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), - errmsg("conflicting or redundant options"), - errhint("Option \"force_null\" supplied more than once for a column."))); + errmsg("option \"%s\" specified more than once", def->defname))); force_null = def; (void) defGetBoolean(def); }
Re: Included xid in restoring reorder buffer changes from disk log message
On Thu, Apr 29, 2021 at 9:45 PM vignesh C wrote: > > Hi, > > While debugging one of the logical decoding issues, I found that xid was not > included in restoring reorder buffer changes from disk log messages. > Attached a patch for it. I felt including the XID will be helpful in > debugging. Thoughts? > It makes sense to include xid in the debug message, earlier I thought that will it be a good idea to also print the toplevel_xid. But I think it is for debug purposes and only we have the xid we can fetch the other parameters so maybe adding xid is good enough. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Replication slot stats misgivings
On Fri, Apr 30, 2021 at 5:55 AM Masahiko Sawada wrote: > > On Thu, Apr 29, 2021 at 9:44 PM vignesh C wrote: > > > > > > > > On Thu, Apr 29, 2021 at 3:06 PM Amit Kapila wrote: > > > > > > On Wed, Apr 28, 2021 at 5:01 PM Amit Kapila > > > wrote: > > > > > > > > On Wed, Apr 28, 2021 at 4:51 PM Masahiko Sawada > > > > wrote: > > > > > > > > > > On Wed, Apr 28, 2021 at 6:39 PM Amit Kapila > > > > > wrote: > > > > > > > > @@ -1369,7 +1369,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, > > > > ReorderBufferIterTXNState *state) > > > > * Update the total bytes processed before releasing the current set > > > > * of changes and restoring the new set of changes. > > > > */ > > > > - rb->totalBytes += rb->size; > > > > + rb->totalBytes += entry->txn->total_size; > > > > if (ReorderBufferRestoreChanges(rb, entry->txn, >file, > > > > >entries[off].segno)) > > > > > > > > I have not tested this but won't in the above change you need to check > > > > txn->toptxn for subtxns? > > > > > > > > > > Now, I am able to reproduce this issue: > > > Create table t1(c1 int); > > > select pg_create_logical_replication_slot('s', 'test_decoding'); > > > Begin; > > > insert into t1 values(1); > > > savepoint s1; > > > insert into t1 select generate_series(1, 10); > > > commit; > > > > > > postgres=# select count(*) from pg_logical_slot_peek_changes('s1', NULL, > > > NULL); > > > count > > > > > > 15 > > > (1 row) > > > > > > postgres=# select * from pg_stat_replication_slots; > > > slot_name | spill_txns | spill_count | spill_bytes | stream_txns | > > > stream_count | stream_bytes | total_txns | total_bytes | > > > stats_reset > > > ---++-+-+-+--+--++-+-- > > > s1| 0 | 0 | 0 | 0 | > > > 0 |0 | 2 |13200672 | 2021-04-29 > > > 14:33:55.156566+05:30 > > > (1 row) > > > > > > select * from pg_stat_reset_replication_slot('s1'); > > > > > > Now reduce the logical decoding work mem to allow spilling. > > > postgres=# set logical_decoding_work_mem='64kB'; > > > SET > > > postgres=# select count(*) from pg_logical_slot_peek_changes('s1', NULL, > > > NULL); > > > count > > > > > > 15 > > > (1 row) > > > > > > postgres=# select * from pg_stat_replication_slots; > > > slot_name | spill_txns | spill_count | spill_bytes | stream_txns | > > > stream_count | stream_bytes | total_txns | total_bytes | > > > stats_reset > > > ---++-+-+-+--+--++-+-- > > > s1| 1 | 202 |1320 | 0 | > > > 0 |0 | 2 | 672 | 2021-04-29 > > > 14:35:21.836613+05:30 > > > (1 row) > > > > > > You can notice that after we have allowed spilling the 'total_bytes' > > > stats is showing a different value. The attached patch fixes the issue > > > for me. Let me know what do you think about this? > > > > I found one issue with the following scenario when testing with > > logical_decoding_work_mem as 64kB: > > > > BEGIN; > > INSERT INTO t1 values(generate_series(1,1)); > > SAVEPOINT s1; > > INSERT INTO t1 values(generate_series(1,1)); > > COMMIT; > > SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot1', NULL, > > NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); > > select * from pg_stat_replication_slots; > > slot_name | spill_txns | spill_count | spill_bytes | stream_txns | > > stream_count | stream_bytes | total_txns | total_bytes | > > stats_reset > > --++-+-+-+--+--++-+-- > > regression_slot1 | 6 | 154 | 9130176 | 0 | > > 0 |0 | 1 | 4262016 | 2021-04-29 > > 17:50:00.080663+05:30 > > (1 row) > > > > Same thing works fine with logical_decoding_work_mem as 64MB: > > select * from pg_stat_replication_slots; > >slot_name | spill_txns | spill_count | spill_bytes | stream_txns | > > stream_count | stream_bytes | total_txns | total_bytes | > > stats_reset > > --++-+-+-+--+--++-+-- > > regression_slot1 | 6 | 154 | 9130176 | 0 | > > 0 |0 | 1 | 264 | 2021-04-29 > > 17:50:00.080663+05:30 > > (1 row) > > > > The patch required one change: > > - rb->totalBytes += rb->size; > > + if (entry->txn->toptxn) > > + rb->totalBytes += entry->txn->toptxn->total_size; > > + else > > + rb->totalBytes += entry->txn->total_size; >