Will there be https://wiki.postgresql.org/wiki/PgCon_2024_Developer_Unconference ?
Hello Everybody! For at least last two years we have had Developers Conference Unconference notes in PostgreSQL Wiki https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference And I know that people took notes at least at the unconference sessions I attended. S is there a plan to collect them in https://wiki.postgresql.org/wiki/PgCon_2024_Developer_Unconference like previous years ? -- Hannu
Re: DROP OWNED BY fails to clean out pg_init_privs grants
Hi Daniel, pg_upgrade is just one important user of pg_dump which is the one that generates REVOKE for a non-existent role. We should definitely also fix pg_dump, likely just checking that the role exists when generating REVOKE commands (may be a good practice for other cases too so instead of casting to ::regrole do the actual join) ## here is the fix for pg_dump While flying to Vancouver I looked around in pg_dump code, and it looks like the easiest way to mitigate the dangling pg_init_priv entries is to replace the query in pg_dump with one that filters out invalid entries The current query is at line 9336: /* Fetch initial-privileges data */ if (fout->remoteVersion >= 90600) { printfPQExpBuffer(query, "SELECT objoid, classoid, objsubid, privtype, initprivs " "FROM pg_init_privs"); res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK); And we need the same but filtering out invalid aclitems from initprivs something like this WITH q AS ( SELECT objoid, classoid, objsubid, privtype, unnest(initprivs) AS initpriv FROM saved_init_privs ) SELECT objoid, classoid, objsubid, privtype, array_agg(initpriv) as initprivs FROM q WHERE is_valid_value_for_type(initpriv::text, 'aclitem') GROUP BY 1,2,3,4; ### The proposed re[placement query: Unfortunately we do not have an existing is_this_a_valid_value_for_type(value text, type text, OUT res boolean) function, so for a read-only workaround the following seems to work: Here I first collect the initprivs array elements which fail the conversion to text and back into an array and store it in GUC pg_dump.bad_aclitems Then I use this stored list to filter out the bad ones in the actual query. DO $$ DECLARE aclitem_text text; bad_aclitems text[] = '{}'; BEGIN FOR aclitem_text IN SELECT DISTINCT unnest(initprivs)::text FROM pg_init_privs LOOP BEGIN /* try to convert back to aclitem */ PERFORM aclitem_text::aclitem; EXCEPTION WHEN OTHERS THEN /* collect bad aclitems */ bad_aclitems := bad_aclitems || ARRAY[aclitem_text]; END; END LOOP; IF bad_aclitems != '{}' THEN RAISE WARNING 'Ignoring bad aclitems "%" in pg_init_privs', bad_aclitems; END IF; PERFORM set_config('pg_dump.bad_aclitems', bad_aclitems::text, false); -- true for trx-local END; $$; WITH q AS ( SELECT objoid, classoid, objsubid, privtype, unnest(initprivs) AS initpriv FROM pg_init_privs ) SELECT objoid, classoid, objsubid, privtype, array_agg(initpriv) AS initprivs FROM q WHERE NOT initpriv::text = ANY (current_setting('pg_dump.bad_aclitems')::text[]) GROUP BY 1,2,3,4; -- Hannu On Sun, May 26, 2024 at 11:27 PM Daniel Gustafsson wrote: > > > On 26 May 2024, at 23:25, Tom Lane wrote: > > > > Hannu Krosing writes: > >> Attached is a minimal patch to allow missing roles in REVOKE command > > > > FTR, I think this is a very bad idea. > > Agreed, this is papering over a bug. If we are worried about pg_upgrade it > would be better to add a check to pg_upgrade which detects this case and > advices the user how to deal with it. > > -- > Daniel Gustafsson >
Re: DROP OWNED BY fails to clean out pg_init_privs grants
Attached is a minimal patch to allow missing roles in REVOKE command This should fix the pg_upgrade issue and also a case where somebody has dropped a role you are trying to revoke privileges from : smalltest=# create table revoketest(); CREATE TABLE smalltest=# revoke select on revoketest from bob; WARNING: ignoring REVOKE FROM a missing role "bob" REVOKE smalltest=# create user bob; CREATE ROLE smalltest=# grant select on revoketest to bob; GRANT smalltest=# \du List of roles Role name | Attributes ---+ bob | hannuk| Superuser, Create role, Create DB, Replication, Bypass RLS smalltest=# \dp Access privileges Schema |Name| Type | Access privileges| Column privileges | Policies ++---++---+-- public | revoketest | table | hannuk=arwdDxtm/hannuk+| | || | bob=r/hannuk | | public | vacwatch | table || | (2 rows) smalltest=# revoke select on revoketest from bob, joe; WARNING: ignoring REVOKE FROM a missing role "joe" REVOKE smalltest=# \dp Access privileges Schema |Name| Type | Access privileges| Column privileges | Policies ++---++---+-- public | revoketest | table | hannuk=arwdDxtm/hannuk | | public | vacwatch | table || | (2 rows) On Sun, May 26, 2024 at 12:05 AM Hannu Krosing wrote: > > On Sat, May 25, 2024 at 4:48 PM Tom Lane wrote: > > > > Hannu Krosing writes: > > > Having an pg_init_privs entry referencing a non-existing user is > > > certainly of no practical use. > > > > Sure, that's not up for debate. What I think we're discussing > > right now is > > > > 1. What other cases are badly handled by the pg_init_privs > > mechanisms. > > > > 2. How much of that is practical to fix in v17, seeing that > > it's all long-standing bugs and we're already past beta1. > > > > I kind of doubt that the answer to #2 is "all of it". > > But perhaps we can do better than "none of it". > > Putting the fix either in pg_dump or making REVOKE tolerate > non-existing users would definitely be most practical / useful fixes, > as these would actually allow pg_upgrade to v17 to work without > changing anything in older versions. > > Currently one already can revoke a privilege that is not there in the > first place, with the end state being that the privilege (still) does > not exist. > This does not even generate a warning. > > Extending this to revoking from users that do not exist does not seem > any different on conceptual level, though I understand that > implementation would be very different as it needs catching the user > lookup error from a very different part of the code. > > That said, it would be better if we can have something that would be > easy to backport something that would make pg_upgrade work for all > supported versions. > Making REVOKE silently ignore revoking from non-existing users would > improve general robustness but could conceivably change behaviour if > somebody relies on it in their workflows. > > Regards, > Hannu diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c index 7abf3c2a74..aff91582d7 100644 --- a/src/backend/catalog/aclchk.c +++ b/src/backend/catalog/aclchk.c @@ -444,6 +444,9 @@ ExecuteGrantStmt(GrantStmt *stmt) * Convert the RoleSpec list into an Oid list. Note that at this point we * insert an ACL_ID_PUBLIC into the list if appropriate, so downstream * there shouldn't be any additional work needed to support this case. +* +* Allow missing grantees in case of REVOKE (!istmt.is_grant) +* if now valid roles found return immediately */ foreach(cell, stmt->grantees) { @@ -456,12 +459,20 @@ ExecuteGrantStmt(GrantStmt *stmt) grantee_uid = ACL_ID_PUBLIC; break; default: - grantee_uid = get_rolespec_oid(grantee, false); + grantee_uid = get_rolespec_oid(grantee, !istmt.is_grant); break; } - istmt.grantees = lappend_oid(istmt.grantees, grantee_uid); + if (OidIsValid(grantee_uid)) + istmt.grantees = lappend_oid(istmt.grantees, grantee_uid); + else +
Re: DROP OWNED BY fails to clean out pg_init_privs grants
On Sat, May 25, 2024 at 4:48 PM Tom Lane wrote: > > Hannu Krosing writes: > > Having an pg_init_privs entry referencing a non-existing user is > > certainly of no practical use. > > Sure, that's not up for debate. What I think we're discussing > right now is > > 1. What other cases are badly handled by the pg_init_privs > mechanisms. > > 2. How much of that is practical to fix in v17, seeing that > it's all long-standing bugs and we're already past beta1. > > I kind of doubt that the answer to #2 is "all of it". > But perhaps we can do better than "none of it". Putting the fix either in pg_dump or making REVOKE tolerate non-existing users would definitely be most practical / useful fixes, as these would actually allow pg_upgrade to v17 to work without changing anything in older versions. Currently one already can revoke a privilege that is not there in the first place, with the end state being that the privilege (still) does not exist. This does not even generate a warning. Extending this to revoking from users that do not exist does not seem any different on conceptual level, though I understand that implementation would be very different as it needs catching the user lookup error from a very different part of the code. That said, it would be better if we can have something that would be easy to backport something that would make pg_upgrade work for all supported versions. Making REVOKE silently ignore revoking from non-existing users would improve general robustness but could conceivably change behaviour if somebody relies on it in their workflows. Regards, Hannu
Re: DROP OWNED BY fails to clean out pg_init_privs grants
On Fri, May 24, 2024 at 10:00 PM Tom Lane wrote: > > Robert Haas writes: > > On Fri, May 24, 2024 at 2:57 PM Tom Lane wrote: > >> Doesn't seem right to me. That will give pg_dump the wrong idea > >> of what the initial privileges actually were, and I don't see how > >> it can construct correct delta GRANT/REVOKE on the basis of false > >> information. During the dump reload, the extension will be > >> recreated with the original owner (I think), causing its objects' > >> privileges to go back to the original pg_init_privs values. > > > Oh! That does seem like it would make what I said wrong, but how would > > it even know who the original owner was? Shouldn't we be recreating > > the object with the owner it had at dump time? > > Keep in mind that the whole point here is for the pg_dump script to > just say "CREATE EXTENSION foo", not to mess with the individual > objects therein. So the objects are (probably) going to be owned by > the user that issued CREATE EXTENSION. > > In the original conception, that was the end of it: what you got for > the member objects was whatever state CREATE EXTENSION left behind. > The idea of pg_init_privs is to support dump/reload of subsequent > manual alterations of privileges for extension-created objects. > I'm not, at this point, 100% certain that that's a fully realizable > goal. The issue became visible because pg_dump issued a bogus REVOKE ALL ON TABLE public.pg_stat_statements FROM "16390"; Maybe the right place for a fix is in pg_dump and the fix would be to *not* issue REVOKE ALL ON FROM ? Or alternatively change REVOKE to treat non-existing users as a no-op ? Also, the pg_init_privs entry should either go away or at least be changed at the point when the user referenced in init-privs is dropped. Having an pg_init_privs entry referencing a non-existing user is certainly of no practical use. Or maybe we should change the user at that point to NULL or some special non-existing-user-id ? > But I definitely think it's insane to expect that to work > without also tracking changes in the ownership of said objects. > > Maybe forbidding ALTER OWNER on extension-owned objects isn't > such a bad idea?
Re: DROP OWNED BY fails to clean out pg_init_privs grants
While the 'DROP OWNED BY fails to clean out pg_init_privs grants' issue is now fixed,we have a similar issue with REASSIGN OWNED BY that is still there: Tested on fresh git checkout om May 20th test=# create user privtestuser superuser; CREATE ROLE test=# set role privtestuser; SET test=# create extension pg_stat_statements ; CREATE EXTENSION test=# select * from pg_init_privs where privtype ='e'; objoid | classoid | objsubid | privtype | initprivs +--+--+--+-- 16405 | 1259 |0 | e| {privtestuser=arwdDxtm/privtestuser,=r/privtestuser} 16422 | 1259 |0 | e| {privtestuser=arwdDxtm/privtestuser,=r/privtestuser} 16427 | 1255 |0 | e| {privtestuser=X/privtestuser} (3 rows) test=# reset role; RESET test=# reassign owned by privtestuser to hannuk; REASSIGN OWNED test=# select * from pg_init_privs where privtype ='e'; objoid | classoid | objsubid | privtype | initprivs +--+--+--+-- 16405 | 1259 |0 | e| {privtestuser=arwdDxtm/privtestuser,=r/privtestuser} 16422 | 1259 |0 | e| {privtestuser=arwdDxtm/privtestuser,=r/privtestuser} 16427 | 1255 |0 | e| {privtestuser=X/privtestuser} (3 rows) test=# drop user privtestuser ; DROP ROLE test=# select * from pg_init_privs where privtype ='e'; objoid | classoid | objsubid | privtype |initprivs +--+--+--+- 16405 | 1259 |0 | e| {16390=arwdDxtm/16390,=r/16390} 16422 | 1259 |0 | e| {16390=arwdDxtm/16390,=r/16390} 16427 | 1255 |0 | e| {16390=X/16390} (3 rows) This will cause pg_dump to produce something that cant be loaded back into the database: CREATE EXTENSION IF NOT EXISTS pg_stat_statements WITH SCHEMA public; ... REVOKE ALL ON TABLE public.pg_stat_statements FROM "16390"; ... And this will, among other things, break pg_upgrade. - Hannu On Tue, Apr 30, 2024 at 6:40 AM David G. Johnston wrote: > > On Monday, April 29, 2024, Tom Lane wrote: >> >> "David G. Johnston" writes: >> > My solution to this was to rely on the fact that the bootstrap superuser is >> > assigned OID 10 regardless of its name. >> >> Yeah, I wrote it that way to start with too, but reconsidered >> because >> >> (1) I don't like hard-coding numeric OIDs. We can avoid that in C >> code but it's harder to do in SQL. > > > If the tests don’t involve, e.g., the predefined role pg_monitor and its > grantor of the memberships in the other predefined roles, this indeed can be > avoided. So I think my test still needs to check for 10 even if some other > superuser is allowed to produce the test output since a key output in my case > was the bootstrap superuser and the initdb roles. > >> >> (2) It's not clear to me that this test couldn't be run by a >> non-bootstrap superuser. I think "current_user" is actually >> the correct thing for the role executing the test. > > > Agreed, testing against current_role is correct if the things being queried > were created while executing the test. I would need to do this as well to > remove the current requirement that my tests be run by the bootstrap > superuser. > > David J. >
Re: Function and Procedure with same signature?
On Thu, Mar 7, 2024 at 5:46 PM Tom Lane wrote: > > Hannu Krosing writes: > > On Sat, Feb 10, 2024 at 12:38 AM Tom Lane wrote: > >> Worth noting perhaps that this is actually required by the SQL > >> standard: per spec, functions and procedures are both "routines" > >> and share the same namespace, > > > Can you point me to a place in the standard where it requires all > > kinds of ROUTINES to be using the same namespace ? > > [ digs around a bit... ] Well, the spec is vaguer about this than > I thought. It is clear on one thing: 11.60 > conformance rules include ... Thanks for thorough analysis of the standard. I went and looked at more what other relevant database do in this space based on their documentation Tl;DR * MS SQL Server - no overloading allowed anywhere * MySQL - no overloading * Oracle - no overloading at top level - overloading and independent namespaces for functions and procedures * Teradata - function overloading allowed - not clear from documentation if this also applies procedures - function overloading docs does not mention possible clashes with procedure names anywhere * DB2 - function overloading fully supported - procedure overloading supported, but only for distinct NUMBER OF ARGUMENTS I'll try to get access to a Teradata instance to verify the above So at high level most other Serious Databases - do support function overloading - keep functions and procedures in separated namespaces I still think that PostgreSQL having functions and procedures share the same namespace is an unneeded and unjustified restriction I plan to do some hands-on testing on Teradata and DB2 to understand it But my current thinking is that we should not be more restrictive than others unless there is a strong technical reason for it. And currently I do not see any. > It could be argued that this doesn't prohibit having both a function > and a procedure with the same data type list, only that you can't > write ROUTINE when trying to drop or alter one. But I think that's > just motivated reasoning. The paragraphs for being > FUNCTION or PROCEDURE are exactly like the above except they say > "exactly one function" or "exactly one procedure". If you argue that > this text means we must allow functions and procedures with the same > parameter lists, then you are also arguing that we must allow multiple > functions with the same parameter lists, and it's just the user's > tough luck if they need to drop one of them. The main issue is not dropping them, but inability to determine which one to call. We already have this in case of two overloaded functions with same initial argument types and the rest having defaults - when --- hannuk=# create function get(i int, out o int) begin atomic select i; end; CREATE FUNCTION hannuk=# create function get(i int, j int default 0, out o int) begin atomic select i+j; end; CREATE FUNCTION hannuk=# select get(1); ERROR: function get(integer) is not unique LINE 1: select get(1); ^ HINT: Could not choose a best candidate function. You might need to add explicit type casts. --- > A related point is that our tolerance for overloading routine > names isn't unlimited: we don't allow duplicate names with the > same list of input parameters, even if the output parameters are > different. This again has a good reason, as there would be many cases where you could not decide which one to call Not allowing overloading based on only return types is common across all OO languages. My point is that this does not apply to FUNCTION vs. PROCEDURE as it is very clear from the CALL syntax which one is meant. > This is not something that the SQL spec cares to > address, but there are good ambiguity-avoidance reasons for it. > I think limiting overloading so that a ROUTINE specification is > unambiguous is also a good thing. I think ROUTINE being unambiguous is not e very good goal. What if next version of standard introduces DROP DATABASE OBJECT ? > I remain unexcited about changing our definition of this. > "Oracle allows it" is not something that has any weight in > my view: they have made a bunch of bad design decisions > as well as good ones, and I think this is a bad one. Fully agree that "Oracle allows it" is a non-argument. My main point is that there is no strong reason to have objects which are distinct at syntax level to be in the same namespace. # Oracle is actually much more restrictive in top level object namespace - All of the following share the same namespace - [Packages, Private synonyms, Sequences, Stand-alone procedures, Stand-alone stored functions, Tables, User-defined operators, User-defined types, Views]. (I guess this makes parser easier to write) The equivalent in postgreSQL would be [extensions, schemas, tabl
CREATE TABLE creates a composite type corresponding to the table row, which is and is not there
I could not find any explanation of the following behaviour in docs - Our documentation for CREATE TABLE says: CREATE TABLE also automatically creates a data type that represents the composite type corresponding to one row of the table. Therefore, tables cannot have the same name as any existing data type in the same schema. But these composite tables are only sometimes there hannuk=# CREATE TABLE pair(a int, b int); CREATE TABLE hannuk=# INSERT INTO pair VALUES(1,2); INSERT 0 1 hannuk=# select pg_typeof(p) from pair as p; pg_typeof --- pair hannuk=# select pg_typeof(pg_typeof(p)) from pair as p; pg_typeof --- regtype # first case where I can not use the table-defined type hannuk=# create table anoter_pair of pair; ERROR: type pair is not a composite type # the type definitely is there as promised hannuk=# create type pair as (a int, b int); ERROR: type "pair" already exists # and I can create similar type wit other name and use it to create table hannuk=# create type pair2 as (a int, b int); CREATE TYPE hannuk=# create table anoter_pair of pair2; CREATE TABLE # and i can even use it in LIKE hannuk=# CREATE TABLE pair3(like pair2); CREATE TABLE # the type is present in pg_type with type 'c' for Composite hannuk=# select typname, typtype from pg_type where typname = 'pair'; typname | typtype -+- pair| c (1 row) # and I can add comment to the type hannuk=# COMMENT ON TYPE pair is 'A Shroedingers type'; COMMENT # but \dT does not show it (second case) hannuk=# \dT pair List of data types Schema | Name | Description +--+- (0 rows) --- Hannu
Re: Function and Procedure with same signature?
Hi Tom On Sat, Feb 10, 2024 at 12:38 AM Tom Lane wrote: > > "David G. Johnston" writes: > > On Fri, Feb 9, 2024, 12:05 Deepak M wrote: > >> Folks, When tried to create a function with the same signature as > >> procedure it fails. > > > That seems like a good hint you cannot do it. Specifically because they > > get defined in the same internal catalog within which names must be unique. The fact that we currently enforce it this way seems like an implementation deficiency that should be fixed. Bringing this up again, as there seems to be no good reason to have this restriction as there is no place in the call syntax where it would be unclear if a FUNCTION or a PROCEDURE is used. And at least Oracle does allow this (and possibly others) so having this unnecessary restriction in PostgreSQL makes migrations from Oracle applications where this feature is used needlessly complicated. > Worth noting perhaps that this is actually required by the SQL > standard: per spec, functions and procedures are both "routines" > and share the same namespace, Can you point me to a place in the standard where it requires all kinds of ROUTINES to be using the same namespace ? All I could find was that ROUTINES are either FUNCTIONS, PROCEDURES or METHODS and then samples of their usage which made clear that all three are different and usage is disjoint at syntax level. As for DROP ROUTINE we could just raise an error and recommend using more specific DROP FUNCTION or DROP PROCEDURE if there is ambiguity. -- Best Regards Hannu
Re: pgbench - adding pl/pgsql versions of tests
My justification for adding pl/pgsql tests as part of the immediately available tests is that pl/pgsql itself is always enabled, so having a no-effort way to test its performance benefits would be really helpful. We also should have "tps-b-like as SQL function" to round up the "test what's available in server" set. --- Hannu On Fri, Feb 2, 2024 at 9:44 PM Robert Haas wrote: > On Tue, Aug 15, 2023 at 11:41 AM Nathan Bossart > wrote: > > Why's that? I'm not aware of any project policy that prohibits such > > enhancements to pgbench. It might take some effort to gather consensus > on > > a proposal like this, but IMHO that doesn't mean we shouldn't try. If > the > > prevailing wisdom is that we shouldn't add more built-in scripts because > > there is an existing way to provide custom ones, then it's not clear that > > we should proceed with $SUBJECT, anyway. > > I don't think there's a policy against adding more built-in scripts to > pgbench, but I'm skeptical of such efforts because I don't see how to > decide which ones are worthy of inclusion and which are not. Adding > everyone's favorite thing will be too cluttered, and adding nothing > forecloses nothing because people can always provide their own. If we > could establish that certain custom scripts are widely used across > many people, then those might be worth adding. > > I have a vague recollection of someone proposing something similar to > this in the past, possibly Jeff Davis. If there is in fact a paper > trail showing that the same thing has been proposed more than once by > unrelated people, that would be a point in favor of adding that > particular thing. > > -- > Robert Haas > EDB: http://www.enterprisedb.com >
Re: pgbench - adding pl/pgsql versions of tests
Thanks for the update. I will give it another go over the weekend Cheers, Hannu On Thu, Feb 1, 2024 at 7:33 PM vignesh C wrote: > On Fri, 18 Aug 2023 at 23:04, Hannu Krosing wrote: > > > > I will address the comments here over this coming weekend. > > The patch which you submitted has been awaiting your attention for > quite some time now. As such, we have moved it to "Returned with > Feedback" and removed it from the reviewing queue. Depending on > timing, this may be reversible. Kindly address the feedback you have > received, and resubmit the patch to the next CommitFest. > > Regards, > Vignesh >
Re: Emitting JSON to file using COPY TO
> On Sat, Dec 2, 2023 at 4:11 PM Tom Lane wrote: > > Joe Conway writes: > >> I noticed that, with the PoC patch, "json" is the only format that must be > >> quoted. Without quotes, I see a syntax error. In longer term we should move any specific COPY flag names and values out of grammar and their checking into the parts that actually implement whatever the flag is influencing Similar to what we do with OPTIONS in all levels of FDW definitions (WRAPPER itself, SERVER, USER MAPPING, FOREIGN TABLE) [*] https://www.postgresql.org/docs/current/sql-createforeigndatawrapper.html > >> I'm assuming there's a > >> conflict with another json-related rule somewhere in gram.y, but I haven't > >> tracked down exactly which one is causing it. > > While I've not looked too closely, I suspect this might be due to the > FORMAT_LA hack in base_yylex: > > /* Replace FORMAT by FORMAT_LA if it's followed by JSON */ > switch (next_token) > { > case JSON: > cur_token = FORMAT_LA; > break; > } My hope is that turning the WITH into a fully independent part with no grammar-defined keys or values would also solve the issue of quoting "json". For backwards compatibility we may even go the route of keeping the WITH as is but add the OPTIONS which can take any values at grammar level. I shared my "Pluggable Copy " talk slides from Berlin '22 in another thread -- Hannu
Why are wal_keep_size, max_slot_wal_keep_size requiring server restart?
Hello fellow Hackers, Does anyone know why we have decided that the wal_keep_size, max_slot_wal_keep_size GUCs "can only be set in the postgresql.conf file or on the server command line." [1]? It does not seem fundamentally needed , as they are "kind of guidance", especially the second one. The first one - wal_keep_size - could be something that is directly relied on in some code paths, so setting it in a live database could involve some two-step process, where you first set the value and then wait all current transactions to finish before you do any operations depending on the new value, like removing the wal files no more kept because of earlier larger value. moving it up should need no extra action. Moving it up then down immediately after could cause some interesting race conditions when you move it down lower than it was in the beginning, so "wait for all transactions to finish" should apply in all cases For the second one - max_slot_wal_keep_size - I can not immediately come up with a scenario where just setting it could cause any unexpected consequences. If you set it to a value below a current slot value you *do* expect the slot to be invalidated. if you set it to a larger than current value, then infringing slots get more time to correct themselves. Both behaviours would be much more useful if you did not have to restart the whole server to make adjustments. - [1] https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-WAL-KEEP-SIZE Best Regards Hannu
Re: Allowing TRUNCATE of FK target when session_replication_role=replica
Thanks for the pointers. One thing though re: > The former is true but the latter is not. Logical replication requires > wal_level = logical. That's also true for skipping FSM. wal_level=logical is only needed *at provider* side, at least when running pglogical. Also, even for native logical replication it is possible to disconnect the initial copy from CDC streaming, in which case again you can set wal_level=minimal on the target side. Will check the [1] and [2] and come back with more detailed proposal. --- Best regards, Hannu On Tue, Oct 31, 2023 at 5:56 PM Euler Taveira wrote: > > On Tue, Oct 31, 2023, at 5:09 AM, Hannu Krosing wrote: > > Currently we do not allow TRUNCATE of a table when any Foreign Keys > point to that table. > > > It is allowed iif you *also* truncate all tables referencing it. > > At the same time we do allow one to delete all rows when > session_replication_role=replica > > > That's true. > > This causes all kinds of pain when trying to copy in large amounts of > data, especially at the start of logical replication set-up, as many > optimisations to COPY require the table to be TRUNCATEd . > > The main two are ability to FREEZE while copying and the skipping of > WAL generation in case of wal_level=minimal, both of which can achieve > significant benefits when data amounts are large. > > > The former is true but the latter is not. Logical replication requires > wal_level = logical. That's also true for skipping FSM. > > Is there any reason to not allow TRUNCATE when > session_replication_role=replica ? > > > That's basically the same proposal as [1]. That patch was rejected because it > was implemented in a different way that doesn't require the > session_replication_role = replica to bypass the FK checks. > > That's basically the same proposal as [1]. That patch was rejected because it > was implemented in a different way that doesn't require the > session_replication_role = replica to bypass the FK checks. > > There are at least 3 cases that can benefit from this feature: > > 1) if your scenario includes an additional table only in the subscriber > side that contains a foreign key to a replicated table then you will break > your > replication like > > ERROR: cannot truncate a table referenced in a foreign key constraint > DETAIL: Table "foo" references "bar". > HINT: Truncate table "foo" at the same time, or use TRUNCATE ... CASCADE. > CONTEXT: processing remote data for replication origin "pg_16406" during > message type "TRUNCATE" in transaction 12880, finished at 0/297FE08 > > and you have to manually fix your replication. If we allow > session_replication_role = replica to bypass FK check for TRUNCATE commands, > we > wouldn't have an error. I'm not saying that it is a safe operation for logical > replication scenarios. Maybe it is not because table foo will contain invalid > references to table bar and someone should fix it in the subscriber side. > However, the current implementation already allows such orphan rows due to > session_replication_role behavior. > > 2) truncate table at subscriber side during the initial copy. As you > mentioned, > this feature should take advantage of the FREEZE and FSM optimizations. There > was a proposal a few years ago [2]. > > 3) resynchronize a table. Same advantages as item 2. > > Unless there are any serious objections, I will send a patch to also > allow TRUNCATE in this case. > > > You should start checking the previous proposal [1]. > > > [1] > https://www.postgresql.org/message-id/ff835f71-3c6c-335e-4c7b-b9e1646cf3d7%402ndquadrant.it > [2] > https://www.postgresql.org/message-id/CF3B6672-2A43-4204-A60A-68F359218A9B%40endpoint.com > > > -- > Euler Taveira > EDB https://www.enterprisedb.com/ >
Allowing TRUNCATE of FK target when session_replication_role=replica
Hi Currently we do not allow TRUNCATE of a table when any Foreign Keys point to that table. At the same time we do allow one to delete all rows when session_replication_role=replica This causes all kinds of pain when trying to copy in large amounts of data, especially at the start of logical replication set-up, as many optimisations to COPY require the table to be TRUNCATEd . The main two are ability to FREEZE while copying and the skipping of WAL generation in case of wal_level=minimal, both of which can achieve significant benefits when data amounts are large. Is there any reason to not allow TRUNCATE when session_replication_role=replica ? Unless there are any serious objections, I will send a patch to also allow TRUNCATE in this case. Best Regards Hannu
Re: Initdb-time block size specification
Sure, I was just hoping that somebody already knew without needing to specifically check :) And as I see in David's response, the tests are actually broken for other sizes. I'll see if I can (convince somebody to) set this up . Cheers Hannu On Tue, Sep 5, 2023 at 10:23 PM Andres Freund wrote: > > Hi, > > On 2023-09-05 21:52:18 +0200, Hannu Krosing wrote: > > Something I also asked at this years Unconference - Do we currently > > have Build Farm animals testing with different page sizes ? > > You can check that yourself as easily as anybody else. > > Greetings, > > Andres Freund
Re: Initdb-time block size specification
Something I also asked at this years Unconference - Do we currently have Build Farm animals testing with different page sizes ? I'd say that testing all sizes from 4KB up (so 4, 8, 16, 32) should be done at least before each release if not continuously. -- Cheers Hannu On Tue, Sep 5, 2023 at 4:31 PM David Christensen wrote: > > On Tue, Sep 5, 2023 at 9:04 AM Robert Haas wrote: > > > > On Sat, Sep 2, 2023 at 3:09 PM Tomas Vondra > > wrote: > > > Except that no one is forcing you to actually go 130mph or 32mph, right? > > > You make it seem like this patch forces people to use some other page > > > size, but that's clearly not what it's doing - it gives you the option > > > to use smaller or larger block, if you chose to. Just like increasing > > > the speed limit to 130mph doesn't mean you can't keep going 65mph. > > > > > > The thing is - we *already* allow using different block size, except > > > that you have to do custom build. This just makes it easier. > > > > Right. Which is worth doing if it doesn't hurt performance and is > > likely to be useful to a lot of people, and is not worth doing if it > > will hurt performance and be useful to relatively few people. My bet > > is on the latter. > > Agreed that this doesn't make sense if there are major performance > regressions, however the goal here is patch evaluation to measure this > against other workloads and see if this is the case; from my localized > testing things were within acceptable noise levels with the latest > version. > > I agree with Tomas' earlier thoughts: we already allow different block > sizes, and if there are baked-in algorithmic assumptions about block > size (which there probably are), then identifying those or places in > the code where we need additional work or test coverage will only > improve things overall for those non-standard block sizes. > > Best, > > David > >
Re: pgbench - adding pl/pgsql versions of tests
I will address the comments here over this coming weekend. I think that in addition to current "tpc-b like" test we could also have more modern "tpc-c like" and "tpc-h like" tests And why not any other "* -like" from the rest of TPC-*, YCSP, sysbench, ... :) though maybe not as part of pg_bench but as extensions ? --- Hannu On Wed, Aug 16, 2023 at 10:06 AM Fabien COELHO wrote: > > > Hello Nathan, > > >> I'm unclear about what variety of scripts that could be provided given the > >> tables made available with pgbench. ISTM that other scenari would involve > >> both an initialization and associated scripts, and any proposal would be > >> bared because it would open the door to anything. > > > > Why's that? > > Just a wild guess based on 19 years of occasional contributions to pg and > pgbench in particular:-) > > > I'm not aware of any project policy that prohibits such enhancements to > > pgbench. > > Attempts in extending pgbench often fall under "you can do it outside (eg > with a custom script) so there is no need to put that in pgbench as it > would add to the maintenance burden with a weak benefit proven by the fact > that it is not there already". > > > It might take some effort to gather consensus on a proposal like this, > > but IMHO that doesn't mean we shouldn't try. > > Done it in the past. Probably will do it again in the future:-) > > > If the prevailing wisdom is that we shouldn't add more built-in scripts > > because there is an existing way to provide custom ones, then it's not > > clear that we should proceed with $SUBJECT, anyway. > > I'm afraid there is that argument. I do not think that this policy is good > wrt $SUBJECT, ISTM that having an easy way to test something with a > PL/pgSQL function would help promote the language by advertising/showing > the potential performance benefit (or not, depending). Just one function > would be enough for that. > > -- > Fabien.
Re: How to build a new grammer for pg?
I would look at how Babelfish DB did it when adding SQL Server compatibility https://babelfishpg.org/ and https://github.com/babelfish-for-postgresql/ another source to inspect could be https://github.com/IvorySQL/IvorySQL for "oracle compatible PostgreSQL" On Tue, Aug 1, 2023 at 10:07 PM Jonah H. Harris wrote: > > On Tue, Aug 1, 2023 at 3:45 PM Andrew Dunstan wrote: >> >> Or to enable some language other than SQL (QUEL anyone?) > > > A few years ago, I got a minimal POSTQUEL working again to release as a patch > for April Fools' Day, which I never did. I should dig that up somewhere :) > > Anyway, as far as OP's original question regarding replacing the grammar, > there are a couple of such implementations floating around that have done > that. But, I actually think the pluggable parser patches were good examples > of how to integrate a replacement parser that generates the expected parse > tree nodes for anyone who wants to do their own custom parser. See Julien > Rouhaud's SQLOL in the "Hook for extensible parsing" thread and Jim > Mlodgenski's "Parser Hook" thread. > > -- > Jonah H. Harris >
Re: incremental-checkopints
On Wed, Jul 26, 2023 at 9:54 PM Matthias van de Meent wrote: > > Then you ignore the max_wal_size GUC as PostgreSQL so often already > does. At least, it doesn't do what I expect it to do at face value - > limit the size of the WAL directory to the given size. That would require stopping any new writes at wal size == max_wal_size until the checkpoint is completed. I don't think anybody would want that. > But more reasonably, you'd keep track of the count of modified pages > that are yet to be fully WAL-logged, and keep that into account as a > debt that you have to the current WAL insert pointer when considering > checkpoint distances and max_wal_size. I think Peter Geoghegan has worked on somewhat similar approach to account for "accumulated work needed until some desired outcome" though I think it was on the VACUUM side of things.
Re: incremental-checkopints
Starting from increments checkpoint is approaching the problem from the wrong end. What you actually want is Atomic Disk Writes which will allow turning off full_page_writes . Without this you really can not do incremental checkpoints efficiently as checkpoints are currently what is used to determine when is "the first write to a page after checkpoint" and thereby when the full page write is needed. On Wed, Jul 26, 2023 at 8:58 PM Tomas Vondra wrote: > > > > On 7/26/23 15:16, Matthias van de Meent wrote: > > On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera > > wrote: > >> > >> Hello > >> > >> On 2023-Jul-26, Thomas wen wrote: > >> > >>> Hi Hackes: I found this page : > >>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL > >>> no incremental checkpoints have been implemented so far. When a > >>> checkpoint is triggered, the performance jitter of PostgreSQL is very > >>> noticeable. I think incremental checkpoints should be implemented as > >>> soon as possible > >> > >> I think my first question is why do you think that is necessary; there > >> are probably other tools to achieve better performance. For example, > >> you may want to try making checkpoint_completion_target closer to 1, and > >> the checkpoint interval longer (both checkpoint_timeout and > >> max_wal_size). Also, changing shared_buffers may improve things. You > >> can try adding more RAM to the machine. > > > > Even with all those tuning options, a significant portion of a > > checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in > > general) will most often appear at the start of each checkpoint due to > > each first update to a page after a checkpoint needing an FPI. > > Yeah, FPIs are certainly expensive and can represent huge part of the > WAL produced. But how would incremental checkpoints make that step > unnecessary? > > > If instead we WAL-logged only the pages we are about to write to disk > > (like MySQL's double-write buffer, but in WAL instead of a separate > > cyclical buffer file), then a checkpoint_completion_target close to 1 > > would probably solve the issue, but with "WAL-logged torn page > > protection at first update after checkpoint" we'll probably always > > have higher-than-average FPI load just after a new checkpoint. > > > > So essentially instead of WAL-logging the FPI on the first change, we'd > only do that later when actually writing-out the page (either during a > checkpoint or because of memory pressure)? How would you make sure > there's enough WAL space until the next checkpoint? I mean, FPIs are a > huge write amplification source ... > > Imagine the system has max_wal_size set to 1GB, and does 1M updates > before writing 512MB of WAL and thus triggering a checkpoint. Now it > needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with > indexes. What then? > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > >
Re: Example Table AM implementation
Thanks a lot Mark, I will take a look at this and get back to you if I find anything unclear --- Hannu On Tue, Jul 4, 2023 at 10:14 PM Mark Dilger wrote: > > Hackers, > > Over in [1], Hannu Krosing asked me to create and post several Table Access > Methods for testing/example purposes. I am fairly happy to do so, but each > one is large, and should be considered separately for inclusion/rejection in > contrib/, or in src/test/modules as Michael Paquier suggests. As such, I am > starting this new email thread for the first such TAM. I've named it "pile", > which is an English synonym of "heap", and which is also four characters in > length, making for easier side-by-side diffs with the heap code. The pile > code is a deep copy of the heap code, meaning that pile functions do not call > heap code, nor run the in-core regression tests, but rather pile's own > modified copy of the heap code, the regression tests, and even the test data. > Rather than creating a bare-bones skeleton which needs to be populated with > an implementation and regression tests, this patch instead offers a fully > fleshed out TAM which can be pruned down to something reasonably compact once > the user changes it into whatever they want it to be. To reiterate, the > patch is highly duplicative of in-core files. > > Hannu, I'm happy to post something like this three times again, for the named > TAMs you request, but could you first review this patch and maybe try turning > it into something else, such as the in memory temp tables, overlay tables, or > python based tables that you mentioned in [1]? Anything that needs to be > changed to make similar TAMs suitable for the community should be discussed > prior to spamming -hackers with more TAMs. Thanks. > > > [1] > https://www.postgresql.org/message-id/CAMT0RQQXtq8tgVPdFb0mk4v%2BcVuGvPWk1Oz9LDr0EgBfrV6e6w%40mail.gmail.com > > > > — > Mark Dilger > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > >
Including a sample Table Access Method with core code
At PgCon 2023 in Ottawa we had an Unconference session on Table Access Methods [1] One thing that was briefly mentioned (but is missing from the notes) is need to have a sample API client in contrib/ , both for having a 2nd user for API to make it more likely that non-heap AMs are doable and also to serve as an easy starting point for someone interested in developing a new AM. There are a few candidates which could be lightweight enough for this * in-memory temp tables, especially if you specify max table size at creation and/or limit data types which can be used. * "overlay tables" - tables which "overlay" another - possibly read-only - table and store only changed rows and tombstones for deletions. (this likely would make more sense as a FDW itself as Table AM currently knows nothing about Primary Keys and these are likely needed for overlays) * Table AM as a (pl/)Python Class - this is inspired by the amazing Multicorn [2] FDW-in-Python tool which made it ridiculously easy to expose anything (mailbox, twitter feed, git commit history, you-name-it) as a Foreign Table Creating any of these seems to be a project of size suitable for a student course project or maybe Google Summer of Code [3]. Included Mark Dilger directly to this mail as he mentioned he has a Perl script that makes a functional copy of heap AM that can be compiled as installed as custom AM. @mark - maybe you can create 3 boilerplate Table AMs for the above named `mem_am`, `overlay_am` and `py3_am` and we could put them somewhere for interested parties to play with ? [1] https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference#Table_AMs [2] https://multicorn.org/ - unfortunately unmaintained since 2016, but there are some forks supporting later PostgreSQL versions [3] https://wiki.postgresql.org/wiki/GSoC - Google Summer of Code --- Best Regards Hannu
Re: Let's make PostgreSQL multi-threaded
One more unexpected benefit of having shared caches would be easing access to other databases. If the system caches are there for all databases anyway, then it becomes much easier to make queries using objects from multiple databases. Note that this does not strictly need threads, just shared caches. On Thu, Jun 15, 2023 at 11:04 AM Hannu Krosing wrote: > > On Thu, Jun 15, 2023 at 10:41 AM James Addison wrote: > > > > This is making me wonder about other performance/scalability areas > > that might not have been considered due to focus on the details of the > > existing codebase, but I'll save that for another thread and will try > > to learn more first. > > A gradual move to more shared structures seems to be a way forward > > It should get us all the benefits of threading minus the need for TLB > reloading and (in some cases) reduction of per-process virtual memory > mapping tables. > > In any case we would need to implement all the locking and parallelism > management of these shared structures that are not there in the > current process architecture. > > So a fair bit of work but also a clearly defined benefits of > 1) reduced memory usage > 2) no need to rebuild caches for each new connection > 3) no need to track PREPARE statements inside connection poolers. > > There can be extra complexity when different connections use the same > prepared statement name (say "PREP001") for different queries. > For this wel likely will need a good cooperation with connection > pooler where it passes some kind of client connection id along at the > transaction start
Re: Let's make PostgreSQL multi-threaded
On Thu, Jun 15, 2023 at 10:41 AM James Addison wrote: > > This is making me wonder about other performance/scalability areas > that might not have been considered due to focus on the details of the > existing codebase, but I'll save that for another thread and will try > to learn more first. A gradual move to more shared structures seems to be a way forward It should get us all the benefits of threading minus the need for TLB reloading and (in some cases) reduction of per-process virtual memory mapping tables. In any case we would need to implement all the locking and parallelism management of these shared structures that are not there in the current process architecture. So a fair bit of work but also a clearly defined benefits of 1) reduced memory usage 2) no need to rebuild caches for each new connection 3) no need to track PREPARE statements inside connection poolers. There can be extra complexity when different connections use the same prepared statement name (say "PREP001") for different queries. For this wel likely will need a good cooperation with connection pooler where it passes some kind of client connection id along at the transaction start
Re: Let's make PostgreSQL multi-threaded
On Thu, Jun 15, 2023 at 9:12 AM Konstantin Knizhnik wrote: > There are three different but related directions of improving current > Postgres: > 1. Replacing processes with threads Here we could likely start with making parallel query multi-threaded. This would also remove the big blocker for parallelizing things like CREATE TABLE AS SELECT ... where we are currently held bac by the restriction that only the leader process can write. > 2. Builtin connection pooler Would be definitely a nice thing to have. And we could even start by integrating a non-threaded pooler like pgbouncer to run as a postgresql worker process (or two). > 3. Lightweight backends (shared catalog/relation/prepared statements caches) Shared prepared statement caches (of course have to be per-user and per-database) would give additional benefit of lightweight connection poolers not needing to track these. Currently the missing support of named prepared statements is one of the main hindrances of using pgbouncer with JDBC in transaction pooling mode (you can use it, but have to turn off automatic statement preparing) > > The motivation for such changes are also similar: > 1. Increase Postgres scalability > 2. Reduce memory consumption > 3. Make Postgres better fit cloud and serverless requirements The memory consumption reduction would be a big and clear win for many workloads. Also just moving more things in shared memory will also prepare us for move to threaded server (if it will eventually happen) > I am not sure now which one should be addressed first or them can be done > together. Shared caches seem like a guaranteed win at least on memory usage. There could be performance (and complexity) downsides for specific workloads, but they would be the same as for the threaded model, so would also be a good learning opportunity. > Replacing static variables with thread-local is the first and may be the > easiest step. I think we got our first patch doing this (as part of patches for running PG threaded on Solaris) quite early in the OSS development , could have been even in the last century :) > It requires more or less mechanical changes. More challenging thing is > replacing private per-backend data structures > with shared ones (caches, file descriptors,...) Indeed, sharing caches would be also part of the work that is needed for the sharded model, so anyone feeling strongly about moving to threads could start with this :) --- Hannu
Re: Let's make PostgreSQL multi-threaded
On Tue, Jun 13, 2023 at 9:55 AM Kyotaro Horiguchi wrote: > > At Tue, 13 Jun 2023 09:55:36 +0300, Konstantin Knizhnik > wrote in > > Postgres backend is "thick" not because of large number of local > > variables. > > It is because of local caches: catalog cache, relation cache, prepared > > statements cache,... > > If they are not rewritten, then backend still may consume a lot of > > memory even if it will be thread rather then process. > > But threads simplify development of global caches, although it can be > > done with DSM. > > With the process model, that local stuff are flushed out upon > reconnection. If we switch to the thread model, we will need an > expiration mechanism for those stuff. The part that can not be so easily solved is that "the local stuff" can include some leakage that is not directly controlled by us. I remember a few times when memory leaks in some PostGIS packages cause slow memory exhaustion and the simple fix was limiting connection lifetime to something between 15 min and an hour. The main problem here is that PostGIS uses a few tens of other GPL GIS related packages which are all changing independently and thus it is quite hard to be sure that none of these have developed a leak. And you also likely can not just stop upgrading these as they also contain security fixes. I have no idea what the fix could be in case of threaded server.
Re: Let's make PostgreSQL multi-threaded
On Mon, Jun 5, 2023 at 4:52 PM Heikki Linnakangas wrote: > > If there are no major objections, I'm going to update the developer FAQ, > removing the excuses there for why we don't use threads [1]. I think it is not wise to start the wholesale removal of the objections there. But I think it is worthwhile to revisit the section about threads and maybe split out the historic part which is no more true, and provide both pros and cons for these. I started with this short summary from the discussion in this thread, feel free to expand, argue, fix :) * is current excuse -- is counterargument or ack As an example, threads are not yet used instead of multiple processes for backends because: * Historically, threads were poorly supported and buggy. -- yes they were, not relevant now when threads are well-supported and non-buggy * An error in one backend can corrupt other backends if they're threads within a single process -- still valid for silent corruption -- for detected crash - yes, but we are restarting all backends in case of crash anyway. * Speed improvements using threads are small compared to the remaining backend startup time. -- we now have some measurements that show significant performance improvements not related to startup time * The backend code would be more complex. -- this is still the case -- even more worrisome is that all extensions also need to be rewritten -- and many incompatibilities will be silent and take potentially years to find * Terminating backend processes allows the OS to cleanly and quickly free all resources, protecting against memory and file descriptor leaks and making backend shutdown cheaper and faster -- still true * Debugging threaded programs is much harder than debugging worker processes, and core dumps are much less useful -- this was countered by claiming that -- by now we have reasonable debugger support for threads -- there is no direct debugger support for debugging the exact system set up like PostgreSQL processes + shared memory * Sharing of read-only executable mappings and the use of shared_buffers means processes, like threads, are very memory efficient -- this seems to say that the current process model is as good as threads ? -- there were a few counterarguments -- per-backend virtual memory mapping can add up to significant amount of extra RAM usage -- the discussion did not yet touch various per-backend caches (pg_catalog cache, statement cache) which are arguably easier to implement in threaded model -- TLB reload at each process switch is expensive and would be mostly avoided in case of threads * Regular creation and destruction of processes helps protect against memory fragmentation, which can be hard to manage in long-running processes -- probably still true -
Re: Let's make PostgreSQL multi-threaded
I discovered this thread from a Twitter post "PostgreSQL will finally be rewritten in Rust" :) On Mon, Jun 5, 2023 at 5:18 PM Tom Lane wrote: > > Heikki Linnakangas writes: > > I spoke with some folks at PGCon about making PostgreSQL multi-threaded, > > so that the whole server runs in a single process, with multiple > > threads. It has been discussed many times in the past, last thread on > > pgsql-hackers was back in 2017 when Konstantin made some experiments [0]. > > > I feel that there is now pretty strong consensus that it would be a good > > thing, more so than before. Lots of work to get there, and lots of > > details to be hashed out, but no objections to the idea at a high level. > > > The purpose of this email is to make that silent consensus explicit. If > > you have objections to switching from the current multi-process > > architecture to a single-process, multi-threaded architecture, please > > speak up. > > For the record, I think this will be a disaster. There is far too much > code that will get broken, largely silently, and much of it is not > under our control. > > regards, tom lane > >
Re: Let's make PostgreSQL multi-threaded
On Thu, Jun 8, 2023 at 4:56 PM Robert Haas wrote: > > On Thu, Jun 8, 2023 at 8:44 AM Hannu Krosing wrote: > > > That sounds like a bad idea, dynamic shared memory is more expensive > > > to maintain than our static shared memory systems, not in the least > > > because DSM is not guaranteed to share the same addresses in each > > > process' address space. > > > > Then this too needs to be fixed > > Honestly, I'm struggling to respond to this non-sarcastically. I mean, > I was the one who implemented DSM. Do you think it works the way that > it works because I considered doing something smart and decided to do > something dumb instead? No, I meant that this needs to be fixed at OS level, by being able to use the same mapping. We should not shy away from asking the OS people for adding the useful features still missing. It was mentioned in the Unconference Kernel Hacker AMA talk and said kernel hacker works for Oracle, andf they also seemed to be needing this :)
Re: Let's make PostgreSQL multi-threaded
On Thu, Jun 8, 2023 at 2:15 PM Matthias van de Meent wrote: > > On Thu, 8 Jun 2023 at 11:54, Hannu Krosing wrote: > > > > On Wed, Jun 7, 2023 at 11:37 PM Andres Freund wrote: > > > > > > Hi, > > > > > > On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote: > > > > 2. While I wouldn't want to necessarily discourage a moonshot effort, I > > > > would ask if developer time could be better spent on tackling some of > > > > the > > > > other problems around vertical scalability? Per some PGCon discussions, > > > > there's still room for improvement in how PostgreSQL can best utilize > > > > resources available very large "commodity" machines (a 448-core / 24TB > > > > RAM > > > > instance comes to mind). > > > > > > I think we're starting to hit quite a few limits related to the process > > > model, > > > particularly on bigger machines. The overhead of cross-process context > > > switches is inherently higher than switching between threads in the same > > > process - and my suspicion is that that overhead will continue to > > > increase. Once you have a significant number of connections we end up > > > spending > > > a *lot* of time in TLB misses, and that's inherent to the process model, > > > because you can't share the TLB across processes. > > > > > > This part was touched in the "AMA with a Linux Kernale Hacker" > > Unconference session where he mentioned that the had proposed a > > 'mshare' syscall for this. > > > > So maybe a more fruitful way to fixing the perceived issues with > > process model is to push for small changes in Linux to overcome these > > avoiding a wholesale rewrite ? > > We support not just Linux, but also Windows and several (?) BSDs. I'm > not against pushing Linux to make things easier for us, but Linux is > an open source project, too, where someone need to put in time to get > the shiny things that you want. And I'd rather see our time spent in > PostgreSQL, as Linux is only used by a part of our user base. Do we have any statistics for the distribution of our user base ? My gut feeling says that for performance-critical use the non-Linux is in low single digits at best. My fascination for OpenSource started with realisation that instead of workarounds you can actually fix the problem at source. So if the specific problem is that TLB is not shared then the proper fix is making it shared instead of rewriting everything else to get around it. None of us is limited to writing code in PostgreSQL only. If the easiest and more generix fix can be done in Linux then so be it. It is also possible that Windows and *BSD already have a similar feature. > > > > The amount of duplicated code we have to deal with due to to the process > > > model > > > is quite substantial. We have local memory, statically allocated shared > > > memory > > > and dynamically allocated shared memory variants for some things. And > > > that's > > > just going to continue. > > > > Maybe we can already remove the distinction between static and dynamic > > shared memory ? > > That sounds like a bad idea, dynamic shared memory is more expensive > to maintain than our static shared memory systems, not in the least > because DSM is not guaranteed to share the same addresses in each > process' address space. Then this too needs to be fixed > > > Though I already heard some complaints at the conference discussions > > that having the dynamic version available has made some developers > > sloppy in using it resulting in wastefulness. > > Do you know any examples of this wastefulness? No. Just somebody mentioned it in a hallway conversation and the rest of the developers present mumbled approvingly :) > > > > I'm purposely giving a nonanswer on whether it's a worthwhile goal, but > > > > rather I'd be curious where it could stack up against some other > > > > efforts to > > > > continue to help PostgreSQL improve performance and handle very large > > > > workloads. > > > > > > There's plenty of things we can do before, but in the end I think > > > tackling the > > > issues you mention and moving to threads are quite tightly linked. > > > > Still we should be focusing our attention at solving the issues and > > not at "moving to threads" and hoping this will fix the issues by > > itself. > > I suspect that it is much easier to solve some of the issues when > working in a shared address space. Probably. But it woul
Re: Use COPY for populating all pgbench tables
I guess that COPY will still be slower than generating the data server-side ( --init-steps=...G... ) ? What I'd really like to see is providing all the pgbench functions also on the server. Specifically the various random(...) functions - random_exponential(...), random_gaussian(...), random_zipfian(...) so that also custom data generationm could be easily done server-side with matching distributions. On Thu, Jun 8, 2023 at 7:34 AM David Rowley wrote: > > On Thu, 8 Jun 2023 at 07:16, Tristan Partin wrote: > > > > master: > > > > 5000 of 5000 tuples (100%) done (elapsed 260.93 s, remaining 0.00 > > s)) > > vacuuming... > > creating primary keys... > > done in 1414.26 s (drop tables 0.20 s, create tables 0.82 s, client-side > > generate 1280.43 s, vacuum 2.55 s, primary keys 130.25 s). > > > > patchset: > > > > 5000 of 5000 tuples (100%) of pgbench_accounts done (elapsed 243.82 > > s, remaining 0.00 s)) > > vacuuming... > > creating primary keys... > > done in 375.66 s (drop tables 0.14 s, create tables 0.73 s, client-side > > generate 246.27 s, vacuum 2.77 s, primary keys 125.75 s). > > I've also previously found pgbench -i to be slow. It was a while ago, > and IIRC, it was due to the printfPQExpBuffer() being a bottleneck > inside pgbench. > > On seeing your email, it makes me wonder if PG16's hex integer > literals might help here. These should be much faster to generate in > pgbench and also parse on the postgres side. > > I wrote a quick and dirty patch to try that and I'm not really getting > the same performance increases as I'd have expected. I also tested > with your patch too and it does not look that impressive either when > running pgbench on the same machine as postgres. > > pgbench copy speedup > > ** master > drowley@amd3990x:~$ pgbench -i -s 1000 postgres > 1 of 1 tuples (100%) done (elapsed 74.15 s, remaining 0.00 s) > vacuuming... > creating primary keys... > done in 95.71 s (drop tables 0.00 s, create tables 0.01 s, client-side > generate 74.45 s, vacuum 0.12 s, primary keys 21.13 s). > > ** David's Patched > drowley@amd3990x:~$ pgbench -i -s 1000 postgres > 1 of 1 tuples (100%) done (elapsed 69.64 s, remaining 0.00 s) > vacuuming... > creating primary keys... > done in 90.22 s (drop tables 0.00 s, create tables 0.01 s, client-side > generate 69.91 s, vacuum 0.12 s, primary keys 20.18 s). > > ** Tristan's patch > drowley@amd3990x:~$ pgbench -i -s 1000 postgres > 1 of 1 tuples (100%) of pgbench_accounts done (elapsed > 77.44 s, remaining 0.00 s) > vacuuming... > creating primary keys... > done in 98.64 s (drop tables 0.00 s, create tables 0.01 s, client-side > generate 77.47 s, vacuum 0.12 s, primary keys 21.04 s). > > I'm interested to see what numbers you get. You'd need to test on > PG16 however. I left the old code in place to generate the decimal > numbers for versions < 16. > > David
Re: Let's make PostgreSQL multi-threaded
On Thu, Jun 8, 2023 at 11:54 AM Hannu Krosing wrote: > > On Wed, Jun 7, 2023 at 11:37 PM Andres Freund wrote: > > > > Hi, > > > > On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote: > > > 2. While I wouldn't want to necessarily discourage a moonshot effort, I > > > would ask if developer time could be better spent on tackling some of the > > > other problems around vertical scalability? Per some PGCon discussions, > > > there's still room for improvement in how PostgreSQL can best utilize > > > resources available very large "commodity" machines (a 448-core / 24TB RAM > > > instance comes to mind). > > > > I think we're starting to hit quite a few limits related to the process > > model, > > particularly on bigger machines. The overhead of cross-process context > > switches is inherently higher than switching between threads in the same > > process - and my suspicion is that that overhead will continue to > > increase. Once you have a significant number of connections we end up > > spending > > a *lot* of time in TLB misses, and that's inherent to the process model, > > because you can't share the TLB across processes. > > > This part was touched in the "AMA with a Linux Kernale Hacker" > Unconference session where he mentioned that the had proposed a > 'mshare' syscall for this. Also, the *static* huge pages already let you solve this problem now by sharing the page tables Cheers Hannu
Re: Let's make PostgreSQL multi-threaded
On Thu, Jun 8, 2023 at 12:09 AM Andres Freund wrote: ... > We could e.g. eventually decide that we > don't support parallel query without threading support - which would allow us > to get rid of a very significant amount of code and runtime overhead. Here I was hoping to go in the opposite direction and support parallel query across replicas. This looks much more doable based on the process model than the single process / multiple threads model. --- Cheers Hannu
Re: Let's make PostgreSQL multi-threaded
I think I remember that in the early days of development somebody did send a patch-set for making PostgreSQL threaded on Solaris. I don't remember why this did not catch on. On Wed, Jun 7, 2023 at 11:40 PM Thomas Kellerer wrote: > > Tomas Vondra schrieb am 07.06.2023 um 21:20: > > Also, which other projects did this transition? Is there something we > > could learn from them? Were they restricted to much smaller list of > > platforms? > > Firebird did this a while ago if I'm not mistaken. > > Not open source, but Oracle was historically multi-threaded on Windows and > multi-process on all other platforms. > I _think_ starting with 19c you can optionally run it multi-threaded on Linux > as well. > > But I doubt, they are willing to share any insights ;) > > >
Re: Let's make PostgreSQL multi-threaded
On Wed, Jun 7, 2023 at 11:37 PM Andres Freund wrote: > > Hi, > > On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote: > > 2. While I wouldn't want to necessarily discourage a moonshot effort, I > > would ask if developer time could be better spent on tackling some of the > > other problems around vertical scalability? Per some PGCon discussions, > > there's still room for improvement in how PostgreSQL can best utilize > > resources available very large "commodity" machines (a 448-core / 24TB RAM > > instance comes to mind). > > I think we're starting to hit quite a few limits related to the process model, > particularly on bigger machines. The overhead of cross-process context > switches is inherently higher than switching between threads in the same > process - and my suspicion is that that overhead will continue to > increase. Once you have a significant number of connections we end up spending > a *lot* of time in TLB misses, and that's inherent to the process model, > because you can't share the TLB across processes. This part was touched in the "AMA with a Linux Kernale Hacker" Unconference session where he mentioned that the had proposed a 'mshare' syscall for this. So maybe a more fruitful way to fixing the perceived issues with process model is to push for small changes in Linux to overcome these avoiding a wholesale rewrite ? > > > The amount of duplicated code we have to deal with due to to the process model > is quite substantial. We have local memory, statically allocated shared memory > and dynamically allocated shared memory variants for some things. And that's > just going to continue. Maybe we can already remove the distinction between static and dynamic shared memory ? Though I already heard some complaints at the conference discussions that having the dynamic version available has made some developers sloppy in using it resulting in wastefulness. > > > > I'm purposely giving a nonanswer on whether it's a worthwhile goal, but > > rather I'd be curious where it could stack up against some other efforts to > > continue to help PostgreSQL improve performance and handle very large > > workloads. > > There's plenty of things we can do before, but in the end I think tackling the > issues you mention and moving to threads are quite tightly linked. Still we should be focusing our attention at solving the issues and not at "moving to threads" and hoping this will fix the issues by itself. Cheers Hannu
Re: Time to move pg_test_timing to measure in nanoseconds
Sure, will do. On Sun, Mar 26, 2023 at 11:40 PM Andres Freund wrote: > > Hi, > > On 2023-03-26 16:43:21 +0200, Hannu Krosing wrote: > > Currently pg_test_timing utility measures its timing overhead in > > microseconds, giving results like this > > I have a patch that does that and a bit more that's included in a larger > patchset by David Geier: > https://postgr.es/m/198ef658-a5b7-9862-2017-faf85d59e3a8%40gmail.com > > Could you review that part of the patchset? > > Greetings, > > Andres Freund
Re: Disable vacuuming to provide data history
There is also another blocker - our timestamp resolution is 1 microsecond and we are dangerously close to speeds where one could update a row twice in the same microsecond . I have been thinking about this, and what is needed is 1. a nanosecond-resolution "abstime" type - not absolutely necessary, but would help with corner cases. 2. VACUUM should be able to "freeze" by replacing xmin/xmax values with commit timestamps, or adding tmin/tmax where necessary. 3. Optionally VACUUM could move historic rows to archive tables with explicit tmin/tmax columns (this also solves the pg_dump problem) Most of the above design - apart from the timestamp resolution and vacuum being the one doing stamping in commit timestamps - is not really new - up to version 6.2 PostgreSQL had tmin/tmax instead of xmin/xmax and you could specify the timestamp you want to query any table at. And the original Postgres design was Full History Database where you could say " SELECT name, population FROM cities['epoch' .. 'now'] " to get all historic population values. And historic data was meant to be moved to the WORM optical drives which had just arrived to the market --- Hannu On Sat, Feb 25, 2023 at 3:11 AM Vik Fearing wrote: > > On 2/24/23 22:06, Corey Huinker wrote: > > On Thu, Feb 23, 2023 at 6:04 AM wrote: > > > > [1] some implementations don't use null, they use an end-timestamp set to > > a date implausibly far in the future ( 3999-12-31 for example ), > > The specification is, "At any point in time, all rows that have their > system-time period end column set to the highest value supported by the > data type of that column are known as current system rows; all other > rows are known as historical system rows." > > I would like to see us use 'infinity' for this. > > The main design blocker for me is how to handle dump/restore. The > standard does not bother thinking about that. > -- > Vik Fearing > > >
Time to move pg_test_timing to measure in nanoseconds
Currently pg_test_timing utility measures its timing overhead in microseconds, giving results like this ~$ /usr/lib/postgresql/15/bin/pg_test_timing Testing timing overhead for 3 seconds. Per loop time including overhead: 18.97 ns Histogram of timing durations: < us % of total count 1 98.11132 155154419 2 1.887562985010 4 0.00040630 8 0.00012184 16 0.00058919 32 0.3 40 64 0.0 6 I got curious and wanted to see how the 98.1% timings are distributed (raw uncleaned patch attached) And this is what I got when I increased the measuring resolution to nanoseconds hannuk@hannuk1:~/work/postgres15_uni_dist_on/src/bin/pg_test_timing$ ./pg_test_timing Testing timing overhead for 3 seconds. Per loop time including overhead: 17.34 ns, min: 15, same: 0 Histogram of timing durations: < ns % of total count 1 0.0 0 2 0.0 0 4 0.0 0 8 0.0 0 16 1.143871979085 32 98.47924 170385392 64 0.21666 374859 128 0.15654 270843 256 0.00297 5139 512 0.00016272 1024 0.4 73 2048 0.00018306 4096 0.00022375 8192 0.6 99 16384 0.5 80 32768 0.1 20 65536 0.0 6 131072 0.0 2 As most of the samples seems to be in ranges 8..15 and 16..32 nanoseconds the current way of measuring at microsecond resolution is clearly inadequate. The attached patch is not meant to be applied as-is but is rather there as a helper to easily verify the above numbers. QUESTIONS 1. Do you think it is ok to just change pg_test_timing to return the result in nanoseconds or should there be a flag that asks for nanosecond resolution ? 2. Should the output be changed to give ranges instead of `*** pg_test_timing.c.orig 2023-01-16 01:09:51.839251695 +0100 --- pg_test_timing.c 2023-01-16 22:24:49.768037225 +0100 *** *** 11,16 --- 11,20 #include "getopt_long.h" #include "portability/instr_time.h" + #define INSTR_TIME_GET_NANOSEC(t) \ + (((uint64) (t).tv_sec * (uint64) 10) + (uint64) ((t).tv_nsec)) + + static const char *progname; static unsigned int test_duration = 3; *** *** 19,26 static uint64 test_timing(unsigned int duration); static void output(uint64 loop_count); ! /* record duration in powers of 2 microseconds */ ! long long int histogram[32]; int main(int argc, char *argv[]) --- 23,30 static uint64 test_timing(unsigned int duration); static void output(uint64 loop_count); ! /* record duration in powers of 2 nanoseconds */ ! long long int histogram[64]; int main(int argc, char *argv[]) *** *** 124,139 uint64 total_time; int64 time_elapsed = 0; uint64 loop_count = 0; uint64 prev, cur; instr_time start_time, end_time, temp; ! total_time = duration > 0 ? duration * INT64CONST(100) : 0; INSTR_TIME_SET_CURRENT(start_time); ! cur = INSTR_TIME_GET_MICROSEC(start_time); while (time_elapsed < total_time) { --- 128,145 uint64 total_time; int64 time_elapsed = 0; uint64 loop_count = 0; + uint64 same_count = 0; + uint64 min_diff = UINT64_MAX; uint64 prev, cur; instr_time start_time, end_time, temp; ! total_time = duration > 0 ? duration * INT64CONST(10) : 0; INSTR_TIME_SET_CURRENT(start_time); ! cur = INSTR_TIME_GET_NANOSEC(start_time); while (time_elapsed < total_time) { *** *** 141,157 bits = 0; prev = cur; INSTR_TIME_SET_CURRENT(temp); ! cur = INSTR_TIME_GET_MICROSEC(temp); diff = cur - prev; /* Did time go backwards? */ ! if (diff < 0) { fprintf(stderr, _("Detected clock going backwards in time.\n")); fprintf(stderr, _("Time warp: %d ms\n"), diff); exit(1); } /* What is the highest bit in the time diff? */ while (diff) --- 147,172 bits = 0; prev = cur; + // INSTR_TIME_SET_CURRENT(temp); + // prev = INSTR_TIME_GET_NANOSEC(temp); INSTR_TIME_SET_CURRENT(temp); ! cur = INSTR_TIME_GET_NANOSEC(temp); diff = cur - prev; /* Did time go backwards? */ ! if (unlikely(diff <= 0)) { + if (diff == 0 ) + same_count ++; + else + { fprintf(stderr, _("Detected clock going backwards in time.\n")); fprintf(stderr, _("Time warp: %d ms\n"), diff); exit(1); + } } + if (min_diff > diff) + min_diff = diff; /* What is the highest bit in the time diff? */ while (diff) *** *** 165,179 loop_count++; INSTR_TIME_SUBTRACT(temp, start_time); ! time_elapsed = INSTR_TIME_GET_MICROSEC(temp); }
pgbench - adding pl/pgsql versions of tests
Hello Hackers, The attached patch adds pl/pgsql versions of "tpcb-like" and "simple-update" internal test scripts The tests perform functionally exactly the same, but are generally faster as they avoid most client-server latency. The reason I'd like to have them as part of pgbench are two 1. so I don't have to create the script and function manually each time I want to test mainly the database (instead of the client-database system) 2. so that new users of PostgreSQL can easily see how much better OLTP workloads perform when packaged up as a server-side function The new user-visible functionalities are two new build-in scripts -b list : $ pgbench -b list Available builtin scripts: tpcb-like: plpgsql-tpcb-like: simple-update: plpgsql-simple-update: select-only: which one can run using the -b / --builtin= option pgbench -b plpgsql-tpcb-like ... or pgbench -b plpgsql-simple-update ... And a flag --no-functions which lets you not to create the functions at init there are also character flags to -I / --init , -- Y to drop the functions and -- y to create the functions. Creating is default behaviour, but can be disabled fia long flag --no-functions ) I selected Yy as they were unused and can be thought of as "inverted lambda symbol" :) If there are no strong objections, I'll add it to the commitfest as well - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index c4a44debeb..1edcec2f5c 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -172,8 +172,8 @@ typedef struct socket_set / * some configurable parameters */ -#define DEFAULT_INIT_STEPS "dtgvp" /* default -I setting */ -#define ALL_INIT_STEPS "dtgGvpf" /* all possible steps */ +#define DEFAULT_INIT_STEPS "dYtgvpy" /* default -I setting */ +#define ALL_INIT_STEPS "dYtgGvpfy" /* all possible steps */ #define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ @@ -800,6 +800,15 @@ static const BuiltinScript builtin_script[] = "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" "END;\n" }, + { + "plpgsql-tpcb-like", + "", + "\\set aid random(1, " CppAsString2(naccounts) " * :scale)\n" + "\\set bid random(1, " CppAsString2(nbranches) " * :scale)\n" + "\\set tid random(1, " CppAsString2(ntellers) " * :scale)\n" + "\\set delta random(-5000, 5000)\n" + "SELECT 1 FROM pgbench_tpcb_like(:aid, :bid, :tid, :delta);\n" + }, { "simple-update", "", @@ -813,6 +822,15 @@ static const BuiltinScript builtin_script[] = "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" "END;\n" }, + { + "plpgsql-simple-update", + "", + "\\set aid random(1, " CppAsString2(naccounts) " * :scale)\n" + "\\set bid random(1, " CppAsString2(nbranches) " * :scale)\n" + "\\set tid random(1, " CppAsString2(ntellers) " * :scale)\n" + "\\set delta random(-5000, 5000)\n" + "SELECT 1 FROM pgbench_simple_update(:aid, :bid, :tid, :delta);\n" + }, { "select-only", "", @@ -885,6 +903,7 @@ usage(void) " -q, --quiet quiet logging (one message each 5 seconds)\n" " -s, --scale=NUM scaling factor\n" " --foreign-keys create foreign key constraints between tables\n" + " --no-functions do not create pl/pgsql functions for internal scripts\n" " --index-tablespace=TABLESPACE\n" " create indexes in the specified tablespace\n" " --partition-method=(range|hash)\n" @@ -4836,6 +4855,54 @@ initTruncateTables(PGconn *con) "pgbench_tellers"); } +/* + * Create the functions needed for plpgsql-* builting scripts + */ +static void +initCreateFuntions(PGconn *con) +{ + fprintf(stderr, "creating functions...\n"); + + executeStatement(con, + "CREATE FUNCTION pgbench_tpcb_like(_aid int, _bid int, _tid int, _delta int)\n" + "RETURNS void\n" + "LANGUAGE plpgsql\n" + "AS $plpgsql$\n" + "BEGIN\n" + "UPDATE pgbench_accounts SET abalance = abalance + _delta WHERE aid = _aid;\n" + "PERFORM abalance FROM pgbench_accounts WHERE aid = _aid;\n" + "UPDATE pgbench_tellers SET
Is there a way to use exported snapshots in autocommit mode ?
Hello hackers, Is there a way to use exported snapshots in autocommit mode ? Either something similar to defaults in default_transaction_deferrable, default_transaction_isolation, default_transaction_read_only Or something that could be set in the function's SET part. Context: I am working on speeding up large table copy via parallelism using pl/proxy SPLIT functionality and would like to have all the parallel sub-copies run with the same snapshot. Best Regards Hannu
Is anybody planning to release pglogical for v15 ?
So, is anybody planning to release pglogical for v15 ? There are still a few things that one can do in pglogical but not in native / built0in replication ... Best Regards Hannu
Re: making relfilenodes 56 bits
Re: staticAssertStmt(MAX_FORKNUM <= INT8_MAX); Have you really thought through making the ForkNum 8-bit ? For example this would limit a columnar storage with each column stored in it's own fork (which I'd say is not entirely unreasonable) to having just about ~250 columns. And there can easily be other use cases where we do not want to limit number of forks so much Cheers Hannu On Tue, Jul 12, 2022 at 10:36 PM Robert Haas wrote: > > On Tue, Jul 12, 2022 at 1:09 PM Andres Freund wrote: > > What does currently happen if we exceed that? > > elog > > > > diff --git a/src/include/utils/wait_event.h > > > b/src/include/utils/wait_event.h > > > index b578e2ec75..5d3775ccde 100644 > > > --- a/src/include/utils/wait_event.h > > > +++ b/src/include/utils/wait_event.h > > > @@ -193,7 +193,7 @@ typedef enum > > > WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE, > > > WAIT_EVENT_LOGICAL_REWRITE_WRITE, > > > WAIT_EVENT_RELATION_MAP_READ, > > > - WAIT_EVENT_RELATION_MAP_SYNC, > > > + WAIT_EVENT_RELATION_MAP_RENAME, > > > > Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME, > > since it includes fsync etc? > > Sure, I had it that way for a while and changed it at the last minute. > I can change it back. > > -- > Robert Haas > EDB: http://www.enterprisedb.com > >
Re: Hardening PostgreSQL via (optional) ban on local file system access
And thanks to Robert and Bruce for bringing up good points about potential pitfalls! I think we do have a good discussion going on here :) --- Hannu On Fri, Jul 1, 2022 at 11:14 AM Hannu Krosing wrote: > > On Thu, Jun 30, 2022 at 7:25 PM Bruce Momjian wrote: > > > > On Thu, Jun 30, 2022 at 11:52:20AM -0400, Robert Haas wrote: > > > I don't think this would be very convenient in most scenarios, > > This is the eternal problem with security - more security always > includes more inconvenience. > > Unlocking your door when coming home is more inconvenient than not > unlocking it, and the least inconvenient thing would be not having a > door at all. > Imagine coming to your door with a heavy shopping bag in each hand - > at this moment almost anyone would prefer the door not being there :) > > This one would be for cases where you want best multi-layer > protections also against unknown threats and are ready to trade some > convenience for security. Also it would not be that bad once you use > automated deployment pipelines which just need an extra line to unlock > superuser for deployment. > > > > and I > > > think it would also be difficult to implement correctly. I don't think > > > you can get by with just having superuser() return false sometimes > > > despite pg_authid.rolsuper being true. There's a lot of subtle > > > assumptions in the code to the effect that the properties of a session > > > are basically stable unless some SQL is executed which changes things. > > A good barrier SQL for this could be SET ROLE=... . > Maybe just have a mode where a superuser can not log in _or_ SET ROLE > unless this is explicitly allowed in pg_superuser.conf > > > > I think if we start injecting hacks like this it may seem to work in > > > light testing but we'll never get to the end of the bug reports. > > In this case it looks like each of these bug reports would mean an > avoided security breach which for me looks preferable. > > Again, this would be all optional, opt-in, DBA-needs-to-set-it-up > feature for the professionally paranoid and not something we suddenly > force on people who would like to run all their databases using > user=postgres database=postgres with trust specified in the > pg_hba.conf "because the firewall takes care of security" :) > > > Yeah, seems it would have to be specified per-session, but how would you > > specify a specific session before the session starts? > > One often recommended way to do superuser'y things in a secure > production database is to have a non-privileged NOINHERIT user for > logging in and then do > SET ROLE=; > when needed, similar to using su/sudo in shell. This practice both > reduces the attack surface and also provides auditability by knowing > who logged in for superuser work. > > In this case one could easily get the pid and do the needed extra > setup before escalating privileges to superuser. > > --- > Hannu
Re: Hardening PostgreSQL via (optional) ban on local file system access
On Thu, Jun 30, 2022 at 7:25 PM Bruce Momjian wrote: > > On Thu, Jun 30, 2022 at 11:52:20AM -0400, Robert Haas wrote: > > I don't think this would be very convenient in most scenarios, This is the eternal problem with security - more security always includes more inconvenience. Unlocking your door when coming home is more inconvenient than not unlocking it, and the least inconvenient thing would be not having a door at all. Imagine coming to your door with a heavy shopping bag in each hand - at this moment almost anyone would prefer the door not being there :) This one would be for cases where you want best multi-layer protections also against unknown threats and are ready to trade some convenience for security. Also it would not be that bad once you use automated deployment pipelines which just need an extra line to unlock superuser for deployment. > > and I > > think it would also be difficult to implement correctly. I don't think > > you can get by with just having superuser() return false sometimes > > despite pg_authid.rolsuper being true. There's a lot of subtle > > assumptions in the code to the effect that the properties of a session > > are basically stable unless some SQL is executed which changes things. A good barrier SQL for this could be SET ROLE=... . Maybe just have a mode where a superuser can not log in _or_ SET ROLE unless this is explicitly allowed in pg_superuser.conf > > I think if we start injecting hacks like this it may seem to work in > > light testing but we'll never get to the end of the bug reports. In this case it looks like each of these bug reports would mean an avoided security breach which for me looks preferable. Again, this would be all optional, opt-in, DBA-needs-to-set-it-up feature for the professionally paranoid and not something we suddenly force on people who would like to run all their databases using user=postgres database=postgres with trust specified in the pg_hba.conf "because the firewall takes care of security" :) > Yeah, seems it would have to be specified per-session, but how would you > specify a specific session before the session starts? One often recommended way to do superuser'y things in a secure production database is to have a non-privileged NOINHERIT user for logging in and then do SET ROLE=; when needed, similar to using su/sudo in shell. This practice both reduces the attack surface and also provides auditability by knowing who logged in for superuser work. In this case one could easily get the pid and do the needed extra setup before escalating privileges to superuser. --- Hannu
Re: Hardening PostgreSQL via (optional) ban on local file system access
Ah, I see. My counterclaim was that there are lots of use cases where one would want to be extra sure that *only* they, as the owner of the database, can access the database as a superuser and not someone else. Even if there is some obscure way for that "someone else" to either connect as a superuser or to escalate privileges to superuser. And what I propose would be a means to achieve that at the expense of extra steps when starting to act as a superuser. In a nutshell this would be equivalent for two factor authentication for acting as a superuser - 1) you must be able to log in as a user with superuser attribute 2) you must present proof that you can access the underlying file system Cheers, Hannu Krosing On Wed, Jun 29, 2022 at 12:48 PM Laurenz Albe wrote: > > On Wed, 2022-06-29 at 00:05 -0700, Andres Freund wrote: > > On 2022-06-29 08:51:10 +0200, Laurenz Albe wrote: > > > On Tue, 2022-06-28 at 16:27 -0700, Andres Freund wrote: > > > > > Experience shows that 99% of the time one can run PostgreSQL just fine > > > > > without a superuser > > > > > > > > IME that's not at all true. It might not be needed interactively, but > > > > that's > > > > not all the same as not being needed at all. > > > > > > I also disagree with that. Not having a superuser is one of the pain > > > points with using a hosted database: no untrusted procedural languages, > > > no untrusted extensions (unless someone hacked up PostgreSQL or provided > > > a workaround akin to a SECURITY DEFINER function), etc. > > > > I'm not sure what exactly you're disagreeing with? I'm not saying that > > superuser isn't needed interactively in general, just that there are > > reasonably common scenarios in which that's the case. > > I was unclear, sorry. I agreed with you that you can't do without superuser > and disagreed with the claim that 99% of the time nobody needs superuser > access. > > Yours, > Laurenz Albe
Re: Hardening PostgreSQL via (optional) ban on local file system access
The idea is to allow superuser, but only in case you *already* have access to the file system. You could think of it as a two factor authentication for using superuser. So in the simplest implementation it would be touch $PGDATA/allow_superuser psql hannuk=# CREATE EXTENSION ... rm $PGDATA/allow_superuser and in more sophisticated implementation it could be terminal 1: psql hannuk=# select pg_backend_pid(); pg_backend_pid 1749025 (1 row) terminal 2: echo 1749025 > $PGDATA/allow_superuser back to terminal 1 still connected to backend with pid 1749025: $ CREATE EXTENSION ... .. and then clean up the sentinel file after, or just make it valid for N minutes from creation Cheers, Hannu Krosing On Wed, Jun 29, 2022 at 8:51 AM Laurenz Albe wrote: > > On Tue, 2022-06-28 at 16:27 -0700, Andres Freund wrote: > > > Experience shows that 99% of the time one can run PostgreSQL just fine > > > without a superuser > > > > IME that's not at all true. It might not be needed interactively, but that's > > not all the same as not being needed at all. > > I also disagree with that. Not having a superuser is one of the pain > points with using a hosted database: no untrusted procedural languages, > no untrusted extensions (unless someone hacked up PostgreSQL or provided > a workaround akin to a SECURITY DEFINER function), etc. > > Yours, > Laurenz Albe
Re: Hardening PostgreSQL via (optional) ban on local file system access
I was not after *completely* removing it, but just having an option which makes the superuser() function always return false. For known cases of needing a superuser there would be a way to enable it , perhaps via a sentinel file or pg_hba-like configuration file. And as first cut I would advocate for disabling SECURITY DEFINER functions for simplicity and robustness of defense. A short term solution for these would be to re-write them in C (or Go or Rust or any other compiled language) as C functions have full access anyway. But C functions fall in the same category as other defenses discussed at the start of this thread - namely to use them, you already need access to the file system. Running production databases without superuser available is not as impossible as you may think - Cloud SQL version of PostgreSQL has been in use with great success for years without exposing a real superuser to end users (there are some places where `cloudsqlsuperuser` gives you partial superuser'y abilities). Letting user turn off the superuser access when no known need for it exists (which is 99.9% in must use cases) would improve secondary defenses noticeably. It would also be a good start to figuring out the set of roles into which one can decompose superuser access in longer run -- Hannu On Tue, Jun 28, 2022 at 8:30 PM Robert Haas wrote: > > On Mon, Jun 27, 2022 at 5:37 PM Hannu Krosing wrote: > > My current thinking is (based on more insights from Andres) that we > > should also have a startup flag to disable superuser altogether to > > avoid bypasses via direct manipulation of pg_proc. > > > > Experience shows that 99% of the time one can run PostgreSQL just fine > > without a superuser, so having a superuser available all the time is > > kind of like leaving a loaded gun on the kitchen table because you > > sometimes need to go hunting. > > > > I am especially waiting for Andres' feedback on viability this approach. > > Well, I'm not Andres but I don't think not having a superuser at all > is in any way a viable approach. It's necessary to be able to > administer the database system, and the bootstrap superuser can't be > removed outright in any case because it owns a ton of objects. > > There are basically two ways of trying to solve this problem. On the > one hand we could try to create a mode in which the privileges of the > superuser are restricted enough that the superuser can't break out to > the operating system. The list of things that would need to be blocked > is, I think, more extensive than any list you've give so far. The > other is to stick with the idea of an unrestricted superuser but come > up with ways of giving a controlled subset of the superuser's > privileges to a non-superuser. I believe this is the more promising > approach, and there have been multiple discussion threads about it in > the last six months. > > -- > Robert Haas > EDB: http://www.enterprisedb.com
Re: Hardening PostgreSQL via (optional) ban on local file system access
My current thinking is (based on more insights from Andres) that we should also have a startup flag to disable superuser altogether to avoid bypasses via direct manipulation of pg_proc. Experience shows that 99% of the time one can run PostgreSQL just fine without a superuser, so having a superuser available all the time is kind of like leaving a loaded gun on the kitchen table because you sometimes need to go hunting. I am especially waiting for Andres' feedback on viability this approach. Cheers Hannu On Mon, Jun 27, 2022 at 10:37 PM Jeff Davis wrote: > > On Sat, 2022-06-25 at 00:08 +0200, Hannu Krosing wrote: > > Hi Pgsql-Hackers > > > > As part of ongoing work on PostgreSQL security hardening we have > > added a capability to disable all file system access (COPY TO/FROM > > [PROGRAM] , pg_*file*() functions, lo_*() functions > > accessing files, etc) in a way that can not be re-enabled without > > already having access to the file system. That is via a flag which > > can > > be set only in postgresql.conf or on the command line. > > How much of this can be done as a special extension already? > > For instance, a ProcessUtility_hook can prevent superuser from > executing COPY TO/FROM PROGRAM. > > As others point out, that would still leave a lot of surface area for > attacks, e.g. by manipulating the catalog. But it could be a starting > place to make attacks "harder", without core postgres needing to make > security promises that will be hard to keep. > > Regards, > Jeff Davis > >
Re: [PoC] Improve dead tuple storage for lazy vacuum
> Another thought: for non-x86 platforms, the SIMD nodes degenerate to > "simple loop", and looping over up to 32 elements is not great > (although possibly okay). We could do binary search, but that has bad > branch prediction. I am not sure that for relevant non-x86 platforms SIMD / vector instructions would not be used (though it would be a good idea to verify) Do you know any modern platforms that do not have SIMD ? I would definitely test before assuming binary search is better. Often other approaches like counting search over such small vectors is much better when the vector fits in cache (or even a cache line) and you always visit all items as this will completely avoid branch predictions and allows compiler to vectorize and / or unroll the loop as needed. Cheers Hannu
Re: Hardening PostgreSQL via (optional) ban on local file system access
Having a superuser.conf file could also be used to solve another issue - currently, if you remove the SUPERUSER attribute from all users your only option to get it back would be to run in single-user mode. With conf file one could add an "always" option there which makes any matching user superuser to fix whatever needs fixing as superuser. Cheers Hannu On Sat, Jun 25, 2022 at 10:54 PM Hannu Krosing wrote: > > What are your ideas of applying a change similar to above to actually > being a superuser ? > > That is adding a check for "superuser being currently available" to > function superuser() in > ./src/backend/utils/misc/superuser.c ? > > It could be as simple as a flag that can be set only at startup for > maximum speed - though the places needing superuser check are never in > speed-critical path as far as I have seen. > > Or it could be more complex and dynamix, like having a file similar to > pg_hba.conf that defines the ability to run code as superuser based on > a set of attributes each of which could be * meaning "any" > > These could be > * database (to limit superuser to only certain databases) > * session user (to allow only some users to become superuser, > including via SECURITY DEFINER functions) > * backend pid (this would allow kind of 2-factor authentication - > connect, then use that session's pid to add a row to > pg_ok_to_be_sup.conf file, then continue with your superuser-y stuff) > * valid-until - this lets one allow superuser for a limited period > without fear of forgetting top disable it > > This approach would have the the benefit of being very low-code while > delivering the extra protection of needing pre-existing access to > filesystem to enable/disable . > > For easiest usability the pg_ok_to_be_sup.conf file should be outside > pg_reload_conf() - either just read each time the superuser() check is > run, or watched via inotify and reloaded each time it changes. > > Cheers, > Hannu > > P.S: - thanks Magnus for the "please don't top-post" notice - I also > need to remember to check if all the quoted mail history is left in > when I just write a replay without touching any of it. I hope it does > the right thing and leaves it out, but it just might unders some > conditions bluntly append it anyway just in case :)
Re: Hardening PostgreSQL via (optional) ban on local file system access
What are your ideas of applying a change similar to above to actually being a superuser ? That is adding a check for "superuser being currently available" to function superuser() in ./src/backend/utils/misc/superuser.c ? It could be as simple as a flag that can be set only at startup for maximum speed - though the places needing superuser check are never in speed-critical path as far as I have seen. Or it could be more complex and dynamix, like having a file similar to pg_hba.conf that defines the ability to run code as superuser based on a set of attributes each of which could be * meaning "any" These could be * database (to limit superuser to only certain databases) * session user (to allow only some users to become superuser, including via SECURITY DEFINER functions) * backend pid (this would allow kind of 2-factor authentication - connect, then use that session's pid to add a row to pg_ok_to_be_sup.conf file, then continue with your superuser-y stuff) * valid-until - this lets one allow superuser for a limited period without fear of forgetting top disable it This approach would have the the benefit of being very low-code while delivering the extra protection of needing pre-existing access to filesystem to enable/disable . For easiest usability the pg_ok_to_be_sup.conf file should be outside pg_reload_conf() - either just read each time the superuser() check is run, or watched via inotify and reloaded each time it changes. Cheers, Hannu P.S: - thanks Magnus for the "please don't top-post" notice - I also need to remember to check if all the quoted mail history is left in when I just write a replay without touching any of it. I hope it does the right thing and leaves it out, but it just might unders some conditions bluntly append it anyway just in case :)
Re: Hardening PostgreSQL via (optional) ban on local file system access
The old versions should definitely not have it turned on by default. I probably was not as clear as I thought in bringing out that point.. For upcoming next ones the distributors may want to turn it on for some more security-conscious ("enterprize") distributions. -- Hannu On Sat, Jun 25, 2022 at 2:08 AM David G. Johnston wrote: > > > > On Friday, June 24, 2022, Gurjeet Singh wrote: >> >> On Fri, Jun 24, 2022 at 4:13 PM Andres Freund wrote: >> > On 2022-06-25 00:08:13 +0200, Hannu Krosing wrote: >> >> > > 3) should this be back-patched (we can provide batches for all >> > > supported PgSQL versions) >> > >> > Err, what? >> >> Translation: Backpatching these changes to any stable versions will >> not be acceptable (per the project versioning policy [1]), since these >> changes would be considered new feature. These changes can break >> installations, if released in a minor version. >> > > No longer having the public schema in the search_path was a feature that got > back-patched, with known bad consequences, without any way for the DBA to > voice their opinion on the matter. This proposal seems similar enough to at > least ask the question, with full DBA control and no known bad consequences. > > David J. >
Re: Hardening PostgreSQL via (optional) ban on local file system access
My understanding was that unless activated by admin these changes would change nothing. And they would be (borderline :) ) security fixes And the versioning policy link actually does not say anything about not adding features to older versions (I know this is the policy, just pointing out the info in not on that page). On Sat, Jun 25, 2022 at 1:46 AM Gurjeet Singh wrote: > > On Fri, Jun 24, 2022 at 4:13 PM Andres Freund wrote: > > On 2022-06-25 00:08:13 +0200, Hannu Krosing wrote: > > > > 3) should this be back-patched (we can provide batches for all > > > supported PgSQL versions) > > > > Err, what? > > Translation: Backpatching these changes to any stable versions will > not be acceptable (per the project versioning policy [1]), since these > changes would be considered new feature. These changes can break > installations, if released in a minor version. > > [1]: https://www.postgresql.org/support/versioning/ > > Best regards, > Gurjeet > http://Gurje.et
Re: Hardening PostgreSQL via (optional) ban on local file system access
On Sat, Jun 25, 2022 at 1:23 AM Hannu Krosing wrote: > My impression was that this was largely fixed via disabling the old > direct file calling convention, but then again I did not pay much > attention at that time :) I meant of course direct FUNCTION calling convention (Version 0 Calling Conventions) -- Hannu
Re: Hardening PostgreSQL via (optional) ban on local file system access
On Sat, Jun 25, 2022 at 1:13 AM Andres Freund wrote: > > Hi, > > On 2022-06-25 00:08:13 +0200, Hannu Krosing wrote: > > Currently the file system access is controlled via being a SUPREUSER > > or having the pg_read_server_files, pg_write_server_files and > > pg_execute_server_program roles. The problem with this approach is > > that it will not stop an attacker who has managed to become the > > PostgreSQL SUPERUSER from breaking out of the server to reading and > > writing files and running programs in the surrounding container, VM or > > OS. > > If a user has superuser rights, they automatically can execute arbitrary > code. It's that simple. Removing roles isn't going to change that. Our code > doesn't protect against C functions mismatching their SQL level > definitions. With that you can do a lot of things. Are you claiming that one can manipulate PostgreSQL to do any file writes directly by manipulating pg_proc to call the functions "in a wrong way" ? My impression was that this was largely fixed via disabling the old direct file calling convention, but then again I did not pay much attention at that time :) So your suggestion would be to also include disabling access to at least pg_proc for creating C and internal functions and possibly some other system tables to remove this threat ? Cheers Hannu
Hardening PostgreSQL via (optional) ban on local file system access
Hi Pgsql-Hackers As part of ongoing work on PostgreSQL security hardening we have added a capability to disable all file system access (COPY TO/FROM [PROGRAM] , pg_*file*() functions, lo_*() functions accessing files, etc) in a way that can not be re-enabled without already having access to the file system. That is via a flag which can be set only in postgresql.conf or on the command line. Currently the file system access is controlled via being a SUPREUSER or having the pg_read_server_files, pg_write_server_files and pg_execute_server_program roles. The problem with this approach is that it will not stop an attacker who has managed to become the PostgreSQL SUPERUSER from breaking out of the server to reading and writing files and running programs in the surrounding container, VM or OS. If we had had this then for example the infamous 2020 PgCrypto worm [1] would have been much less likely to succeed. So here are a few questions to get discussion started. 1) would it be enough to just disable WRITING to the filesystem (COPY ... TO ..., COPY TO ... PROGRAM ...) or are some reading functions also potentially exploitable or at least making attackers life easier ? 2) should configuration be all-or-nothing or more fine-tunable (maybe a comma-separated list of allowed features) ? 3) should this be back-patched (we can provide batches for all supported PgSQL versions) 4) We should likely start with this flag off, but any paranoid (read - good and security conscious) DBA can turn it on. 5) Which file access functions should be in the unsafe list - pg_current_logfile is likely safe as is pg_relation_filenode, but pg_ls_dir likely is not. some subversions might be ok again, like pg_ls_waldir ? 6) or should we control it via disabling the pg_*_server_* roles for different take of configuration from 5) ? As I said, we are happy to provide patches (and docs, etc) for all the PostgreSQL versions we decide to support here. Best Regards Hannu - [1] https://www.securityweek.com/pgminer-crypto-mining-botnet-abuses-postgresql-distribution
Which hook to use when overriding utility commands (COPY ...)
Hi Pgsql-Hackers Which hook should I use when overriding the COPY command in an extension? I am working on adding new functionalities to COPY (compression, index management, various other transports in addition to stdin and file, other data formats, etc...) and while the aim is to contribute this to v15 I would also like to have much of it in earlier versions. As the current policy is to back-port only bugfixes and not "features" , the only way I can see to get it in earlier versions is to provide an extension which intercepts the COPY command and replaces it with my own implementation. So my question is, which of the hooks would be easiest to use for this ? At the syntax level it would still look the same COPY ... FROM/TO ... WITH ( options) and the extensibility will be in file names (using a URI scheme mapped to transports) and in options part. so I hope to fully reuse the parsing part and get in before the existence checks. Does anyone have experience in this and can [point to samples? Cheers Hannu
Re: logical decoding and replication of sequences
Just a note for some design decisions > 1) By default, sequences are treated non-transactionally, i.e. sent to the > output plugin right away. If our aim is just to make sure that all user-visible data in *transactional* tables is consistent with sequence state then one very much simplified approach to this could be to track the results of nextval() calls in a transaction at COMMIT put the latest sequence value in WAL (or just track the sequences affected and put the latest sequence state in WAL at commit which needs extra read of sequence but protects against race conditions with parallel transactions which get rolled back later) This avoids sending redundant changes for multiple nextval() calls (like loading a million-row table with sequence-generated id column) And one can argue that we can safely ignore anything in ROLLBACKED sequences. This is assuming that even if we did advance the sequence paste the last value sent by the latest COMMITTED transaction it does not matter for database consistency. It can matter if customers just call nextval() in rolled-back transactions and somehow expect these values to be replicated based on reasoning along "sequences are not transactional - so rollbacks should not matter" . Or we may get away with most in-detail sequence tracking on the source if we just keep track of the xmin of the sequence and send the sequence info over at commit if it == current_transaction_id ? ----- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Fri, Sep 24, 2021 at 9:16 PM Tomas Vondra wrote: > > Hi, > > On 9/23/21 12:27 PM, Peter Eisentraut wrote: > > On 30.07.21 20:26, Tomas Vondra wrote: > >> Here's a an updated version of this patch - rebased to current master, > >> and fixing some of the issues raised in Peter's review. > > > > This patch needs an update, as various conflicts have arisen now. > > > > As was discussed before, it might be better to present a separate patch > > for just the logical decoding part for now, since the replication and > > DDL stuff has the potential to conflict heavily with other patches being > > discussed right now. It looks like cutting this patch in two should be > > doable easily. > > > > Attached is the rebased patch, split into three parts: > > 1) basic decoding infrastructure (decoding, reorderbuffer etc.) > 2) support for sequences in test_decoding > 3) support for sequences in built-in replication (catalogs, syntax, ...) > > The last part is the largest one - I'm sure we'll have discussions about > the grammar, adding sequences automatically, etc. But as you said, let's > focus on the first part, which deals with the required decoding stuff. > > I've added a couple comments, explaining how we track sequences, why we > need the XID in nextval() etc. I've also added streaming support. > > > > I looked through the test cases in test_decoding again. It all looks > > pretty sensible. If anyone can think of any other tricky or dubious > > cases, we can add them there. It's easiest to discuss these things with > > concrete test cases rather than in theory. > > > > One slightly curious issue is that this can make sequence values go > > backwards, when seen by the logical decoding consumer, like in the test > > case: > > > > + BEGIN > > + sequence: public.test_sequence transactional: 1 created: 1 last_value: > > 1, log_cnt: 0 is_called: 0 > > + COMMIT > > + sequence: public.test_sequence transactional: 0 created: 0 last_value: > > 33, log_cnt: 0 is_called: 1 > > + BEGIN > > + sequence: public.test_sequence transactional: 1 created: 1 last_value: > > 4, log_cnt: 0 is_called: 1 > > + COMMIT > > + sequence: public.test_sequence transactional: 0 created: 0 last_value: > > 334, log_cnt: 0 is_called: 1 > > > > I suppose that's okay, since it's not really the intention that someone > > is concurrently consuming sequence values on the subscriber. Maybe > > something for the future. Fixing that would require changing the way > > transactional sequence DDL updates these values, so it's not directly > > the job of the decoding to address this. > > > > Yeah, that's due to how ALTER SEQUENCE does things, and I agree redoing > that seems well out of scope for this patch. What seems a bit suspicious > is that some of the ALTER SEQUENCE changes have "created: 1" - it's > probably correct, though, because those ALTER SEQUENCE statements can be > rolled-back, so we see it as if a new sequence is created. The flag name > might be a bit confusing, though. > > > A small thing I found: Maybe the text that test_decoding
Re: logical replication restrictions
On Wed, Sep 22, 2021 at 6:18 AM Amit Kapila wrote: > > On Tue, Sep 21, 2021 at 4:21 PM Marcos Pegoraro wrote: >> >> No, I´m talking about that configuration you can have on standby servers >> recovery_min_apply_delay = '8h' >> > > oh okay, I think this can be useful in some cases where we want to avoid data > loss similar to its use for physical standby. For example, if the user has by > mistake truncated the table (or deleted some required data) on the publisher, > we can always it from the subscriber if we have such a feature. > > Having said that, I am not sure if we can call it a restriction. It is more > of a TODO kind of thing. It doesn't sound advisable to me to keep growing the > current Restrictions page [1]. One could argue that not having delayed apply *is* a restriction compared to both physical replication and "the original upstream" pg_logical. I think therefore it should be mentioned in "Restrictions" so people considering moving from physical streaming to pg_logical or just trying to decide whether to use pg_logical are warned. Also, the Restrictions page starts with " These might be addressed in future releases." so there is no exclusivity of being either a restriction or TODO. > [1] - https://wiki.postgresql.org/wiki/Todo > [2] - > https://www.postgresql.org/docs/devel/logical-replication-restrictions.html - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
Re: prevent immature WAL streaming
Hi Alvaro, I just started reading this thread, but maybe you can confirm or refute my understanding of what was done. In the first email you write > As mentioned in the course of thread [1], we're missing a fix for streaming replication to avoid sending records that the primary hasn't fully flushed yet. This patch is a first attempt at fixing that problem by retreating the LSN reported as FlushPtr whenever a segment is registered, based on the understanding that if no registration exists then the LogwrtResult.Flush pointer can be taken at face value; but if a registration exists, then we have to stream only till the start LSN of that registered entry. So did we end up holding back the wal_sender to not send anything that is not confirmed as flushed on master Are there measurements on how much this slows down replication compared to allowing sending the moment it is written in buffers but not necessarily flushed locally ? Did we investigate possibility of sending as fast as possible and controlling the flush synchronisation by sending separate flush pointers *both* ways ? And maybe there was even an alternative considered where we are looking at a more general Durability, for example 2-out-of-3 where primary is one of the 3 and not necessarily the most durable one? ----- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Fri, Sep 24, 2021 at 4:33 AM Alvaro Herrera wrote: > > On 2021-Sep-23, Alvaro Herrera wrote: > > > However, I notice now that the pg_rewind tests reproducibly fail in > > branch 14 for reasons I haven't yet understood. It's strange that no > > other branch fails, even when run quite a few times. > > Turns out that this is a real bug (setting EndOfLog seems insufficient). > I'm looking into it. > > -- > Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ > "No necesitamos banderas > No reconocemos fronteras" (Jorge González) > >
Re: WIP: System Versioned Temporal Table
On Mon, Sep 20, 2021 at 7:09 AM Corey Huinker wrote: > > On Sun, Sep 19, 2021 at 3:12 PM Hannu Krosing wrote: >> >> A side table has the nice additional benefit that we can very easily >> version the *table structure* so when we ALTER TABLE and the table >> structure changes we just make a new side table with now-currents >> structure. > > > It's true that would allow for perfect capture of changes to the table > structure, but how would you query the thing? > > If a system versioned table was created with a column foo that is type float, > and then we dropped that column, how would we ever query what the value of > foo was in the past? We can query that thing only in tables AS OF the time when the column was still there. We probably could get away with pretending the dropped columns to be NULL in newer versions (and the versions before the column was added) Even more tricky case would be changing the column data type. > > Would the columns returned from SELECT * change based on the timeframe > requested? If we want to emulate real table history, then it should. But the * thing was not really specified well even for original PostgreSQL inheritance. Maybe we could do SELECT (* AS OF 'yesterday afternoon'::timestamp) FROM ... :) > If we then later added another column that happened to also be named foo but > now was type JSONB, would we change the datatype returned based on the time > period being queried? Many databases do allow returning multiple result sets, and actually the PostgreSQL wire *protocol* also theoretically supports this, just that it is not supported by any current client library. With current libraries it would be possible to return a dynamic version of row_to_json(t.*) which changes based on actual historical table structure > Is the change in structure a system versioning which itself must be captured? We do capture it (kind of) for logical decoding. That is, we decode according to the structure in effect at the time of row creation, though we currently miss the actual DDL itself. So there is a lot to figure out and define, but at least storing the history in a separate table gives a good foundation to build upon. - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
Re: WIP: System Versioned Temporal Table
A side table has the nice additional benefit that we can very easily version the *table structure* so when we ALTER TABLE and the table structure changes we just make a new side table with now-currents structure. Also we may want different set of indexes on historic table(s) for whatever reason And we may even want to partition history tables for speed, storage cost or just to drop very ancient history - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Sun, Sep 19, 2021 at 8:32 PM Simon Riggs wrote: > > On Sun, 19 Sept 2021 at 01:16, Corey Huinker wrote: > >> > >> 1. Much of what I have read about temporal tables seemed to imply or > >> almost assume that system temporal tables would be implemented as two > >> actual separate tables. Indeed, SQLServer appears to do it that way [1] > >> with syntax like > >> > >> WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.WebsiteUserInfoHistory)); > >> > >> > >> Q 1.1. Was that implementation considered and if so, what made this > >> implementation more appealing? > >> > > > > I've been digging some more on this point, and I've reached the conclusion > > that a separate history table is the better implementation. It would make > > the act of removing system versioning into little more than a DROP TABLE, > > plus adjusting the base table to reflect that it is no longer system > > versioned. > > Thanks for giving this a lot of thought. When you asked the question > the first time you hadn't discussed how that might work, but now we > have something to discuss. > > > 10. Queries that omit the FOR SYSTEM_TIME clause, as well as ones that use > > FOR SYSTEM_TIME AS OF CURRENT_TIMESTAMP, would simply use the base table > > directly with no quals to add. > > 11. Queries that use FOR SYSTEM_TIME and not FOR SYSTEM_TIME AS OF > > CURRENT_TIMESTAMP, then the query would do a union of the base table and > > the history table with quals applied to both. > > > > 14. DROP SYSTEM VERSIONING from a table would be quite straightforward - > > the history table would be dropped along with the triggers that reference > > it, setting relissystemversioned = 'f' on the base table. > > > > I think this would have some key advantages: > > > > 1. MVCC bloat is no worse than it was before. > > The number of row versions stored in the database is the same for > both, just it would be split across two tables in this form. > > > 2. No changes whatsoever to referential integrity. > > The changes were fairly minor, but I see your thinking about indexes > as a simplification. > > > 3. DROP SYSTEM VERSIONING becomes an O(1) operation. > > It isn't top of mind to make this work well. The whole purpose of the > history is to keep it, not to be able to drop it quickly. > > > > Thoughts? > > There are 3 implementation routes that I see, so let me explain so > that others can join the discussion. > > 1. Putting all data in one table. This makes DROP SYSTEM VERSIONING > effectively impossible. It requires access to the table to be > rewritten to add in historical quals for non-historical access and it > requires some push-ups around indexes. (The current patch adds the > historic quals by kludging the parser, which is wrong place, since it > doesn't work for joins etc.. However, given that issue, the rest seems > to follow on naturally). > > 2. Putting data in a side table. This makes DROP SYSTEM VERSIONING > fairly trivial, but it complicates many DDL commands (please make a > list?) and requires the optimizer to know about this and cater to it, > possibly complicating plans. Neither issue is insurmountable, but it > becomes more intrusive. > > The current patch could go in either of the first 2 directions with > further work. > > 3. Let the Table Access Method handle it. I call this out separately > since it avoids making changes to the rest of Postgres, which might be > a good thing, with the right TAM implementation. > > My preferred approach would be to do this "for free" in the table > access method, but we're a long way from this in terms of actual > implementation. When Corey suggested earlier that we just put the > syntax in there, this was the direction I was thinking. > > After waiting a day since I wrote the above, I think we should go with > (2) as Corey suggests, at least for now, and we can always add (3) > later. > > -- > Simon Riggshttp://www.EnterpriseDB.com/ > >
Re: The Free Space Map: Problems and Opportunities
On Wed, Sep 8, 2021 at 6:52 AM Peter Geoghegan wrote: > > On Tue, Sep 7, 2021 at 5:25 AM Hannu Krosing wrote: > > Are you speaking of just heap pages here or also index pages ? > > Mostly heap pages, but FWIW I think it could work for index tuples > too, with retail index tuple deletion. Because that allows you to even > remove would-be LP_DEAD item pointers. But it does need info from the heap page(s) the index entry points to, which can become quite expensive if that heap page is locked by other processes or even not cached at all. Also, parallel processes may have moved the index entry to other pages etc. > > Or are you expecting these to be kept in good-enoug shape by your > > earlier index manager work ? > > It's very workload dependent. Some things were very much improved by > bottom-up index deletion in Postgres 14, for example (non-hot updates > with lots of logically unchanged indexes). Other things weren't helped > at all, or were barely helped. I think it's important to cover or > cases. > > > A minimal useful patch emerging from that understanding could be > > something which just adds hysteresis to FSM management. (TBH, I > > actually kind of expected some hysteresis to be there already, as it > > is in my mental model of "how things should be done" for managing > > almost any resource :) ) > > I think that you need to do the FSM before the aborted-heap-tuple > cleanup. Otherwise you don't really know when or where to apply the > special kind of pruning that the patch invents, which targets aborts > specifically. Alternative is to not actually prune the page at all at this time (to avoid re-dirtying it and/or WAL write) but just inspect and add the *potential* free space in FSM. And then do the pruning at the time of next INSERT or UPDATE when the page is ditied and WAL written anyway. > > Adding hysteresis to FSM management can hopefully be done independent > > of all the other stuff and also seems to be something that is > > unobtrusive and non-controversial enough to fit in current release and > > possibly be even back-ported . > > I don't know about that! Seems kind of an invasive patch to me. Why do you think that changing the calculation when a page is added back to FSM is "kind of invasive" ? Not having looked at the code I would assume from the discussion here that we remove the page from FSM when free space goes above FILLFACTOR and we add it back when it goes back below FILLFACTOR. The code change would just change the "add id back" code to use FILLFACTOR - HYSTERESIS_FACTOR instead. What other parts does this need to touch ? Or is it just that the calculation source code is not localised well in code and written multiple times all over the place ? > > I did not mean CHECKPOINT as a command, but more the concept of > > writing back / un-dirtying the page. In this sense it *is* special > > because it is the last point in time where you are guaranteed to have > > the page available in buffercache and thus cheap to access for > > modifications plus you will avoid a second full-page writeback because > > of cleanup. Also you do not want to postpone the cleanup to actual > > page eviction, as that is usually in the critical path for some user > > query or command. > > I intend to do some kind of batching, but only at the level of small > groups of related transactions. Writing a page that was quickly opened > for new inserts, filled with newly inserted heap tuples, then closed, > and finally cleaned up doesn't seem like it needs to take any direct > responsibility for writeback. Agreed that this code does not need to care about writeback. I approached it from the other end and proposed the "just before writeback" as a strategy which is easy to understand conceptually and also is mostly "already there" codewise, that is there is a location in write-back code to plug the call to cleanup in . Cheers - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
Re: The Free Space Map: Problems and Opportunities
On Tue, Sep 7, 2021 at 2:29 AM Peter Geoghegan wrote: > > On Mon, Sep 6, 2021 at 4:33 PM Hannu Krosing wrote: > > When I have been thinking of this type of problem it seems that the > > latest -- and correct :) -- place which should do all kinds of > > cleanup like removing aborted tuples, freezing committed tuples and > > setting any needed hint bits would be background writer or CHECKPOINT. > > > > This would be more PostgreSQL-like, as it moves any work not > > immediately needed from the critical path, as an extension of how MVCC > > for PostgreSQL works in general. > > I think it depends. There is no need to do work in the background > here, with TPC-C. With my patch series each backend can know that it > just had an aborted transaction that affected a page that it more or > less still owns. And has very close at hand, for further inserts. It's > very easy to piggy-back the work once you have that sense of ownership > of newly allocated heap pages by individual backends/transactions. Are you speaking of just heap pages here or also index pages ? It seems indeed easy for heap, but for index pages can be mixed up by other parallel work, especially things like Serial Primary Keys . Or are you expecting these to be kept in good-enoug shape by your earlier index manager work ? > > This would be more PostgreSQL-like, as it moves any work not > > immediately needed from the critical path, as an extension of how MVCC > > for PostgreSQL works in general. > > I think that it also makes sense to have what I've called "eager > physical rollback" that runs in the background, as you suggest. > > I'm thinking of a specialized form of VACUUM that targets a specific > aborted transaction's known-dirtied pages. That's my long term goal, > actually. Originally I wanted to do this as a way of getting rid of > SLRUs and tuple freezing, by representing that all heap pages must > only have committed tuples implicitly. That seemed like a good enough > reason to split VACUUM into specialized "eager physical rollback > following abort" and "garbage collection" variants. > > The insight that making abort-related cleanup special will help free > space management is totally new to me -- it emerged from working > directly on this benchmark. But it nicely complements some of my > existing ideas about improving VACUUM. A minimal useful patch emerging from that understanding could be something which just adds hysteresis to FSM management. (TBH, I actually kind of expected some hysteresis to be there already, as it is in my mental model of "how things should be done" for managing almost any resource :) ) Adding hysteresis to FSM management can hopefully be done independent of all the other stuff and also seems to be something that is unobtrusive and non-controversial enough to fit in current release and possibly be even back-ported . > > But doing it as part of checkpoint probably ends up with less WAL > > writes in the end. > > I don't think that checkpoints are special in any way. They're very > important in determining the total number of FPIs we'll generate, and > so have huge importance today. But that seems accidental to me. I did not mean CHECKPOINT as a command, but more the concept of writing back / un-dirtying the page. In this sense it *is* special because it is the last point in time where you are guaranteed to have the page available in buffercache and thus cheap to access for modifications plus you will avoid a second full-page writeback because of cleanup. Also you do not want to postpone the cleanup to actual page eviction, as that is usually in the critical path for some user query or command. Of course this becomes most important for workloads where the active working set is larger than fits in memory and this is not a typical case for OLTP any more. But at least freezing the page before write-out could have a very big impact on the need to freeze "old" pages available only on disk and thus would be a cheap way to improve the problems around running out of transaction ids. > > There could be a possibility to do a small amount of cleanup -- enough > > for TPC-C-like workloads, but not larger ones -- while waiting for the > > next command to arrive from the client over the network. This of > > course assumes that we will not improve our feeder mechanism to have > > back-to-back incoming commands, which can already be done today, but > > which I have seen seldom used. > > That's what I meant, really. Doing the work of cleaning up a heap page > that a transaction inserts into (say pruning away aborted tuples or > setting hint bits) should ideally happen right after commit or abort > -- at least for OLTP like workloads, which are the common case
Re: The Free Space Map: Problems and Opportunities
When I have been thinking of this type of problem it seems that the latest -- and correct :) -- place which should do all kinds of cleanup like removing aborted tuples, freezing committed tuples and setting any needed hint bits would be background writer or CHECKPOINT. This would be more PostgreSQL-like, as it moves any work not immediately needed from the critical path, as an extension of how MVCC for PostgreSQL works in general. Another option could be one more background cleaner with "needs cleanup" bitmap which is updated by "real work" backends to let the cleaner know what needs to be done. But doing it as part of checkpoint probably ends up with less WAL writes in the end. That said, CHECKPOINT can efficiently clean up only Heap pages, unless we do some extra work to ensure buffercache eviction ordering so that Heap is (almost) always cleaaned worst and has thus an option to also clean index pointers which point to it. There could be a possibility to do a small amount of cleanup -- enough for TPC-C-like workloads, but not larger ones -- while waiting for the next command to arrive from the client over the network. This of course assumes that we will not improve our feeder mechanism to have back-to-back incoming commands, which can already be done today, but which I have seen seldom used. Cheers, - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Mon, Sep 6, 2021 at 11:59 PM Peter Geoghegan wrote: > > On Mon, Aug 16, 2021 at 5:15 PM Peter Geoghegan wrote: > > This is actually quite simple -- the chaos that follows is where it > > gets truly complicated (and not necessarily worth getting into just > > yet). The update of a given order and its order lines entries takes > > place hours later, within a delivery transaction. > > Update on this: aborted transactions play an important role in the > chaos that we see with BenchmarkSQL/TPC-C. > > This is surprising, in one way. Only 1% of all transactions are > aborted (per the TPC-C spec). But in another way it's not surprising. > It's totally consistent with what I've been saying about open and > closed pages. When a page that's initially allocated and opened by a > backend becomes closed following some inserts, the page should be left > in a more or less pristine state -- owning backends that open and > close pages need to be "good citizens. > > The definition of "pristine" must also include "no already-aborted > tuples" -- aborted xact tuples are a problem that we can and should > nip in the bud. This is really an extension of what I've been saying > about not disrupting naturally occuring temporal and spatial locality. > We're far better off if the space reused by freeing aborted heap > tuples is reused by the very next transaction running in the very same > backend. Plus it's not that hard; we can know for sure that this space > is immediately available, which is not generally true of pruning. And > so this free space from aborted xact tuples is quite special. > > Even with the master branch the problems with the FSM are much less > noticeable once you configure BenchmarkSQL to not abort any > transactions [1]. You need to consider the whole picture to understand > how the issue of aborted transactions has such an outsized negative > impact on performance and space efficiency. > > It's a whole *chain* of events: > > * If just 1% of all transactions abort, that's enough to leave an > average of about 1 aborted tuple on every "order" table page (which > naturally has small heap tuples [2]), and about 10 on a minority of > the really big "new order" table's pages. Maybe 1 in 10 or 12 "order > line" table heap pages are affected in this way. > > * The overall impact is that small (but not tiny) pockets of free > space are left randomly strewn all over the place. Not just LP_DEAD > line pointer stubs -- whole entire aborted heap tuples. > > * In theory we could address this problem using opportunistic pruning. > But in practice that simply cannot happen today, because we don't > touch the inserted rows until literally hours later. > > The same rows will eventually have updates and selects, but that won't > help -- too little, too late. That's just how opportunistic pruning > works: currently you need some kind of scan node for opportunistic > pruning to kick in, so continually inserting transactions (from new > orders) are currently fundamentally unable to do any opportunistic > pruning. Much less opportunistic pruning that kicks in at exactly the > right time. > > * When an autovacuum worker runs against these tables, it will of > course notice this diffuse free space from aborted tuples, and put it >
Re: Middleware Messages for FE/BE
Jesper, please don't confuse my ramblings with Simon's initial proposal. As I understand, the original proposal was just about adding a new wire protocol message type, which then could be used for emitting custom messages by middleware support extension - likely loaded as a preloaded shared library - and consumed by some middleware, which could either be some proxy or connection pooler or something compiled as part of the libpq, JDBC or other client driver. So it was just about providing extensibility to the protocol (the same way almost everything else in PostgreSQL is extensible). But at least initially each middleware would process only it's own class, so a single if() and not a big switch/case :) The things I talked about were some forward-looking proposals on how to use this capability and that some of the messages may at some point become used / usable by more than a single middleware component at which point they should be standardised . - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Thu, Aug 19, 2021 at 8:29 PM Simon Riggs wrote: > > On Thu, 19 Aug 2021 at 19:04, Jesper Pedersen > wrote: > > > Should the middleware guess the client driver based on the startup > > message ? Or are you thinking about having a class identifier group for > > each client driver and then a massive "switch/case" inside each middleware ? > > The question is how does a client communicate with middleware? > > The message that would be sent would relate to the features of the > middleware, while being agnostic as to the client driver. > > -- > Simon Riggshttp://www.EnterpriseDB.com/ > >
Re: Middleware Messages for FE/BE
Actually the way I have been thinking about this would be a message "Hey replica x.x.x.x, please take over the write head role" Which would 1. cause the replica x.x.x.x to self-promote at this point-in-WAL 1A - optionally the old write head becomes a replica (it should mention it in message if it plans to stay around as a replica) 2. tells other replicas to switch over and start replicating from it 3. tells all connected clients to switch, if they need read-write access. This could be even done without timeline switch, as it should guarantee that there is no split brain writing Of course there are (and should be) ways to still use the WALs normally for cases where replica x.x.x.x does not exists, like PITR And making this play nicely with Logical Decoding is another can of worms needing to be solved, but it is a start to becoming "Cloud Native" :) - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Thu, Aug 19, 2021 at 11:36 AM Hannu Krosing wrote: > > One more set of "standard middleware messages" clients/middleware > could turn on could be reporting LSNs > * various local LSNs for progress of WAL persisting > * reporting replication state of some or all replicas > > - > Hannu Krosing > Google Cloud - We have a long list of planned contributions and we are hiring. > Contact me if interested. > > On Thu, Aug 19, 2021 at 11:33 AM Hannu Krosing wrote: > > > > Need for this does come up quite often so I very much support this. > > > > In addition to keeping a registry there likely need to be some other > > "generally agreed" rules as well, like > > * it being safe to ignore any and all of the middleware messages (at > > least with no degradation from the state of not having them) > > * and maybe even a standard way to turn them on and off. > > > > On/Off switch could be of course done using flags for each > > individual use case, but it would be good to agree conventions. > > > > Another thing to agree would be a set of standard messages, like "I am > > overloaded, consider moving some load away" or "Planning to switch > > over to replica x.x.x.x, please follow" > > > > - > > Hannu Krosing > > Google Cloud - We have a long list of planned contributions and we are > > hiring. > > Contact me if interested. > > > > > > On Thu, Aug 19, 2021 at 10:33 AM Simon Riggs > > wrote: > > > > > > The current FE/BE protocol assumes that the client is talking to the > > > server directly. If middleware wants to do something like load > > > balancing, the only current option is to inspect the incoming commands > > > and take action from that. That approach performs poorly and has > > > proven difficult to maintain, limiting the functionality in Postgres > > > ecosystem middleware. > > > > > > It would be useful to have a way to speak to middleware within a > > > session or prior to each command. There are ways to frig this and > > > obviously it can always be done out-of-core, but there is a clear > > > requirement for various client tools to agree a standard way for them > > > to send messages to middleware rather than the database server. If we > > > get PostgreSQL Project approval for this, then authors of client and > > > middleware tools will know how to interoperate and can begin adding > > > features to take advantage of this, allowing the Postgres ecosystem to > > > improve and extend its middleware. > > > > > > Byte1('M') > > > Identifies the message as a middleware message. (Or perhaps use 'U' > > > for User Message?). > > > > > > Int32 > > > Length of message contents in bytes, including self. > > > > > > Int64 > > > Routing/class identifier, allowing middleware to quickly ignore whole > > > classes of message if not appropriate. We would run some kind of > > > allocation scheme to ensure unique meaning, probably via the Postgres > > > Wiki. The first 2 billion values would be reserved for allocation by > > > the PostgreSQL Project itself, values beyond that open for external > > > allocation. > > > > > > Byten > > > The message itself, where n is the remaining bytes in the message. > > > > > > The message is intended for middleware only. The server always ignores > > > these messages, with an optional debug facility that can be enabled to > > > allow printing NOTICEs to allow testing. > > > > > > I will supply a patch to any agreed message format, together with a > > > new libpq call to utilise this. > > > > > > In summary: the patch is easy, the point is we need agreement to allow > > > and encourage interoperation between clients and middleware. > > > > > > Thoughts? > > > > > > -- > > > Simon Riggshttp://www.EnterpriseDB.com/ > > > > > >
Re: Middleware Messages for FE/BE
One more set of "standard middleware messages" clients/middleware could turn on could be reporting LSNs * various local LSNs for progress of WAL persisting * reporting replication state of some or all replicas ----- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Thu, Aug 19, 2021 at 11:33 AM Hannu Krosing wrote: > > Need for this does come up quite often so I very much support this. > > In addition to keeping a registry there likely need to be some other > "generally agreed" rules as well, like > * it being safe to ignore any and all of the middleware messages (at > least with no degradation from the state of not having them) > * and maybe even a standard way to turn them on and off. > > On/Off switch could be of course done using flags for each > individual use case, but it would be good to agree conventions. > > Another thing to agree would be a set of standard messages, like "I am > overloaded, consider moving some load away" or "Planning to switch > over to replica x.x.x.x, please follow" > > - > Hannu Krosing > Google Cloud - We have a long list of planned contributions and we are hiring. > Contact me if interested. > > > On Thu, Aug 19, 2021 at 10:33 AM Simon Riggs > wrote: > > > > The current FE/BE protocol assumes that the client is talking to the > > server directly. If middleware wants to do something like load > > balancing, the only current option is to inspect the incoming commands > > and take action from that. That approach performs poorly and has > > proven difficult to maintain, limiting the functionality in Postgres > > ecosystem middleware. > > > > It would be useful to have a way to speak to middleware within a > > session or prior to each command. There are ways to frig this and > > obviously it can always be done out-of-core, but there is a clear > > requirement for various client tools to agree a standard way for them > > to send messages to middleware rather than the database server. If we > > get PostgreSQL Project approval for this, then authors of client and > > middleware tools will know how to interoperate and can begin adding > > features to take advantage of this, allowing the Postgres ecosystem to > > improve and extend its middleware. > > > > Byte1('M') > > Identifies the message as a middleware message. (Or perhaps use 'U' > > for User Message?). > > > > Int32 > > Length of message contents in bytes, including self. > > > > Int64 > > Routing/class identifier, allowing middleware to quickly ignore whole > > classes of message if not appropriate. We would run some kind of > > allocation scheme to ensure unique meaning, probably via the Postgres > > Wiki. The first 2 billion values would be reserved for allocation by > > the PostgreSQL Project itself, values beyond that open for external > > allocation. > > > > Byten > > The message itself, where n is the remaining bytes in the message. > > > > The message is intended for middleware only. The server always ignores > > these messages, with an optional debug facility that can be enabled to > > allow printing NOTICEs to allow testing. > > > > I will supply a patch to any agreed message format, together with a > > new libpq call to utilise this. > > > > In summary: the patch is easy, the point is we need agreement to allow > > and encourage interoperation between clients and middleware. > > > > Thoughts? > > > > -- > > Simon Riggshttp://www.EnterpriseDB.com/ > > > >
Re: Middleware Messages for FE/BE
Need for this does come up quite often so I very much support this. In addition to keeping a registry there likely need to be some other "generally agreed" rules as well, like * it being safe to ignore any and all of the middleware messages (at least with no degradation from the state of not having them) * and maybe even a standard way to turn them on and off. On/Off switch could be of course done using flags for each individual use case, but it would be good to agree conventions. Another thing to agree would be a set of standard messages, like "I am overloaded, consider moving some load away" or "Planning to switch over to replica x.x.x.x, please follow" - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Thu, Aug 19, 2021 at 10:33 AM Simon Riggs wrote: > > The current FE/BE protocol assumes that the client is talking to the > server directly. If middleware wants to do something like load > balancing, the only current option is to inspect the incoming commands > and take action from that. That approach performs poorly and has > proven difficult to maintain, limiting the functionality in Postgres > ecosystem middleware. > > It would be useful to have a way to speak to middleware within a > session or prior to each command. There are ways to frig this and > obviously it can always be done out-of-core, but there is a clear > requirement for various client tools to agree a standard way for them > to send messages to middleware rather than the database server. If we > get PostgreSQL Project approval for this, then authors of client and > middleware tools will know how to interoperate and can begin adding > features to take advantage of this, allowing the Postgres ecosystem to > improve and extend its middleware. > > Byte1('M') > Identifies the message as a middleware message. (Or perhaps use 'U' > for User Message?). > > Int32 > Length of message contents in bytes, including self. > > Int64 > Routing/class identifier, allowing middleware to quickly ignore whole > classes of message if not appropriate. We would run some kind of > allocation scheme to ensure unique meaning, probably via the Postgres > Wiki. The first 2 billion values would be reserved for allocation by > the PostgreSQL Project itself, values beyond that open for external > allocation. > > Byten > The message itself, where n is the remaining bytes in the message. > > The message is intended for middleware only. The server always ignores > these messages, with an optional debug facility that can be enabled to > allow printing NOTICEs to allow testing. > > I will supply a patch to any agreed message format, together with a > new libpq call to utilise this. > > In summary: the patch is easy, the point is we need agreement to allow > and encourage interoperation between clients and middleware. > > Thoughts? > > -- > Simon Riggshttp://www.EnterpriseDB.com/ > >
pgbench functions as extension
Has anybody worked on making the pgbench internal functions available as extensions ? I mean the ones from " Built-In Functions" [*] Some of these are probably available in other extensions or even as builtins but having things like hash_fnv1a() or random_zipfian() which are guaranteed to have exact same behaviour as the ones in bundled pgbench would enable more work to be pushed to database and generally resu;lt in better interoperability for tests. -- [*] https://www.postgresql.org/docs/13/pgbench.html#PGBENCH-BUILTIN-FUNCTIONS ----- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
Re: NAMEDATALEN increase because of non-latin languages
Could we just make the limitation to be 64 (or 128) _characters_ not _bytes_ ? Memory sizes and processor speeds have grown by order(s) of magnitude since the 64 byte limit was decided and supporting non-ASCII charsets properly seems like a prudent thing to do. Also - have we checked that at least the truncation does not cut utf-8 characters in half ? - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Wed, Aug 18, 2021 at 1:33 PM Julien Rouhaud wrote: > > On Wed, Aug 18, 2021 at 7:27 PM John Naylor > wrote: > > > > On Wed, Aug 18, 2021 at 7:15 AM Julien Rouhaud wrote: > > > > > > Unfortunately, the problem isn't really the additional disk space it > > > would require. The problem is the additional performance hit and > > > memory overhead, as the catalog names are part of the internal > > > syscache. > > > > Some actual numbers on recent hardware would show what kind of tradeoff is > > involved. No one has done that for a long time that I recall. > > Agreed, but I don't have access to such hardware. However this won't > influence the memory overhead part, and there is already frequent > problems with that, especially since declarative partitioning, so I > don't see how we could afford that without some kind of cache TTL or > similar. AFAIR the last discussion about it a few years ago didn't > lead anywhere :( > >
Is there now an official way to pass column projection to heap AM
When working on zedstore, the developers solved the lack of support for passing column projection into the AM as explained in the zedstore/README [*] -8<--8<--8<---8<--- Current table am API requires enhancement here to pass down column projection to AM. The patch showcases two different ways for the same. * For sequential scans added new beginscan_with_column_projection() API. Executor checks AM property and if it leverages column projection uses this new API else normal beginscan() API. * For index scans instead of modifying the begin scan API, added new API to specifically pass column projection list after calling begin scan to populate the scan descriptor but before fetching the tuples. -8<--8<--8<---8<--- Which of course requires patching also the call sites in PostgreSQL code itself to make use of these methods As this was already almost a year ago, have there been any developments in the core PostgreSQL code for supporting this ? Or is projection still supported only in FDWs ? -- [*] https://github.com/greenplum-db/postgres/blob/zedstore/src/backend/access/zedstore/README - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
Re: [PoC] Improve dead tuple storage for lazy vacuum
e bit like a > hash join makes sense because hash joins do very well with high join > selectivity, and high join selectivity is common in the real world. > The intersection of TIDs from each leaf page with the in-memory TID > delete structure will often be very small indeed. The hard to optimize case is still when we have dead tuple counts in hundreds of millions, or even billions, like on a HTAP database after a few hours of OLAP query have accumulated loads of dead tuples in tables getting heavy OLTP traffic. There of course we could do a totally different optimisation, where we also allow reaping tuples newer than the OLAP queries snapshot if we can prove that when the snapshot moves forward next time, it has to jump over said transactions making them indeed DEAD and not RECENTLY DEAD. Currently we let a single OLAP query ruin everything :) > > Then again it may be so much extra work that it starts to dominate > > some parts of profiles. > > > > For example see the work that was done in improving the mini-vacuum > > part where it was actually faster to copy data out to a separate > > buffer and then back in than shuffle it around inside the same 8k page > > Some of what I'm saying is based on the experience of improving > similar code used by index tuple deletion in Postgres 14. That did > quite a lot of sorting of TIDs and things like that. In the end the > sorting had no more than a negligible impact on performance. Good to know :) > What > really mattered was that we efficiently coordinate with distant heap > pages that describe which index tuples we can delete from a given leaf > page. Sorting hundreds of TIDs is cheap. Reading hundreds of random > locations in memory (or even far fewer) is not so cheap. It might even > be very slow indeed. Sorting in order to batch could end up looking > like cheap insurance that we should be glad to pay for. If the most expensive operation is sorting a few hundred of tids, then this should be fast enough. My worries were more that after the sorting we can not to dsimple index lookups for them, but each needs to be found via bseach or maybe even just search if that is faster under some size limit, and that these could add up. Or some other needed thing that also has to be done, like allocating extra memory or moving other data around in a way that CPU does not like. Cheers - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
Re: [PoC] Improve dead tuple storage for lazy vacuum
simple array of > TIDs. Maybe this will help the compiler and the CPU to fully > understand the *natural* data dependencies, so that they can be as > effective as possible in making the code run fast. It's possible that > a modern CPU will be able to *hide* the latency more intelligently > than what we have today. The latency is such a big problem that we may > be able to justify "wasting" other CPU resources, just because it > sometimes helps with hiding the latency. For example, it might > actually be okay to sort all of the TIDs on the page to make the bulk > processing work Then again it may be so much extra work that it starts to dominate some parts of profiles. For example see the work that was done in improving the mini-vacuum part where it was actually faster to copy data out to a separate buffer and then back in than shuffle it around inside the same 8k page :) So only testing will tell. > -- though you might still do a precheck that is > similar to the precheck inside lazy_tid_reaped() that was added by you > in commit bbaf315309e. > > Of course it's very easy to be wrong about stuff like this. But it > might not be that hard to prototype. You can literally copy and paste > code from _bt_delitems_delete_check() to do this. It does the same > basic thing already. Also a lot of testing would be needed to figure out which strategy fits best for which distribution of dead tuples, and possibly their relation to the order of tuples to check from indexes . Cheers -- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
Re: pgbench logging broken by time logic changes
On Thu, Jun 17, 2021 at 7:18 AM Kyotaro Horiguchi wrote: > > I'm not sure we have transaction lasts for very short time that > nanoseconds matters. > Nanoseconds may not matter yet, but they could be handy when for example we want to determine the order of parallel query executions. We are less than an order of magnitude away from being able to do 1M inserts/updates/deletes per second, so microseconds already are not always 100% reliable. We could possibly move to using LSNs fetched as part of the queries for this case, but this will surely introduce more problems than it solves :) Cheers - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
Re: [PoC] Improve dead tuple storage for lazy vacuum
Very nice results. I have been working on the same problem but a bit different solution - a mix of binary search for (sub)pages and 32-bit bitmaps for tid-in-page. Even with currebnt allocation heuristics (allocate 291 tids per page) it initially allocate much less space, instead of current 291*6=1746 bytes per page it needs to allocate 80 bytes. Also it can be laid out so that it is friendly to parallel SIMD searches doing up to 8 tid lookups in parallel. That said, for allocating the tid array, the best solution is to postpone it as much as possible and to do the initial collection into a file, which 1) postpones the memory allocation to the beginning of index cleanups 2) lets you select the correct size and structure as you know more about the distribution at that time 3) do the first heap pass in one go and then advance frozenxmin *before* index cleanup Also, collecting dead tids into a file makes it trivial (well, almost :) ) to parallelize the initial heap scan, so more resources can be thrown at it if available. Cheers - Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Thu, Jul 8, 2021 at 10:48 AM Masahiko Sawada wrote: > > On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan wrote: > > > > On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada > > wrote: > > > Currently, the TIDs of dead tuples are stored in an array that is > > > collectively allocated at the start of lazy vacuum and TID lookup uses > > > bsearch(). There are the following challenges and limitations: > > > > > > 1. Don't allocate more than 1GB. There was a discussion to eliminate > > > this limitation by using MemoryContextAllocHuge() but there were > > > concerns about point 2[1]. > > > > I think that the main problem with the 1GB limitation is that it is > > surprising -- it can cause disruption when we first exceed the magical > > limit of ~174 million TIDs. This can cause us to dirty index pages a > > second time when we might have been able to just do it once with > > sufficient memory for TIDs. OTOH there are actually cases where having > > less memory for TIDs makes performance *better* because of locality > > effects. This perverse behavior with memory sizing isn't a rare case > > that we can safely ignore -- unfortunately it's fairly common. > > > > My point is that we should be careful to choose the correct goal. > > Obviously memory use matters. But it might be more helpful to think of > > memory use as just a proxy for what truly matters, not a goal in > > itself. It's hard to know what this means (what is the "real goal"?), > > and hard to measure it even if you know for sure. It could still be > > useful to think of it like this. > > As I wrote in the first email, I think there are two important factors > in index vacuuming performance: the performance to check if heap TID > that an index tuple points to is dead, and the number of times to > perform index bulk-deletion. The flame graph I attached in the first > mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a > disk-intensive operation in practice. Given that most index AM's > bulk-deletion does a full index scan and a table could have multiple > indexes, reducing the number of times to perform index bulk-deletion > really contributes to reducing the execution time, especially for > large tables. I think that a more compact data structure for dead > tuple TIDs is one of the ways to achieve that. > > > > > > A run container is selected in this test case, using 4 bytes for each > > > block. > > > > > > Execution Time Memory Usage > > > array 8,883.03600,008,248 > > > intset 7,358.23100,671,488 > > > tbm758.81 100,671,544 > > > rtbm 764.33 29,384,816 > > > > > > Overall, 'rtbm' has a much better lookup performance and good memory > > > usage especially if there are relatively many dead tuples. However, in > > > some cases, 'intset' and 'array' have a better memory usage. > > > > This seems very promising. > > > > I wonder how much you have thought about the index AM side. It makes > > sense to initially evaluate these techniques using this approach of > > separating the data structure from how it is used by VACUUM -- I think > > that that was a good idea. But at the same time there may be certain > > important theoretical questions that cannot be answered this way -- > > questions about how everything "fits together" in a real VACUUM might > > matter a lot. You've probably thought
Re: .ready and .done files considered harmful
How are you envisioning the shared-memory signaling should work in the original sample case, where the archiver had been failing for half a year ? Or should we perhaps have a system table for ready-to-archive WAL files to get around limitation sof file system to return just the needed files with ORDER BY ... LIMIT as we already know how to make lookups in database fast ? Cheers Hannu On Thu, May 6, 2021 at 12:24 PM Robert Haas wrote: > > On Thu, May 6, 2021 at 3:23 AM Kyotaro Horiguchi > wrote: > > FWIW It's already done for v14 individually. > > > > Author: Fujii Masao > > Date: Mon Mar 15 13:13:14 2021 +0900 > > > > Make archiver process an auxiliary process. > > Oh, I hadn't noticed. Thanks. > > -- > Robert Haas > EDB: http://www.enterprisedb.com > >
Re: MaxOffsetNumber for Table AMs
- Hannu Krosing On Thu, May 6, 2021 at 4:53 AM Jeff Davis wrote: > > On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote: > > How hard would it be to declare TID as current ItemPointerData with > > some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe, > > MovedPartitionsOffsetNumber = 0xfffd, presumably also 0x ?). > > I don't think there's consensus in this thread that we want to do that, > but I'd be fine with it. Sure. I just proposed this as a Minimal Viable Change. Just hoping that we can agree on an interim solution which addresses the immediate problem and then continue arguing about the ideal way for the rest of v15 cycle (and the v16 and v17 ;) ) > > It's possible but not trivial. tidbitmap.c would be the biggest > challenge, I think. I think all these places (tidbitmap, gin, brin) relay on "relatively small" MaxHeapTuplesPerPage as an upper bound for some allocations and then still allocate a lot more than needed. One can get to 291 tids / page only when you create a table with no columns, or less than 8 columns which are all set to NULL. A table with a single non-null boolean column already can fit only 226 tuples per page. It is definitely a non-trivial amount of work to rewrite these three but going to (almost) full 48 bits from current theoretically-a-little-over-40-effective-bits would expand the number of addressable tuples 225 times. Of course it would be extra nice to also somehow encode the 3 special ItemPointerData values (NULL, 0xfffe, 0cfffd) "somewhere else" and get an option of uninterrupted 48-bit address space for non-heap table AMs, but this would likely be much more disruptive, if possible at all. We could still check, if they are used outside of heapam and maybe just fix these places. > > Regards, > Jeff Davis > > >
Re: MaxOffsetNumber for Table AMs
On Thu, May 6, 2021 at 3:07 AM Robert Haas wrote: > > On Wed, May 5, 2021 at 3:43 PM Matthias van de Meent > > Storage gains for index-oriented tables can become as large as the > > size of the primary key by not having to store all primary key values > > in both the index and the table; which can thus be around 100% of a > > table in the least efficient cases of having a PK over all columns. > > > > Yes, this might be indeed only a 'small gain' for access latency, but > > not needing to store another copy of your data (and keeping it in > > cache, etc.) is a significant win in my book. > > This is a really good point. Also, if the table is ordered by a > synthetic logical TID, range scans on the primary key will be less > efficient than if the primary key is itself the TID. We have the > ability to CLUSTER on an index for good reasons, and "Automatically > maintain clustering on a table" has been on the todo list forever. > It's hard to imagine this will ever be achieved with the current heap, > though: the way to get there is to have a table AM for which this is > an explicit goal. But would this not have the downside that all the secondary indexes will blow up as they now need to have the full table row as the TID ? - Hannu Krosing
Re: MaxOffsetNumber for Table AMs
How hard would it be to declare TID as current ItemPointerData with some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe, MovedPartitionsOffsetNumber = 0xfffd, presumably also 0x ?). And then commit to fixing usage outside access/heap/ which depend on small value for MaxHeapTuplesPerPage, currently only nbodes/tidscan.c , access/gin/ginpostinglist.c and access/brin/brin_*.c there is also MaxHeapTuplesPerPage usage in ./backend/storage/page/bufpage.c but it seems to be all in heapam-dependent functions (PageRepairFragmentation(), PageGetHeapFreeSpace() and few others) which most likely should be moved to access/heap/ Doing it this way would leave us with some manageable complexity in mapping from TID to 48-bit integer and/or 3 wanted positions in 2^32 Hannu Krosing On Wed, May 5, 2021 at 8:40 PM Peter Geoghegan wrote: > > On Wed, May 5, 2021 at 11:25 AM Andres Freund wrote: > > Agreed. And we can increase the fit a good bit without needing invasive > > all-over changes. With those changes often even helping heap. > > > > E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem > > even for heap - we waste a lot of space that's not commonly used. A > > better datastructure (radix tree like I'd say, but several tree shaped > > approaches seem possible). > > Agreed -- even if we only cared about heapam we still ought to do > something about tidbitmap.c's use of MaxHeapTuplesPerPage. > > -- > Peter Geoghegan > >
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
It would be really convenient if user-visible serialisations of the query id had something that identifies the computation method. maybe prefix 'N' for internal, 'S' for pg_stat_statements etc. This would immediately show in logs at what point the id calculator was changed On Fri, Mar 19, 2021 at 5:54 PM Bruce Momjian wrote: > On Fri, Mar 19, 2021 at 10:27:51PM +0800, Julien Rouhaud wrote: > > On Fri, Mar 19, 2021 at 09:29:06AM -0400, Bruce Momjian wrote: > > > OK, that makes perfect sense. I think the best solution is to document > > > that compute_query_id just controls the built-in computation of the > > > query id, and that extensions can also compute it if this is off, and > > > pg_stat_activity and log_line_prefix will display built-in or extension > > > computed query ids. > > > > So the last version of the patch should implement that behavior right? > It's > > just missing some explicit guidance that third-party extensions should > only > > calculate a queryid if compute_query_id is off > > Yes, I think we are now down to just how the extensions should be told > to behave, and how we document this --- see the email I just sent. > > -- > Bruce Momjian https://momjian.us > EDB https://enterprisedb.com > > If only the physical world exists, free will is an illusion. > > > >
Re: shared memory stats: high level design decisions: consistency, dropping
> But now we could instead schedule stats to be removed at commit time. That's not trivial of course, as we'd need to handle cases where the commit fails after the commit record, but before processing the dropped stats. We likely can not remove them at commit time, but only after the oldest open snapshot moves parts that commit ? Would an approach where we keep stats in a structure logically similar to MVCC we use for normal tables be completely unfeasible ? We would only need to keep one Version per backend in transaction . --- Hannu On Sat, Mar 20, 2021 at 12:51 AM Andres Freund wrote: > > Hi, > > I am working on Kyotaro Horiguchi's shared memory stats patch [1] with > the goal of getting it into a shape that I'd be happy to commit. That > thread is quite long and most are probably skipping over new messages in > it. > > There are two high-level design decisions / questions that I think > warrant a wider audience (I'll keep lower level discussion in the other > thread). > > In case it is not obvious, the goal of the shared memory stats patch is > to replace the existing statistics collector, to which new stats are > reported via an UDP socket, and where clients read data from the stats > collector by reading the entire database's stats from disk. > > The replacement is to put the statistics into a shared memory > segment. Fixed-size stats (e.g. bgwriter, checkpointer, wal activity, > etc) being stored directly in a struct in memory. Stats for objects > where a variable number exists, e.g. tables, are addressed via a dshash. > table that points to the stats that are in turn allocated using dsa.h. > > > 1) What kind of consistency do we want from the pg_stats_* views? > > Right now the first access to stats in a transaction will trigger a read > of both the global and per-database stats from disk. If the on-disk > state is too old, we'll ask the stats collector to write out a new file > a couple times. > > For the rest of the transaction that in-memory state is used unless > pg_stat_clear_snapshot() is called. Which means that multiple reads from > e.g. pg_stat_user_tables will show the same results as before [2]. > > That makes stats accesses quite expensive if there are lots of > objects. > > But it also means that separate stats accesses - which happen all the > time - return something repeatable and kind of consistent. > > Now, the stats aren't really consistent in the sense that they are > really accurate, UDP messages can be lost, or only some of the stats > generated by a TX might not yet have been received, other transactions > haven't yet sent them. Etc. > > > With the shared memory patch the concept of copying all stats for the > current database into local memory at the time of the first stats access > doesn't make sense to me. Horiguchi-san had actually implemented that, > but I argued that that would be cargo-culting an efficiency hack > required by the old storage model forward. > > The cost of doing this is substantial. On master, with a database that > contains 1 million empty tables, any stats access takes ~0.4s and > increases memory usage by 170MB. > > > 1.1) > > I hope everybody agrees with not requiring that stats don't need to be > the way they were at the time of first stat access in a transaction, > even if that first access was to a different stat object than the > currently accessed stat? > > > 1.2) > > Do we expect repeated accesses to the same stat to stay the same through > the transaction? The easiest way to implement stats accesses is to > simply fetch the stats from shared memory ([3]). That would obviously > result in repeated accesses to the same stat potentially returning > changing results over time. > > I think that's perfectly fine, desirable even, for pg_stat_*. > > > 1.3) > > What kind of consistency do we expect between columns of views like > pg_stat_all_tables? > > Several of the stats views aren't based on SRFs or composite-type > returning functions, but instead fetch each stat separately: > > E.g. pg_stat_all_tables: > SELECT c.oid AS relid, > n.nspname AS schemaname, > c.relname, > pg_stat_get_numscans(c.oid) AS seq_scan, > pg_stat_get_tuples_returned(c.oid) AS seq_tup_read, > sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan, > sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + > pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch, > pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins, > ... > pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count >FROM pg_class c > LEFT JOIN pg_index i ON c.oid = i.indrelid > ... > > Which means that if we do not cache stats, additional stats updates > could have been applied between two stats accessors. E.g the seq_scan > from before some pgstat_report_stat() but the seq_tup_read from after. > > If we instead fetch all of a table's stats in one go, we would get > consistency between the columns. But obviously that'd require changing > all the stats views.
Re: shared memory stats: high level design decisions: consistency, dropping
On Sat, Mar 20, 2021 at 1:21 AM Andres Freund wrote: > > Hi, > > On 2021-03-20 01:16:31 +0100, Hannu Krosing wrote: > > > But now we could instead schedule stats to be removed at commit > > time. That's not trivial of course, as we'd need to handle cases where > > the commit fails after the commit record, but before processing the > > dropped stats. > > > > We likely can not remove them at commit time, but only after the > > oldest open snapshot moves parts that commit ? > > I don't see why? A dropped table is dropped, and cannot be accessed > anymore. Snapshots don't play a role here - the underlying data is gone > (minus a placeholder file to avoid reusing the oid, until the next > commit). If you run a vacuum on some unrelated table in the same > database, the stats for a dropped table will already be removed long > before there's no relation that could theoretically open the table. > > Note that table level locking would prevent a table from being dropped > if a long-running transaction has already accessed it. Yeah, just checked. DROP TABLE waits until the reading transaction finishes. > > > Would an approach where we keep stats in a structure logically similar > > to MVCC we use for normal tables be completely unfeasible ? > > Yes, pretty unfeasible. Stats should work on standbys too... I did not mean actually using MVCC and real transaction ids but rather similar approach, where (potentially) different stats rows are kept for each backend. This of course only is a win in case multiple backends can use the same stats row. Else it is easier to copy the backends version into backend local memory. But I myself do not see any problem with stats rows changing all the time. The only worry would be parts of the same row being out of sync. This can of course be solved by locking, but for large number of backends with tiny transactions this locking itself could potentially become a problem. Here alternating between two or more versions could help and then it also starts makes sense to keep the copies in shared memory > Regards, > > Andres
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Fri, Mar 19, 2021 at 2:29 PM Bruce Momjian wrote: > > OK, that makes perfect sense. I think the best solution is to document > that compute_query_id just controls the built-in computation of the > query id, and that extensions can also compute it if this is off, and > pg_stat_activity and log_line_prefix will display built-in or extension > computed query ids. > > It might be interesting someday to check if the hook changed a > pre-computed query id and warn the user in the logs, but that could > cause more log-spam problems than help. The log-spam could be mitigated by logging it just once per connection the first time it is overridden Also, we could ask the extensions to expose the "method name" in a read-only GUC so one can do SHOW compute_query_id_method; and get the name of method use compute_query_id_method builtin And it may even dynamically change to indicate the overriding of builtin compute_query_id_method --- fancy_compute_query_id (overrides builtin)
Re: pl/pgsql feature request: shorthand for argument and local variable references
On Thu, Mar 18, 2021 at 4:23 PM Pavel Stehule wrote: > But we don't support this feature. We are changing just a top scope's label. > So syntax "ALIAS FOR FUNCTION is not good. The user can have false hopes In this case it looks like it should go together with other labels and have << label_here >> syntax ? And we are back to the question of where to put this top scope label :) Maybe we cloud still pull in the function arguments into the outermost blocks scope, and advise users to have an extra block if they want to have same names in both ? CREATE OR REPLACE FUNCTION fx(a int, b int) RETURNS ... AS $$ << fnargs >> BEGIN << topblock >> DECLARE a int := fnargs.a * 2; b int := topblock.a + 2; BEGIN END; END; $$; and we could make the empty outer block optional and treat two consecutive labels as top scope label and outermost block label CREATE OR REPLACE FUNCTION fx(a int, b int) RETURNS ... AS $$ << fnargs >> << topblock >> DECLARE a int := fnargs.a * 2; b int := topblock.a + 2; BEGIN END; $$; But I agree that this is also not ideal. And CREATE OR REPLACE FUNCTION fx(a int, b int) WITH (TOPSCOPE_LABEL fnargs) RETURNS ... AS $$ << topblock >> DECLARE a int := fnargs.a * 2; b int := topblock.a + 2; BEGIN END; $$; Is even more inconsistent than #option syntax Cheers Hannu PS: > > For cleanness/orthogonality I would also prefer the blocklables to be in > DECLARE > for each block, but this train has already left :) > Though we probably could add alternative syntax ALIAS FOR BLOCK ? > > why? Is it a joke? > > you are defining a block label, and you want to in the same block redefine > some outer label? I don't think it is a good idea.
Re: pl/pgsql feature request: shorthand for argument and local variable references
On Thu, Mar 18, 2021 at 3:45 PM Pavel Stehule wrote: > > > > čt 18. 3. 2021 v 15:19 odesílatel Hannu Krosing napsal: ... >> Variation could be >> >> DECLARE >>fnarg ALIAS FOR FUNCTION a_function_with_an_inconveniently_long_name; >> >> so it is clear that we are aliasing current function name >> >> >> or even make the function name optional, as there can only be one, so >> >> DECLARE >>fnarg ALIAS FOR FUNCTION; > > > Why you think so it is better than just > > #routine_label fnarg > > ? It just adds already a *third* way to (re)name things, in addition to << blocklabel >> and ALIAS FOR For some reason it feels to me similar to having completely different syntax rules for function names, argument names and variable names instead of all having to follow rules for "identifiers" For cleanness/orthogonality I would also prefer the blocklables to be in DECLARE for each block, but this train has already left :) Though we probably could add alternative syntax ALIAS FOR BLOCK ? > Maybe the name of the option can be routine_alias? Is it better for you? I dont think the name itself is bad, as it is supposed to be used for both FUNCTION and PROCEDURE renaming > My objection to DECLARE is freedom of this clause. It can be used everywhere. > the effect of DECLARE is block limited. So theoretically somebody can write > > BEGIN > .. > DECLARE fnarg ALIAS FOR FUNCTION; > BEGIN > .. > END; > > END; > > It disallows simple implementation. We could limit ALIAS FOR FUNCTION to outermost block only, and revisit this later if there are loud complaints (which I don't think there will be :) ) Cheers Hannu
Re: pl/pgsql feature request: shorthand for argument and local variable references
I went to archives and read the whole discussion for this from the beginning I did not really understand, why using the ALIAS FOR syntax is significantly harder to implement then #routine_label The things you mentioned as currently using #OPTION seem to be in a different category from the aliases and block labels - they are more like pragmas - so to me it still feels inconsistent to use #routine_label for renaming the outer namespace. Also checked code and indeed there is support for #OPTION DUMP and #VARIABLE_CONFLICT ERROR Is it documented and just hard to find ? On Thu, Mar 18, 2021 at 3:19 PM Hannu Krosing wrote: > > On Thu, Mar 18, 2021 at 5:27 AM Pavel Stehule wrote: > > > > > > There are few main reasons: > > > > a) compile options are available already - so I don't need invent new > > syntax - #OPTION DUMP, #PRINT_STRICT ON, #VARIABLE_CONFLICT ERROR > > Are these documented anywhere ? > > At least a quick search for pl/pgsql + #OPTION DUMP, #PRINT_STRICT ON, > #VARIABLE_CONFLICT ERROR returned nothing relevant and also searching > in > > If pl/pgsql actually supports any of these, these should get documented! > > > > For me the most logical place for declaring a function name alias > would be to allow ALIAS FOR the function name in DECLARE section. > > plpgsql_test=# CREATE FUNCTION > a_function_with_an_inconveniently_long_name(i int , OUT o int) > LANGUAGE plpgsql AS $plpgsql$ > DECLARE >fnarg ALIAS FOR a_function_with_an_inconveniently_long_name; > BEGIN >fnarg.o = 2 * fnarg.i; > END; > $plpgsql$; > > but unfortunately this currently returns > > ERROR: 42704: variable "a_function_with_an_inconveniently_long_name" > does not exist > LINE 4:fnarg ALIAS FOR a_function_with_an_inconveniently_long_na... >^ > LOCATION: plpgsql_yyparse, pl_gram.y:677 > > Variation could be > > DECLARE >fnarg ALIAS FOR FUNCTION a_function_with_an_inconveniently_long_name; > > so it is clear that we are aliasing current function name > > or even make the function name optional, as there can only be one, so > > DECLARE >fnarg ALIAS FOR FUNCTION; > > > Cheers > Hannu
Re: pl/pgsql feature request: shorthand for argument and local variable references
On Thu, Mar 18, 2021 at 5:27 AM Pavel Stehule wrote: > > > There are few main reasons: > > a) compile options are available already - so I don't need invent new syntax > - #OPTION DUMP, #PRINT_STRICT ON, #VARIABLE_CONFLICT ERROR Are these documented anywhere ? At least a quick search for pl/pgsql + #OPTION DUMP, #PRINT_STRICT ON, #VARIABLE_CONFLICT ERROR returned nothing relevant and also searching in If pl/pgsql actually supports any of these, these should get documented! For me the most logical place for declaring a function name alias would be to allow ALIAS FOR the function name in DECLARE section. plpgsql_test=# CREATE FUNCTION a_function_with_an_inconveniently_long_name(i int , OUT o int) LANGUAGE plpgsql AS $plpgsql$ DECLARE fnarg ALIAS FOR a_function_with_an_inconveniently_long_name; BEGIN fnarg.o = 2 * fnarg.i; END; $plpgsql$; but unfortunately this currently returns ERROR: 42704: variable "a_function_with_an_inconveniently_long_name" does not exist LINE 4:fnarg ALIAS FOR a_function_with_an_inconveniently_long_na... ^ LOCATION: plpgsql_yyparse, pl_gram.y:677 Variation could be DECLARE fnarg ALIAS FOR FUNCTION a_function_with_an_inconveniently_long_name; so it is clear that we are aliasing current function name or even make the function name optional, as there can only be one, so DECLARE fnarg ALIAS FOR FUNCTION; Cheers Hannu
Re: pl/pgsql feature request: shorthand for argument and local variable references
why are you using yet another special syntax for this ? would it not be better to do something like this: CREATE FUNCTION a_reall_long_and_winding_function_name(i int, out o int) LANGUAGE plpgsql AS $plpgsql$ DECLARE args function_name_alias BEGIN args.o = 2 * args.i END; $plpgsql$; or at least do something based on block labels (see end of https://www.postgresql.org/docs/13/plpgsql-structure.html ) CREATE FUNCTION somefunc() RETURNS integer AS $$ << outerblock >> DECLARE ... maybe extend this to CREATE FUNCTION somefunc() RETURNS integer AS $$ << functionlabel:args >> << outerblock >> DECLARE ... or replace 'functionlabel' with something less ugly :) maybe <<< args >>> Cheers Hannu On Wed, Mar 17, 2021 at 5:05 PM Pavel Stehule wrote: > > > > st 17. 3. 2021 v 9:20 odesílatel Michael Paquier napsal: >> >> On Wed, Mar 17, 2021 at 02:06:57PM +0800, Julien Rouhaud wrote: >> > I also think that there should be a single usable top label, otherwise it >> > will >> > lead to confusing code and it can be a source of bug. >> >> Okay, that's fine by me. Could it be possible to come up with an >> approach that does not hijack the namespace list contruction and the >> lookup logic as much as it does now? I get it that the patch is done >> this way because of the ordering of operations done for the initial ns >> list creation and the grammar parsing that adds the routine label on >> top of the root one, but I'd like to believe that there are more solid >> ways to achieve that, like for example something that decouples the >> root label and its alias, or something that associates an alias >> directly with its PLpgSQL_nsitem entry? > > > I am checking it now, and I don't see any easy solution. The namespace is a > one direction tree - only backward iteration from leafs to roots is > supported. At the moment, when I try to replace the name of root ns_item, > this root ns_item is referenced by the function's arguments (NS_TYPE_VAR > type). So anytime I need some helper ns_item node, that can be a persistent > target for these nodes. In this case is a memory overhead of just one ns_item. > > orig_label <- argument1 <- argument2 > > The patch has to save the original orig_label because it can be referenced > from argument1 or by "FOUND" variable. New graphs looks like > > new_label <- invalidated orig_label <- argument1 <- ... > > This tree has a different direction than is usual, and then replacing the > root node is not simple. > > Second solution (significantly more simple) is an additional pointer in > ns_item structure. In almost all cases this pointer will not be used. Because > ns_item uses a flexible array, then union cannot be used. I implemented this > in a patch marked as "alias-implementation". > > What do you think about it? > > Pavel > > > > > > >> >> -- >> Michael
Re: Boundary value check in lazy_tid_reaped()
We could also go parallel in another direction - I have been mulling about writing a "vectorized" bsearch which would use AVX2, where you look up 64 (or more likely 32, so tids also fit in 256bit vector) tids at a time. The trickiest part is that the search can complete at different iteration for each value. Of course it is possible that this has a very bad RAM access behaviour and is no win at all even if it otherways works. -- Hannu Krosing On Tue, Mar 16, 2021 at 10:08 PM Peter Geoghegan wrote: > On Sun, Mar 14, 2021 at 4:22 PM Thomas Munro > wrote: > > BTW I got around to trying this idea out for a specialised > > bsearch_itemptr() using a wide comparator, over here: > > Cool! > > I have another thing that should be considered when we revisit this > area in the future: maybe we should structure the binary search to > lookup multiple TIDs at once -- probably all of the TIDs from a given > leaf page, taken together. > > There is now an nbtree function used for tuple deletion (all tuple > deletion, not just bottom-up deletion) that works like this: > _bt_delitems_delete_check(). I suspect that it would make sense to > generalize it to do the same thing for regular VACUUM. Perhaps this > idea would have to be combined with other techniques to show a real > benefit. It would probably be necessary to sort the TIDs first (just > like index_delete_sort() does for the _bt_delitems_delete_check() code > today), but that's probably no big deal. > > It is typical to have 200 - 400 TIDs on an nbtree leaf page without > using deduplication. And with deduplication enabled you can have as > many as 1300 TIDs on a single 8KiB nbtree leaf page. It's easy to > imagine something like GCC's __builtin_prefetch() (or maybe just more > predictable access patterns in our "batch binary search") making > everything much faster through batching. This will also naturally make > btvacuumpage() much less "branchy", since of course it will no longer > need to process the page one TID at a time -- that helps too. > > -- > Peter Geoghegan > > >
Re: Extensibility of the PostgreSQL wire protocol
On Thu, Mar 4, 2021 at 9:55 PM Jan Wieck wrote: > > Another possibility, and this is what is being used by the AWS team > implementing the TDS protocol for Babelfish, is to completely replace > the entire TCOP mainloop function PostgresMain(). I suspect this is the only reasonable way to do it for protocols which are not very close to libpq. > That is of course a > rather drastic move and requires a lot more coding on the extension > side, Not necessarily - if the new protocol is close to existing one, then it is copy/paste + some changes. If it is radically different, then trying to fit it into the current mainloop will be even harder than writing from scratch. And will very likely fail in the end anyway :) > but the whole thing was developed that way from the beginning and > it is working. I don't have a definitive date when that code will be > presented. Kuntal or Prateek may be able to fill in more details. Are you really fully replacing the main loop, or are you running a second main loop in parallel in the same database server instance, perhaps as a separate TDS_postmaster backend ? Will the data still also be accessible "as postgres" via port 5432 when TDS/SQLServer support is active ?
Re: PROXY protocol support
The current proposal seems to miss the case of transaction pooling (and statement pooling) where the same established connection multiplexes transactions / statements from multiple remote clients. What we would need for that case would be a functionl pg_set_remote_client_address( be_key, remote_ip, remote_hostname) where only be_key and remote_ip are required, but any string (up to a certain length) would be accepted as hostname. It would be really nice if we could send this request at protocol level but if that is hard to do then having a function would get us half way there. the be_key in the function is the key from PGcancel, which is stored by libpq when making the connection, and it is there, to make sure that only the directly connecting proxy can successfully call the function. Cheers Hannu On Fri, Mar 5, 2021 at 12:21 AM Jacob Champion wrote: > > On Thu, 2021-03-04 at 21:45 +0100, Magnus Hagander wrote: > > On Thu, Mar 4, 2021 at 9:07 PM Jacob Champion wrote: > > > Idle thought I had while setting up a local test rig: Are there any > > > compelling cases for allowing PROXY packets to arrive over Unix > > > sockets? (By which I mean, the proxy is running on the same machine as > > > Postgres, and connects to it using the .s.PGSQL socket file instead of > > > TCP.) Are there cases where you want some other software to interact > > > with the TCP stack instead of Postgres, but it'd still be nice to have > > > the original connection information available? > > > > I'm uncertain what that usecase would be for something like haproxy, > > tbh. It can't do connection pooling, so adding it on the same machine > > as postgres itself wouldn't really add anything, I think? > > Yeah, I wasn't thinking HAproxy so much as some unspecified software > appliance that's performing Some Task before allowing a TCP client to > speak to Postgres. But it'd be better to hear from someone that has an > actual use case, instead of me spitballing. > > > Iid think about the other end, if you had a proxy on a different > > machine accepting unix connections and passing them on over > > PROXY-over-tcp. But I doubt it's useful to know it was unix in that > > case (since it still couldn't do peer or such for the auth) -- > > instead, that seems like an argument where it'd be better to proxy > > without using PROXY and just letting the IP address be. > > You could potentially design a system that lets you proxy a "local all > all trust" setup from a different (trusted) machine, without having to > actually let people onto the machine that's running Postgres. That > would require some additional authentication on the PROXY connection > (i.e. something stronger than host-based auth) to actually be useful. > > -- other notes -- > > A small nitpick on the current separate-port PoC is that I'm forced to > set up a "regular" TCP port, even if I only want the PROXY behavior. > > The original-host logging isn't working for me: > > WARNING: pg_getnameinfo_all() failed: ai_family not supported > LOG: proxy connection from: host=??? port=??? > > and I think the culprit is this: > > >/* Store a copy of the original address, for logging */ > >memcpy(_save, >raddr, port->raddr.salen); > > port->raddr.salen is the length of port->raddr.addr; we want the length > of the copy to be sizeof(port->raddr) here, no? > > --Jacob
Re: Extensibility of the PostgreSQL wire protocol
I have not looked at the actual patch, but does it allow you to set up its own channels to listen to ? For example if I'd want to set up a server to listen to incoming connections over QUIC [1] - a protocol which create a connection over UDP and allows clients to move to new IP addresses (among other things) then would the current extensibility proposal cover this ? Maybe a correct approach would be to just start up a separate "postmaster" to listen to a different protocol ? [1] https://en.wikipedia.org/wiki/QUIC Cheers Hannu
Re: We should stop telling users to "vacuum that database in single-user mode"
On Wed, Mar 3, 2021 at 11:33 AM David Rowley wrote: > > On Wed, 3 Mar 2021 at 21:44, Magnus Hagander wrote: ... > > I think we misunderstand each other. I meant this only as a comment > > about the idea of ignoring the cost limit in single user mode -- that > > is, it's a reason to *want* vacuum to not run as quickly as possible > > in single user mode. I should've trimmed the email better. > > I meant to ignore the cost limits if we're within a hundred million or > so of the stopLimit. Per what Hannu mentioned, there does not seem to > be a great need with current versions of PostgreSQL to restart in the > instance in single-user mode. VACUUM still works once we're beyond the > stopLimit. It's just commands that need to generate a new XID that'll > fail with the error message mentioned by Hannu. I am investigating a possibility of introducing a special "Restricted Maintenance Mode" to let admin mitigate after xidStopLimit, maybe for another 0.5M txids, by doing things like * dropping an index - to make vacuum faster * dropping a table - sometimes it is better to drop a table in order to get the production database functional again instead of waiting hours for the vacuum to finish. And then later restore it from backup or maybe access it from a read-only clone of the database via FDW. * drop a stale replication slot which is holding back vacuum To make sure that this will not accidentally just move xidStopLimit to 0.5M for users who run main workloads as a superuser (they do exists!) this mode should be restricted to * only superuser * only a subset of commands / functions * be heavily throttled to avoid running out of TXIDs, maybe 1-10 xids per second * maybe require also setting a GUC to be very explicit > > I agree with your other idea, that of kicking in a more aggressive > > autovacuum if it's not dealing with things fast enough. Maybe even on > > an incremental way - that is run with the default, then at another > > threshold drop them to half, and at yet another threshold drop them to > > 0. I agree that pretty much anything is better than forcing the user > > into single user mode. > > OK cool. I wondered if it should be reduced incrementally or just > switch off the cost limit completely once we're beyond > ShmemVariableCache->xidStopLimit. Abrupt change is something that is more likely to make the user/DBA notice that something is going on. I have even been thinking about deliberate throttling to make the user notice / pay attention. > If we did want it to be incremental > then if we had say ShmemVariableCache->xidFastVacLimit, which was > about 100 million xids before xidStopLimit, then the code could adjust > the sleep delay down by the percentage through we are from > xidFastVacLimit to xidStopLimit. > > However, if we want to keep adjusting the sleep delay then we need to > make that work for vacuums that are running already. We don't want to > call ReadNextTransactionId() too often, but maybe if we did it once > per 10 seconds worth of vacuum_delay_point()s. That way we'd never do > it for vacuums already going at full speed. There are already samples of this in code, for example the decision to force-start disabled autovacuum is considered after every 64k transactions. There is a related item in https://commitfest.postgresql.org/32/2983/ . When that gets done, we could drive the adjustments from autovacuum.c by adding the remaining XID range adjustment to existing worker delay adjust mechanisms in autovac_balance_cost() and signalling the autovacuum backend to run the adjustment every few seconds once we are in the danger zone. Cheers Hannu
We should stop telling users to "vacuum that database in single-user mode"
restart the database or just kill the vacuum process after reloading flags. After this it should complete in 15-30 min, after which the database is available for writes again. Cheers, Hannu Krosing
Re: CREATE ROUTINE MAPPING
Hi Corey Have you looked at pl/proxy ? It does this and then some (sharding) It actually started out as a set of pl/pythonu functions, but then got formalized into a full extension language for defining remote (potentially sharded) function calls Best Regards Hannu Krosng On Fri, 12 Jan 2018 at 03:38, Corey Huinker wrote: > A few months ago, I was researching ways for formalizing calling functions > on one postgres instance from another. RPC, basically. In doing so, I > stumbled across an obscure part of the the SQL Standard called ROUTINE > MAPPING, which is exactly what I'm looking for. > > The syntax specified is, roughly: > > CREATE ROUTINE MAPPING local_routine_name FOR remote_routine_spec > SERVER my_server [ OPTIONS( ... ) ] > > > Which isn't too different from CREATE USER MAPPING. > > The idea here is that if I had a local query: > > SELECT t.x, remote_func1(), remote_func2(t.y) > > FROM remote_table t > > WHERE t.active = true; > > > that would become this query on the remote side: > > SELECT t.x, local_func1(), local_func2(t.y) > > FROM local_table t > > WHERE t.active = true; > > > > That was probably the main intention of this feature, but I see a > different possibility there. Consider the cases: > > SELECT remote_func(1,'a'); > > > and > > SELECT * FROM remote_srf(10, true); > > > Now we could have written remote_func() and remote_srf() in plpythonu, and > it could access whatever remote data that we wanted to see, but that > exposes our local server to the untrusted pl/python module as well as > python process overhead. > > We could create a specialized foreign data wrapper that requires a WHERE > clause to include all the require parameters as predicates, essentially > making every function a table, but that's awkward and unclear to an end > user. > > Having the ability to import functions from other servers allows us to > write foreign servers that expose functions to the local database, and > those foreign servers handle the bloat and risks associated with accessing > that remote data. > > Moreover, it would allow hosted environments (AWS, etc) that restrict the > extensions that can be added to the database to still connect to those > foreign data sources. > > I'm hoping to submit a patch for this someday, but it touches on several > areas of the codebase where I have no familiarity, so I've put forth to > spark interest in the feature, to see if any similar work is underway, or > if anyone can offer guidance. > > Thanks in advance. >
Re: Proposal: http2 wire format
> > > * room for other resultset formats later. Like Damir, I really want to > add > > protobuf or json serializations of result sets at some point, mainly so > we > > can return "entity graphs" in graph representation rather than left-join > > projection. > > -1. I don't think this belongs in postgres. > Maybe the functionality does not belong in *core* postgres, but I sure would like it to be possible to have an extension being able to do it. A bit similar to what logical decoding plugins do now, just more flexible in terms of protocol Cheers Hannu
A Generic Question about Generic type subscripting
Sorry for being late to the party I started looking at the thread about "Generic type subscripting" and am wondering, why does it take the approach of modifying pg_type and modifying lots of internal functions, when instead it could be defined in a much lighter and less intrusive way as an operator, probably by reserving a dedicated operator name CREATE OPERATOR [...] (PROCEDURE = json_object_field, LEFTARG=jsonb, RIGHTARG=text); CREATE OPERATOR [...] (PROCEDURE = jsonb_array_element, LEFTARG=jsonb, RIGHTARG=int); This might put more work on the writers of actual subscription operators, but if we are looking at diverse new types, it may also be, that writing operator functions is even easier than learning a full new skillset for writing "subscripting functions" Defining "subscripting" as an operator does still require changes to how current subscripting operations are parsed, but even there I am not sure it would be more complex, as all the parser has to do is to determine that it is a subscripting operation and then delegate to the corresponding operator. Also, there is a problem of what to do with element (or or even slice) assignements, as there require three arguments. I see two possibilities 1) add a third "ARG" to the CREATE OPERATOR syntax, maybe VALUEARG 2) use composite types - so for jsonb1[int1] = jsonb2 the operator would be defined by first defining a CREATE TYPE intkey_and_jsonvalue as (key int, value jsonb) and then using this in CREATE OPERATOR [...] (PROCEDURE = jsonb_set_key, LEFTARG=jsonb, RIGHTARG=intkey_and_jsonvalue) Cheers Hannu
Re: AS OF queries
On 20.12.2017 14:45, Konstantin Knizhnik wrote: > I wonder if Postgres community is interested in supporting time travel > queries in PostgreSQL (something like AS OF queries in Oracle: > https://docs.oracle.com/cd/B14117_01/appdev.101/b10795/adfns_fl.htm). > As far as I know something similar is now developed for MariaDB. > > It seems to me that it will be not so difficult to implement them in > Postgres - we already have versions of tuples. > Looks like we only need to do three things: > 1. Disable autovacuum (autovacuum = off) In the design for original University Postgres ( which was a full history database geared towards WORM drives ) it was the task of vacuum to move old tuples to "an archive" from where the AS OF queries would then fetch them as needed. This might also be a good place to do Commit LSN to Commit Timestamp translation Hannu > 2. Enable commit timestamp (track_commit_timestamp = on) > 3. Add asofTimestamp to snapshot and patch XidInMVCCSnapshot to > compare commit timestamps when it is specified in snapshot. > > > Attached please find my prototype implementation of it. > Most of the efforts are needed to support asof timestamp in grammar > and add it to query plan. > I failed to support AS OF clause (as in Oracle) because of > shift-reduce conflicts with aliases, > so I have to introduce new ASOF keyword. May be yacc experts can > propose how to solve this conflict without introducing new keyword... > > Please notice that now ASOF timestamp is used only for data snapshot, > not for catalog snapshot. > I am not sure that it is possible (and useful) to travel through > database schema history... > > Below is an example of how it works: > > postgres=# create table foo(pk serial primary key, ts timestamp > default now(), val text); > CREATE TABLE > postgres=# insert into foo (val) values ('insert'); > INSERT 0 1 > postgres=# insert into foo (val) values ('insert'); > INSERT 0 1 > postgres=# insert into foo (val) values ('insert'); > INSERT 0 1 > postgres=# select * from foo; > pk | ts | val > ++ > 1 | 2017-12-20 14:59:17.715453 | insert > 2 | 2017-12-20 14:59:22.933753 | insert > 3 | 2017-12-20 14:59:27.87712 | insert > (3 rows) > > postgres=# select * from foo asof timestamp '2017-12-20 14:59:25'; > pk | ts | val > ++ > 1 | 2017-12-20 14:59:17.715453 | insert > 2 | 2017-12-20 14:59:22.933753 | insert > (2 rows) > > postgres=# select * from foo asof timestamp '2017-12-20 14:59:20'; > pk | ts | val > ++ > 1 | 2017-12-20 14:59:17.715453 | insert > (1 row) > > postgres=# update foo set val='upd',ts=now() where pk=1; > UPDATE 1 > postgres=# select * from foo asof timestamp '2017-12-20 14:59:20'; > pk | ts | val > ++ > 1 | 2017-12-20 14:59:17.715453 | insert > (1 row) > > postgres=# select * from foo; > pk | ts | val > ++ > 2 | 2017-12-20 14:59:22.933753 | insert > 3 | 2017-12-20 14:59:27.87712 | insert > 1 | 2017-12-20 15:09:17.046047 | upd > (3 rows) > > postgres=# update foo set val='upd2',ts=now() where pk=1; > UPDATE 1 > postgres=# select * from foo asof timestamp '2017-12-20 15:10'; > pk | ts | val > ++ > 2 | 2017-12-20 14:59:22.933753 | insert > 3 | 2017-12-20 14:59:27.87712 | insert > 1 | 2017-12-20 15:09:17.046047 | upd > (3 rows) > > > Comments and feedback are welcome:) > -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability https://2ndquadrant.com/