from:"Hannu Krosing"

Will there be https://wiki.postgresql.org/wiki/PgCon_2024_Developer_Unconference ?

2024-06-03 Thread Hannu Krosing

Hello Everybody!

For at least last two years we have had Developers Conference
Unconference notes in PostgreSQL Wiki

https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference
https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference

And I know that people took notes at least at the unconference
sessions I attended.

S is there a plan to collect them in
https://wiki.postgresql.org/wiki/PgCon_2024_Developer_Unconference
like previous years ?

--
Hannu

Re: DROP OWNED BY fails to clean out pg_init_privs grants

2024-05-28 Thread Hannu Krosing

Hi Daniel,

pg_upgrade is just one important user of pg_dump which is the one that
generates REVOKE for a non-existent role.

We should definitely also fix pg_dump, likely just checking that the
role exists when generating REVOKE commands (may be a good practice
for other cases too so instead of casting to ::regrole  do the actual
join)


## here is the fix for pg_dump

While flying to Vancouver I looked around in pg_dump code, and it
looks like the easiest way to mitigate the dangling pg_init_priv
entries is to replace the query in pg_dump with one that filters out
invalid entries

The current query is at line 9336:

/* Fetch initial-privileges data */
if (fout->remoteVersion >= 90600)
{
printfPQExpBuffer(query,
  "SELECT objoid, classoid, objsubid, privtype, initprivs "
  "FROM pg_init_privs");

res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK);


And we need the same but filtering out invalid aclitems from initprivs
something like this

WITH q AS (
  SELECT objoid, classoid, objsubid, privtype, unnest(initprivs) AS
initpriv FROM saved_init_privs
)
SELECT objoid, classoid, objsubid, privtype, array_agg(initpriv) as initprivs
  FROM q
 WHERE is_valid_value_for_type(initpriv::text, 'aclitem')
 GROUP BY 1,2,3,4;

### The proposed re[placement query:

Unfortunately we do not have an existing
is_this_a_valid_value_for_type(value text, type text, OUT res boolean)
function, so for a read-only workaround the following seems to work:

Here I first collect the initprivs array elements which fail the
conversion to text and back into an array and store it in GUC
pg_dump.bad_aclitems

Then I use this stored list to filter out the bad ones in the actual query.

DO $$
DECLARE
  aclitem_text text;
  bad_aclitems text[] = '{}';
BEGIN
  FOR aclitem_text IN
SELECT DISTINCT unnest(initprivs)::text FROM pg_init_privs
  LOOP
BEGIN /* try to convert back to aclitem */
  PERFORM aclitem_text::aclitem;
EXCEPTION WHEN OTHERS THEN /* collect bad aclitems */
  bad_aclitems := bad_aclitems || ARRAY[aclitem_text];
END;
  END LOOP;
  IF bad_aclitems != '{}' THEN
RAISE WARNING 'Ignoring bad aclitems "%" in pg_init_privs', bad_aclitems;
  END IF;
  PERFORM set_config('pg_dump.bad_aclitems', bad_aclitems::text,
false); -- true for trx-local
END;
$$;
WITH q AS (
  SELECT objoid, classoid, objsubid, privtype, unnest(initprivs) AS
initpriv FROM pg_init_privs
)
SELECT objoid, classoid, objsubid, privtype, array_agg(initpriv) AS initprivs
  FROM q
 WHERE NOT initpriv::text = ANY
(current_setting('pg_dump.bad_aclitems')::text[])
 GROUP BY 1,2,3,4;

--
Hannu

On Sun, May 26, 2024 at 11:27 PM Daniel Gustafsson  wrote:
>
> > On 26 May 2024, at 23:25, Tom Lane  wrote:
> >
> > Hannu Krosing  writes:
> >> Attached is a minimal patch to allow missing roles in REVOKE command
> >
> > FTR, I think this is a very bad idea.
>
> Agreed, this is papering over a bug.  If we are worried about pg_upgrade it
> would be better to add a check to pg_upgrade which detects this case and
> advices the user how to deal with it.
>
> --
> Daniel Gustafsson
>

Re: DROP OWNED BY fails to clean out pg_init_privs grants

2024-05-26 Thread Hannu Krosing

Attached is a minimal patch to allow missing roles in REVOKE command

This should fix the pg_upgrade issue and also a case where somebody
has dropped a role you are trying to revoke privileges from :

smalltest=# create table revoketest();
CREATE TABLE
smalltest=# revoke select on revoketest from bob;
WARNING:  ignoring REVOKE FROM a missing role "bob"
REVOKE
smalltest=# create user bob;
CREATE ROLE
smalltest=# grant select on revoketest to bob;
GRANT
smalltest=# \du
 List of roles
 Role name | Attributes
---+
 bob   |
 hannuk| Superuser, Create role, Create DB, Replication, Bypass RLS

smalltest=# \dp
  Access privileges
 Schema |Name| Type  |   Access privileges| Column
privileges | Policies
++---++---+--
 public | revoketest | table | hannuk=arwdDxtm/hannuk+|   |
||   | bob=r/hannuk   |   |
 public | vacwatch   | table ||   |
(2 rows)

smalltest=# revoke select on revoketest from bob, joe;
WARNING:  ignoring REVOKE FROM a missing role "joe"
REVOKE
smalltest=# \dp
  Access privileges
 Schema |Name| Type  |   Access privileges| Column
privileges | Policies
++---++---+--
 public | revoketest | table | hannuk=arwdDxtm/hannuk |   |
 public | vacwatch   | table ||   |
(2 rows)


On Sun, May 26, 2024 at 12:05 AM Hannu Krosing  wrote:
>
> On Sat, May 25, 2024 at 4:48 PM Tom Lane  wrote:
> >
> > Hannu Krosing  writes:
> > > Having an pg_init_privs entry referencing a non-existing user is
> > > certainly of no practical use.
> >
> > Sure, that's not up for debate.  What I think we're discussing
> > right now is
> >
> > 1. What other cases are badly handled by the pg_init_privs
> > mechanisms.
> >
> > 2. How much of that is practical to fix in v17, seeing that
> > it's all long-standing bugs and we're already past beta1.
> >
> > I kind of doubt that the answer to #2 is "all of it".
> > But perhaps we can do better than "none of it".
>
> Putting the fix either in pg_dump or making REVOKE tolerate
> non-existing users  would definitely be most practical / useful fixes,
> as these would actually allow pg_upgrade to v17 to work without
> changing anything in older versions.
>
> Currently one already can revoke a privilege that is not there in the
> first place, with the end state being that the privilege (still) does
> not exist.
> This does not even generate a warning.
>
> Extending this to revoking from users that do not exist does not seem
> any different on conceptual level, though I understand that
> implementation would be very different as it needs catching the user
> lookup error from a very different part of the code.
>
> That said, it would be better if we can have something that would be
> easy to backport something that would make pg_upgrade work for all
> supported versions.
> Making REVOKE silently ignore revoking from non-existing users would
> improve general robustness but could conceivably change behaviour if
> somebody relies on it in their workflows.
>
> Regards,
> Hannu
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 7abf3c2a74..aff91582d7 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -444,6 +444,9 @@ ExecuteGrantStmt(GrantStmt *stmt)
 * Convert the RoleSpec list into an Oid list.  Note that at this point we
 * insert an ACL_ID_PUBLIC into the list if appropriate, so downstream
 * there shouldn't be any additional work needed to support this case.
+*
+* Allow missing grantees in case of REVOKE (!istmt.is_grant)
+* if now valid roles found return immediately  
 */
foreach(cell, stmt->grantees)
{
@@ -456,12 +459,20 @@ ExecuteGrantStmt(GrantStmt *stmt)
grantee_uid = ACL_ID_PUBLIC;
break;
default:
-   grantee_uid = get_rolespec_oid(grantee, false);
+   grantee_uid = get_rolespec_oid(grantee, !istmt.is_grant);
break;
}
-   istmt.grantees = lappend_oid(istmt.grantees, grantee_uid);
+   if (OidIsValid(grantee_uid))
+   istmt.grantees = lappend_oid(istmt.grantees, grantee_uid);
+   else
+

Re: DROP OWNED BY fails to clean out pg_init_privs grants

2024-05-25 Thread Hannu Krosing

On Sat, May 25, 2024 at 4:48 PM Tom Lane  wrote:
>
> Hannu Krosing  writes:
> > Having an pg_init_privs entry referencing a non-existing user is
> > certainly of no practical use.
>
> Sure, that's not up for debate.  What I think we're discussing
> right now is
>
> 1. What other cases are badly handled by the pg_init_privs
> mechanisms.
>
> 2. How much of that is practical to fix in v17, seeing that
> it's all long-standing bugs and we're already past beta1.
>
> I kind of doubt that the answer to #2 is "all of it".
> But perhaps we can do better than "none of it".

Putting the fix either in pg_dump or making REVOKE tolerate
non-existing users  would definitely be most practical / useful fixes,
as these would actually allow pg_upgrade to v17 to work without
changing anything in older versions.

Currently one already can revoke a privilege that is not there in the
first place, with the end state being that the privilege (still) does
not exist.
This does not even generate a warning.

Extending this to revoking from users that do not exist does not seem
any different on conceptual level, though I understand that
implementation would be very different as it needs catching the user
lookup error from a very different part of the code.

That said, it would be better if we can have something that would be
easy to backport something that would make pg_upgrade work for all
supported versions.
Making REVOKE silently ignore revoking from non-existing users would
improve general robustness but could conceivably change behaviour if
somebody relies on it in their workflows.

Regards,
Hannu

Re: DROP OWNED BY fails to clean out pg_init_privs grants

2024-05-25 Thread Hannu Krosing

On Fri, May 24, 2024 at 10:00 PM Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Fri, May 24, 2024 at 2:57 PM Tom Lane  wrote:
> >> Doesn't seem right to me.  That will give pg_dump the wrong idea
> >> of what the initial privileges actually were, and I don't see how
> >> it can construct correct delta GRANT/REVOKE on the basis of false
> >> information.  During the dump reload, the extension will be
> >> recreated with the original owner (I think), causing its objects'
> >> privileges to go back to the original pg_init_privs values.
>
> > Oh! That does seem like it would make what I said wrong, but how would
> > it even know who the original owner was? Shouldn't we be recreating
> > the object with the owner it had at dump time?
>
> Keep in mind that the whole point here is for the pg_dump script to
> just say "CREATE EXTENSION foo", not to mess with the individual
> objects therein.  So the objects are (probably) going to be owned by
> the user that issued CREATE EXTENSION.
>
> In the original conception, that was the end of it: what you got for
> the member objects was whatever state CREATE EXTENSION left behind.
> The idea of pg_init_privs is to support dump/reload of subsequent
> manual alterations of privileges for extension-created objects.
> I'm not, at this point, 100% certain that that's a fully realizable
> goal.

The issue became visible because pg_dump issued a bogus

REVOKE ALL ON TABLE public.pg_stat_statements FROM "16390";

Maybe the right place for a fix is in pg_dump and the fix would be to *not*
issue REVOKE ALL ON  FROM  ?

Or alternatively change REVOKE to treat non-existing users as a no-op ?

Also, the pg_init_privs entry should either go away or at least be
changed at the point when the user referenced in init-privs is
dropped.

Having an pg_init_privs entry referencing a non-existing user is
certainly of no practical use.

Or maybe we should change the user at that point to NULL or some
special non-existing-user-id ?

> But I definitely think it's insane to expect that to work
> without also tracking changes in the ownership of said objects.
>
> Maybe forbidding ALTER OWNER on extension-owned objects isn't
> such a bad idea?

Re: DROP OWNED BY fails to clean out pg_init_privs grants

2024-05-23 Thread Hannu Krosing

While the 'DROP OWNED BY fails to clean out pg_init_privs grants'
issue is now fixed,we have a similar issue with REASSIGN OWNED BY that
is still there:

Tested on fresh git checkout om May 20th

test=# create user privtestuser superuser;
CREATE ROLE
test=# set role privtestuser;
SET
test=# create extension pg_stat_statements ;
CREATE EXTENSION
test=# select * from pg_init_privs where privtype ='e';
 objoid | classoid | objsubid | privtype |
initprivs
+--+--+--+--
  16405 | 1259 |0 | e|
{privtestuser=arwdDxtm/privtestuser,=r/privtestuser}
  16422 | 1259 |0 | e|
{privtestuser=arwdDxtm/privtestuser,=r/privtestuser}
  16427 | 1255 |0 | e| {privtestuser=X/privtestuser}
(3 rows)

test=# reset role;
RESET
test=# reassign owned by privtestuser to hannuk;
REASSIGN OWNED
test=# select * from pg_init_privs where privtype ='e';
 objoid | classoid | objsubid | privtype |
initprivs
+--+--+--+--
  16405 | 1259 |0 | e|
{privtestuser=arwdDxtm/privtestuser,=r/privtestuser}
  16422 | 1259 |0 | e|
{privtestuser=arwdDxtm/privtestuser,=r/privtestuser}
  16427 | 1255 |0 | e| {privtestuser=X/privtestuser}
(3 rows)

test=# drop user privtestuser ;
DROP ROLE
test=# select * from pg_init_privs where privtype ='e';
 objoid | classoid | objsubid | privtype |initprivs
+--+--+--+-
  16405 | 1259 |0 | e| {16390=arwdDxtm/16390,=r/16390}
  16422 | 1259 |0 | e| {16390=arwdDxtm/16390,=r/16390}
  16427 | 1255 |0 | e| {16390=X/16390}
(3 rows)


This will cause pg_dump to produce something that cant be loaded back
into the database:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements WITH SCHEMA public;
...
REVOKE ALL ON TABLE public.pg_stat_statements FROM "16390";
...

And this will, among other things, break pg_upgrade.


-
Hannu



On Tue, Apr 30, 2024 at 6:40 AM David G. Johnston
 wrote:
>
> On Monday, April 29, 2024, Tom Lane  wrote:
>>
>> "David G. Johnston"  writes:
>> > My solution to this was to rely on the fact that the bootstrap superuser is
>> > assigned OID 10 regardless of its name.
>>
>> Yeah, I wrote it that way to start with too, but reconsidered
>> because
>>
>> (1) I don't like hard-coding numeric OIDs.  We can avoid that in C
>> code but it's harder to do in SQL.
>
>
> If the tests don’t involve, e.g., the predefined role pg_monitor and its 
> grantor of the memberships in the other predefined roles, this indeed can be 
> avoided.  So I think my test still needs to check for 10 even if some other 
> superuser is allowed to produce the test output since a key output in my case 
> was the bootstrap superuser and the initdb roles.
>
>>
>> (2) It's not clear to me that this test couldn't be run by a
>> non-bootstrap superuser.  I think "current_user" is actually
>> the correct thing for the role executing the test.
>
>
> Agreed, testing against current_role is correct if the things being queried 
> were created while executing the test.  I would need to do this as well to 
> remove the current requirement that my tests be run by the bootstrap 
> superuser.
>
> David J.
>

Re: Function and Procedure with same signature?

2024-03-11 Thread Hannu Krosing

On Thu, Mar 7, 2024 at 5:46 PM Tom Lane  wrote:
>
> Hannu Krosing  writes:
> > On Sat, Feb 10, 2024 at 12:38 AM Tom Lane  wrote:
> >> Worth noting perhaps that this is actually required by the SQL
> >> standard: per spec, functions and procedures are both "routines"
> >> and share the same namespace,
>
> > Can you point me to a place in the standard where it requires all
> > kinds of ROUTINES to be using the same namespace ?
>
> [ digs around a bit... ]  Well, the spec is vaguer about this than
> I thought.  It is clear on one thing: 11.60 
> conformance rules include
...

Thanks for thorough analysis of the standard.

I went and looked at more what other relevant database do in this
space based on their documentation

Tl;DR

* MS SQL Server
   - no overloading allowed anywhere
* MySQL
   - no overloading
* Oracle
   - no overloading at top level
   - overloading and independent namespaces for functions and procedures
* Teradata
   - function overloading allowed
   - not clear from documentation if this also applies procedures
   - function overloading docs does not mention possible clashes with
procedure names anywhere
* DB2
   - function overloading fully supported
   - procedure overloading supported, but only for distinct NUMBER OF ARGUMENTS

I'll try to get access to a Teradata instance to verify the above

So at high level most other Serious Databases
  - do support function overloading
  - keep functions and procedures in separated namespaces

I still think that PostgreSQL having functions and procedures share
the same namespace is an unneeded and unjustified restriction


I plan to do some hands-on testing on Teradata and DB2 to understand it

But my current thinking is that we should not be more restrictive than
others unless there is a strong technical reason for it. And currently
I do not see any.

> It could be argued that this doesn't prohibit having both a function
> and a procedure with the same data type list, only that you can't
> write ROUTINE when trying to drop or alter one.  But I think that's
> just motivated reasoning.  The paragraphs for  being
> FUNCTION or PROCEDURE are exactly like the above except they say
> "exactly one function" or "exactly one procedure".  If you argue that
> this text means we must allow functions and procedures with the same
> parameter lists, then you are also arguing that we must allow multiple
> functions with the same parameter lists, and it's just the user's
> tough luck if they need to drop one of them.

The main issue is not dropping them, but inability to determine which
one to call.

We already have this in case of two overloaded functions with same
initial argument types and the rest having defaults - when

---
hannuk=# create function get(i int, out o int) begin atomic select i; end;
CREATE FUNCTION
hannuk=# create function get(i int, j int default 0, out o int) begin
atomic select i+j; end;
CREATE FUNCTION
hannuk=# select get(1);
ERROR:  function get(integer) is not unique
LINE 1: select get(1);
   ^
HINT:  Could not choose a best candidate function. You might need to
add explicit type casts.
---

> A related point is that our tolerance for overloading routine
> names isn't unlimited: we don't allow duplicate names with the
> same list of input parameters, even if the output parameters are
> different.

This again has a good reason, as there would be many cases where you
could not decide which one to call

Not allowing overloading based on only return types is common across
all OO languages.

My point is that this does not apply to FUNCTION vs. PROCEDURE as it
is very clear from the CALL syntax which one is meant.

> This is not something that the SQL spec cares to
> address, but there are good ambiguity-avoidance reasons for it.
> I think limiting overloading so that a ROUTINE specification is
> unambiguous is also a good thing.

I think ROUTINE being unambiguous is not e very good goal.

What if next version of standard introduces DROP DATABASE OBJECT ?

> I remain unexcited about changing our definition of this.
> "Oracle allows it" is not something that has any weight in
> my view: they have made a bunch of bad design decisions
> as well as good ones, and I think this is a bad one.

Fully agree that  "Oracle allows it" is a non-argument.

My main point is that there is no strong reason to have objects which
are distinct at syntax level to be in the same namespace.

# Oracle is actually much more restrictive in top level object namespace -

All of the following share the same namespace - [Packages, Private
synonyms, Sequences, Stand-alone procedures, Stand-alone stored
functions, Tables, User-defined operators, User-defined types, Views].
(I guess this makes parser easier to write)

The equivalent in postgreSQL would be [extensions, schemas, tabl

CREATE TABLE creates a composite type corresponding to the table row, which is and is not there

2024-03-07 Thread Hannu Krosing

I could not find any explanation of the following behaviour in docs -


Our documentation for CREATE TABLE says:

CREATE TABLE also automatically creates a data type that represents
the composite type corresponding to one row of the table. Therefore,
tables cannot have the same name as any existing data type in the same
schema.


But these composite tables are only sometimes there

hannuk=# CREATE TABLE pair(a int, b int);
CREATE TABLE

hannuk=# INSERT INTO pair VALUES(1,2);
INSERT 0 1

hannuk=# select pg_typeof(p) from pair as p;
 pg_typeof
---
 pair

hannuk=# select pg_typeof(pg_typeof(p)) from pair as p;
 pg_typeof
---
 regtype

# first case where I can not use the table-defined type

hannuk=# create table anoter_pair of pair;
ERROR:  type pair is not a composite type

# the type definitely is there as promised

hannuk=# create type pair as (a int, b int);
ERROR:  type "pair" already exists

# and I can create similar type wit other name and use it to create table

hannuk=# create type pair2 as (a int, b int);
CREATE TYPE

hannuk=# create table anoter_pair of pair2;
CREATE TABLE

# and i can even use it in LIKE

hannuk=# CREATE TABLE pair3(like pair2);
CREATE TABLE

# the type is present in pg_type with type 'c' for Composite

hannuk=# select typname, typtype from pg_type where typname = 'pair';
 typname | typtype
-+-
 pair| c
(1 row)

# and I can add comment to the type

hannuk=# COMMENT ON TYPE pair is 'A Shroedingers type';
COMMENT

# but \dT does not show it (second case)

hannuk=# \dT pair
 List of data types
 Schema | Name | Description
+--+-
(0 rows)

---
Hannu

Re: Function and Procedure with same signature?

2024-03-07 Thread Hannu Krosing

Hi Tom

On Sat, Feb 10, 2024 at 12:38 AM Tom Lane  wrote:
>
> "David G. Johnston"  writes:
> > On Fri, Feb 9, 2024, 12:05 Deepak M  wrote:
> >> Folks, When tried to create a function with the same signature as
> >> procedure it fails.
>
> > That seems like a good hint you cannot do it.  Specifically because they
> > get defined in the same internal catalog within which names must be unique.

The fact that we currently enforce it this way seems like an
implementation deficiency that should be fixed.

Bringing this up again, as there seems to be no good reason to have
this restriction as there is no place in the call syntax where it
would be unclear if a FUNCTION or a PROCEDURE is used.

And at least Oracle does allow this (and possibly others) so having
this unnecessary restriction in PostgreSQL makes migrations from
Oracle applications where this feature is used needlessly complicated.

> Worth noting perhaps that this is actually required by the SQL
> standard: per spec, functions and procedures are both "routines"
> and share the same namespace,

Can you point me to a place in the standard where it requires all
kinds of ROUTINES to be using the same namespace ?

All I could find was that ROUTINES are either FUNCTIONS, PROCEDURES or
METHODS and then samples of their usage which made clear that all
three are different and usage is disjoint at syntax level.

As for DROP ROUTINE we could just raise an error and recommend using
more specific DROP FUNCTION or DROP PROCEDURE if there is ambiguity.

--
Best Regards
Hannu

Re: pgbench - adding pl/pgsql versions of tests

2024-02-02 Thread Hannu Krosing

My justification for adding pl/pgsql tests as part of the immediately
available tests is that pl/pgsql itself is always enabled, so having a
no-effort way to test its performance benefits would be really helpful.
We also should have "tps-b-like as SQL function" to round up the "test
what's available in server" set.

---
Hannu

On Fri, Feb 2, 2024 at 9:44 PM Robert Haas  wrote:

> On Tue, Aug 15, 2023 at 11:41 AM Nathan Bossart
>  wrote:
> > Why's that?  I'm not aware of any project policy that prohibits such
> > enhancements to pgbench.  It might take some effort to gather consensus
> on
> > a proposal like this, but IMHO that doesn't mean we shouldn't try.  If
> the
> > prevailing wisdom is that we shouldn't add more built-in scripts because
> > there is an existing way to provide custom ones, then it's not clear that
> > we should proceed with $SUBJECT, anyway.
>
> I don't think there's a policy against adding more built-in scripts to
> pgbench, but I'm skeptical of such efforts because I don't see how to
> decide which ones are worthy of inclusion and which are not. Adding
> everyone's favorite thing will be too cluttered, and adding nothing
> forecloses nothing because people can always provide their own. If we
> could establish that certain custom scripts are widely used across
> many people, then those might be worth adding.
>
> I have a vague recollection of someone proposing something similar to
> this in the past, possibly Jeff Davis. If there is in fact a paper
> trail showing that the same thing has been proposed more than once by
> unrelated people, that would be a point in favor of adding that
> particular thing.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>

Re: pgbench - adding pl/pgsql versions of tests

2024-02-02 Thread Hannu Krosing

Thanks for the update.

I will give it another go over the weekend

Cheers,
Hannu

On Thu, Feb 1, 2024 at 7:33 PM vignesh C  wrote:

> On Fri, 18 Aug 2023 at 23:04, Hannu Krosing  wrote:
> >
> > I will address the comments here over this coming weekend.
>
> The patch which you submitted has been awaiting your attention for
> quite some time now.  As such, we have moved it to "Returned with
> Feedback" and removed it from the reviewing queue. Depending on
> timing, this may be reversible.  Kindly address the feedback you have
> received, and resubmit the patch to the next CommitFest.
>
> Regards,
> Vignesh
>

Re: Emitting JSON to file using COPY TO

2023-12-09 Thread Hannu Krosing

>

On Sat, Dec 2, 2023 at 4:11 PM Tom Lane  wrote:
>
> Joe Conway  writes:
> >> I noticed that, with the PoC patch, "json" is the only format that must be
> >> quoted.  Without quotes, I see a syntax error.


In longer term we should move any specific COPY flag names and values
out of grammar and their checking into the parts that actually
implement whatever the flag is influencing

Similar to what we do with OPTIONS in all levels of FDW  definitions
(WRAPPER itself, SERVER, USER MAPPING, FOREIGN TABLE)

[*] https://www.postgresql.org/docs/current/sql-createforeigndatawrapper.html

> >> I'm assuming there's a
> >> conflict with another json-related rule somewhere in gram.y, but I haven't
> >> tracked down exactly which one is causing it.
>
> While I've not looked too closely, I suspect this might be due to the
> FORMAT_LA hack in base_yylex:
>
> /* Replace FORMAT by FORMAT_LA if it's followed by JSON */
> switch (next_token)
> {
> case JSON:
> cur_token = FORMAT_LA;
> break;
> }

My hope is that turning the WITH into a fully independent part with no
grammar-defined keys or values would also solve the issue of quoting
"json".

For backwards compatibility we may even go the route of keeping the
WITH as is  but add the OPTIONS which can take any values at grammar
level.

I shared my "Pluggable Copy " talk slides from Berlin '22 in another thread

--
Hannu

Why are wal_keep_size, max_slot_wal_keep_size requiring server restart?

2023-12-09 Thread Hannu Krosing

Hello fellow Hackers,

Does anyone know why we have decided that the wal_keep_size,
max_slot_wal_keep_size GUCs "can only be set in the postgresql.conf
file or on the server command line."  [1]?

It does not seem fundamentally needed , as they are "kind of
guidance", especially the second one.

The first one - wal_keep_size - could be something that is directly
relied on in some code paths, so setting it in a live database could
involve some two-step process, where you first set the value and then
wait all current transactions to finish before you do any operations
depending on the new value, like removing the wal files no more kept
because of earlier larger value. moving it up should need no extra
action. Moving it up then down immediately after could cause some
interesting race conditions when you move it down lower than it was in
the beginning, so "wait for all transactions to finish" should apply
in all cases

For the second one - max_slot_wal_keep_size - I can not immediately
come up with a scenario where just setting it could cause any
unexpected consequences. If you set it to a value below a current slot
value you *do* expect the slot to be invalidated. if you set it to a
larger than current value, then infringing slots get more time to
correct themselves. Both behaviours would be much more useful if you
did not have to restart the whole server to make adjustments.

-
[1] 
https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-WAL-KEEP-SIZE

Best Regards
Hannu

Re: Allowing TRUNCATE of FK target when session_replication_role=replica

2023-10-31 Thread Hannu Krosing

Thanks for the pointers.

One thing though re:
> The former is true but the latter is not. Logical replication requires
> wal_level = logical. That's also true for skipping FSM.

wal_level=logical is only needed *at provider* side, at least when
running pglogical.

Also, even for native logical replication it is possible to disconnect
the initial copy from CDC streaming, in which case again you can set
wal_level=minimal on the target side.

Will check the [1] and [2] and come back with more detailed proposal.

---
Best regards,
Hannu




On Tue, Oct 31, 2023 at 5:56 PM Euler Taveira  wrote:
>
> On Tue, Oct 31, 2023, at 5:09 AM, Hannu Krosing wrote:
>
> Currently we do not allow TRUNCATE of a table when any Foreign Keys
> point to that table.
>
>
> It is allowed iif you *also* truncate all tables referencing it.
>
> At the same time we do allow one to delete all rows when
> session_replication_role=replica
>
>
> That's true.
>
> This causes all kinds of pain when trying to copy in large amounts of
> data, especially at the start of logical replication set-up, as many
> optimisations to COPY require the table to be TRUNCATEd .
>
> The main two are ability to FREEZE while copying and the skipping of
> WAL generation in case of wal_level=minimal, both of which can achieve
> significant benefits when data amounts are large.
>
>
> The former is true but the latter is not. Logical replication requires
> wal_level = logical. That's also true for skipping FSM.
>
> Is there any reason to not allow TRUNCATE when
> session_replication_role=replica ?
>
>
> That's basically the same proposal as [1]. That patch was rejected because it
> was implemented in a different way that doesn't require the
> session_replication_role = replica to bypass the FK checks.
>
> That's basically the same proposal as [1]. That patch was rejected because it
> was implemented in a different way that doesn't require the
> session_replication_role = replica to bypass the FK checks.
>
> There are at least 3 cases that can benefit from this feature:
>
> 1) if your scenario includes an additional table only in the subscriber
> side that contains a foreign key to a replicated table then you will break 
> your
> replication like
>
> ERROR:  cannot truncate a table referenced in a foreign key constraint
> DETAIL:  Table "foo" references "bar".
> HINT:  Truncate table "foo" at the same time, or use TRUNCATE ... CASCADE.
> CONTEXT:  processing remote data for replication origin "pg_16406" during
> message type "TRUNCATE" in transaction 12880, finished at 0/297FE08
>
> and you have to manually fix your replication. If we allow
> session_replication_role = replica to bypass FK check for TRUNCATE commands, 
> we
> wouldn't have an error. I'm not saying that it is a safe operation for logical
> replication scenarios. Maybe it is not because table foo will contain invalid
> references to table bar and someone should fix it in the subscriber side.
> However, the current implementation already allows such orphan rows due to
> session_replication_role behavior.
>
> 2) truncate table at subscriber side during the initial copy. As you 
> mentioned,
> this feature should take advantage of the FREEZE and FSM optimizations. There
> was a proposal a few years ago [2].
>
> 3) resynchronize a table. Same advantages as item 2.
>
> Unless there are any serious objections, I will send a patch to also
> allow TRUNCATE in this case.
>
>
> You should start checking the previous proposal [1].
>
>
> [1] 
> https://www.postgresql.org/message-id/ff835f71-3c6c-335e-4c7b-b9e1646cf3d7%402ndquadrant.it
> [2] 
> https://www.postgresql.org/message-id/CF3B6672-2A43-4204-A60A-68F359218A9B%40endpoint.com
>
>
> --
> Euler Taveira
> EDB   https://www.enterprisedb.com/
>

Allowing TRUNCATE of FK target when session_replication_role=replica

2023-10-31 Thread Hannu Krosing

Hi

Currently we do not allow TRUNCATE of a table when any Foreign Keys
point to that table.

At the same time we do allow one to delete all rows when
session_replication_role=replica

This causes all kinds of pain when trying to copy in large amounts of
data, especially at the start of logical replication set-up, as many
optimisations to COPY require the table to be TRUNCATEd .

The main two are ability to FREEZE while copying and the skipping of
WAL generation in case of wal_level=minimal, both of which can achieve
significant benefits when data amounts are large.

Is there any reason to not allow TRUNCATE when
session_replication_role=replica ?

Unless there are any serious objections, I will send a patch to also
allow TRUNCATE in this case.


Best Regards
Hannu

Re: Initdb-time block size specification

2023-09-05 Thread Hannu Krosing

Sure, I was just hoping that somebody already knew without needing to
specifically check :)

And as I see in David's response, the tests are actually broken for other sizes.

I'll see if I can (convince somebody to) set this up .

Cheers
Hannu

On Tue, Sep 5, 2023 at 10:23 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-09-05 21:52:18 +0200, Hannu Krosing wrote:
> > Something I also asked at this years Unconference - Do we currently
> > have Build Farm animals testing with different page sizes ?
>
> You can check that yourself as easily as anybody else.
>
> Greetings,
>
> Andres Freund

Re: Initdb-time block size specification

2023-09-05 Thread Hannu Krosing

Something I also asked at this years Unconference - Do we currently
have Build Farm animals testing with different page sizes ?

I'd say that testing all sizes from 4KB up (so 4, 8, 16, 32) should be
done at least before each release if not continuously.

-- Cheers

Hannu


On Tue, Sep 5, 2023 at 4:31 PM David Christensen
 wrote:
>
> On Tue, Sep 5, 2023 at 9:04 AM Robert Haas  wrote:
> >
> > On Sat, Sep 2, 2023 at 3:09 PM Tomas Vondra
> >  wrote:
> > > Except that no one is forcing you to actually go 130mph or 32mph, right?
> > > You make it seem like this patch forces people to use some other page
> > > size, but that's clearly not what it's doing - it gives you the option
> > > to use smaller or larger block, if you chose to. Just like increasing
> > > the speed limit to 130mph doesn't mean you can't keep going 65mph.
> > >
> > > The thing is - we *already* allow using different block size, except
> > > that you have to do custom build. This just makes it easier.
> >
> > Right. Which is worth doing if it doesn't hurt performance and is
> > likely to be useful to a lot of people, and is not worth doing if it
> > will hurt performance and be useful to relatively few people. My bet
> > is on the latter.
>
> Agreed that this doesn't make sense if there are major performance
> regressions, however the goal here is patch evaluation to measure this
> against other workloads and see if this is the case; from my localized
> testing things were within acceptable noise levels with the latest
> version.
>
> I agree with Tomas' earlier thoughts: we already allow different block
> sizes, and if there are baked-in algorithmic assumptions about block
> size (which there probably are), then identifying those or places in
> the code where we need additional work or test coverage will only
> improve things overall for those non-standard block sizes.
>
> Best,
>
> David
>
>

Re: pgbench - adding pl/pgsql versions of tests

2023-08-18 Thread Hannu Krosing

I will address the comments here over this coming weekend.


I think that in addition to current "tpc-b like" test we could also
have more modern "tpc-c like" and "tpc-h like" tests

And why not any other "* -like" from the rest of TPC-*, YCSP, sysbench, ... :)

though maybe not as part of pg_bench but as extensions ?

---
Hannu

On Wed, Aug 16, 2023 at 10:06 AM Fabien COELHO  wrote:
>
>
> Hello Nathan,
>
> >> I'm unclear about what variety of scripts that could be provided given the
> >> tables made available with pgbench. ISTM that other scenari would involve
> >> both an initialization and associated scripts, and any proposal would be
> >> bared because it would open the door to anything.
> >
> > Why's that?
>
> Just a wild guess based on 19 years of occasional contributions to pg and
> pgbench in particular:-)
>
> > I'm not aware of any project policy that prohibits such enhancements to
> > pgbench.
>
> Attempts in extending pgbench often fall under "you can do it outside (eg
> with a custom script) so there is no need to put that in pgbench as it
> would add to the maintenance burden with a weak benefit proven by the fact
> that it is not there already".
>
> > It might take some effort to gather consensus on a proposal like this,
> > but IMHO that doesn't mean we shouldn't try.
>
> Done it in the past. Probably will do it again in the future:-)
>
> > If the prevailing wisdom is that we shouldn't add more built-in scripts
> > because there is an existing way to provide custom ones, then it's not
> > clear that we should proceed with $SUBJECT, anyway.
>
> I'm afraid there is that argument. I do not think that this policy is good
> wrt $SUBJECT, ISTM that having an easy way to test something with a
> PL/pgSQL function would help promote the language by advertising/showing
> the potential performance benefit (or not, depending). Just one function
> would be enough for that.
>
> --
> Fabien.

Re: How to build a new grammer for pg?

2023-08-08 Thread Hannu Krosing

I would look at how Babelfish DB did it when adding SQL Server compatibility

https://babelfishpg.org/ and https://github.com/babelfish-for-postgresql/

another source to inspect could be
https://github.com/IvorySQL/IvorySQL for "oracle compatible
PostgreSQL"

On Tue, Aug 1, 2023 at 10:07 PM Jonah H. Harris  wrote:
>
> On Tue, Aug 1, 2023 at 3:45 PM Andrew Dunstan  wrote:
>>
>> Or to enable some language other than SQL (QUEL anyone?)
>
>
> A few years ago, I got a minimal POSTQUEL working again to release as a patch 
> for April Fools' Day, which I never did. I should dig that up somewhere :)
>
> Anyway, as far as OP's original question regarding replacing the grammar, 
> there are a couple of such implementations floating around that have done 
> that. But, I actually think the pluggable parser patches were good examples 
> of how to integrate a replacement parser that generates the expected parse 
> tree nodes for anyone who wants to do their own custom parser. See Julien 
> Rouhaud's SQLOL in the "Hook for extensible parsing" thread and Jim 
> Mlodgenski's "Parser Hook" thread.
>
> --
> Jonah H. Harris
>

Re: incremental-checkopints

2023-07-26 Thread Hannu Krosing

On Wed, Jul 26, 2023 at 9:54 PM Matthias van de Meent
 wrote:
>
> Then you ignore the max_wal_size GUC as PostgreSQL so often already
> does. At least, it doesn't do what I expect it to do at face value -
> limit the size of the WAL directory to the given size.

That would require stopping any new writes at wal size == max_wal_size
until the checkpoint is completed.
I don't think anybody would want that.

> But more reasonably, you'd keep track of the count of modified pages
> that are yet to be fully WAL-logged, and keep that into account as a
> debt that you have to the current WAL insert pointer when considering
> checkpoint distances and max_wal_size.

I think Peter Geoghegan has worked on somewhat similar approach to
account for "accumulated work needed until some desired outcome"
though I think it was on the VACUUM side of things.

Re: incremental-checkopints

2023-07-26 Thread Hannu Krosing

Starting from increments checkpoint is approaching the problem from
the wrong end.

What you actually want is Atomic Disk Writes which will allow turning
off full_page_writes .

Without this you really can not do incremental checkpoints efficiently
as checkpoints are currently what is used to determine when is "the
first write to a page after checkpoint" and thereby when the full page
write is needed.




On Wed, Jul 26, 2023 at 8:58 PM Tomas Vondra
 wrote:
>
>
>
> On 7/26/23 15:16, Matthias van de Meent wrote:
> > On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera  
> > wrote:
> >>
> >> Hello
> >>
> >> On 2023-Jul-26, Thomas wen wrote:
> >>
> >>> Hi Hackes:   I found this page :
> >>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
> >>> no incremental checkpoints have been implemented so far. When a
> >>> checkpoint is triggered, the performance jitter of PostgreSQL is very
> >>> noticeable. I think incremental checkpoints should be implemented as
> >>> soon as possible
> >>
> >> I think my first question is why do you think that is necessary; there
> >> are probably other tools to achieve better performance.  For example,
> >> you may want to try making checkpoint_completion_target closer to 1, and
> >> the checkpoint interval longer (both checkpoint_timeout and
> >> max_wal_size).  Also, changing shared_buffers may improve things.  You
> >> can try adding more RAM to the machine.
> >
> > Even with all those tuning options, a significant portion of a
> > checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in
> > general) will most often appear at the start of each checkpoint due to
> > each first update to a page after a checkpoint needing an FPI.
>
> Yeah, FPIs are certainly expensive and can represent huge part of the
> WAL produced. But how would incremental checkpoints make that step
> unnecessary?
>
> > If instead we WAL-logged only the pages we are about to write to disk
> > (like MySQL's double-write buffer, but in WAL instead of a separate
> > cyclical buffer file), then a checkpoint_completion_target close to 1
> > would probably solve the issue, but with "WAL-logged torn page
> > protection at first update after checkpoint" we'll probably always
> > have higher-than-average FPI load just after a new checkpoint.
> >
>
> So essentially instead of WAL-logging the FPI on the first change, we'd
> only do that later when actually writing-out the page (either during a
> checkpoint or because of memory pressure)? How would you make sure
> there's enough WAL space until the next checkpoint? I mean, FPIs are a
> huge write amplification source ...
>
> Imagine the system has max_wal_size set to 1GB, and does 1M updates
> before writing 512MB of WAL and thus triggering a checkpoint. Now it
> needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with
> indexes. What then?
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>

Re: Example Table AM implementation

2023-07-06 Thread Hannu Krosing

Thanks a lot Mark,

I will take a look at this and get back to you if I find anything unclear

---
Hannu

On Tue, Jul 4, 2023 at 10:14 PM Mark Dilger
 wrote:
>
> Hackers,
>
> Over in [1], Hannu Krosing asked me to create and post several Table Access 
> Methods for testing/example purposes.  I am fairly happy to do so, but each 
> one is large, and should be considered separately for inclusion/rejection in 
> contrib/, or in src/test/modules as Michael Paquier suggests.  As such, I am 
> starting this new email thread for the first such TAM.  I've named it "pile", 
> which is an English synonym of "heap", and which is also four characters in 
> length, making for easier side-by-side diffs with the heap code.  The pile 
> code is a deep copy of the heap code, meaning that pile functions do not call 
> heap code, nor run the in-core regression tests, but rather pile's own 
> modified copy of the heap code, the regression tests, and even the test data. 
>  Rather than creating a bare-bones skeleton which needs to be populated with 
> an implementation and regression tests, this patch instead offers a fully 
> fleshed out TAM which can be pruned down to something reasonably compact once 
> the user changes it into whatever they want it to be.  To reiterate, the 
> patch is highly duplicative of in-core files.
>
> Hannu, I'm happy to post something like this three times again, for the named 
> TAMs you request, but could you first review this patch and maybe try turning 
> it into something else, such as the in memory temp tables, overlay tables, or 
> python based tables that you mentioned in [1]?  Anything that needs to be 
> changed to make similar TAMs suitable for the community should be discussed 
> prior to spamming -hackers with more TAMs.  Thanks.
>
>
> [1] 
> https://www.postgresql.org/message-id/CAMT0RQQXtq8tgVPdFb0mk4v%2BcVuGvPWk1Oz9LDr0EgBfrV6e6w%40mail.gmail.com
>
>
>
> —
> Mark Dilger
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>

Including a sample Table Access Method with core code

2023-07-03 Thread Hannu Krosing

At PgCon 2023 in Ottawa we had an Unconference session on Table Access
Methods [1]

One thing that was briefly mentioned (but is missing from the notes)
is need to have a sample API client in contrib/ , both for having a
2nd user for API to make it more likely that non-heap AMs are doable
and also to serve as an easy starting point for someone interested in
developing a new AM.

There are a few candidates which could be lightweight enough for this

* in-memory temp tables, especially if you specify max table size at
creation and/or limit data types which can be used.

* "overlay tables" - tables which "overlay" another - possibly
read-only - table and store only changed rows and tombstones for
deletions. (this likely would make more sense as a FDW itself as Table
AM currently knows nothing about Primary Keys and these are likely
needed for overlays)

* Table AM as a (pl/)Python Class - this is inspired by the amazing
Multicorn [2] FDW-in-Python tool which made it ridiculously easy to
expose anything (mailbox, twitter feed, git commit history,
you-name-it) as a Foreign Table


Creating any of these seems to be a project of size suitable for a
student course project or maybe Google Summer of Code [3].


Included Mark Dilger directly to this mail as he mentioned he has a
Perl script that makes a functional copy of heap AM that can be
compiled as installed as custom AM.

@mark - maybe you can create 3 boilerplate Table AMs for the above
named `mem_am`, `overlay_am` and `py3_am` and we could put them
somewhere for interested parties to play with ?

[1] https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference#Table_AMs
[2] https://multicorn.org/ - unfortunately unmaintained since 2016,
but there are some forks supporting later PostgreSQL versions
[3] https://wiki.postgresql.org/wiki/GSoC - Google Summer of Code

---
Best Regards
Hannu

Re: Let's make PostgreSQL multi-threaded

2023-06-15 Thread Hannu Krosing

One more unexpected benefit of having shared caches would be easing
access to other databases.

If the system caches are there for all databases anyway, then it
becomes much easier to make queries using objects from multiple
databases.

Note that this does not strictly need threads, just shared caches.

On Thu, Jun 15, 2023 at 11:04 AM Hannu Krosing  wrote:
>
> On Thu, Jun 15, 2023 at 10:41 AM James Addison  wrote:
> >
> > This is making me wonder about other performance/scalability areas
> > that might not have been considered due to focus on the details of the
> > existing codebase, but I'll save that for another thread and will try
> > to learn more first.
>
> A gradual move to more shared structures seems to be a way forward
>
> It should get us all the benefits of threading minus the need for TLB
> reloading and (in some cases) reduction of per-process virtual memory
> mapping tables.
>
> In any case we would need to implement all the locking and parallelism
> management of these shared structures that are not there in the
> current process architecture.
>
> So a fair bit of work but also a clearly defined benefits of
> 1) reduced memory usage
> 2) no need to rebuild caches for each new connection
> 3) no need to track PREPARE statements inside connection poolers.
>
> There can be extra complexity when different connections use the same
> prepared statement name (say "PREP001") for different queries.
> For this wel likely will need a good cooperation with connection
> pooler where it passes some kind of client connection id along at the
> transaction start

Re: Let's make PostgreSQL multi-threaded

2023-06-15 Thread Hannu Krosing

On Thu, Jun 15, 2023 at 10:41 AM James Addison  wrote:
>
> This is making me wonder about other performance/scalability areas
> that might not have been considered due to focus on the details of the
> existing codebase, but I'll save that for another thread and will try
> to learn more first.

A gradual move to more shared structures seems to be a way forward

It should get us all the benefits of threading minus the need for TLB
reloading and (in some cases) reduction of per-process virtual memory
mapping tables.

In any case we would need to implement all the locking and parallelism
management of these shared structures that are not there in the
current process architecture.

So a fair bit of work but also a clearly defined benefits of
1) reduced memory usage
2) no need to rebuild caches for each new connection
3) no need to track PREPARE statements inside connection poolers.

There can be extra complexity when different connections use the same
prepared statement name (say "PREP001") for different queries.
For this wel likely will need a good cooperation with connection
pooler where it passes some kind of client connection id along at the
transaction start

Re: Let's make PostgreSQL multi-threaded

2023-06-15 Thread Hannu Krosing

On Thu, Jun 15, 2023 at 9:12 AM Konstantin Knizhnik  wrote:

> There are three different but related directions of improving current 
> Postgres:
> 1. Replacing processes with threads

Here we could likely start with making parallel query multi-threaded.

This would also remove the big blocker for parallelizing things like
CREATE TABLE AS SELECT ... where we are currently held bac by the
restriction that only the leader process can write.

> 2. Builtin connection pooler

Would be definitely a nice thing to have. And we could even start by
integrating a non-threaded pooler like pgbouncer to run as a
postgresql worker process (or two).

> 3. Lightweight backends (shared catalog/relation/prepared statements caches)

Shared prepared statement caches (of course have to be per-user and
per-database) would give additional benefit of lightweight connection
poolers not needing to track these. Currently the missing support of
named prepared statements is one of the main hindrances of using
pgbouncer with JDBC in transaction pooling mode (you can use it, but
have to turn off automatic statement preparing)

>
> The motivation for such changes are also similar:
> 1. Increase Postgres scalability
> 2. Reduce memory consumption
> 3. Make Postgres better fit cloud and serverless requirements

The memory consumption reduction would be a big and clear win for many
workloads.

Also just moving more things in shared memory will also prepare us for
move to threaded server (if it will eventually happen)

> I am not sure now which one should be addressed first or them can be done 
> together.

Shared caches seem like a guaranteed win at least on memory usage.
There could be performance  (and complexity) downsides for specific
workloads, but they would be the same as for the threaded model, so
would also be a good learning opportunity.

> Replacing static variables with thread-local is the first and may be the 
> easiest step.

I think we got our first patch doing this (as part of patches for
running PG threaded on Solaris) quite early in the OSS development ,
could have been even in the last century :)

> It requires more or less mechanical changes. More challenging thing is 
> replacing private per-backend data structures
> with shared ones (caches, file descriptors,...)

Indeed, sharing caches would be also part of the work that is needed
for the sharded model, so anyone feeling strongly about moving to
threads could start with this :)

---
Hannu

Re: Let's make PostgreSQL multi-threaded

2023-06-14 Thread Hannu Krosing

On Tue, Jun 13, 2023 at 9:55 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 13 Jun 2023 09:55:36 +0300, Konstantin Knizhnik  
> wrote in
> > Postgres backend is "thick" not because of large number of local
> > variables.
> > It is because of local caches: catalog cache, relation cache, prepared
> > statements cache,...
> > If they are not rewritten, then backend still may consume a lot of
> > memory even if it will be thread rather then process.
> > But threads simplify development of global caches, although it can be
> > done with DSM.
>
> With the process model, that local stuff are flushed out upon
> reconnection. If we switch to the thread model, we will need an
> expiration mechanism for those stuff.

The part that can not be so easily solved is that "the local stuff"
can include some leakage that is not directly controlled by us.

I remember a few times when memory leaks in some PostGIS packages
cause slow memory exhaustion and the simple fix was limiting
connection lifetime to something between 15 min and an hour.

The main problem here is that PostGIS uses a few tens of other GPL GIS
related packages which are all changing independently and thus it is
quite hard to be sure that none of these have developed a leak. And
you also likely can not just stop upgrading these as they also contain
security fixes.

I have no idea what the fix could be in case of threaded server.

Re: Let's make PostgreSQL multi-threaded

2023-06-10 Thread Hannu Krosing

On Mon, Jun 5, 2023 at 4:52 PM Heikki Linnakangas  wrote:
>
> If there are no major objections, I'm going to update the developer FAQ,
> removing the excuses there for why we don't use threads [1].

I think it is not wise to start the wholesale removal of the objections there.

But I think it is worthwhile to revisit the section about threads and
maybe split out the historic part which is no more true, and provide
both pros and cons for these.

I started with this short summary from the discussion in this thread,
feel free to expand, argue, fix :)
* is current excuse
-- is counterargument or ack

As an example, threads are not yet used instead of multiple processes
for backends because:
* Historically, threads were poorly supported and buggy.
-- yes they were, not relevant now when threads are well-supported and non-buggy

* An error in one backend can corrupt other backends if they're
threads within a single process
-- still valid for silent corruption
-- for detected crash - yes, but we are restarting all backends in
case of crash anyway.

* Speed improvements using threads are small compared to the remaining
backend startup time.
-- we now have some measurements that show significant performance
improvements not related to startup time

* The backend code would be more complex.
-- this is still the case
-- even more worrisome is that all extensions also need to be rewritten
-- and many incompatibilities will be silent and take potentially years to find

* Terminating backend processes allows the OS to cleanly and quickly
free all resources, protecting against memory and file descriptor
leaks and making backend shutdown cheaper and faster
-- still true

* Debugging threaded programs is much harder than debugging worker
processes, and core dumps are much less useful
-- this was countered by claiming that
  -- by now we have reasonable debugger support for threads
  -- there is no direct debugger support for debugging the exact
system set up like PostgreSQL processes + shared memory

* Sharing of read-only executable mappings and the use of
shared_buffers means processes, like threads, are very memory
efficient
-- this seems to say that the current process model is as good as threads ?
-- there were a few counterarguments
  -- per-backend virtual memory mapping can add up to significant
amount of extra RAM usage
  -- the discussion did not yet touch various per-backend caches
(pg_catalog cache, statement cache) which are arguably easier to
implement in threaded model
  -- TLB reload at each process switch is expensive and would be
mostly avoided in case of threads

* Regular creation and destruction of processes helps protect against
memory fragmentation, which can be hard to manage in long-running
processes
-- probably still true
-

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Hannu Krosing

I discovered this thread from a Twitter post "PostgreSQL will finally
be rewritten in Rust"  :)

On Mon, Jun 5, 2023 at 5:18 PM Tom Lane  wrote:
>
> Heikki Linnakangas  writes:
> > I spoke with some folks at PGCon about making PostgreSQL multi-threaded,
> > so that the whole server runs in a single process, with multiple
> > threads. It has been discussed many times in the past, last thread on
> > pgsql-hackers was back in 2017 when Konstantin made some experiments [0].
>
> > I feel that there is now pretty strong consensus that it would be a good
> > thing, more so than before. Lots of work to get there, and lots of
> > details to be hashed out, but no objections to the idea at a high level.
>
> > The purpose of this email is to make that silent consensus explicit. If
> > you have objections to switching from the current multi-process
> > architecture to a single-process, multi-threaded architecture, please
> > speak up.
>
> For the record, I think this will be a disaster.  There is far too much
> code that will get broken, largely silently, and much of it is not
> under our control.
>
> regards, tom lane
>
>

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Hannu Krosing

On Thu, Jun 8, 2023 at 4:56 PM Robert Haas  wrote:
>
> On Thu, Jun 8, 2023 at 8:44 AM Hannu Krosing  wrote:
> > > That sounds like a bad idea, dynamic shared memory is more expensive
> > > to maintain than our static shared memory systems, not in the least
> > > because DSM is not guaranteed to share the same addresses in each
> > > process' address space.
> >
> > Then this too needs to be fixed
>
> Honestly, I'm struggling to respond to this non-sarcastically. I mean,
> I was the one who implemented DSM. Do you think it works the way that
> it works because I considered doing something smart and decided to do
> something dumb instead?

No, I meant that this needs to be fixed at OS level, by being able to
use the same mapping.

We should not shy away from asking the OS people for adding the useful
features still missing.

It was mentioned in the Unconference Kernel Hacker AMA talk and said
kernel hacker works for Oracle, andf they also seemed to be needing
this :)

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Hannu Krosing

On Thu, Jun 8, 2023 at 2:15 PM Matthias van de Meent
 wrote:
>
> On Thu, 8 Jun 2023 at 11:54, Hannu Krosing  wrote:
> >
> > On Wed, Jun 7, 2023 at 11:37 PM Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote:
> > > > 2. While I wouldn't want to necessarily discourage a moonshot effort, I
> > > > would ask if developer time could be better spent on tackling some of 
> > > > the
> > > > other problems around vertical scalability? Per some PGCon discussions,
> > > > there's still room for improvement in how PostgreSQL can best utilize
> > > > resources available very large "commodity" machines (a 448-core / 24TB 
> > > > RAM
> > > > instance comes to mind).
> > >
> > > I think we're starting to hit quite a few limits related to the process 
> > > model,
> > > particularly on bigger machines. The overhead of cross-process context
> > > switches is inherently higher than switching between threads in the same
> > > process - and my suspicion is that that overhead will continue to
> > > increase. Once you have a significant number of connections we end up 
> > > spending
> > > a *lot* of time in TLB misses, and that's inherent to the process model,
> > > because you can't share the TLB across processes.
> >
> >
> > This part was touched in the "AMA with a Linux Kernale Hacker"
> > Unconference session where he mentioned that the had proposed a
> > 'mshare' syscall for this.
> >
> > So maybe a more fruitful way to fixing the perceived issues with
> > process model is to push for small changes in Linux to overcome these
> > avoiding a wholesale rewrite ?
>
> We support not just Linux, but also Windows and several (?) BSDs. I'm
> not against pushing Linux to make things easier for us, but Linux is
> an open source project, too, where someone need to put in time to get
> the shiny things that you want. And I'd rather see our time spent in
> PostgreSQL, as Linux is only used by a part of our user base.

Do we have any statistics for the distribution of our user base ?

My gut feeling says that for performance-critical use the non-Linux is
in low single digits at best.

My fascination for OpenSource started with realisation that instead of
workarounds you can actually fix the problem at source. So if the
specific problem is that TLB is not shared then the proper fix is
making it shared instead of rewriting everything else to get around
it. None of us is limited to writing code in PostgreSQL only. If the
easiest and more generix fix can be done in Linux then so be it.

It is also possible that Windows and *BSD already have a similar feature.

>
> > > The amount of duplicated code we have to deal with due to to the process 
> > > model
> > > is quite substantial. We have local memory, statically allocated shared 
> > > memory
> > > and dynamically allocated shared memory variants for some things. And 
> > > that's
> > > just going to continue.
> >
> > Maybe we can already remove the distinction between static and dynamic
> > shared memory ?
>
> That sounds like a bad idea, dynamic shared memory is more expensive
> to maintain than our static shared memory systems, not in the least
> because DSM is not guaranteed to share the same addresses in each
> process' address space.

Then this too needs to be fixed

>
> > Though I already heard some complaints at the conference discussions
> > that having the dynamic version available has made some developers
> > sloppy in using it resulting in wastefulness.
>
> Do you know any examples of this wastefulness?

No. Just somebody mentioned it in a hallway conversation and the rest
of the developers present mumbled approvingly :)

> > > > I'm purposely giving a nonanswer on whether it's a worthwhile goal, but
> > > > rather I'd be curious where it could stack up against some other 
> > > > efforts to
> > > > continue to help PostgreSQL improve performance and handle very large
> > > > workloads.
> > >
> > > There's plenty of things we can do before, but in the end I think 
> > > tackling the
> > > issues you mention and moving to threads are quite tightly linked.
> >
> > Still we should be focusing our attention at solving the issues and
> > not at "moving to threads" and hoping this will fix the issues by
> > itself.
>
> I suspect that it is much easier to solve some of the issues when
> working in a shared address space.

Probably. But it woul

Re: Use COPY for populating all pgbench tables

2023-06-08 Thread Hannu Krosing

I guess that COPY will still be slower than  generating the data
server-side ( --init-steps=...G... ) ?

What I'd really like to see is providing all the pgbench functions
also on the server. Specifically the various random(...) functions -
random_exponential(...), random_gaussian(...), random_zipfian(...) so
that also custom data generationm could be easily done server-side
with matching distributions.

On Thu, Jun 8, 2023 at 7:34 AM David Rowley  wrote:
>
> On Thu, 8 Jun 2023 at 07:16, Tristan Partin  wrote:
> >
> > master:
> >
> > 5000 of 5000 tuples (100%) done (elapsed 260.93 s, remaining 0.00 
> > s))
> > vacuuming...
> > creating primary keys...
> > done in 1414.26 s (drop tables 0.20 s, create tables 0.82 s, client-side 
> > generate 1280.43 s, vacuum 2.55 s, primary keys 130.25 s).
> >
> > patchset:
> >
> > 5000 of 5000 tuples (100%) of pgbench_accounts done (elapsed 243.82 
> > s, remaining 0.00 s))
> > vacuuming...
> > creating primary keys...
> > done in 375.66 s (drop tables 0.14 s, create tables 0.73 s, client-side 
> > generate 246.27 s, vacuum 2.77 s, primary keys 125.75 s).
>
> I've also previously found pgbench -i to be slow.  It was a while ago,
> and IIRC, it was due to the printfPQExpBuffer() being a bottleneck
> inside pgbench.
>
> On seeing your email, it makes me wonder if PG16's hex integer
> literals might help here.  These should be much faster to generate in
> pgbench and also parse on the postgres side.
>
> I wrote a quick and dirty patch to try that and I'm not really getting
> the same performance increases as I'd have expected. I also tested
> with your patch too and it does not look that impressive either when
> running pgbench on the same machine as postgres.
>
> pgbench copy speedup
>
> ** master
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 1 of 1 tuples (100%) done (elapsed 74.15 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 95.71 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 74.45 s, vacuum 0.12 s, primary keys 21.13 s).
>
> ** David's Patched
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 1 of 1 tuples (100%) done (elapsed 69.64 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 90.22 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 69.91 s, vacuum 0.12 s, primary keys 20.18 s).
>
> ** Tristan's patch
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 1 of 1 tuples (100%) of pgbench_accounts done (elapsed
> 77.44 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 98.64 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 77.47 s, vacuum 0.12 s, primary keys 21.04 s).
>
> I'm interested to see what numbers you get.  You'd need to test on
> PG16 however. I left the old code in place to generate the decimal
> numbers for versions < 16.
>
> David

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Hannu Krosing

On Thu, Jun 8, 2023 at 11:54 AM Hannu Krosing  wrote:
>
> On Wed, Jun 7, 2023 at 11:37 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote:
> > > 2. While I wouldn't want to necessarily discourage a moonshot effort, I
> > > would ask if developer time could be better spent on tackling some of the
> > > other problems around vertical scalability? Per some PGCon discussions,
> > > there's still room for improvement in how PostgreSQL can best utilize
> > > resources available very large "commodity" machines (a 448-core / 24TB RAM
> > > instance comes to mind).
> >
> > I think we're starting to hit quite a few limits related to the process 
> > model,
> > particularly on bigger machines. The overhead of cross-process context
> > switches is inherently higher than switching between threads in the same
> > process - and my suspicion is that that overhead will continue to
> > increase. Once you have a significant number of connections we end up 
> > spending
> > a *lot* of time in TLB misses, and that's inherent to the process model,
> > because you can't share the TLB across processes.
>
>
> This part was touched in the "AMA with a Linux Kernale Hacker"
> Unconference session where he mentioned that the had proposed a
> 'mshare' syscall for this.

Also, the *static* huge pages already let you solve this problem now
by sharing the page tables


Cheers
Hannu

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Hannu Krosing

On Thu, Jun 8, 2023 at 12:09 AM Andres Freund  wrote:
...

> We could e.g. eventually decide that we
> don't support parallel query without threading support - which would allow us
> to get rid of a very significant amount of code and runtime overhead.

Here I was hoping to go in the opposite direction and support parallel
query across replicas.

This looks much more doable based on the process model than the single
process / multiple threads model.

---
Cheers
Hannu

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Hannu Krosing

I think I remember that in the early days of development somebody did
send a patch-set for making PostgreSQL threaded on Solaris.

I don't remember why this did not catch on.

On Wed, Jun 7, 2023 at 11:40 PM Thomas Kellerer  wrote:
>
> Tomas Vondra schrieb am 07.06.2023 um 21:20:
> > Also, which other projects did this transition? Is there something we
> > could learn from them? Were they restricted to much smaller list of
> > platforms?
>
> Firebird did this a while ago if I'm not mistaken.
>
> Not open source, but Oracle was historically multi-threaded on Windows and 
> multi-process on all other platforms.
> I _think_ starting with 19c you can optionally run it multi-threaded on Linux 
> as well.
>
> But I doubt, they are willing to share any insights ;)
>
>
>

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Hannu Krosing

On Wed, Jun 7, 2023 at 11:37 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote:
> > 2. While I wouldn't want to necessarily discourage a moonshot effort, I
> > would ask if developer time could be better spent on tackling some of the
> > other problems around vertical scalability? Per some PGCon discussions,
> > there's still room for improvement in how PostgreSQL can best utilize
> > resources available very large "commodity" machines (a 448-core / 24TB RAM
> > instance comes to mind).
>
> I think we're starting to hit quite a few limits related to the process model,
> particularly on bigger machines. The overhead of cross-process context
> switches is inherently higher than switching between threads in the same
> process - and my suspicion is that that overhead will continue to
> increase. Once you have a significant number of connections we end up spending
> a *lot* of time in TLB misses, and that's inherent to the process model,
> because you can't share the TLB across processes.


This part was touched in the "AMA with a Linux Kernale Hacker"
Unconference session where he mentioned that the had proposed a
'mshare' syscall for this.

So maybe a more fruitful way to fixing the perceived issues with
process model is to push for small changes in Linux to overcome these
avoiding a wholesale rewrite ?

>
>
> The amount of duplicated code we have to deal with due to to the process model
> is quite substantial. We have local memory, statically allocated shared memory
> and dynamically allocated shared memory variants for some things. And that's
> just going to continue.

Maybe we can already remove the distinction between static and dynamic
shared memory ?

Though I already heard some complaints at the conference discussions
that having the dynamic version available has made some developers
sloppy in using it resulting in wastefulness.

>
>
> > I'm purposely giving a nonanswer on whether it's a worthwhile goal, but
> > rather I'd be curious where it could stack up against some other efforts to
> > continue to help PostgreSQL improve performance and handle very large
> > workloads.
>
> There's plenty of things we can do before, but in the end I think tackling the
> issues you mention and moving to threads are quite tightly linked.

Still we should be focusing our attention at solving the issues and
not at "moving to threads" and hoping this will fix the issues by
itself.

Cheers
Hannu

Re: Time to move pg_test_timing to measure in nanoseconds

2023-03-26 Thread Hannu Krosing

Sure, will do.

On Sun, Mar 26, 2023 at 11:40 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-03-26 16:43:21 +0200, Hannu Krosing wrote:
> > Currently pg_test_timing utility measures its timing overhead in
> > microseconds, giving results like this
>
> I have a patch that does that and a bit more that's included in a larger
> patchset by David Geier:
> https://postgr.es/m/198ef658-a5b7-9862-2017-faf85d59e3a8%40gmail.com
>
> Could you review that part of the patchset?
>
> Greetings,
>
> Andres Freund

Re: Disable vacuuming to provide data history

2023-03-26 Thread Hannu Krosing

There is also another blocker - our timestamp resolution is 1
microsecond and we are dangerously close to speeds where one could
update a row twice in the same microsecond .

I have been thinking about this, and what is needed is

1. a nanosecond-resolution "abstime" type - not absolutely necessary,
but would help with corner cases.
2. VACUUM should be able to "freeze" by replacing xmin/xmax values
with commit timestamps, or adding tmin/tmax where necessary.
3. Optionally VACUUM could move historic rows to archive tables with
explicit tmin/tmax columns (this also solves the pg_dump problem)

Most of the above design - apart from the timestamp resolution and
vacuum being the one doing stamping in commit timestamps -  is not
really new - up to version 6.2 PostgreSQL had tmin/tmax instead of
xmin/xmax and you could specify the timestamp you want to query any
table at.

And the original Postgres design was Full History Database where you
could say " SELECT name, population FROM cities['epoch' .. 'now'] " to
get all historic population values.

And historic data was meant to be moved to the WORM optical drives
which had just arrived to the market

---
Hannu

On Sat, Feb 25, 2023 at 3:11 AM Vik Fearing  wrote:
>
> On 2/24/23 22:06, Corey Huinker wrote:
> > On Thu, Feb 23, 2023 at 6:04 AM  wrote:
> >
> >   [1] some implementations don't use null, they use an end-timestamp set to
> > a date implausibly far in the future ( 3999-12-31 for example ),
>
> The specification is, "At any point in time, all rows that have their
> system-time period end column set to the highest value supported by the
> data type of that column are known as current system rows; all other
> rows are known as historical system rows."
>
> I would like to see us use 'infinity' for this.
>
> The main design blocker for me is how to handle dump/restore.  The
> standard does not bother thinking about that.
> --
> Vik Fearing
>
>
>

Time to move pg_test_timing to measure in nanoseconds

2023-03-26 Thread Hannu Krosing

Currently pg_test_timing utility measures its timing overhead in
microseconds, giving results like this

~$ /usr/lib/postgresql/15/bin/pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 18.97 ns
Histogram of timing durations:
  < us   % of total  count
 1 98.11132  155154419
 2  1.887562985010
 4  0.00040630
 8  0.00012184
16  0.00058919
32  0.3 40
64  0.0  6

I got curious and wanted to see how the 98.1% timings are distributed
(raw uncleaned patch attached)
And this is what I got when I increased the measuring resolution to nanoseconds

hannuk@hannuk1:~/work/postgres15_uni_dist_on/src/bin/pg_test_timing$
./pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 17.34 ns, min: 15, same: 0
Histogram of timing durations:
   < ns   % of total  count
  1  0.0  0
  2  0.0  0
  4  0.0  0
  8  0.0  0
 16  1.143871979085
 32 98.47924  170385392
 64  0.21666 374859
128  0.15654 270843
256  0.00297   5139
512  0.00016272
   1024  0.4 73
   2048  0.00018306
   4096  0.00022375
   8192  0.6 99
  16384  0.5 80
  32768  0.1 20
  65536  0.0  6
 131072  0.0  2

As most of the samples seems to be in ranges 8..15 and 16..32
nanoseconds the current way of measuring at microsecond resolution is
clearly inadequate.

The attached patch is not meant to be applied as-is but is rather
there as a helper to easily verify the above numbers.


QUESTIONS

1. Do you think it is ok to just change pg_test_timing to return the
result in nanoseconds or should there be a flag that asks for
nanosecond resolution ?

2. Should the output be changed to give ranges instead of `*** pg_test_timing.c.orig	2023-01-16 01:09:51.839251695 +0100
--- pg_test_timing.c	2023-01-16 22:24:49.768037225 +0100
***
*** 11,16 
--- 11,20 
  #include "getopt_long.h"
  #include "portability/instr_time.h"
  
+ #define INSTR_TIME_GET_NANOSEC(t) \
+ 	(((uint64) (t).tv_sec * (uint64) 10) + (uint64) ((t).tv_nsec))
+ 
+ 
  static const char *progname;
  
  static unsigned int test_duration = 3;
***
*** 19,26 
  static uint64 test_timing(unsigned int duration);
  static void output(uint64 loop_count);
  
! /* record duration in powers of 2 microseconds */
! long long int histogram[32];
  
  int
  main(int argc, char *argv[])
--- 23,30 
  static uint64 test_timing(unsigned int duration);
  static void output(uint64 loop_count);
  
! /* record duration in powers of 2 nanoseconds */
! long long int histogram[64];
  
  int
  main(int argc, char *argv[])
***
*** 124,139 
  	uint64		total_time;
  	int64		time_elapsed = 0;
  	uint64		loop_count = 0;
  	uint64		prev,
  cur;
  	instr_time	start_time,
  end_time,
  temp;
  
! 	total_time = duration > 0 ? duration * INT64CONST(100) : 0;
  
  	INSTR_TIME_SET_CURRENT(start_time);
! 	cur = INSTR_TIME_GET_MICROSEC(start_time);
  
  	while (time_elapsed < total_time)
  	{
--- 128,145 
  	uint64		total_time;
  	int64		time_elapsed = 0;
  	uint64		loop_count = 0;
+ 	uint64		same_count = 0;
+ 	uint64		min_diff = UINT64_MAX;
  	uint64		prev,
  cur;
  	instr_time	start_time,
  end_time,
  temp;
  
! 	total_time = duration > 0 ? duration * INT64CONST(10) : 0;
  
  	INSTR_TIME_SET_CURRENT(start_time);
! 	cur = INSTR_TIME_GET_NANOSEC(start_time);
  
  	while (time_elapsed < total_time)
  	{
***
*** 141,157 
  	bits = 0;
  
  		prev = cur;
  		INSTR_TIME_SET_CURRENT(temp);
! 		cur = INSTR_TIME_GET_MICROSEC(temp);
  		diff = cur - prev;
  
  		/* Did time go backwards? */
! 		if (diff < 0)
  		{
  			fprintf(stderr, _("Detected clock going backwards in time.\n"));
  			fprintf(stderr, _("Time warp: %d ms\n"), diff);
  			exit(1);
  		}
  
  		/* What is the highest bit in the time diff? */
  		while (diff)
--- 147,172 
  	bits = 0;
  
  		prev = cur;
+ //		INSTR_TIME_SET_CURRENT(temp);
+ //		prev = INSTR_TIME_GET_NANOSEC(temp);
  		INSTR_TIME_SET_CURRENT(temp);
! 		cur = INSTR_TIME_GET_NANOSEC(temp);
  		diff = cur - prev;
  
  		/* Did time go backwards? */
! 		if (unlikely(diff <= 0))
  		{
+ 			if (diff == 0 )
+ same_count ++;
+ 			else
+ 			{
  			fprintf(stderr, _("Detected clock going backwards in time.\n"));
  			fprintf(stderr, _("Time warp: %d ms\n"), diff);
  			exit(1);
+ 			}
  		}
+ 		if (min_diff > diff)
+ 			min_diff = diff;
  
  		/* What is the highest bit in the time diff? */
  		while (diff)
***
*** 165,179 
  
  		loop_count++;
  		INSTR_TIME_SUBTRACT(temp, start_time);
! 		time_elapsed = INSTR_TIME_GET_MICROSEC(temp);
  	}

pgbench - adding pl/pgsql versions of tests

2023-01-04 Thread Hannu Krosing

Hello Hackers,

The attached patch adds pl/pgsql versions of "tpcb-like" and
"simple-update" internal test scripts

The tests perform functionally exactly the same, but are generally
faster as they avoid most client-server latency.

The reason I'd like to have them as part of pgbench are two

1. so I don't have to create the script and function manually each
time I want to test mainly the database (instead of the
client-database system)

2. so that new users of PostgreSQL can easily see how much better OLTP
workloads perform when packaged up as a server-side function

The new user-visible functionalities are two new build-in scripts  -b list :

$ pgbench -b list
Available builtin scripts:
  tpcb-like: 
  plpgsql-tpcb-like: 
  simple-update: 
  plpgsql-simple-update: 
select-only: 

which one can run  using the -b / --builtin= option

pgbench -b plpgsql-tpcb-like ...
or
pgbench -b plpgsql-simple-update ...

And a flag --no-functions which lets you not to create the functions at init

there are also character flags to -I / --init ,
-- Y to drop the functions and
-- y to create the functions. Creating is default behaviour, but can
be disabled fia long flag --no-functions )

I selected Yy as they were unused and can be thought of as "inverted
lambda symbol" :)

If there are no strong objections, I'll add it to the commitfest as well

-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index c4a44debeb..1edcec2f5c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -172,8 +172,8 @@ typedef struct socket_set
 /
  * some configurable parameters */
 
-#define DEFAULT_INIT_STEPS "dtgvp"	/* default -I setting */
-#define ALL_INIT_STEPS "dtgGvpf"	/* all possible steps */
+#define DEFAULT_INIT_STEPS "dYtgvpy"	/* default -I setting */
+#define ALL_INIT_STEPS "dYtgGvpfy"	/* all possible steps */
 
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
@@ -800,6 +800,15 @@ static const BuiltinScript builtin_script[] =
 		"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
 		"END;\n"
 	},
+	{
+		"plpgsql-tpcb-like",
+		"",
+		"\\set aid random(1, " CppAsString2(naccounts) " * :scale)\n"
+		"\\set bid random(1, " CppAsString2(nbranches) " * :scale)\n"
+		"\\set tid random(1, " CppAsString2(ntellers) " * :scale)\n"
+		"\\set delta random(-5000, 5000)\n"
+		"SELECT 1 FROM pgbench_tpcb_like(:aid, :bid, :tid, :delta);\n"
+	},
 	{
 		"simple-update",
 		"",
@@ -813,6 +822,15 @@ static const BuiltinScript builtin_script[] =
 		"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
 		"END;\n"
 	},
+	{
+		"plpgsql-simple-update",
+		"",
+		"\\set aid random(1, " CppAsString2(naccounts) " * :scale)\n"
+		"\\set bid random(1, " CppAsString2(nbranches) " * :scale)\n"
+		"\\set tid random(1, " CppAsString2(ntellers) " * :scale)\n"
+		"\\set delta random(-5000, 5000)\n"
+		"SELECT 1 FROM pgbench_simple_update(:aid, :bid, :tid, :delta);\n"
+	},
 	{
 		"select-only",
 		"",
@@ -885,6 +903,7 @@ usage(void)
 		   "  -q, --quiet  quiet logging (one message each 5 seconds)\n"
 		   "  -s, --scale=NUM  scaling factor\n"
 		   "  --foreign-keys   create foreign key constraints between tables\n"
+		   "  --no-functions   do not create pl/pgsql functions for internal scripts\n"
 		   "  --index-tablespace=TABLESPACE\n"
 		   "   create indexes in the specified tablespace\n"
 		   "  --partition-method=(range|hash)\n"
@@ -4836,6 +4855,54 @@ initTruncateTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
+/*
+ * Create the functions needed for plpgsql-* builting scripts
+ */
+static void
+initCreateFuntions(PGconn *con)
+{
+	fprintf(stderr, "creating functions...\n");
+
+	executeStatement(con, 
+		"CREATE FUNCTION pgbench_tpcb_like(_aid int, _bid int, _tid int, _delta int)\n"
+		"RETURNS void\n"
+		"LANGUAGE plpgsql\n"
+		"AS $plpgsql$\n"
+		"BEGIN\n"
+		"UPDATE pgbench_accounts SET abalance = abalance + _delta WHERE aid = _aid;\n"
+		"PERFORM abalance FROM pgbench_accounts WHERE aid = _aid;\n"
+		"UPDATE pgbench_tellers SET

Is there a way to use exported snapshots in autocommit mode ?

2022-12-09 Thread Hannu Krosing

Hello hackers,

Is there a way to use exported snapshots in autocommit mode ?

Either something similar to defaults
in default_transaction_deferrable,  default_transaction_isolation,
default_transaction_read_only
Or something that could be set in the function's SET part.

Context: I am working on speeding up large table copy via parallelism using
pl/proxy SPLIT functionality and would like to have all the parallel
sub-copies run with the same snapshot.

Best Regards
Hannu

Is anybody planning to release pglogical for v15 ?

2022-10-13 Thread Hannu Krosing

So, is anybody planning to release pglogical for v15 ?

There are still a few things that one can do in  pglogical but not in
native / built0in replication ...


Best Regards
Hannu

Re: making relfilenodes 56 bits

2022-07-12 Thread Hannu Krosing

Re: staticAssertStmt(MAX_FORKNUM <= INT8_MAX);

Have you really thought through making the ForkNum 8-bit ?

For example this would limit a columnar storage with each column
stored in it's own fork (which I'd say is not entirely unreasonable)
to having just about ~250 columns.

And there can easily be other use cases where we do not want to limit
number of forks so much

Cheers
Hannu

On Tue, Jul 12, 2022 at 10:36 PM Robert Haas  wrote:
>
> On Tue, Jul 12, 2022 at 1:09 PM Andres Freund  wrote:
> > What does currently happen if we exceed that?
>
> elog
>
> > > diff --git a/src/include/utils/wait_event.h 
> > > b/src/include/utils/wait_event.h
> > > index b578e2ec75..5d3775ccde 100644
> > > --- a/src/include/utils/wait_event.h
> > > +++ b/src/include/utils/wait_event.h
> > > @@ -193,7 +193,7 @@ typedef enum
> > >   WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
> > >   WAIT_EVENT_LOGICAL_REWRITE_WRITE,
> > >   WAIT_EVENT_RELATION_MAP_READ,
> > > - WAIT_EVENT_RELATION_MAP_SYNC,
> > > + WAIT_EVENT_RELATION_MAP_RENAME,
> >
> > Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
> > since it includes fsync etc?
>
> Sure, I had it that way for a while and changed it at the last minute.
> I can change it back.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>
>

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-07-01 Thread Hannu Krosing

And thanks to Robert and Bruce for bringing up good points about
potential pitfalls!

I think we do have a good discussion going on here :)

---
Hannu

On Fri, Jul 1, 2022 at 11:14 AM Hannu Krosing  wrote:
>
> On Thu, Jun 30, 2022 at 7:25 PM Bruce Momjian  wrote:
> >
> > On Thu, Jun 30, 2022 at 11:52:20AM -0400, Robert Haas wrote:
> > > I don't think this would be very convenient in most scenarios,
>
> This is the eternal problem with security - more security always
> includes more inconvenience.
>
> Unlocking your door when coming home is more inconvenient than not
> unlocking it, and the least inconvenient thing would be not having a
> door at all.
> Imagine coming to your door with a heavy shopping bag in each hand -
> at this moment almost anyone would prefer the door not being there :)
>
> This one would be for cases where you want best multi-layer
> protections also against unknown threats and are ready to trade some
> convenience for security. Also it would not be that bad once you use
> automated deployment pipelines which just need an extra line to unlock
> superuser for deployment.
>
> > > and I
> > > think it would also be difficult to implement correctly. I don't think
> > > you can get by with just having superuser() return false sometimes
> > > despite pg_authid.rolsuper being true. There's a lot of subtle
> > > assumptions in the code to the effect that the properties of a session
> > > are basically stable unless some SQL is executed which changes things.
>
> A good barrier SQL for this could be SET ROLE=... .
> Maybe just have a mode where a superuser can not log in _or_ SET ROLE
> unless this is explicitly allowed in pg_superuser.conf
>
> > > I think if we start injecting hacks like this it may seem to work in
> > > light testing but we'll never get to the end of the bug reports.
>
> In this case it looks like each of these bug reports would mean an
> avoided security breach which for me looks preferable.
>
> Again, this would be all optional, opt-in, DBA-needs-to-set-it-up
> feature for the professionally paranoid and not something we suddenly
> force on people who would like to run all their databases using
> user=postgres database=postgres with trust specified in the
> pg_hba.conf "because the firewall takes care of security" :)
>
> > Yeah, seems it would have to be specified per-session, but how would you
> > specify a specific session before the session starts?
>
> One often recommended way to do superuser'y things in a secure
> production database is to have a non-privileged NOINHERIT user for
> logging in and then do
> SET ROLE=;
> when needed, similar to using su/sudo in shell. This practice both
> reduces the attack surface and also provides auditability by knowing
> who logged in for superuser work.
>
> In this case one could easily get the pid and do the needed extra
> setup before escalating privileges to superuser.
>
> ---
> Hannu

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-07-01 Thread Hannu Krosing

On Thu, Jun 30, 2022 at 7:25 PM Bruce Momjian  wrote:
>
> On Thu, Jun 30, 2022 at 11:52:20AM -0400, Robert Haas wrote:
> > I don't think this would be very convenient in most scenarios,

This is the eternal problem with security - more security always
includes more inconvenience.

Unlocking your door when coming home is more inconvenient than not
unlocking it, and the least inconvenient thing would be not having a
door at all.
Imagine coming to your door with a heavy shopping bag in each hand -
at this moment almost anyone would prefer the door not being there :)

This one would be for cases where you want best multi-layer
protections also against unknown threats and are ready to trade some
convenience for security. Also it would not be that bad once you use
automated deployment pipelines which just need an extra line to unlock
superuser for deployment.

> > and I
> > think it would also be difficult to implement correctly. I don't think
> > you can get by with just having superuser() return false sometimes
> > despite pg_authid.rolsuper being true. There's a lot of subtle
> > assumptions in the code to the effect that the properties of a session
> > are basically stable unless some SQL is executed which changes things.

A good barrier SQL for this could be SET ROLE=... .
Maybe just have a mode where a superuser can not log in _or_ SET ROLE
unless this is explicitly allowed in pg_superuser.conf

> > I think if we start injecting hacks like this it may seem to work in
> > light testing but we'll never get to the end of the bug reports.

In this case it looks like each of these bug reports would mean an
avoided security breach which for me looks preferable.

Again, this would be all optional, opt-in, DBA-needs-to-set-it-up
feature for the professionally paranoid and not something we suddenly
force on people who would like to run all their databases using
user=postgres database=postgres with trust specified in the
pg_hba.conf "because the firewall takes care of security" :)

> Yeah, seems it would have to be specified per-session, but how would you
> specify a specific session before the session starts?

One often recommended way to do superuser'y things in a secure
production database is to have a non-privileged NOINHERIT user for
logging in and then do
SET ROLE=;
when needed, similar to using su/sudo in shell. This practice both
reduces the attack surface and also provides auditability by knowing
who logged in for superuser work.

In this case one could easily get the pid and do the needed extra
setup before escalating privileges to superuser.

---
Hannu

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-06-29 Thread Hannu Krosing

Ah, I see.

My counterclaim was that there are lots of use cases where one would
want to be extra sure that *only* they, as the owner of the database,
can access the database as a superuser and not someone else. Even if
there is some obscure way for that "someone else" to either connect as
a superuser or to escalate privileges to superuser.

And what I propose would be a means to achieve that at the expense of
extra steps when starting to act as a superuser.

In a nutshell this would be equivalent for two factor authentication
for acting as a superuser -
  1) you must be able to log in as a user with superuser attribute
  2) you must present proof that you can access the underlying file system

Cheers,
Hannu Krosing


On Wed, Jun 29, 2022 at 12:48 PM Laurenz Albe  wrote:
>
> On Wed, 2022-06-29 at 00:05 -0700, Andres Freund wrote:
> > On 2022-06-29 08:51:10 +0200, Laurenz Albe wrote:
> > > On Tue, 2022-06-28 at 16:27 -0700, Andres Freund wrote:
> > > > > Experience shows that 99% of the time one can run PostgreSQL just fine
> > > > > without a superuser
> > > >
> > > > IME that's not at all true. It might not be needed interactively, but 
> > > > that's
> > > > not all the same as not being needed at all.
> > >
> > > I also disagree with that.  Not having a superuser is one of the pain
> > > points with using a hosted database: no untrusted procedural languages,
> > > no untrusted extensions (unless someone hacked up PostgreSQL or provided
> > > a workaround akin to a SECURITY DEFINER function), etc.
> >
> > I'm not sure what exactly you're disagreeing with? I'm not saying that
> > superuser isn't needed interactively in general, just that there are
> > reasonably common scenarios in which that's the case.
>
> I was unclear, sorry.  I agreed with you that you can't do without superuser
> and disagreed with the claim that 99% of the time nobody needs superuser
> access.
>
> Yours,
> Laurenz Albe

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-06-29 Thread Hannu Krosing

The idea is to allow superuser, but only in case you *already* have
access to the file system.
You could think of it as a two factor authentication for using superuser.


So in the simplest implementation it would be

touch $PGDATA/allow_superuser

psql
hannuk=# CREATE EXTENSION ...

rm $PGDATA/allow_superuser


and in more sophisticated implementation it could be

terminal 1:
psql
hannuk=# select pg_backend_pid();
 pg_backend_pid

1749025
(1 row)

terminal 2:
echo 1749025 > $PGDATA/allow_superuser

back to terminal 1 still connected to backend with pid 1749025:
$ CREATE EXTENSION ...

.. and then clean up the sentinel file after, or just make it valid
for N minutes from creation


Cheers,
Hannu Krosing

On Wed, Jun 29, 2022 at 8:51 AM Laurenz Albe  wrote:
>
> On Tue, 2022-06-28 at 16:27 -0700, Andres Freund wrote:
> > > Experience shows that 99% of the time one can run PostgreSQL just fine
> > > without a superuser
> >
> > IME that's not at all true. It might not be needed interactively, but that's
> > not all the same as not being needed at all.
>
> I also disagree with that.  Not having a superuser is one of the pain
> points with using a hosted database: no untrusted procedural languages,
> no untrusted extensions (unless someone hacked up PostgreSQL or provided
> a workaround akin to a SECURITY DEFINER function), etc.
>
> Yours,
> Laurenz Albe

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-06-28 Thread Hannu Krosing

I was not after *completely* removing it, but just having an option
which makes the superuser() function always return false.

For known cases of needing a superuser there would be a way to enable
it , perhaps via a sentinel file or pg_hba-like configuration file.

And as first cut I would advocate for disabling SECURITY DEFINER
functions for simplicity and robustness of defense. A short term
solution for these would be to re-write them in C (or Go or Rust or
any other compiled language) as C functions have full access anyway.
But C functions fall in the same category as other defenses discussed
at the start of this thread - namely to use them, you already need
access to the file system.

Running production databases without superuser available is not as
impossible as you may think  - Cloud SQL version of PostgreSQL has
been in use with great success for years without exposing a real
superuser to end users (there are some places where
`cloudsqlsuperuser` gives you partial superuser'y abilities).

Letting user turn off the superuser access when no known need for it
exists (which is 99.9% in must use cases) would improve secondary
defenses noticeably.

It would also be a good start to figuring out the set of roles into
which one can decompose superuser access in longer run

--
Hannu

On Tue, Jun 28, 2022 at 8:30 PM Robert Haas  wrote:
>
> On Mon, Jun 27, 2022 at 5:37 PM Hannu Krosing  wrote:
> > My current thinking is (based on more insights from Andres) that we
> > should also have a startup flag to disable superuser altogether to
> > avoid bypasses via direct manipulation of pg_proc.
> >
> > Experience shows that 99% of the time one can run PostgreSQL just fine
> > without a superuser, so having a superuser available all the time is
> > kind of like leaving a loaded gun on the kitchen table because you
> > sometimes need to go hunting.
> >
> > I am especially waiting for Andres' feedback on viability this approach.
>
> Well, I'm not Andres but I don't think not having a superuser at all
> is in any way a viable approach. It's necessary to be able to
> administer the database system, and the bootstrap superuser can't be
> removed outright in any case because it owns a ton of objects.
>
> There are basically two ways of trying to solve this problem. On the
> one hand we could try to create a mode in which the privileges of the
> superuser are restricted enough that the superuser can't break out to
> the operating system. The list of things that would need to be blocked
> is, I think, more extensive than any list you've give so far. The
> other is to stick with the idea of an unrestricted superuser but come
> up with ways of giving a controlled subset of the superuser's
> privileges to a non-superuser. I believe this is the more promising
> approach, and there have been multiple discussion threads about it in
> the last six months.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-06-27 Thread Hannu Krosing

My current thinking is (based on more insights from Andres) that we
should also have a startup flag to disable superuser altogether to
avoid bypasses via direct manipulation of pg_proc.

Experience shows that 99% of the time one can run PostgreSQL just fine
without a superuser, so having a superuser available all the time is
kind of like leaving a loaded gun on the kitchen table because you
sometimes need to go hunting.

I am especially waiting for Andres' feedback on viability this approach.

Cheers
Hannu

On Mon, Jun 27, 2022 at 10:37 PM Jeff Davis  wrote:
>
> On Sat, 2022-06-25 at 00:08 +0200, Hannu Krosing wrote:
> > Hi Pgsql-Hackers
> >
> > As part of ongoing work on PostgreSQL  security hardening we have
> > added a capability to disable all file system access (COPY TO/FROM
> > [PROGRAM] , pg_*file*() functions, lo_*() functions
> > accessing files, etc) in a way that can not be re-enabled without
> > already having access to the file system. That is via a flag which
> > can
> > be set only in postgresql.conf or on the command line.
>
> How much of this can be done as a special extension already?
>
> For instance, a ProcessUtility_hook can prevent superuser from
> executing COPY TO/FROM PROGRAM.
>
> As others point out, that would still leave a lot of surface area for
> attacks, e.g. by manipulating the catalog. But it could be a starting
> place to make attacks "harder", without core postgres needing to make
> security promises that will be hard to keep.
>
> Regards,
> Jeff Davis
>
>

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-27 Thread Hannu Krosing

> Another thought: for non-x86 platforms, the SIMD nodes degenerate to
> "simple loop", and looping over up to 32 elements is not great
> (although possibly okay). We could do binary search, but that has bad
> branch prediction.

I am not sure that for relevant non-x86 platforms SIMD / vector
instructions would not be used (though it would be a good idea to
verify)
Do you know any modern platforms that do not have SIMD ?

I would definitely test before assuming binary search is better.

Often other approaches like counting search over such small vectors is
much better when the vector fits in cache (or even a cache line) and
you always visit all items as this will completely avoid branch
predictions and allows compiler to vectorize and / or unroll the loop
as needed.

Cheers
Hannu

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-06-25 Thread Hannu Krosing

Having a superuser.conf file could also be used to solve another issue
- currently, if you remove the SUPERUSER attribute from all users your
only option to get it back would be to run in single-user mode. With
conf file one could add an "always" option there which makes any
matching user superuser to fix whatever needs fixing as superuser.

Cheers
Hannu

On Sat, Jun 25, 2022 at 10:54 PM Hannu Krosing  wrote:
>
> What are your ideas of applying a change similar to above to actually
> being a superuser ?
>
> That is adding a check for "superuser being currently available" to
> function superuser() in
> ./src/backend/utils/misc/superuser.c ?
>
> It could be as simple as a flag that can be set only at startup for
> maximum speed - though the places needing superuser check are never in
> speed-critical path as far as I have seen.
>
> Or it could be more complex and dynamix, like having a file similar to
> pg_hba.conf that defines the ability to run code as superuser based on
> a set of attributes each of which could be * meaning "any"
>
> These could be
>   * database (to limit superuser to only certain databases)
>   * session user (to allow only some users to become superuser,
> including via SECURITY DEFINER functions)
>   * backend pid (this would allow kind of 2-factor authentication -
> connect, then use that session's pid to add a row to
> pg_ok_to_be_sup.conf file, then continue with your superuser-y stuff)
>   * valid-until - this lets one allow superuser for a limited period
> without fear of forgetting top disable it
>
> This approach would have the the benefit of being very low-code while
> delivering the extra protection of needing pre-existing access to
> filesystem to enable/disable .
>
> For easiest usability the pg_ok_to_be_sup.conf file should be outside
> pg_reload_conf() - either just read each time the superuser() check is
> run, or watched via inotify and reloaded each time it changes.
>
> Cheers,
> Hannu
>
> P.S: - thanks Magnus for the "please don't top-post" notice - I also
> need to remember to check if all the quoted mail history is left in
> when I just write a replay without touching any of it. I hope it does
> the right thing and leaves it out, but it just might unders some
> conditions bluntly append it anyway just in case :)

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-06-25 Thread Hannu Krosing

What are your ideas of applying a change similar to above to actually
being a superuser ?

That is adding a check for "superuser being currently available" to
function superuser() in
./src/backend/utils/misc/superuser.c ?

It could be as simple as a flag that can be set only at startup for
maximum speed - though the places needing superuser check are never in
speed-critical path as far as I have seen.

Or it could be more complex and dynamix, like having a file similar to
pg_hba.conf that defines the ability to run code as superuser based on
a set of attributes each of which could be * meaning "any"

These could be
  * database (to limit superuser to only certain databases)
  * session user (to allow only some users to become superuser,
including via SECURITY DEFINER functions)
  * backend pid (this would allow kind of 2-factor authentication -
connect, then use that session's pid to add a row to
pg_ok_to_be_sup.conf file, then continue with your superuser-y stuff)
  * valid-until - this lets one allow superuser for a limited period
without fear of forgetting top disable it

This approach would have the the benefit of being very low-code while
delivering the extra protection of needing pre-existing access to
filesystem to enable/disable .

For easiest usability the pg_ok_to_be_sup.conf file should be outside
pg_reload_conf() - either just read each time the superuser() check is
run, or watched via inotify and reloaded each time it changes.

Cheers,
Hannu

P.S: - thanks Magnus for the "please don't top-post" notice - I also
need to remember to check if all the quoted mail history is left in
when I just write a replay without touching any of it. I hope it does
the right thing and leaves it out, but it just might unders some
conditions bluntly append it anyway just in case :)

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-06-24 Thread Hannu Krosing

The old versions should definitely not have it turned on by default. I
probably was not as clear as I thought in bringing out that point..

For upcoming next ones the distributors may want to turn it on for
some more security-conscious ("enterprize") distributions.

-- Hannu

On Sat, Jun 25, 2022 at 2:08 AM David G. Johnston
 wrote:
>
>
>
> On Friday, June 24, 2022, Gurjeet Singh  wrote:
>>
>> On Fri, Jun 24, 2022 at 4:13 PM Andres Freund  wrote:
>> > On 2022-06-25 00:08:13 +0200, Hannu Krosing wrote:
>>
>> > > 3) should this be back-patched (we can provide batches for all
>> > > supported PgSQL versions)
>> >
>> > Err, what?
>>
>> Translation: Backpatching these changes to any stable versions will
>> not be acceptable (per the project versioning policy [1]), since these
>> changes would be considered new feature. These changes can break
>> installations, if released in a minor version.
>>
>
> No longer having the public schema in the search_path was a feature that got 
> back-patched, with known bad consequences, without any way for the DBA to 
> voice their opinion on the matter.  This proposal seems similar enough to at 
> least ask the question, with full DBA control and no known bad consequences.
>
> David J.
>

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-06-24 Thread Hannu Krosing

My understanding was that unless activated by admin these changes
would change nothing.

And they would be (borderline :) ) security fixes

And the versioning policy link actually does not say anything about
not adding features to older versions (I know this is the policy, just
pointing out the info in not on that page).

On Sat, Jun 25, 2022 at 1:46 AM Gurjeet Singh  wrote:
>
> On Fri, Jun 24, 2022 at 4:13 PM Andres Freund  wrote:
> > On 2022-06-25 00:08:13 +0200, Hannu Krosing wrote:
>
> > > 3) should this be back-patched (we can provide batches for all
> > > supported PgSQL versions)
> >
> > Err, what?
>
> Translation: Backpatching these changes to any stable versions will
> not be acceptable (per the project versioning policy [1]), since these
> changes would be considered new feature. These changes can break
> installations, if released in a minor version.
>
> [1]: https://www.postgresql.org/support/versioning/
>
> Best regards,
> Gurjeet
> http://Gurje.et

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-06-24 Thread Hannu Krosing

On Sat, Jun 25, 2022 at 1:23 AM Hannu Krosing  wrote:

> My impression was that this was largely fixed via disabling the old
> direct file calling convention, but then again I did not pay much
> attention at that time :)

I meant of course direct FUNCTION calling convention (Version 0
Calling Conventions)

-- Hannu

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-06-24 Thread Hannu Krosing

On Sat, Jun 25, 2022 at 1:13 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-06-25 00:08:13 +0200, Hannu Krosing wrote:
> > Currently the file system access is controlled via being a SUPREUSER
> > or having the pg_read_server_files, pg_write_server_files and
> > pg_execute_server_program roles. The problem with this approach is
> > that it will not stop an attacker who has managed to become the
> > PostgreSQL  SUPERUSER from  breaking out of the server to reading and
> > writing files and running programs in the surrounding container, VM or
> > OS.
>
> If a user has superuser rights, they automatically can execute arbitrary
> code. It's that simple. Removing roles isn't going to change that. Our code
> doesn't protect against C functions mismatching their SQL level
> definitions. With that you can do a lot of things.

Are you claiming that one can manipulate PostgreSQL to do any file
writes directly by manipulating pg_proc to call the functions "in  a
wrong way" ?

My impression was that this was largely fixed via disabling the old
direct file calling convention, but then again I did not pay much
attention at that time :)

So your suggestion would be to also include disabling access to at
least pg_proc for creating C and internal functions and possibly some
other system tables to remove this threat ?

Cheers
Hannu

Hardening PostgreSQL via (optional) ban on local file system access

2022-06-24 Thread Hannu Krosing

Hi Pgsql-Hackers

As part of ongoing work on PostgreSQL  security hardening we have
added a capability to disable all file system access (COPY TO/FROM
[PROGRAM] , pg_*file*() functions, lo_*() functions
accessing files, etc) in a way that can not be re-enabled without
already having access to the file system. That is via a flag which can
be set only in postgresql.conf or on the command line.

Currently the file system access is controlled via being a SUPREUSER
or having the pg_read_server_files, pg_write_server_files and
pg_execute_server_program roles. The problem with this approach is
that it will not stop an attacker who has managed to become the
PostgreSQL  SUPERUSER from  breaking out of the server to reading and
writing files and running programs in the surrounding container, VM or
OS.

If we had had this then for example the infamous 2020 PgCrypto worm
[1] would have been much less likely to succeed.

So here are a few questions to get discussion started.

1) would it be enough to just disable WRITING to the filesystem (COPY
... TO ..., COPY TO ... PROGRAM ...) or are some reading functions
also potentially exploitable or at least making attackers life easier
?
2) should configuration be all-or-nothing or more fine-tunable (maybe
a comma-separated list of allowed features) ?
3) should this be back-patched (we can provide batches for all
supported PgSQL versions)
4) We should likely start with this flag off, but any paranoid (read -
good and security conscious)  DBA can turn it on.
5) Which file access functions should be in the unsafe list -
pg_current_logfile is likely safe as is pg_relation_filenode, but
pg_ls_dir likely is not. some subversions might be ok again, like
pg_ls_waldir ?
6) or should we control it via disabling the pg_*_server_* roles for
different take of configuration from 5) ?

As I said, we are happy to provide patches (and docs, etc) for all the
PostgreSQL versions we decide to support here.

Best Regards
Hannu


-
[1] 
https://www.securityweek.com/pgminer-crypto-mining-botnet-abuses-postgresql-distribution

Which hook to use when overriding utility commands (COPY ...)

2022-03-19 Thread Hannu Krosing

Hi Pgsql-Hackers

Which hook should I use when overriding the COPY command in an extension?

I am working on adding new functionalities to COPY (compression, index
management, various other transports in addition to stdin and file, other
data formats, etc...) and while the aim is to contribute this to v15 I
would also like to have much of it in earlier versions.

As the current policy is to back-port only bugfixes and not "features" ,
the only way I can see to get it in earlier versions is to provide an
extension which intercepts the COPY command and replaces it with my own
implementation.

So my question is, which of the hooks would be easiest to use for this ?

At the syntax level it would still look the same COPY ... FROM/TO ... WITH
( options) and the extensibility will be in file names (using a URI scheme
mapped to transports) and in options part.
so I hope to fully reuse the parsing part and get in before the existence
checks.

Does anyone have experience in this and can [point to samples?

Cheers
Hannu

Re: logical decoding and replication of sequences

2021-09-25 Thread Hannu Krosing

Just a note for some design decisions

> 1) By default, sequences are treated non-transactionally, i.e. sent to the 
> output plugin right away.

If our aim is just to make sure that all user-visible data in
*transactional* tables is consistent with sequence state then  one
very much simplified approach to this could be to track the results of
nextval() calls in a transaction at COMMIT put the latest sequence
value in WAL (or just track the sequences affected and put the latest
sequence state in WAL at commit which needs extra read of sequence but
protects against race conditions with parallel transactions which get
rolled back later)

This avoids sending redundant changes for multiple nextval() calls
(like loading a million-row table with sequence-generated id column)

And one can argue that we can safely ignore anything in ROLLBACKED
sequences. This is assuming that even if we did advance the sequence
paste the last value sent by the latest COMMITTED transaction it does
not matter for database consistency.

It can matter if customers just call nextval() in rolled-back
transactions and somehow expect these values to be replicated based on
reasoning along "sequences are not transactional - so rollbacks should
not matter" .

Or we may get away with most in-detail sequence tracking on the source
if we just keep track of the xmin of the sequence and send the
sequence info over at commit if it == current_transaction_id ?


-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

On Fri, Sep 24, 2021 at 9:16 PM Tomas Vondra
 wrote:
>
> Hi,
>
> On 9/23/21 12:27 PM, Peter Eisentraut wrote:
> > On 30.07.21 20:26, Tomas Vondra wrote:
> >> Here's a an updated version of this patch - rebased to current master,
> >> and fixing some of the issues raised in Peter's review.
> >
> > This patch needs an update, as various conflicts have arisen now.
> >
> > As was discussed before, it might be better to present a separate patch
> > for just the logical decoding part for now, since the replication and
> > DDL stuff has the potential to conflict heavily with other patches being
> > discussed right now.  It looks like cutting this patch in two should be
> > doable easily.
> >
>
> Attached is the rebased patch, split into three parts:
>
> 1) basic decoding infrastructure (decoding, reorderbuffer etc.)
> 2) support for sequences in test_decoding
> 3) support for sequences in built-in replication (catalogs, syntax, ...)
>
> The last part is the largest one - I'm sure we'll have discussions about
> the grammar, adding sequences automatically, etc. But as you said, let's
> focus on the first part, which deals with the required decoding stuff.
>
> I've added a couple comments, explaining how we track sequences, why we
> need the XID in nextval() etc. I've also added streaming support.
>
>
> > I looked through the test cases in test_decoding again.  It all looks
> > pretty sensible.  If anyone can think of any other tricky or dubious
> > cases, we can add them there.  It's easiest to discuss these things with
> > concrete test cases rather than in theory.
> >
> > One slightly curious issue is that this can make sequence values go
> > backwards, when seen by the logical decoding consumer, like in the test
> > case:
> >
> > + BEGIN
> > + sequence: public.test_sequence transactional: 1 created: 1 last_value:
> > 1, log_cnt: 0 is_called: 0
> > + COMMIT
> > + sequence: public.test_sequence transactional: 0 created: 0 last_value:
> > 33, log_cnt: 0 is_called: 1
> > + BEGIN
> > + sequence: public.test_sequence transactional: 1 created: 1 last_value:
> > 4, log_cnt: 0 is_called: 1
> > + COMMIT
> > + sequence: public.test_sequence transactional: 0 created: 0 last_value:
> > 334, log_cnt: 0 is_called: 1
> >
> > I suppose that's okay, since it's not really the intention that someone
> > is concurrently consuming sequence values on the subscriber.  Maybe
> > something for the future.  Fixing that would require changing the way
> > transactional sequence DDL updates these values, so it's not directly
> > the job of the decoding to address this.
> >
>
> Yeah, that's due to how ALTER SEQUENCE does things, and I agree redoing
> that seems well out of scope for this patch. What seems a bit suspicious
> is that some of the ALTER SEQUENCE changes have "created: 1" - it's
> probably correct, though, because those ALTER SEQUENCE statements can be
> rolled-back, so we see it as if a new sequence is created. The flag name
> might be a bit confusing, though.
>
> > A small thing I found: Maybe the text that test_decoding

Re: logical replication restrictions

2021-09-25 Thread Hannu Krosing

On Wed, Sep 22, 2021 at 6:18 AM Amit Kapila  wrote:
>
> On Tue, Sep 21, 2021 at 4:21 PM Marcos Pegoraro  wrote:
>>
>> No, I´m talking about that configuration you can have on standby servers
>> recovery_min_apply_delay = '8h'
>>
>
> oh okay, I think this can be useful in some cases where we want to avoid data 
> loss similar to its use for physical standby. For example, if the user has by 
> mistake truncated the table (or deleted some required data) on the publisher, 
> we can always it from the subscriber if we have such a feature.
>
> Having said that, I am not sure if we can call it a restriction. It is more 
> of a TODO kind of thing. It doesn't sound advisable to me to keep growing the 
> current Restrictions page [1].

One could argue that not having delayed apply *is* a restriction
compared to both physical replication and "the original upstream"
pg_logical.

I think therefore it should be mentioned in "Restrictions"  so people
considering moving from physical streaming to pg_logical or just
trying to decide whether to use pg_logical are warned.

Also, the Restrictions page starts with " These might be addressed in
future releases." so there is no exclusivity of being either a
restriction or TODO.

> [1] - https://wiki.postgresql.org/wiki/Todo
> [2] - 
> https://www.postgresql.org/docs/devel/logical-replication-restrictions.html

-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

Re: prevent immature WAL streaming

2021-09-24 Thread Hannu Krosing

Hi Alvaro,

I just started reading this thread, but maybe you can confirm or
refute my understanding of what was done.

In the first email you write

> As mentioned in the course of thread [1], we're missing a fix for
streaming replication to avoid sending records that the primary hasn't
fully flushed yet.  This patch is a first attempt at fixing that problem
by retreating the LSN reported as FlushPtr whenever a segment is
registered, based on the understanding that if no registration exists
then the LogwrtResult.Flush pointer can be taken at face value; but if a
registration exists, then we have to stream only till the start LSN of
that registered entry.

So did we end up holding back the wal_sender to not send anything that
is not confirmed as flushed on master

Are there measurements on how much this slows down replication
compared to allowing sending the moment it is written in buffers but
not necessarily flushed locally ?

Did we investigate possibility of sending as fast as possible and
controlling the flush synchronisation by sending separate flush
pointers *both* ways ?

And maybe there was even an alternative considered where we are
looking at a more general Durability, for example 2-out-of-3 where
primary is one of the 3 and not necessarily the most durable one?

-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

On Fri, Sep 24, 2021 at 4:33 AM Alvaro Herrera  wrote:
>
> On 2021-Sep-23, Alvaro Herrera wrote:
>
> > However, I notice now that the pg_rewind tests reproducibly fail in
> > branch 14 for reasons I haven't yet understood.  It's strange that no
> > other branch fails, even when run quite a few times.
>
> Turns out that this is a real bug (setting EndOfLog seems insufficient).
> I'm looking into it.
>
> --
> Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
> "No necesitamos banderas
>  No reconocemos fronteras"  (Jorge González)
>
>

Re: WIP: System Versioned Temporal Table

2021-09-20 Thread Hannu Krosing

On Mon, Sep 20, 2021 at 7:09 AM Corey Huinker  wrote:
>
> On Sun, Sep 19, 2021 at 3:12 PM Hannu Krosing  wrote:
>>
>> A side table has the nice additional benefit that we can very easily
>> version the *table structure* so when we ALTER TABLE and the table
>> structure changes we just make a new side table with now-currents
>> structure.
>
>
> It's true that would allow for perfect capture of changes to the table 
> structure, but how would you query the thing?
>
> If a system versioned table was created with a column foo that is type float, 
> and then we dropped that column, how would we ever query what the value of 
> foo was in the past?

We can query that thing only in tables AS OF the time when the column
was still there.

We probably could get away with pretending the dropped columns to be
NULL in newer versions (and the versions before the column was added)

Even more tricky case would be changing the column data type.

>
> Would the columns returned from SELECT * change based on the timeframe 
> requested?

If we want to emulate real table history, then it should.

But the * thing was not really specified well even for original
PostgreSQL inheritance.

Maybe we could do SELECT (* AS OF 'yesterday afternoon'::timestamp) FROM ... :)

> If we then later added another column that happened to also be named foo but 
> now was type JSONB, would we change the datatype returned based on the time 
> period being queried?

Many databases do allow returning multiple result sets, and actually
the PostgreSQL wire *protocol* also theoretically supports this, just
that it is not supported by any current client library.

With current libraries it would be possible to return a dynamic
version of  row_to_json(t.*) which changes based on actual historical
table structure

> Is the change in structure a system versioning which itself must be captured?

We do capture it (kind of) for logical decoding. That is, we decode
according to the structure in effect at the time of row creation,
though we currently miss the actual DDL itself.

So there is a lot to figure out and define, but at least storing the
history in a separate table gives a good foundation to build upon.

-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

Re: WIP: System Versioned Temporal Table

2021-09-19 Thread Hannu Krosing

A side table has the nice additional benefit that we can very easily
version the *table structure* so when we ALTER TABLE and the table
structure changes we just make a new side table with now-currents
structure.

Also we may want different set of indexes on historic table(s) for
whatever reason

And we may even want to partition history tables for speed, storage
cost  or just to drop very ancient history

-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

On Sun, Sep 19, 2021 at 8:32 PM Simon Riggs
 wrote:
>
> On Sun, 19 Sept 2021 at 01:16, Corey Huinker  wrote:
> >>
> >> 1. Much of what I have read about temporal tables seemed to imply or 
> >> almost assume that system temporal tables would be implemented as two 
> >> actual separate tables. Indeed, SQLServer appears to do it that way [1] 
> >> with syntax like
> >>
> >> WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.WebsiteUserInfoHistory));
> >>
> >>
> >> Q 1.1. Was that implementation considered and if so, what made this 
> >> implementation more appealing?
> >>
> >
> > I've been digging some more on this point, and I've reached the conclusion 
> > that a separate history table is the better implementation. It would make 
> > the act of removing system versioning into little more than a DROP TABLE, 
> > plus adjusting the base table to reflect that it is no longer system 
> > versioned.
>
> Thanks for giving this a lot of thought. When you asked the question
> the first time you hadn't discussed how that might work, but now we
> have something to discuss.
>
> > 10. Queries that omit the FOR SYSTEM_TIME clause, as well as ones that use 
> > FOR SYSTEM_TIME AS OF CURRENT_TIMESTAMP, would simply use the base table 
> > directly with no quals to add.
> > 11. Queries that use FOR SYSTEM_TIME and not FOR SYSTEM_TIME AS OF 
> > CURRENT_TIMESTAMP, then the query would do a union of the base table and 
> > the history table with quals applied to both.
>
>
> > 14. DROP SYSTEM VERSIONING from a table would be quite straightforward - 
> > the history table would be dropped along with the triggers that reference 
> > it, setting relissystemversioned = 'f' on the base table.
> >
> > I think this would have some key advantages:
> >
> > 1. MVCC bloat is no worse than it was before.
>
> The number of row versions stored in the database is the same for
> both, just it would be split across two tables in this form.
>
> > 2. No changes whatsoever to referential integrity.
>
> The changes were fairly minor, but I see your thinking about indexes
> as a simplification.
>
> > 3. DROP SYSTEM VERSIONING becomes an O(1) operation.
>
> It isn't top of mind to make this work well. The whole purpose of the
> history is to keep it, not to be able to drop it quickly.
>
>
> > Thoughts?
>
> There are 3 implementation routes that I see, so let me explain so
> that others can join the discussion.
>
> 1. Putting all data in one table. This makes DROP SYSTEM VERSIONING
> effectively impossible. It requires access to the table to be
> rewritten to add in historical quals for non-historical access and it
> requires some push-ups around indexes. (The current patch adds the
> historic quals by kludging the parser, which is wrong place, since it
> doesn't work for joins etc.. However, given that issue, the rest seems
> to follow on naturally).
>
> 2. Putting data in a side table. This makes DROP SYSTEM VERSIONING
> fairly trivial, but it complicates many DDL commands (please make a
> list?) and requires the optimizer to know about this and cater to it,
> possibly complicating plans. Neither issue is insurmountable, but it
> becomes more intrusive.
>
> The current patch could go in either of the first 2 directions with
> further work.
>
> 3. Let the Table Access Method handle it. I call this out separately
> since it avoids making changes to the rest of Postgres, which might be
> a good thing, with the right TAM implementation.
>
> My preferred approach would be to do this "for free" in the table
> access method, but we're a long way from this in terms of actual
> implementation. When Corey  suggested earlier that we just put the
> syntax in there, this was the direction I was thinking.
>
> After waiting a day since I wrote the above, I think we should go with
> (2) as Corey suggests, at least for now, and we can always add (3)
> later.
>
> --
> Simon Riggshttp://www.EnterpriseDB.com/
>
>

Re: The Free Space Map: Problems and Opportunities

2021-09-08 Thread Hannu Krosing

On Wed, Sep 8, 2021 at 6:52 AM Peter Geoghegan  wrote:
>
> On Tue, Sep 7, 2021 at 5:25 AM Hannu Krosing  wrote:
> > Are you speaking of just heap pages here or also index pages ?
>
> Mostly heap pages, but FWIW I think it could work for index tuples
> too, with retail index tuple deletion. Because that allows you to even
> remove would-be LP_DEAD item pointers.

But it does need info from the heap page(s) the index entry points to,
which can become quite expensive if that heap page is locked by other
processes or even not cached at all.

Also, parallel processes may have moved the index entry to other pages etc.

> > Or are you expecting these to be kept in good-enoug shape by your
> > earlier index manager work ?
>
> It's very workload dependent. Some things were very much improved by
> bottom-up index deletion in Postgres 14, for example (non-hot updates
> with lots of logically unchanged indexes). Other things weren't helped
> at all, or were barely helped. I think it's important to cover or
> cases.
>
> > A minimal useful patch emerging from that understanding could be
> > something which just adds hysteresis to FSM management. (TBH, I
> > actually kind of expected some hysteresis to be there already, as it
> > is in my mental model of "how things should be done" for managing
> > almost any resource :) )
>
> I think that you need to do the FSM before the aborted-heap-tuple
> cleanup. Otherwise you don't really know when or where to apply the
> special kind of pruning that the patch invents, which targets aborts
> specifically.

Alternative is to not actually prune the page at all at this time (to
avoid re-dirtying it and/or WAL write) but just inspect and add the
*potential* free space in FSM.
And then do the pruning at the time of next INSERT or UPDATE when the
page is ditied and WAL written anyway.

> > Adding hysteresis to FSM management can hopefully be done independent
> > of all the other stuff and also seems to be something that is
> > unobtrusive and non-controversial enough to fit in current release and
> > possibly be even back-ported .
>
> I don't know about that! Seems kind of an invasive patch to me.

Why do you think that changing the calculation when a page is added
back to FSM is "kind of invasive" ?

Not having looked at the code I would assume from the discussion here
that we remove the page from FSM when free space goes above FILLFACTOR
and we add it back when it goes back below FILLFACTOR.

The code change would just change the "add id back" code to use
FILLFACTOR - HYSTERESIS_FACTOR instead.

What other parts does this need to touch ? Or is it just that the
calculation source code is not localised well in code and written
multiple times all over the place ?

> > I did not mean CHECKPOINT as a command, but more the concept of
> > writing back / un-dirtying the page. In this sense it *is* special
> > because it is the last point in time where you are guaranteed to have
> > the page available in buffercache and thus cheap to access for
> > modifications plus you will avoid a second full-page writeback because
> > of cleanup. Also you do not want to postpone the cleanup to actual
> > page eviction, as that is usually in the critical path for some user
> > query or command.
>
> I intend to do some kind of batching, but only at the level of small
> groups of related transactions. Writing a page that was quickly opened
> for new inserts, filled with newly inserted heap tuples, then closed,
> and finally cleaned up doesn't seem like it needs to take any direct
> responsibility for writeback.

Agreed that this code does not need to care about writeback.

I approached it from the other end and proposed the "just before
writeback" as a strategy which is easy to understand conceptually and
also is mostly "already there" codewise, that is there is a location
in write-back code to plug the call to cleanup in .

Cheers
-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

Re: The Free Space Map: Problems and Opportunities

2021-09-07 Thread Hannu Krosing

On Tue, Sep 7, 2021 at 2:29 AM Peter Geoghegan  wrote:
>
> On Mon, Sep 6, 2021 at 4:33 PM Hannu Krosing  wrote:
> > When I have been thinking of this type of problem it seems that the
> > latest -- and correct :) --  place which should do all kinds of
> > cleanup like removing aborted tuples, freezing committed tuples and
> > setting any needed hint bits would be background writer or CHECKPOINT.
> >
> > This would be more PostgreSQL-like, as it moves any work not
> > immediately needed from the critical path, as an extension of how MVCC
> > for PostgreSQL works in general.
>
> I think it depends. There is no need to do work in the background
> here, with TPC-C. With my patch series each backend can know that it
> just had an aborted transaction that affected a page that it more or
> less still owns. And has very close at hand, for further inserts. It's
> very easy to piggy-back the work once you have that sense of ownership
> of newly allocated heap pages by individual backends/transactions.

Are you speaking of just heap pages here or also index pages ?

It seems indeed easy for heap, but for index pages can be mixed up by
other parallel work, especially things like Serial Primary Keys .

Or are you expecting these to be kept in good-enoug shape by your
earlier index manager work ?

> > This would be more PostgreSQL-like, as it moves any work not
> > immediately needed from the critical path, as an extension of how MVCC
> > for PostgreSQL works in general.
>
> I think that it also makes sense to have what I've called "eager
> physical rollback" that runs in the background, as you suggest.
>
> I'm thinking of a specialized form of VACUUM that targets a specific
> aborted transaction's known-dirtied pages. That's my long term goal,
> actually. Originally I wanted to do this as a way of getting rid of
> SLRUs and tuple freezing, by representing that all heap pages must
> only have committed tuples implicitly. That seemed like a good enough
> reason to split VACUUM into specialized "eager physical rollback
> following abort" and "garbage collection" variants.
>
> The insight that making abort-related cleanup special will help free
> space management is totally new to me -- it emerged from working
> directly on this benchmark. But it nicely complements some of my
> existing ideas about improving VACUUM.

A minimal useful patch emerging from that understanding could be
something which just adds hysteresis to FSM management. (TBH, I
actually kind of expected some hysteresis to be there already, as it
is in my mental model of "how things should be done" for managing
almost any resource :) )

Adding hysteresis to FSM management can hopefully be done independent
of all the other stuff and also seems to be something that is
unobtrusive and non-controversial enough to fit in current release and
possibly be even back-ported .

> > But doing it as part of checkpoint probably ends up with less WAL
> > writes in the end.
>
> I don't think that checkpoints are special in any way. They're very
> important in determining the total number of FPIs we'll generate, and
> so have huge importance today. But that seems accidental to me.

I did not mean CHECKPOINT as a command, but more the concept of
writing back / un-dirtying the page. In this sense it *is* special
because it is the last point in time where you are guaranteed to have
the page available in buffercache and thus cheap to access for
modifications plus you will avoid a second full-page writeback because
of cleanup. Also you do not want to postpone the cleanup to actual
page eviction, as that is usually in the critical path for some user
query or command.

Of course this becomes most important for workloads where the active
working set is larger than fits in memory and this is not a typical
case for OLTP any more. But at least freezing the page before
write-out could have a very big impact on the need to freeze "old"
pages available only on disk and thus would be a cheap way to improve
the problems around running out of transaction ids.

> > There could be a possibility to do a small amount of cleanup -- enough
> > for TPC-C-like workloads, but not larger ones -- while waiting for the
> > next command to arrive from the client over the network. This of
> > course assumes that we will not improve our feeder mechanism to have
> > back-to-back incoming commands, which can already be done today, but
> > which I have seen seldom used.
>
> That's what I meant, really. Doing the work of cleaning up a heap page
> that a transaction inserts into (say pruning away aborted tuples or
> setting hint bits) should ideally happen right after commit or abort
> -- at least for OLTP like workloads, which are the common case

Re: The Free Space Map: Problems and Opportunities

2021-09-06 Thread Hannu Krosing

When I have been thinking of this type of problem it seems that the
latest -- and correct :) --  place which should do all kinds of
cleanup like removing aborted tuples, freezing committed tuples and
setting any needed hint bits would be background writer or CHECKPOINT.

This would be more PostgreSQL-like, as it moves any work not
immediately needed from the critical path, as an extension of how MVCC
for PostgreSQL works in general.

Another option could be one more background cleaner with "needs
cleanup" bitmap which is updated by "real work" backends to let the
cleaner know what needs to be done.

But doing it as part of checkpoint probably ends up with less WAL
writes in the end.

That said, CHECKPOINT can efficiently clean up only Heap pages, unless
we do some extra work to ensure buffercache eviction ordering so that
Heap is (almost) always cleaaned worst and has thus an option to also
clean index pointers which point to it.

There could be a possibility to do a small amount of cleanup -- enough
for TPC-C-like workloads, but not larger ones -- while waiting for the
next command to arrive from the client over the network. This of
course assumes that we will not improve our feeder mechanism to have
back-to-back incoming commands, which can already be done today, but
which I have seen seldom used.

Cheers,
-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

On Mon, Sep 6, 2021 at 11:59 PM Peter Geoghegan  wrote:
>
> On Mon, Aug 16, 2021 at 5:15 PM Peter Geoghegan  wrote:
> > This is actually quite simple -- the chaos that follows is where it
> > gets truly complicated (and not necessarily worth getting into just
> > yet). The update of a given order and its order lines entries takes
> > place hours later, within a delivery transaction.
>
> Update on this: aborted transactions play an important role in the
> chaos that we see with BenchmarkSQL/TPC-C.
>
> This is surprising, in one way. Only 1% of all transactions are
> aborted (per the TPC-C spec). But in another way it's not surprising.
> It's totally consistent with what I've been saying about open and
> closed pages. When a page that's initially allocated and opened by a
> backend becomes closed following some inserts, the page should be left
> in a more or less pristine state -- owning backends that open and
> close pages need to be "good citizens.
>
> The definition of "pristine" must also include "no already-aborted
> tuples" -- aborted xact tuples are a problem that we can and should
> nip in the bud. This is really an extension of what I've been saying
> about not disrupting naturally occuring temporal and spatial locality.
> We're far better off if the space reused by freeing aborted heap
> tuples is reused by the very next transaction running in the very same
> backend. Plus it's not that hard; we can know for sure that this space
> is immediately available, which is not generally true of pruning. And
> so this free space from aborted xact tuples is quite special.
>
> Even with the master branch the problems with the FSM are much less
> noticeable once you configure BenchmarkSQL to not abort any
> transactions [1]. You need to consider the whole picture to understand
> how the issue of aborted transactions has such an outsized negative
> impact on performance and space efficiency.
>
> It's a whole *chain* of events:
>
> * If just 1% of all transactions abort, that's enough to leave an
> average of about 1 aborted tuple on every "order" table page (which
> naturally has small heap tuples [2]), and about 10 on a minority of
> the really big "new order" table's pages. Maybe 1 in 10 or 12 "order
> line" table heap pages are affected in this way.
>
> * The overall impact is that small (but not tiny) pockets of free
> space are left randomly strewn all over the place. Not just LP_DEAD
> line pointer stubs -- whole entire aborted heap tuples.
>
> * In theory we could address this problem using opportunistic pruning.
> But in practice that simply cannot happen today, because we don't
> touch the inserted rows until literally hours later.
>
> The same rows will eventually have updates and selects, but that won't
> help -- too little, too late. That's just how opportunistic pruning
> works: currently you need some kind of scan node for opportunistic
> pruning to kick in, so continually inserting transactions (from new
> orders) are currently fundamentally unable to do any opportunistic
> pruning. Much less opportunistic pruning that kicks in at exactly the
> right time.
>
> * When an autovacuum worker runs against these tables, it will of
> course notice this diffuse free space from aborted tuples, and put it
>

Re: Middleware Messages for FE/BE

2021-08-19 Thread Hannu Krosing

Jesper, please don't confuse my ramblings with Simon's initial proposal.

As I understand, the original proposal was just about adding a new
wire protocol message type, which then could be used for emitting
custom messages by middleware support extension - likely loaded as a
preloaded shared library - and consumed by some middleware, which
could either be some proxy or connection pooler or something compiled
as part of the libpq, JDBC or other client driver. So it was just
about providing extensibility to the protocol (the same way almost
everything else in PostgreSQL is extensible).

But at least initially each middleware would process only it's own
class, so a single if() and not a big switch/case :)

The things I talked about were some forward-looking proposals on how
to use this capability and that some of the messages may at some point
become used / usable by more than a single middleware component at
which point they should be standardised .

-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

On Thu, Aug 19, 2021 at 8:29 PM Simon Riggs
 wrote:
>
> On Thu, 19 Aug 2021 at 19:04, Jesper Pedersen
>  wrote:
>
> > Should the middleware guess the client driver based on the startup
> > message ? Or are you thinking about having a class identifier group for
> > each client driver and then a massive "switch/case" inside each middleware ?
>
> The question is how does a client communicate with middleware?
>
> The message that would be sent would relate to the features of the
> middleware, while being agnostic as to the client driver.
>
> --
> Simon Riggshttp://www.EnterpriseDB.com/
>
>

Re: Middleware Messages for FE/BE

2021-08-19 Thread Hannu Krosing

Actually the way I have been thinking about this would be a message

 "Hey replica x.x.x.x, please take over the write head role"

Which would

1. cause the replica  x.x.x.x to self-promote at this point-in-WAL
1A - optionally the old write head becomes a replica (it should
mention it in message if it plans to stay around as a replica)
2. tells other replicas to switch over and start replicating from it
3. tells all connected clients to switch, if they need read-write access.

This could be even done without timeline switch, as it should
guarantee that there is no split brain writing
Of course there are (and should be) ways to still use the WALs
normally for cases where replica x.x.x.x does not exists, like PITR

And making this play nicely with Logical Decoding is another can of
worms needing to be solved, but it is a start to becoming "Cloud
Native" :)

-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.


On Thu, Aug 19, 2021 at 11:36 AM Hannu Krosing  wrote:
>
> One more set of "standard middleware messages" clients/middleware
> could turn on could be reporting LSNs
>  * various local LSNs for progress of WAL persisting
>  * reporting replication state of some or all replicas
>
> -
> Hannu Krosing
> Google Cloud - We have a long list of planned contributions and we are hiring.
> Contact me if interested.
>
> On Thu, Aug 19, 2021 at 11:33 AM Hannu Krosing  wrote:
> >
> > Need for this does come up quite often so I very much support this.
> >
> > In addition to keeping a registry there likely need to be some other
> > "generally agreed" rules as well, like
> > * it being safe to ignore any and all of the middleware messages (at
> > least with no degradation from the state of not having them)
> > * and maybe even a standard way to turn them on and off.
> >
> >   On/Off switch could be of course done using flags for each
> > individual use case, but it would be good to agree conventions.
> >
> > Another thing to agree would be a set of standard messages, like "I am
> > overloaded, consider moving some load away" or "Planning to switch
> > over to replica x.x.x.x, please follow"
> >
> > -
> > Hannu Krosing
> > Google Cloud - We have a long list of planned contributions and we are 
> > hiring.
> > Contact me if interested.
> >
> >
> > On Thu, Aug 19, 2021 at 10:33 AM Simon Riggs
> >  wrote:
> > >
> > > The current FE/BE protocol assumes that the client is talking to the
> > > server directly. If middleware wants to do something like load
> > > balancing, the only current option is to inspect the incoming commands
> > > and take action from that. That approach performs poorly and has
> > > proven difficult to maintain, limiting the functionality in Postgres
> > > ecosystem middleware.
> > >
> > > It would be useful to have a way to speak to middleware within a
> > > session or prior to each command. There are ways to frig this and
> > > obviously it can always be done out-of-core, but there is a clear
> > > requirement for various client tools to agree a standard way for them
> > > to send messages to middleware rather than the database server. If we
> > > get PostgreSQL Project approval for this, then authors of client and
> > > middleware tools will know how to interoperate and can begin adding
> > > features to take advantage of this, allowing the Postgres ecosystem to
> > > improve and extend its middleware.
> > >
> > > Byte1('M')
> > > Identifies the message as a middleware message. (Or perhaps use 'U'
> > > for User Message?).
> > >
> > > Int32
> > > Length of message contents in bytes, including self.
> > >
> > > Int64
> > > Routing/class identifier, allowing middleware to quickly ignore whole
> > > classes of message if not appropriate. We would run some kind of
> > > allocation scheme to ensure unique meaning, probably via the Postgres
> > > Wiki. The first 2 billion values would be reserved for allocation by
> > > the PostgreSQL Project itself, values beyond that open for external
> > > allocation.
> > >
> > > Byten
> > > The message itself, where n is the remaining bytes in the message.
> > >
> > > The message is intended for middleware only. The server always ignores
> > > these messages, with an optional debug facility that can be enabled to
> > > allow printing NOTICEs to allow testing.
> > >
> > > I will supply a patch to any agreed message format, together with a
> > > new libpq call to utilise this.
> > >
> > > In summary: the patch is easy, the point is we need agreement to allow
> > > and encourage interoperation between clients and middleware.
> > >
> > > Thoughts?
> > >
> > > --
> > > Simon Riggshttp://www.EnterpriseDB.com/
> > >
> > >

Re: Middleware Messages for FE/BE

2021-08-19 Thread Hannu Krosing

One more set of "standard middleware messages" clients/middleware
could turn on could be reporting LSNs
 * various local LSNs for progress of WAL persisting
 * reporting replication state of some or all replicas

-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

On Thu, Aug 19, 2021 at 11:33 AM Hannu Krosing  wrote:
>
> Need for this does come up quite often so I very much support this.
>
> In addition to keeping a registry there likely need to be some other
> "generally agreed" rules as well, like
> * it being safe to ignore any and all of the middleware messages (at
> least with no degradation from the state of not having them)
> * and maybe even a standard way to turn them on and off.
>
>   On/Off switch could be of course done using flags for each
> individual use case, but it would be good to agree conventions.
>
> Another thing to agree would be a set of standard messages, like "I am
> overloaded, consider moving some load away" or "Planning to switch
> over to replica x.x.x.x, please follow"
>
> -
> Hannu Krosing
> Google Cloud - We have a long list of planned contributions and we are hiring.
> Contact me if interested.
>
>
> On Thu, Aug 19, 2021 at 10:33 AM Simon Riggs
>  wrote:
> >
> > The current FE/BE protocol assumes that the client is talking to the
> > server directly. If middleware wants to do something like load
> > balancing, the only current option is to inspect the incoming commands
> > and take action from that. That approach performs poorly and has
> > proven difficult to maintain, limiting the functionality in Postgres
> > ecosystem middleware.
> >
> > It would be useful to have a way to speak to middleware within a
> > session or prior to each command. There are ways to frig this and
> > obviously it can always be done out-of-core, but there is a clear
> > requirement for various client tools to agree a standard way for them
> > to send messages to middleware rather than the database server. If we
> > get PostgreSQL Project approval for this, then authors of client and
> > middleware tools will know how to interoperate and can begin adding
> > features to take advantage of this, allowing the Postgres ecosystem to
> > improve and extend its middleware.
> >
> > Byte1('M')
> > Identifies the message as a middleware message. (Or perhaps use 'U'
> > for User Message?).
> >
> > Int32
> > Length of message contents in bytes, including self.
> >
> > Int64
> > Routing/class identifier, allowing middleware to quickly ignore whole
> > classes of message if not appropriate. We would run some kind of
> > allocation scheme to ensure unique meaning, probably via the Postgres
> > Wiki. The first 2 billion values would be reserved for allocation by
> > the PostgreSQL Project itself, values beyond that open for external
> > allocation.
> >
> > Byten
> > The message itself, where n is the remaining bytes in the message.
> >
> > The message is intended for middleware only. The server always ignores
> > these messages, with an optional debug facility that can be enabled to
> > allow printing NOTICEs to allow testing.
> >
> > I will supply a patch to any agreed message format, together with a
> > new libpq call to utilise this.
> >
> > In summary: the patch is easy, the point is we need agreement to allow
> > and encourage interoperation between clients and middleware.
> >
> > Thoughts?
> >
> > --
> > Simon Riggshttp://www.EnterpriseDB.com/
> >
> >

Re: Middleware Messages for FE/BE

2021-08-19 Thread Hannu Krosing

Need for this does come up quite often so I very much support this.

In addition to keeping a registry there likely need to be some other
"generally agreed" rules as well, like
* it being safe to ignore any and all of the middleware messages (at
least with no degradation from the state of not having them)
* and maybe even a standard way to turn them on and off.

  On/Off switch could be of course done using flags for each
individual use case, but it would be good to agree conventions.

Another thing to agree would be a set of standard messages, like "I am
overloaded, consider moving some load away" or "Planning to switch
over to replica x.x.x.x, please follow"

-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.


On Thu, Aug 19, 2021 at 10:33 AM Simon Riggs
 wrote:
>
> The current FE/BE protocol assumes that the client is talking to the
> server directly. If middleware wants to do something like load
> balancing, the only current option is to inspect the incoming commands
> and take action from that. That approach performs poorly and has
> proven difficult to maintain, limiting the functionality in Postgres
> ecosystem middleware.
>
> It would be useful to have a way to speak to middleware within a
> session or prior to each command. There are ways to frig this and
> obviously it can always be done out-of-core, but there is a clear
> requirement for various client tools to agree a standard way for them
> to send messages to middleware rather than the database server. If we
> get PostgreSQL Project approval for this, then authors of client and
> middleware tools will know how to interoperate and can begin adding
> features to take advantage of this, allowing the Postgres ecosystem to
> improve and extend its middleware.
>
> Byte1('M')
> Identifies the message as a middleware message. (Or perhaps use 'U'
> for User Message?).
>
> Int32
> Length of message contents in bytes, including self.
>
> Int64
> Routing/class identifier, allowing middleware to quickly ignore whole
> classes of message if not appropriate. We would run some kind of
> allocation scheme to ensure unique meaning, probably via the Postgres
> Wiki. The first 2 billion values would be reserved for allocation by
> the PostgreSQL Project itself, values beyond that open for external
> allocation.
>
> Byten
> The message itself, where n is the remaining bytes in the message.
>
> The message is intended for middleware only. The server always ignores
> these messages, with an optional debug facility that can be enabled to
> allow printing NOTICEs to allow testing.
>
> I will supply a patch to any agreed message format, together with a
> new libpq call to utilise this.
>
> In summary: the patch is easy, the point is we need agreement to allow
> and encourage interoperation between clients and middleware.
>
> Thoughts?
>
> --
> Simon Riggshttp://www.EnterpriseDB.com/
>
>

pgbench functions as extension

2021-08-18 Thread Hannu Krosing

Has anybody worked on making the pgbench internal functions available
as extensions ?

I mean the ones from " Built-In Functions" [*]

Some of these are probably available in other extensions or even as
builtins but having things like hash_fnv1a() or random_zipfian() which
are guaranteed to have exact same behaviour as the ones in bundled
pgbench would enable more work to be pushed to database and generally
resu;lt in better interoperability for tests.

--
[*] https://www.postgresql.org/docs/13/pgbench.html#PGBENCH-BUILTIN-FUNCTIONS

-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

Re: NAMEDATALEN increase because of non-latin languages

2021-08-18 Thread Hannu Krosing

Could we just make the limitation to be 64 (or 128) _characters_ not _bytes_ ?

Memory sizes and processor speeds have grown by order(s) of magnitude
since the 64 byte limit was decided and supporting non-ASCII charsets
properly seems like a prudent thing to do.

Also - have we checked that at least the truncation does not cut utf-8
characters in half ?

-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

On Wed, Aug 18, 2021 at 1:33 PM Julien Rouhaud  wrote:
>
> On Wed, Aug 18, 2021 at 7:27 PM John Naylor
>  wrote:
> >
> > On Wed, Aug 18, 2021 at 7:15 AM Julien Rouhaud  wrote:
> > >
> > > Unfortunately, the problem isn't really the additional disk space it
> > > would require.  The problem is the additional performance hit and
> > > memory overhead, as the catalog names are part of the internal
> > > syscache.
> >
> > Some actual numbers on recent hardware would show what kind of tradeoff is 
> > involved. No one has done that for a long time that I recall.
>
> Agreed, but I don't have access to such hardware.  However this won't
> influence the memory overhead part, and there is already frequent
> problems with that, especially since declarative partitioning, so I
> don't see how we could afford that without some kind of cache TTL or
> similar.  AFAIR the last discussion about it a few years ago didn't
> lead anywhere :(
>
>

Is there now an official way to pass column projection to heap AM

2021-08-18 Thread Hannu Krosing

When working on zedstore, the developers solved the lack of support
for passing column projection into the AM  as explained in the
zedstore/README [*]

-8<--8<--8<---8<---
Current table am API requires enhancement here to pass down column
projection to AM. The patch showcases two different ways for the same.

* For sequential scans added new beginscan_with_column_projection()
API. Executor checks AM property and if it leverages column projection
uses this new API else normal beginscan() API.

* For index scans instead of modifying the begin scan API, added new
API to specifically pass column projection list after calling begin
scan to populate the scan descriptor but before fetching the tuples.
-8<--8<--8<---8<---

Which of course requires patching also the call sites in PostgreSQL
code itself to make use of these methods

As this was already almost a year ago, have there been any
developments in the core PostgreSQL code for supporting this ?

Or is projection still supported only in FDWs ?

--
[*] 
https://github.com/greenplum-db/postgres/blob/zedstore/src/backend/access/zedstore/README

-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-08 Thread Hannu Krosing

e bit like a
> hash join makes sense because hash joins do very well with high join
> selectivity, and high join selectivity is common in the real world.
> The intersection of TIDs from each leaf page with the in-memory TID
> delete structure will often be very small indeed.

The hard to optimize case is still when we have dead tuple counts in
hundreds of millions, or even billions, like on a HTAP database after
a few hours of OLAP query have accumulated loads of dead tuples in
tables getting heavy OLTP traffic.

There of course we could do a totally different optimisation, where we
also allow reaping tuples newer than the OLAP queries snapshot if we
can prove that when the snapshot moves forward next time, it has to
jump over said transactions making them indeed DEAD and not RECENTLY
DEAD. Currently we let a single OLAP query ruin everything :)

> > Then again it may be so much extra work that it starts to dominate
> > some parts of profiles.
> >
> > For example see the work that was done in improving the mini-vacuum
> > part where it was actually faster to copy data out to a separate
> > buffer and then back in than shuffle it around inside the same 8k page
>
> Some of what I'm saying is based on the experience of improving
> similar code used by index tuple deletion in Postgres 14. That did
> quite a lot of sorting of TIDs and things like that. In the end the
> sorting had no more than a negligible impact on performance.

Good to know :)

> What
> really mattered was that we efficiently coordinate with distant heap
> pages that describe which index tuples we can delete from a given leaf
> page. Sorting hundreds of TIDs is cheap. Reading hundreds of random
> locations in memory (or even far fewer) is not so cheap. It might even
> be very slow indeed. Sorting in order to batch could end up looking
> like cheap insurance that we should be glad to pay for.

If the most expensive operation is sorting a few hundred of tids, then
this should be fast enough.

My worries were more that after the sorting we can not to dsimple
index lookups for them, but each needs to be found via bseach or maybe
even just search if that is faster under some size limit, and that
these could add up. Or some other needed thing that also has to be
done, like allocating extra memory or moving other data around in a
way that CPU does not like.

Cheers
-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-08 Thread Hannu Krosing

 simple array of
> TIDs. Maybe this will help the compiler and the CPU to fully
> understand the *natural* data dependencies, so that they can be as
> effective as possible in making the code run fast. It's possible that
> a modern CPU will be able to *hide* the latency more intelligently
> than what we have today. The latency is such a big problem that we may
> be able to justify "wasting" other CPU resources, just because it
> sometimes helps with hiding the latency. For example, it might
> actually be okay to sort all of the TIDs on the page to make the bulk
> processing work

Then again it may be so much extra work that it starts to dominate
some parts of profiles.

For example see the work that was done in improving the mini-vacuum
part where it was actually faster to copy data out to a separate
buffer and then back in than shuffle it around inside the same 8k page
:)

So only testing will tell.

> -- though you might still do a precheck that is
> similar to the precheck inside lazy_tid_reaped() that was added by you
> in commit bbaf315309e.
>
> Of course it's very easy to be wrong about stuff like this. But it
> might not be that hard to prototype. You can literally copy and paste
> code from _bt_delitems_delete_check() to do this. It does the same
> basic thing already.

Also a lot of testing would be needed to figure out which strategy
fits best for which distribution of dead tuples, and possibly their
relation to the order of tuples to check from indexes .


Cheers

--
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

Re: pgbench logging broken by time logic changes

2021-07-08 Thread Hannu Krosing

On Thu, Jun 17, 2021 at 7:18 AM Kyotaro Horiguchi
 wrote:
>
> I'm not sure we have transaction lasts for very short time that
> nanoseconds matters.
>

Nanoseconds may not matter yet, but they could be handy when for
example we want to determine the order of parallel query executions.

We are less than an order of magnitude away from being able to do 1M
inserts/updates/deletes per second, so microseconds already are not
always 100% reliable.

We could possibly move to using LSNs fetched as part of the queries
for this case, but this will surely introduce more problems than it
solves :)

Cheers
-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-08 Thread Hannu Krosing

Very nice results.

I have been working on the same problem but a bit different solution -
a mix of binary search for (sub)pages and 32-bit bitmaps for
tid-in-page.

Even with currebnt allocation heuristics (allocate 291 tids per page)
it initially allocate much less space, instead of current 291*6=1746
bytes per page it needs to allocate 80 bytes.

Also it can be laid out so that it is friendly to parallel SIMD
searches doing up to 8 tid lookups in parallel.

That said, for allocating the tid array, the best solution is to
postpone it as much as possible and to do the initial collection into
a file, which

1) postpones the memory allocation to the beginning of index cleanups

2) lets you select the correct size and structure as you know more
about the distribution at that time

3) do the first heap pass in one go and then advance frozenxmin
*before* index cleanup

Also, collecting dead tids into a file makes it trivial (well, almost
:) ) to parallelize the initial heap scan, so more resources can be
thrown at it if available.

Cheers
-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.




On Thu, Jul 8, 2021 at 10:48 AM Masahiko Sawada  wrote:
>
> On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan  wrote:
> >
> > On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada  
> > wrote:
> > > Currently, the TIDs of dead tuples are stored in an array that is
> > > collectively allocated at the start of lazy vacuum and TID lookup uses
> > > bsearch(). There are the following challenges and limitations:
> > >
> > > 1. Don't allocate more than 1GB. There was a discussion to eliminate
> > > this limitation by using MemoryContextAllocHuge() but there were
> > > concerns about point 2[1].
> >
> > I think that the main problem with the 1GB limitation is that it is
> > surprising -- it can cause disruption when we first exceed the magical
> > limit of ~174 million TIDs. This can cause us to dirty index pages a
> > second time when we might have been able to just do it once with
> > sufficient memory for TIDs. OTOH there are actually cases where having
> > less memory for TIDs makes performance *better* because of locality
> > effects. This perverse behavior with memory sizing isn't a rare case
> > that we can safely ignore -- unfortunately it's fairly common.
> >
> > My point is that we should be careful to choose the correct goal.
> > Obviously memory use matters. But it might be more helpful to think of
> > memory use as just a proxy for what truly matters, not a goal in
> > itself. It's hard to know what this means (what is the "real goal"?),
> > and hard to measure it even if you know for sure. It could still be
> > useful to think of it like this.
>
> As I wrote in the first email, I think there are two important factors
> in index vacuuming performance: the performance to check if heap TID
> that an index tuple points to is dead, and the number of times to
> perform index bulk-deletion. The flame graph I attached in the first
> mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
> disk-intensive operation in practice. Given that most index AM's
> bulk-deletion does a full index scan and a table could have multiple
> indexes, reducing the number of times to perform index bulk-deletion
> really contributes to reducing the execution time, especially for
> large tables. I think that a more compact data structure for dead
> tuple TIDs is one of the ways to achieve that.
>
> >
> > > A run container is selected in this test case, using 4 bytes for each 
> > > block.
> > >
> > >   Execution Time  Memory Usage
> > > array   8,883.03600,008,248
> > > intset   7,358.23100,671,488
> > > tbm758.81 100,671,544
> > > rtbm   764.33  29,384,816
> > >
> > > Overall, 'rtbm' has a much better lookup performance and good memory
> > > usage especially if there are relatively many dead tuples. However, in
> > > some cases, 'intset' and 'array' have a better memory usage.
> >
> > This seems very promising.
> >
> > I wonder how much you have thought about the index AM side. It makes
> > sense to initially evaluate these techniques using this approach of
> > separating the data structure from how it is used by VACUUM -- I think
> > that that was a good idea. But at the same time there may be certain
> > important theoretical questions that cannot be answered this way --
> > questions about how everything "fits together" in a real VACUUM might
> > matter a lot. You've probably thought

Re: .ready and .done files considered harmful

2021-05-06 Thread Hannu Krosing

How are you envisioning the shared-memory signaling should work in the
original sample case, where the archiver had been failing for half a
year ?

Or should we perhaps have a system table for ready-to-archive WAL
files to get around limitation sof file system to return just the
needed files with ORDER BY ... LIMIT as we already know how to make
lookups in database fast ?

Cheers
Hannu

On Thu, May 6, 2021 at 12:24 PM Robert Haas  wrote:
>
> On Thu, May 6, 2021 at 3:23 AM Kyotaro Horiguchi
>  wrote:
> > FWIW It's already done for v14 individually.
> >
> > Author: Fujii Masao 
> > Date:   Mon Mar 15 13:13:14 2021 +0900
> >
> > Make archiver process an auxiliary process.
>
> Oh, I hadn't noticed. Thanks.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>
>

Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Hannu Krosing

-
Hannu Krosing

On Thu, May 6, 2021 at 4:53 AM Jeff Davis  wrote:
>
> On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote:
> > How hard would it be to declare TID as current ItemPointerData with
> > some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe,
> > MovedPartitionsOffsetNumber = 0xfffd, presumably also 0x ?).
>
> I don't think there's consensus in this thread that we want to do that,
> but I'd be fine with it.

Sure. I just proposed this as a Minimal Viable Change.

Just hoping that we can agree on an interim solution which addresses
the immediate problem and then continue arguing about the ideal way
for the rest of v15 cycle (and  the v16 and v17 ;) )

>
> It's possible but not trivial. tidbitmap.c would be the biggest
> challenge, I think.

I think all these places (tidbitmap, gin, brin) relay on "relatively
small" MaxHeapTuplesPerPage as an upper bound for some allocations and
then still allocate a lot more than needed.

One can get to 291 tids / page only when you create a table with no
columns, or less than 8 columns which are all set to NULL. A table
with a single non-null boolean column already can fit only 226 tuples
per page.

It is definitely a non-trivial amount of work to rewrite these three
but going to (almost) full 48 bits from current
theoretically-a-little-over-40-effective-bits would expand the number
of addressable tuples 225 times.

Of course it would be extra nice to also somehow encode the 3 special
ItemPointerData values (NULL, 0xfffe, 0cfffd) "somewhere else" and get
an option of uninterrupted 48-bit address space for non-heap table
AMs, but this would likely be much more disruptive, if possible at
all.
We could still check, if they are used outside of heapam and maybe
just fix these places.

>
> Regards,
> Jeff Davis
>
>
>

Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Hannu Krosing

On Thu, May 6, 2021 at 3:07 AM Robert Haas  wrote:
>
> On Wed, May 5, 2021 at 3:43 PM Matthias van de Meent
> > Storage gains for index-oriented tables can become as large as the
> > size of the primary key by not having to store all primary key values
> > in both the index and the table; which can thus be around 100% of a
> > table in the least efficient cases of having a PK over all columns.
> >
> > Yes, this might be indeed only a 'small gain' for access latency, but
> > not needing to store another copy of your data (and keeping it in
> > cache, etc.) is a significant win in my book.
>
> This is a really good point. Also, if the table is ordered by a
> synthetic logical TID, range scans on the primary key will be less
> efficient than if the primary key is itself the TID. We have the
> ability to CLUSTER on an index for good reasons, and "Automatically
> maintain clustering on a table" has been on the todo list forever.
> It's hard to imagine this will ever be achieved with the current heap,
> though: the way to get there is to have a table AM for which this is
> an explicit goal.

But would this not have the downside that all the secondary indexes
will blow up as they now need to have the full table row as the TID  ?

-
Hannu Krosing

Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Hannu Krosing

How hard would it be to declare TID as current ItemPointerData with
some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe,
MovedPartitionsOffsetNumber = 0xfffd, presumably also 0x ?).

And then commit to fixing usage outside access/heap/ which depend on
small value for MaxHeapTuplesPerPage, currently only nbodes/tidscan.c
, access/gin/ginpostinglist.c and access/brin/brin_*.c

there is also MaxHeapTuplesPerPage usage in
./backend/storage/page/bufpage.c but it seems to be all in
heapam-dependent functions (PageRepairFragmentation(),
PageGetHeapFreeSpace() and few others) which most likely should be
moved to access/heap/

Doing it this way would leave us with some manageable complexity in
mapping from TID to 48-bit integer and/or 3 wanted positions in  2^32


Hannu Krosing


On Wed, May 5, 2021 at 8:40 PM Peter Geoghegan  wrote:
>
> On Wed, May 5, 2021 at 11:25 AM Andres Freund  wrote:
> > Agreed. And we can increase the fit a good bit without needing invasive
> > all-over changes. With those changes often even helping heap.
> >
> > E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem
> > even for heap - we waste a lot of space that's not commonly used. A
> > better datastructure (radix tree like I'd say, but several tree shaped
> > approaches seem possible).
>
> Agreed -- even if we only cared about heapam we still ought to do
> something about tidbitmap.c's use of MaxHeapTuplesPerPage.
>
> --
> Peter Geoghegan
>
>

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-20 Thread Hannu Krosing

It would be really convenient if user-visible serialisations of the query
id had something that identifies the computation method.

maybe prefix 'N' for internal, 'S' for pg_stat_statements etc.

This would immediately show in logs at what point the id calculator was
changed

On Fri, Mar 19, 2021 at 5:54 PM Bruce Momjian  wrote:

> On Fri, Mar 19, 2021 at 10:27:51PM +0800, Julien Rouhaud wrote:
> > On Fri, Mar 19, 2021 at 09:29:06AM -0400, Bruce Momjian wrote:
> > > OK, that makes perfect sense.  I think the best solution is to document
> > > that compute_query_id just controls the built-in computation of the
> > > query id, and that extensions can also compute it if this is off, and
> > > pg_stat_activity and log_line_prefix will display built-in or extension
> > > computed query ids.
> >
> > So the last version of the patch should implement that behavior right?
> It's
> > just missing some explicit guidance that third-party extensions should
> only
> > calculate a queryid if compute_query_id is off
>
> Yes, I think we are now down to just how the extensions should be told
> to behave, and how we document this --- see the email I just sent.
>
> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
>   If only the physical world exists, free will is an illusion.
>
>
>
>

Re: shared memory stats: high level design decisions: consistency, dropping

2021-03-20 Thread Hannu Krosing

> But now we could instead schedule stats to be removed at commit
time. That's not trivial of course, as we'd need to handle cases where
the commit fails after the commit record, but before processing the
dropped stats.

We likely can not remove them at commit time, but only after the
oldest open snapshot moves parts that commit ?

Would an approach where we keep stats in a structure logically similar
to MVCC we use for normal tables be completely unfeasible ?

We would only need to keep one Version per backend in transaction .

---
Hannu




On Sat, Mar 20, 2021 at 12:51 AM Andres Freund  wrote:
>
> Hi,
>
> I am working on Kyotaro Horiguchi's shared memory stats patch [1] with
> the goal of getting it into a shape that I'd be happy to commit.  That
> thread is quite long and most are probably skipping over new messages in
> it.
>
> There are two high-level design decisions / questions that I think
> warrant a wider audience (I'll keep lower level discussion in the other
> thread).
>
> In case it is not obvious, the goal of the shared memory stats patch is
> to replace the existing statistics collector, to which new stats are
> reported via an UDP socket, and where clients read data from the stats
> collector by reading the entire database's stats from disk.
>
> The replacement is to put the statistics into a shared memory
> segment. Fixed-size stats (e.g. bgwriter, checkpointer, wal activity,
> etc) being stored directly in a struct in memory. Stats for objects
> where a variable number exists, e.g. tables, are addressed via a dshash.
> table that points to the stats that are in turn allocated using dsa.h.
>
>
> 1) What kind of consistency do we want from the pg_stats_* views?
>
> Right now the first access to stats in a transaction will trigger a read
> of both the global and per-database stats from disk. If the on-disk
> state is too old, we'll ask the stats collector to write out a new file
> a couple times.
>
> For the rest of the transaction that in-memory state is used unless
> pg_stat_clear_snapshot() is called. Which means that multiple reads from
> e.g. pg_stat_user_tables will show the same results as before [2].
>
> That makes stats accesses quite expensive if there are lots of
> objects.
>
> But it also means that separate stats accesses - which happen all the
> time - return something repeatable and kind of consistent.
>
> Now, the stats aren't really consistent in the sense that they are
> really accurate, UDP messages can be lost, or only some of the stats
> generated by a TX might not yet have been received, other transactions
> haven't yet sent them. Etc.
>
>
> With the shared memory patch the concept of copying all stats for the
> current database into local memory at the time of the first stats access
> doesn't make sense to me.  Horiguchi-san had actually implemented that,
> but I argued that that would be cargo-culting an efficiency hack
> required by the old storage model forward.
>
> The cost of doing this is substantial. On master, with a database that
> contains 1 million empty tables, any stats access takes ~0.4s and
> increases memory usage by 170MB.
>
>
> 1.1)
>
> I hope everybody agrees with not requiring that stats don't need to be
> the way they were at the time of first stat access in a transaction,
> even if that first access was to a different stat object than the
> currently accessed stat?
>
>
> 1.2)
>
> Do we expect repeated accesses to the same stat to stay the same through
> the transaction?  The easiest way to implement stats accesses is to
> simply fetch the stats from shared memory ([3]). That would obviously
> result in repeated accesses to the same stat potentially returning
> changing results over time.
>
> I think that's perfectly fine, desirable even, for pg_stat_*.
>
>
> 1.3)
>
> What kind of consistency do we expect between columns of views like
> pg_stat_all_tables?
>
> Several of the stats views aren't based on SRFs or composite-type
> returning functions, but instead fetch each stat separately:
>
> E.g. pg_stat_all_tables:
>  SELECT c.oid AS relid,
> n.nspname AS schemaname,
> c.relname,
> pg_stat_get_numscans(c.oid) AS seq_scan,
> pg_stat_get_tuples_returned(c.oid) AS seq_tup_read,
> sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan,
> sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + 
> pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch,
> pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins,
> ...
> pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count
>FROM pg_class c
>  LEFT JOIN pg_index i ON c.oid = i.indrelid
> ...
>
> Which means that if we do not cache stats, additional stats updates
> could have been applied between two stats accessors. E.g the seq_scan
> from before some pgstat_report_stat() but the seq_tup_read from after.
>
> If we instead fetch all of a table's stats in one go, we would get
> consistency between the columns. But obviously that'd require changing
> all the stats views.

Re: shared memory stats: high level design decisions: consistency, dropping

2021-03-19 Thread Hannu Krosing

On Sat, Mar 20, 2021 at 1:21 AM Andres Freund  wrote:
>
> Hi,
>
> On 2021-03-20 01:16:31 +0100, Hannu Krosing wrote:
> > > But now we could instead schedule stats to be removed at commit
> > time. That's not trivial of course, as we'd need to handle cases where
> > the commit fails after the commit record, but before processing the
> > dropped stats.
> >
> > We likely can not remove them at commit time, but only after the
> > oldest open snapshot moves parts that commit ?
>
> I don't see why? A dropped table is dropped, and cannot be accessed
> anymore. Snapshots don't play a role here - the underlying data is gone
> (minus a placeholder file to avoid reusing the oid, until the next
> commit).  If you run a vacuum on some unrelated table in the same
> database, the stats for a dropped table will already be removed long
> before there's no relation that could theoretically open the table.
>
> Note that table level locking would prevent a table from being dropped
> if a long-running transaction has already accessed it.

Yeah, just checked. DROP TABLE waits until the reading transaction finishes.

>
> > Would an approach where we keep stats in a structure logically similar
> > to MVCC we use for normal tables be completely unfeasible ?
>
> Yes, pretty unfeasible. Stats should work on standbys too...

I did not mean actually using MVCC and real transaction ids but rather
 similar approach, where (potentially) different stats rows are kept
for each backend.

This of course only is a win in case multiple backends can use the
same stats row. Else it is easier to copy the backends version into
backend local memory.

But I myself do not see any problem with stats rows changing all the time.

The only worry would be parts of the same row being out of sync. This
can of course be solved by locking, but for large number of backends
with tiny transactions this locking itself could potentially become a
problem. Here alternating between two or more versions could help and
then it also starts makes sense to keep the copies in shared memory

> Regards,
>
> Andres

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-19 Thread Hannu Krosing

On Fri, Mar 19, 2021 at 2:29 PM Bruce Momjian  wrote:
>
> OK, that makes perfect sense.  I think the best solution is to document
> that compute_query_id just controls the built-in computation of the
> query id, and that extensions can also compute it if this is off, and
> pg_stat_activity and log_line_prefix will display built-in or extension
> computed query ids.
>
> It might be interesting someday to check if the hook changed a
> pre-computed query id and warn the user in the logs, but that could
> cause more log-spam problems than help.

The log-spam could be mitigated by logging it just once per connection
the first time it is overridden

Also, we could ask the extensions to expose the "method name" in a read-only GUC

so one can do

SHOW compute_query_id_method;

and get the name of method use

compute_query_id_method

builtin

And it may even dynamically change to indicate the overriding of builtin

compute_query_id_method
---
fancy_compute_query_id (overrides builtin)

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-19 Thread Hannu Krosing

On Thu, Mar 18, 2021 at 4:23 PM Pavel Stehule  wrote:

> But we don't support this feature. We are changing just a top scope's label. 
> So syntax "ALIAS FOR FUNCTION is not good. The user can have false hopes

In this case it looks like it should go together with other labels and
have << label_here >> syntax ?

And we are back to the question of where to put this top scope label :)

Maybe we cloud still pull in the function arguments into the outermost
blocks scope, and advise users to have an extra block if they want to
have same names in both ?


CREATE OR REPLACE FUNCTION fx(a int, b int)
RETURNS ... AS $$
<< fnargs >>
BEGIN
  << topblock >>
  DECLARE
a int := fnargs.a * 2;
b int := topblock.a + 2;
  BEGIN

  END;
END;
$$;

and we could make the empty outer block optional and treat two
consecutive labels as top scope label and outermost block label

CREATE OR REPLACE FUNCTION fx(a int, b int)
RETURNS ... AS $$
<< fnargs >>
<< topblock >>
  DECLARE
a int := fnargs.a * 2;
b int := topblock.a + 2;
  BEGIN

END;
$$;

But I agree that this is also not ideal.


And

CREATE OR REPLACE FUNCTION fx(a int, b int)
WITH (TOPSCOPE_LABEL fnargs)
RETURNS ... AS $$
<< topblock >>
  DECLARE
a int := fnargs.a * 2;
b int := topblock.a + 2;
  BEGIN

END;
$$;

Is even more inconsistent than #option syntax


Cheers
Hannu

PS:

>
> For cleanness/orthogonality I would also prefer the blocklables to be in 
> DECLARE
> for each block, but this train has already left :)
> Though we probably could add alternative syntax ALIAS FOR BLOCK ?

>
> why? Is it a joke?
>
> you are defining a block label, and you want to in the same block redefine 
> some outer label? I don't think it is a good idea.

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-18 Thread Hannu Krosing

On Thu, Mar 18, 2021 at 3:45 PM Pavel Stehule  wrote:
>
>
>
> čt 18. 3. 2021 v 15:19 odesílatel Hannu Krosing  napsal:
...
>> Variation could be
>>
>> DECLARE
>>fnarg ALIAS FOR FUNCTION a_function_with_an_inconveniently_long_name;
>>
>> so it is clear that we are aliasing current function name
>>
>>
>> or even make the function name optional, as there can only be one, so
>>
>> DECLARE
>>fnarg ALIAS FOR FUNCTION;
>
>
> Why you think so it is better than just
>
> #routine_label fnarg
>
> ?

It just adds already a *third* way to (re)name things, in addition to
<< blocklabel >>
and ALIAS FOR

For some reason it feels to me similar to having completely different
syntax rules
for function names, argument names and variable names instead of all having to
follow rules for "identifiers"

For cleanness/orthogonality I would also prefer the blocklables to be in DECLARE
for each block, but this train has already left :)
Though we probably could add alternative syntax ALIAS FOR BLOCK ?

> Maybe the name of the option can be routine_alias? Is it better for you?

I dont think the name itself is bad, as it is supposed to be used for
both FUNCTION
and PROCEDURE renaming

> My objection to DECLARE is freedom of this clause. It can be used everywhere.
> the effect of DECLARE is block limited. So theoretically somebody can write
>
> BEGIN
>   ..
>   DECLARE fnarg ALIAS FOR FUNCTION;
>   BEGIN
> ..
>   END;
>
> END;
>
> It disallows simple implementation.

We could limit ALIAS FOR FUNCTION to outermost block only, and revisit
this later
if there are loud complaints (which I don't think there will be :) )



Cheers
Hannu

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-18 Thread Hannu Krosing

I went to archives and read the whole discussion for this from the beginning

I did not really understand, why using the ALIAS FOR syntax is significantly
harder to implement then #routine_label

The things you mentioned as currently using #OPTION seem to be in a different
category from the aliases and block labels - they are more like pragmas -  so to
me it still feels inconsistent to use #routine_label for renaming the
outer namespace.


Also checked code and indeed there is support for #OPTION DUMP and
#VARIABLE_CONFLICT ERROR
Is it documented and just hard to find ?


On Thu, Mar 18, 2021 at 3:19 PM Hannu Krosing  wrote:
>
> On Thu, Mar 18, 2021 at 5:27 AM Pavel Stehule  wrote:
> >
> >
> > There are few main reasons:
> >
> > a) compile options are available already - so I don't need invent new 
> > syntax - #OPTION DUMP, #PRINT_STRICT ON, #VARIABLE_CONFLICT ERROR
>
> Are these documented anywhere ?
>
> At least a quick search for pl/pgsql + #OPTION DUMP, #PRINT_STRICT ON,
> #VARIABLE_CONFLICT ERROR returned nothing relevant and also searching
> in
>
> If pl/pgsql actually supports any of these, these should get documented!
>
>
>
> For me the most logical place for declaring a function name alias
> would be to allow  ALIAS FOR the function name in DECLARE section.
>
> plpgsql_test=# CREATE FUNCTION
> a_function_with_an_inconveniently_long_name(i int , OUT o int)
> LANGUAGE plpgsql AS $plpgsql$
> DECLARE
>fnarg ALIAS FOR a_function_with_an_inconveniently_long_name;
> BEGIN
>fnarg.o = 2 * fnarg.i;
> END;
> $plpgsql$;
>
> but unfortunately this currently returns
>
> ERROR:  42704: variable "a_function_with_an_inconveniently_long_name"
> does not exist
> LINE 4:fnarg ALIAS FOR a_function_with_an_inconveniently_long_na...
>^
> LOCATION:  plpgsql_yyparse, pl_gram.y:677
>
> Variation could be
>
> DECLARE
>fnarg ALIAS FOR FUNCTION a_function_with_an_inconveniently_long_name;
>
> so it is clear that we are aliasing current function name
>
> or even make the function name optional, as there can only be one, so
>
> DECLARE
>fnarg ALIAS FOR FUNCTION;
>
>
> Cheers
> Hannu

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-18 Thread Hannu Krosing

On Thu, Mar 18, 2021 at 5:27 AM Pavel Stehule  wrote:
>
>
> There are few main reasons:
>
> a) compile options are available already - so I don't need invent new syntax 
> - #OPTION DUMP, #PRINT_STRICT ON, #VARIABLE_CONFLICT ERROR

Are these documented anywhere ?

At least a quick search for pl/pgsql + #OPTION DUMP, #PRINT_STRICT ON,
#VARIABLE_CONFLICT ERROR returned nothing relevant and also searching
in

If pl/pgsql actually supports any of these, these should get documented!

For me the most logical place for declaring a function name alias
would be to allow  ALIAS FOR the function name in DECLARE section.

plpgsql_test=# CREATE FUNCTION
a_function_with_an_inconveniently_long_name(i int , OUT o int)
LANGUAGE plpgsql AS $plpgsql$
DECLARE
   fnarg ALIAS FOR a_function_with_an_inconveniently_long_name;
BEGIN
   fnarg.o = 2 * fnarg.i;
END;
$plpgsql$;

but unfortunately this currently returns

ERROR:  42704: variable "a_function_with_an_inconveniently_long_name"
does not exist
LINE 4:fnarg ALIAS FOR a_function_with_an_inconveniently_long_na...
   ^
LOCATION:  plpgsql_yyparse, pl_gram.y:677

Variation could be

DECLARE
   fnarg ALIAS FOR FUNCTION a_function_with_an_inconveniently_long_name;

so it is clear that we are aliasing current function name

or even make the function name optional, as there can only be one, so

DECLARE
   fnarg ALIAS FOR FUNCTION;

Cheers
Hannu

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-17 Thread Hannu Krosing

why are you using yet another special syntax for this ?

would it not be better to do something like this:
CREATE FUNCTION a_reall_long_and_winding_function_name(i int, out o int)
LANGUAGE plpgsql AS $plpgsql$
DECLARE
   args function_name_alias
BEGIN
  args.o = 2 * args.i
END;
$plpgsql$;

or at least do something based on block labels (see end of
https://www.postgresql.org/docs/13/plpgsql-structure.html )

CREATE FUNCTION somefunc() RETURNS integer AS $$
<< outerblock >>
DECLARE
...

maybe extend this to

CREATE FUNCTION somefunc() RETURNS integer AS $$
<< functionlabel:args >>
<< outerblock >>
DECLARE
...

or replace 'functionlabel' with something less ugly :)

maybe <<< args >>>

Cheers
Hannu



On Wed, Mar 17, 2021 at 5:05 PM Pavel Stehule  wrote:
>
>
>
> st 17. 3. 2021 v 9:20 odesílatel Michael Paquier  napsal:
>>
>> On Wed, Mar 17, 2021 at 02:06:57PM +0800, Julien Rouhaud wrote:
>> > I also think that there should be a single usable top label, otherwise it 
>> > will
>> > lead to confusing code and it can be a source of bug.
>>
>> Okay, that's fine by me.  Could it be possible to come up with an
>> approach that does not hijack the namespace list contruction and the
>> lookup logic as much as it does now?  I get it that the patch is done
>> this way because of the ordering of operations done for the initial ns
>> list creation and the grammar parsing that adds the routine label on
>> top of the root one, but I'd like to believe that there are more solid
>> ways to achieve that, like for example something that decouples the
>> root label and its alias, or something that associates an alias
>> directly with its PLpgSQL_nsitem entry?
>
>
> I am checking it now, and I don't see any easy solution. The namespace is a 
> one direction tree - only backward iteration from leafs to roots is 
> supported. At the moment, when I try to replace the name of root ns_item, 
> this root ns_item is referenced by the function's arguments (NS_TYPE_VAR 
> type). So anytime I need some  helper ns_item node, that can be a persistent 
> target for these nodes. In this case is a memory overhead of just one ns_item.
>
> orig_label <- argument1 <- argument2
>
> The patch has to save the original orig_label because it can be referenced 
> from argument1 or by "FOUND" variable. New graphs looks like
>
> new_label <- invalidated orig_label <- argument1 <- ...
>
> This tree has a different direction than is usual, and then replacing the 
> root node is not simple.
>
> Second solution (significantly more simple) is an additional pointer in 
> ns_item structure. In almost all cases this pointer will not be used. Because 
> ns_item uses a flexible array, then union cannot be used. I implemented this 
> in a patch marked as "alias-implementation".
>
> What do you think about it?
>
> Pavel
>
>
>
>
>
>
>>
>> --
>> Michael

Re: Boundary value check in lazy_tid_reaped()

2021-03-16 Thread Hannu Krosing

We could also go parallel in another direction - I have been mulling about
writing a "vectorized" bsearch which would use AVX2, where you look up 64
(or more likely 32, so tids also fit in 256bit vector) tids at a time.

The trickiest part is that the search can complete at different iteration
for each value.

Of course it is possible that this has a very bad RAM access behaviour and
is no win at all even if it otherways works.

--
Hannu Krosing

On Tue, Mar 16, 2021 at 10:08 PM Peter Geoghegan  wrote:

> On Sun, Mar 14, 2021 at 4:22 PM Thomas Munro 
> wrote:
> > BTW I got around to trying this idea out for a specialised
> > bsearch_itemptr() using a wide comparator, over here:
>
> Cool!
>
> I have another thing that should be considered when we revisit this
> area in the future: maybe we should structure the binary search to
> lookup multiple TIDs at once -- probably all of the TIDs from a given
> leaf page, taken together.
>
> There is now an nbtree function used for tuple deletion (all tuple
> deletion, not just bottom-up deletion) that works like this:
> _bt_delitems_delete_check(). I suspect that it would make sense to
> generalize it to do the same thing for regular VACUUM. Perhaps this
> idea would have to be combined with other techniques to show a real
> benefit. It would probably be necessary to sort the TIDs first (just
> like index_delete_sort() does for the _bt_delitems_delete_check() code
> today), but that's probably no big deal.
>
> It is typical to have 200 - 400 TIDs on an nbtree leaf page without
> using deduplication. And with deduplication enabled you can have as
> many as 1300 TIDs on a single 8KiB nbtree leaf page. It's easy to
> imagine something like GCC's __builtin_prefetch() (or maybe just more
> predictable access patterns in our "batch binary search") making
> everything much faster through batching. This will also naturally make
> btvacuumpage() much less "branchy", since of course it will no longer
> need to process the page one TID at a time -- that helps too.
>
> --
> Peter Geoghegan
>
>
>

Re: Extensibility of the PostgreSQL wire protocol

2021-03-04 Thread Hannu Krosing

On Thu, Mar 4, 2021 at 9:55 PM Jan Wieck  wrote:
>
> Another possibility, and this is what is being used by the AWS team
> implementing the TDS protocol for Babelfish, is to completely replace
> the entire TCOP mainloop function PostgresMain().

I suspect this is the only reasonable way to do it for protocols which are
not very close to libpq.

> That is of course a
> rather drastic move and requires a lot more coding on the extension
> side,

Not necessarily - if the new protocol is close to existing one, then it is
copy/paste + some changes.

If it is radically different, then trying to fit it into the current
mainloop will
be even harder than writing from scratch.

And will very likely fail in the end anyway :)

> but the whole thing was developed that way from the beginning and
> it is working. I don't have a definitive date when that code will be
> presented. Kuntal or Prateek may be able to fill in more details.

Are you really fully replacing the main loop, or are you running a second
main loop in parallel in the same database server instance, perhaps as
a separate TDS_postmaster backend ?

Will the data still also be accessible "as postgres" via port 5432 when
TDS/SQLServer support is active ?

Re: PROXY protocol support

2021-03-04 Thread Hannu Krosing

The current proposal seems to miss the case of transaction pooling
(and statement pooling) where the same established connection
multiplexes transactions / statements from multiple remote clients.

What we would need for that case would be a functionl

pg_set_remote_client_address( be_key, remote_ip, remote_hostname)

where only be_key and remote_ip are required, but any string (up to a
certain length) would be accepted as hostname.

It would be really nice if we could send this request at protocol level but
if that is hard to do then having a function would get us half way there.

the be_key in the function is the key from PGcancel, which is stored
 by libpq when making the connection, and it is there, to make sure
that only the directly connecting proxy can successfully call the function.

Cheers
Hannu

On Fri, Mar 5, 2021 at 12:21 AM Jacob Champion  wrote:
>
> On Thu, 2021-03-04 at 21:45 +0100, Magnus Hagander wrote:
> > On Thu, Mar 4, 2021 at 9:07 PM Jacob Champion  wrote:
> > > Idle thought I had while setting up a local test rig: Are there any
> > > compelling cases for allowing PROXY packets to arrive over Unix
> > > sockets? (By which I mean, the proxy is running on the same machine as
> > > Postgres, and connects to it using the .s.PGSQL socket file instead of
> > > TCP.) Are there cases where you want some other software to interact
> > > with the TCP stack instead of Postgres, but it'd still be nice to have
> > > the original connection information available?
> >
> > I'm uncertain what that usecase would be for something like haproxy,
> > tbh. It can't do connection pooling, so adding it on the same machine
> > as postgres itself wouldn't really add anything, I think?
>
> Yeah, I wasn't thinking HAproxy so much as some unspecified software
> appliance that's performing Some Task before allowing a TCP client to
> speak to Postgres. But it'd be better to hear from someone that has an
> actual use case, instead of me spitballing.
>
> > Iid think about the other end, if you had a proxy on a different
> > machine accepting unix connections and passing them on over
> > PROXY-over-tcp. But I doubt it's useful to know it was unix in that
> > case  (since it still couldn't do peer or such for the auth) --
> > instead, that seems like an argument where it'd be better to proxy
> > without using PROXY and just letting the IP address be.
>
> You could potentially design a system that lets you proxy a "local all
> all trust" setup from a different (trusted) machine, without having to
> actually let people onto the machine that's running Postgres. That
> would require some additional authentication on the PROXY connection
> (i.e. something stronger than host-based auth) to actually be useful.
>
> -- other notes --
>
> A small nitpick on the current separate-port PoC is that I'm forced to
> set up a "regular" TCP port, even if I only want the PROXY behavior.
>
> The original-host logging isn't working for me:
>
>   WARNING:  pg_getnameinfo_all() failed: ai_family not supported
>   LOG:  proxy connection from: host=??? port=???
>
> and I think the culprit is this:
>
> >/* Store a copy of the original address, for logging */
> >memcpy(_save, >raddr, port->raddr.salen);
>
> port->raddr.salen is the length of port->raddr.addr; we want the length
> of the copy to be sizeof(port->raddr) here, no?
>
> --Jacob

Re: Extensibility of the PostgreSQL wire protocol

2021-03-03 Thread Hannu Krosing

I have not looked at the actual patch, but does it allow you to set up
its own channels to listen to ?

For example if I'd want to set up a server to listen to incoming connections
over QUIC [1] - a protocol which create a connection over UDP and allows
clients to move to new IP addresses (among other things) then would the
current extensibility proposal cover this ?

Maybe a correct approach would be to just start up a separate
"postmaster" to listen to a different protocol ?

[1] https://en.wikipedia.org/wiki/QUIC

Cheers
Hannu

Re: We should stop telling users to "vacuum that database in single-user mode"

2021-03-03 Thread Hannu Krosing

On Wed, Mar 3, 2021 at 11:33 AM David Rowley  wrote:
>
> On Wed, 3 Mar 2021 at 21:44, Magnus Hagander  wrote:
...
> > I think we misunderstand each other. I meant this only as a comment
> > about the idea of ignoring the cost limit in single user mode -- that
> > is, it's a reason to *want* vacuum to not run as quickly as possible
> > in single user mode. I should've trimmed the email better.
>
> I meant to ignore the cost limits if we're within a hundred million or
> so of the stopLimit.  Per what Hannu mentioned, there does not seem to
> be a great need with current versions of PostgreSQL to restart in the
> instance in single-user mode. VACUUM still works once we're beyond the
> stopLimit. It's just commands that need to generate a new XID that'll
> fail with the error message mentioned by Hannu.

I am investigating a possibility of introducing a special "Restricted
Maintenance
Mode" to let admin mitigate after xidStopLimit, maybe for another 0.5M txids,
by doing things like

* dropping an index - to make vacuum faster
* dropping a table - sometimes it is better to drop a table in order to get the
  production database functional again instead of waiting hours for the vacuum
  to finish.
  And then later restore it from backup or maybe access it from a read-only
  clone of the database via FDW.
* drop a stale replication slot which is holding back vacuum

To make sure that this will not accidentally just move xidStopLimit to 0.5M for
users who run main workloads as a superuser (they do exists!) this mode should
be restricted to
* only superuser
* only a subset of commands /  functions
* be heavily throttled to avoid running out of TXIDs, maybe 1-10 xids per second
* maybe require also setting a GUC to be very explicit

> > I agree with your other idea, that of kicking in a more aggressive
> > autovacuum if it's not dealing with things fast enough. Maybe even on
> > an incremental way - that is run with the default, then at another
> > threshold drop them to half, and at yet another threshold drop them to
> > 0. I agree that pretty much anything is better than forcing the user
> > into single user mode.
>
> OK cool. I wondered if it should be reduced incrementally or just
> switch off the cost limit completely once we're beyond
> ShmemVariableCache->xidStopLimit.

Abrupt change is something that is more likely to make the user/DBA notice
that something is going on. I have even been thinking about deliberate
throttling to make the user notice / pay attention.

> If we did want it to be incremental
> then if we had say ShmemVariableCache->xidFastVacLimit, which was
> about 100 million xids before xidStopLimit, then the code could adjust
> the sleep delay down by the percentage through we are from
> xidFastVacLimit to xidStopLimit.
>
> However, if we want to keep adjusting the sleep delay then we need to
> make that work for vacuums that are running already. We don't want to
> call ReadNextTransactionId() too often, but maybe if we did it once
> per 10 seconds worth of vacuum_delay_point()s.  That way we'd never do
> it for vacuums already going at full speed.

There are already samples of this in code, for example the decision to
force-start disabled autovacuum is considered after every 64k transactions.

There is a related item in https://commitfest.postgresql.org/32/2983/ .
When that gets done, we could drive the adjustments from autovacuum.c by
adding the remaining XID range adjustment to existing worker delay adjust
mechanisms in autovac_balance_cost() and signalling the autovacuum
backend to run the adjustment every few seconds once we are in the danger
zone.

Cheers
Hannu

We should stop telling users to "vacuum that database in single-user mode"

2021-03-01 Thread Hannu Krosing

restart the database
or just kill the vacuum process after reloading flags. After this it
should complete in
15-30 min, after which the database is available for writes again.

Cheers,
Hannu Krosing

Re: CREATE ROUTINE MAPPING

2018-09-10 Thread Hannu Krosing

Hi Corey

Have you looked at pl/proxy ?

It does this and then some (sharding)

It actually started out as a set of pl/pythonu functions, but then got
formalized into a full extension language for defining remote (potentially
sharded) function calls


Best Regards
Hannu Krosng



On Fri, 12 Jan 2018 at 03:38, Corey Huinker  wrote:

> A few months ago, I was researching ways for formalizing calling functions
> on one postgres instance from another. RPC, basically. In doing so, I
> stumbled across an obscure part of the the SQL Standard called ROUTINE
> MAPPING, which is exactly what I'm looking for.
>
> The syntax specified is, roughly:
>
> CREATE ROUTINE MAPPING local_routine_name FOR remote_routine_spec
> SERVER my_server [ OPTIONS( ... ) ]
>
>
> Which isn't too different from CREATE USER MAPPING.
>
> The idea here is that if I had a local query:
>
> SELECT t.x, remote_func1(),  remote_func2(t.y)
>
> FROM remote_table t
>
> WHERE t.active = true;
>
>
> that would become this query on the remote side:
>
> SELECT t.x, local_func1(), local_func2(t.y)
>
> FROM local_table t
>
> WHERE t.active = true;
>
>
>
> That was probably the main intention of this feature, but I see a
> different possibility there. Consider the cases:
>
> SELECT remote_func(1,'a');
>
>
> and
>
> SELECT * FROM remote_srf(10, true);
>
>
> Now we could have written remote_func() and remote_srf() in plpythonu, and
> it could access whatever remote data that we wanted to see, but that
> exposes our local server to the untrusted pl/python module as well as
> python process overhead.
>
> We could create a specialized foreign data wrapper that requires a WHERE
> clause to include all the require parameters as predicates, essentially
> making every function a table, but that's awkward and unclear to an end
> user.
>
> Having the ability to import functions from other servers allows us to
> write foreign servers that expose functions to the local database, and
> those foreign servers handle the bloat and risks associated with accessing
> that remote data.
>
> Moreover, it would allow hosted environments (AWS, etc) that restrict the
> extensions that can be added to the database to still connect to those
> foreign data sources.
>
> I'm hoping to submit a patch for this someday, but it touches on several
> areas of the codebase where I have no familiarity, so I've put forth to
> spark interest in the feature, to see if any similar work is underway, or
> if anyone can offer guidance.
>
> Thanks in advance.
>

Re: Proposal: http2 wire format

2018-03-29 Thread Hannu Krosing

>
> > * room for other resultset formats later. Like Damir, I really want to
> add
> > protobuf or json serializations of result sets at some point, mainly so
> we
> > can return "entity graphs" in graph representation rather than left-join
> > projection.
>
> -1. I don't think this belongs in postgres.
>

Maybe the functionality does not belong in *core* postgres, but I sure
would like it to be possible to have an extension being able to do it.

A bit similar to what logical decoding plugins do now, just more flexible
in terms of protocol

Cheers
Hannu

A Generic Question about Generic type subscripting

2018-01-28 Thread Hannu Krosing

Sorry for being late to the party

I started looking at the thread about "Generic type subscripting" and am
wondering,
why does it take the approach of modifying pg_type and modifying lots of
internal
functions, when instead it could be defined in a much lighter and less
intrusive way
as an operator, probably by reserving a dedicated operator name

CREATE OPERATOR [...] (PROCEDURE = json_object_field, LEFTARG=jsonb,
RIGHTARG=text);
CREATE OPERATOR [...] (PROCEDURE = jsonb_array_element, LEFTARG=jsonb,
RIGHTARG=int);

This might put more work on the writers of actual subscription operators,
but if we are looking at diverse new types, it may also be, that writing
operator functions is even easier than learning a full new skillset for
writing "subscripting functions"

Defining "subscripting" as an operator does still require changes to how
current subscripting operations are parsed, but even there I am not sure it
would be more complex, as all the parser has to do is to determine that it
is a subscripting operation and then delegate to the corresponding operator.

Also, there is a problem of what to do with element (or or even slice)
assignements, as there require three arguments.

I see two possibilities

1) add a third "ARG" to the CREATE OPERATOR syntax, maybe VALUEARG
2) use composite types - so for

jsonb1[int1] = jsonb2

the operator would be defined by first defining a

CREATE TYPE intkey_and_jsonvalue as (key int, value jsonb)
and then using this in
CREATE OPERATOR [...] (PROCEDURE = jsonb_set_key, LEFTARG=jsonb,
RIGHTARG=intkey_and_jsonvalue)



Cheers
Hannu

Re: AS OF queries

2017-12-27 Thread Hannu Krosing

On 20.12.2017 14:45, Konstantin Knizhnik wrote:
> I wonder if Postgres community is interested in supporting time travel
> queries in PostgreSQL (something like AS OF queries in Oracle:
> https://docs.oracle.com/cd/B14117_01/appdev.101/b10795/adfns_fl.htm).
> As far as I know something similar is now developed for MariaDB.
>
> It seems to me that it will be not so difficult to implement them in
> Postgres - we already have versions of tuples.
> Looks like we only need to do three things:
> 1. Disable autovacuum (autovacuum = off)
In the design for original University Postgres ( which was a full
history database geared towards WORM drives )
it was the task of vacuum to move old tuples to "an archive" from where
the AS OF queries would then fetch
them as needed.

This might also be a good place to do Commit LSN to Commit Timestamp
translation

Hannu

> 2. Enable commit timestamp (track_commit_timestamp = on)
> 3. Add asofTimestamp to snapshot and patch XidInMVCCSnapshot to
> compare commit timestamps when it is specified in snapshot.
>
>
> Attached please find my prototype implementation of it.
> Most of the efforts are needed to support asof timestamp in grammar
> and add it to query plan.
> I failed to support AS OF clause (as in Oracle) because of
> shift-reduce conflicts with aliases,
> so I have to introduce new ASOF keyword. May be yacc experts can
> propose how to solve this conflict without introducing new keyword...
>
> Please notice that now ASOF timestamp is used only for data snapshot,
> not for catalog snapshot.
> I am not sure that it is possible (and useful) to travel through
> database schema history...
>
> Below is an example of how it works:
>
> postgres=# create table foo(pk serial primary key, ts timestamp
> default now(), val text);
> CREATE TABLE
> postgres=# insert into foo (val) values ('insert');
> INSERT 0 1
> postgres=# insert into foo (val) values ('insert');
> INSERT 0 1
> postgres=# insert into foo (val) values ('insert');
> INSERT 0 1
> postgres=# select * from foo;
>  pk | ts |  val
> ++
>   1 | 2017-12-20 14:59:17.715453 | insert
>   2 | 2017-12-20 14:59:22.933753 | insert
>   3 | 2017-12-20 14:59:27.87712  | insert
> (3 rows)
>
> postgres=# select * from foo asof timestamp '2017-12-20 14:59:25';
>  pk | ts |  val
> ++
>   1 | 2017-12-20 14:59:17.715453 | insert
>   2 | 2017-12-20 14:59:22.933753 | insert
> (2 rows)
>
> postgres=# select * from foo asof timestamp '2017-12-20 14:59:20';
>  pk | ts |  val
> ++
>   1 | 2017-12-20 14:59:17.715453 | insert
> (1 row)
>
> postgres=# update foo set val='upd',ts=now() where pk=1;
> UPDATE 1
> postgres=# select * from foo asof timestamp '2017-12-20 14:59:20';
>  pk | ts |  val
> ++
>   1 | 2017-12-20 14:59:17.715453 | insert
> (1 row)
>
> postgres=# select * from foo;
>  pk | ts |  val
> ++
>   2 | 2017-12-20 14:59:22.933753 | insert
>   3 | 2017-12-20 14:59:27.87712  | insert
>   1 | 2017-12-20 15:09:17.046047 | upd
> (3 rows)
>
> postgres=# update foo set val='upd2',ts=now() where pk=1;
> UPDATE 1
> postgres=# select * from foo asof timestamp '2017-12-20 15:10';
>  pk | ts     |  val
> ++
>   2 | 2017-12-20 14:59:22.933753 | insert
>   3 | 2017-12-20 14:59:27.87712  | insert
>   1 | 2017-12-20 15:09:17.046047 | upd
> (3 rows)
>
>
> Comments and feedback are welcome:)
>

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
https://2ndquadrant.com/

1 2 >

1 - 100 of 101 matches

Mail list logo