Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Jan Wieck

On 6/8/23 15:57, Dave Cramer wrote:


Apparently this is coming in pgbouncer Support of prepared statements by 
knizhnik · Pull Request #845 · pgbouncer/pgbouncer (github.com) 



I am quite interested in that patch. Considering how pgbouncer works 
internally I am very curious.



Jan





Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Jan Wieck

On 6/8/23 13:31, Dave Cramer wrote:


On Thu, 8 Jun 2023 at 11:22, Konstantin Knizhnik > wrote:





So it will be responsibility of client to remember text of prepared
query to be able to resend it when statement doesn't exists at server?
IMHO very strange decision. Why not to handle it in connection
pooler (doesn't matter - external or embedded)?


I may be myopic but in the JDBC world and I assume others we have a 
`PreparedStatement` object which has the text of the query.

The text is readily available to us.

Also again from the JDBC point of view we have use un-named statements 
normally and then name them after 5 uses so we already have embedded 
logic on how to deal with PreparedStatements


The entire problem only surfaces when using a connection pool of one 
sort or another. Without one the session is persistent to the client.


At some point I created a "functional" proof of concept for a connection 
pool that did a mapping of the client side name to a pool managed server 
side name. It kept track of which query was known by a server. It kept a 
hashtable of poolname+username+query MD5 sums. On each prepare request 
it would look up if that query is known, add a query-client reference in 
another hashtable and so on. On a Bind/Exec message it would check that 
the server has the query prepared and issue a P message if not. What was 
missing was to keep track of no longer needed queries and deallocate them.


As said, it was a POC. Since it was implemented in Tcl it performed 
miserable, but I got it to the point of being able to pause & resume and 
the whole thing did work with prepared statements on the transaction 
level. So it was a full functioning POC.


What makes this design appealing to me is that it is completely 
transparent to every existing client that uses the extended query 
protocol for server side prepared statements.



Jan





Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Jan Wieck

On 6/8/23 10:56, Dave Cramer wrote:





On Thu, 8 Jun 2023 at 10:31, Jan Wieck <mailto:j...@wi3ck.info>> wrote:


On 6/8/23 09:53, Jan Wieck wrote:
 > On 6/8/23 09:21, Dave Cramer wrote:
 > The server doesn't know about all the clients of the pooler, does
it? It
 > has no way of telling if/when a client disconnects from the pooler.

Another problem that complicates doing it in the server is that the
information require to (re-)prepare a statement in a backend that
currently doesn't have it needs to be kept in shared memory. This
includes the query string itself. Doing that without shared memory in a
pooler that is multi-threaded or based on async-IO is much simpler and
allows for easy ballooning.


I don't expect the server to re-prepare the statement. If the server 
responds with "statement doesn't exist" the client would send a prepare.


Are you proposing a new libpq protocol version?


Jan




Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Jan Wieck

On 6/8/23 09:53, Jan Wieck wrote:

On 6/8/23 09:21, Dave Cramer wrote:
The server doesn't know about all the clients of the pooler, does it? It
has no way of telling if/when a client disconnects from the pooler.


Another problem that complicates doing it in the server is that the 
information require to (re-)prepare a statement in a backend that 
currently doesn't have it needs to be kept in shared memory. This 
includes the query string itself. Doing that without shared memory in a 
pooler that is multi-threaded or based on async-IO is much simpler and 
allows for easy ballooning.



Jan





Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Jan Wieck

On 6/8/23 09:21, Dave Cramer wrote:



On Thu, Jun 8, 2023 at 8:43 AM Jan Wieck <mailto:j...@wi3ck.info>> wrote:


On 6/8/23 02:15, Konstantin Knizhnik wrote:

 > There is a PR with support of prepared statement support to
pgbouncer:
 > https://github.com/pgbouncer/pgbouncer/pull/845
<https://github.com/pgbouncer/pgbouncer/pull/845>
 > any feedback, reviews and suggestions are welcome.

I was about to say that the support would have to come from the pooler
as it is possible to have multiple applications in different languages
connecting to the same pool(s).


Why from the pooler? If it were done at the server every client could 
use it?


The server doesn't know about all the clients of the pooler, does it? It 
has no way of telling if/when a client disconnects from the pooler.



Jan




Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Jan Wieck

On 6/8/23 02:15, Konstantin Knizhnik wrote:


There is a PR with support of prepared statement support to pgbouncer:
https://github.com/pgbouncer/pgbouncer/pull/845
any feedback, reviews and suggestions are welcome.


I was about to say that the support would have to come from the pooler 
as it is possible to have multiple applications in different languages 
connecting to the same pool(s).


I can certainly give this a try, possibly over the weekend. I have a 
TPC-C that can use prepared statements plus pause/resume. That might be 
a good stress for it.



Best Regards, Jan




Re: Add 64-bit XIDs into PostgreSQL 15

2022-11-29 Thread Jan Wieck

On 11/29/22 09:46, Bruce Momjian wrote:

As far as I know, all our freeze values are focused on avoiding XID
wraparound.  If XID wraparound is no longer an issue, we might find that
our freeze limits can be much higher than they are now.



I'd be careful in that direction as the values together with maintenance 
work mem also keep a lid on excessive index cleanup rounds.



Regards, Jan




Re: PL/pgSQL cursors should get generated portal names by default

2022-11-07 Thread Jan Wieck
My comments were in no way meant as an argument for or against the 
change itself. Only to clearly document the side effect it will have.



Regards, Jan


On 11/7/22 11:57, Kirk Wolak wrote:



On Mon, Nov 7, 2022 at 11:10 AM Jan Wieck <mailto:j...@wi3ck.info>> wrote:


On 11/4/22 19:46, Tom Lane wrote:
     > Jan Wieck mailto:j...@wi3ck.info>> writes:
 >> I need to do some testing on this. I seem to recall that the
naming was
 >> originally done because a reference cursor is basically a named
cursor
 >> that can be handed around between functions and even the top SQL
level
 >> of the application. For the latter to work the application needs
to know
 >> the name of the portal.
 >
 > Right.  With this patch, it'd be necessary to hand back the actual
 > portal name (by returning the refcursor value), or else manually
 > set the refcursor value before OPEN to preserve the previous
behavior.
 > But as far as I saw, all our documentation examples show handing back
 > the portal name, so I'm hoping most people do it like that already.

I was mostly concerned that we may unintentionally break
underdocumented
behavior that was originally implemented on purpose. As long as
everyone
is aware that this is breaking backwards compatibility in the way it
does, that's fine.


I respect the concern, and applied some deeper thinking to it...

Here is the logic I am applying to this compatibility issue and what may 
break.

[FWIW, my motto is to be wrong out  loud, as you learn faster]

At first pass, I thought "Well, since this does not break a refcursor, 
which is the obvious use case for RETURNING/PASSING, we are fine!"


But in trying to DEFEND this case, I have come up with example of code 
(that makes some SENSE, but would break):


CREATE FUNCTION test() RETURNS refcursor() LANGUAGE plpgsql AS  $$
DECLARE
     cur_this cursor FOR SELECT 1;
     ref_cur refcursor;
BEGIN
     OPEN cur_this;
     ref_cur := 'cur_this';  -- Using the NAME of the cursor as the 
portal name: Should do:  ref_cur := cur_this; -- Only works after OPEN

     RETURN ref_cur;
END;
$$;

As noted in the comments.  If the code were:
   ref_cur := 'cur_this';  -- Now you can't just use ref_cur := cur_this;
   OPEN cur_this;
   RETURN ref_cur;
Then it would break now...  And even the CORRECT syntax would break, 
since the cursor was not opened, so "cur_this" is null.


Now, I have NO IDEA if someone would actually do this.  It is almost 
pathological.  The use case would be a complex cursor with parameters,

and they changed the code to return a refcursor!
This was the ONLY use case I could think of that wasn't HACKY!

HACKY use cases involve a child routine setting:  local_ref_cursor := 
'cur_this'; in order to access a cursor that was NOT passed to the child.
FWIW, I tested this, and it works, and I can FETCH in the child routine, 
and it affects the parents' LOOP as it should... WOW.  I would be HAPPY

to break such horrible code, it has to be a security concern at some level.

Personally (and my 2 cents really shouldn't matter much), I think this 
should still be fixed.
Because I believe this small use case is rare, it will break 
immediately, and the fix is trivial (just initialize cur_this := 
'cur_this'  in this example),
and the fix removes the Orthogonal Behavior Tom pointed out, which led 
me to reporting this.


I think I have exhausted examples of how this impacts a VALID 
refcursor implementation.  I believe any other such versions are 
variations of this!
And maybe we document that if a refcursor of a cursor is to be returned, 
that the refcursor is ASSIGNED after the OPEN of the cursor, and it is 
done without the quotes, as:

   ref_cursor := cur_this;  -- assign the name after opening.

Thanks!








Re: PL/pgSQL cursors should get generated portal names by default

2022-11-07 Thread Jan Wieck

On 11/4/22 19:46, Tom Lane wrote:

Jan Wieck  writes:
I need to do some testing on this. I seem to recall that the naming was 
originally done because a reference cursor is basically a named cursor 
that can be handed around between functions and even the top SQL level 
of the application. For the latter to work the application needs to know 
the name of the portal.


Right.  With this patch, it'd be necessary to hand back the actual
portal name (by returning the refcursor value), or else manually
set the refcursor value before OPEN to preserve the previous behavior.
But as far as I saw, all our documentation examples show handing back
the portal name, so I'm hoping most people do it like that already.


I was mostly concerned that we may unintentionally break underdocumented 
behavior that was originally implemented on purpose. As long as everyone 
is aware that this is breaking backwards compatibility in the way it 
does, that's fine.




I am currently down with Covid and have trouble focusing. But I hope to 
get to it some time next week.


Get well soon!


Thanks, Jan





Re: PL/pgSQL cursors should get generated portal names by default

2022-11-04 Thread Jan Wieck

On 11/4/22 03:22, Pavel Stehule wrote:

Hi


st 2. 11. 2022 v 0:39 odesílatel Tom Lane > napsal:


There's a complaint at [1] about how you can't re-use the same
cursor variable name within a routine called from another routine
that's already using that name.  The complaint is itself a bit
under-documented, but I believe it is referring to this ancient
bit of behavior:

          A bound cursor variable is initialized to the string value
          representing its name, so that the portal name is the same as
          the cursor variable name, unless the programmer overrides it
          by assignment before opening the cursor.

So if you try to nest usage of two bound cursor variables of the
same name, it blows up on the portal-name conflict.  But it'll work
fine if you use unbound cursors (i.e., plain "refcursor" variables):

          But an unbound cursor
          variable defaults to the null value initially, so it will
receive
          an automatically-generated unique name, unless overridden.

I wonder why we did it like that; maybe it's to be bug-compatible with
some Oracle PL/SQL behavior or other?  Anyway, this seems non-orthogonal
and contrary to all principles of structured programming.  We don't even
offer an example of the sort of usage that would benefit from it, ie
that calling code could "just know" what the portal name is.

I propose that we should drop this auto initialization and let all
refcursor variables start out null, so that they'll get unique
portal names unless you take explicit steps to do something else.
As attached.

(Obviously this would be a HEAD-only fix, but maybe there's scope for
improving the back-branch docs along lines similar to these changes.)


I am sending an review of this patch

1. The patching, compilation without any problems
2. All tests passed
3. The implemented change is documented well
4. Although this is potencial compatibility break, we want this feature. 
It allows to use cursors variables in recursive calls by default, it 
allows shadowing of cursor variables

5. This patch is short and almost trivial, just remove code.

I'll mark this patch as ready for commit


I need to do some testing on this. I seem to recall that the naming was 
originally done because a reference cursor is basically a named cursor 
that can be handed around between functions and even the top SQL level 
of the application. For the latter to work the application needs to know 
the name of the portal.


I am currently down with Covid and have trouble focusing. But I hope to 
get to it some time next week.



Regards, Jan





Re: Limiting memory allocation

2022-05-23 Thread Jan Wieck

On 5/20/22 19:08, Tomas Vondra wrote:

Well, we already have the memory-accounting built into the memory
context infrastructure. It kinda does the same thing as the malloc()
wrapper, except that it does not publish the information anywhere and
it's per-context (so we have to walk the context recursively).

So maybe we could make this on-request somehow? Say, we'd a signal to
the process, and it'd run MemoryContextMemAllocated() on the top memory
context and store the result somewhere.


One remaining problem with all this is that we don't get any feedback 
from calling free() telling if any memory has been returned to the OS or 
not.


How portable would using sbrk() with a zero size be? If that is an 
option then I could envision counting actual calls to malloc() and 
whenever a GUC configurable amount is reached, use sbrk() to find out, 
do the accounting in shared memory and react accordingly.


This way we are not creating another highly contended lock and use 
authoritative information.



Regards, Jan




Re: Limiting memory allocation

2022-05-18 Thread Jan Wieck

On 5/18/22 11:11, Alvaro Herrera wrote:

On 2022-May-18, Jan Wieck wrote:


Maybe I'm missing something, but what is it that you would actually consider
a solution? Knowing your current memory consumption doesn't make the need
for allocating some right now go away. What do you envision the response of
PostgreSQL to be if we had that information about resource pressure?


What was mentioned in the talk where this issue was presented, is that
people would like malloc() to return NULL when there's memory pressure,
even if Linux has been configured indicating that memory overcommit is
OK.  The reason they can't set overcommit off is that it prevents other
services in the same system from running properly.


Thank you Alvaro, that was the missing piece. Now I understand what we 
are trying to do.



As I understand, setrlimit() sets the memory limit for any single
process.  But that isn't useful -- the limit needed is for the whole set
of processes under postmaster.  Limiting any individual process does no
good.

Now that's where cgroup's memory limiting features would prove useful,
if they weren't totally braindead:
https://www.kernel.org/doc/Documentation/cgroup-v2.txt
Apparently, if the cgroup goes over the "high" limit, the processes are
*throttled*.  Then if the group goes over the "max" limit, OOM-killer is
invoked.

(I can't see any way to make this even more counterproductive to the
database use case.  Making the database work more slowly doesn't fix
anything.)

So ditch cgroups.


Agreed.


What they (Timescale) do, is have a LD_PRELOAD library that checks
status of memory pressure, and return NULL from malloc().  This then
leads to clean abort of transactions and all is well.  There's nothing
that Postgres needs to do different than today.

I suppose that what they would like, is a way to inquire into the memory
pressure status at MemoryContextAlloc() time and return NULL if it is
too high.  How exactly this would work is unclear to me; maybe one
process keeps an eye on it in an OS-specific manner, and if it does get
near the maximum, set a bit in shared memory that other processes can
examine when MemoryContextAlloc is called.  It doesn't have to be
exactly accurate; an approximation is probably okay.


Correct, it doesn't have to be accurate. Something /proc based setting a 
flag in shared memory WOULD be good enough, IF MemoryContextAlloc() had 
some way of figuring out that its process is actually the right one to 
abort.


On a high transaction throughput system, having such a background 
process being the only one setting and clearing a flag in shared memory 
could prove disastrous. Let it check and set/clear the flag every second 
... the whole system would throw malloc(3) failures for a whole second 
on every session. Not the system I would like to benchmark ... although 
the result charts would look hilarious.


However, once we are under memory pressure to the point of aborting 
transactions, it may be reasonable to have MemoryContextAlloc() calls 
work through a queue and return NULL one by one until the pressure is 
low enough again.


I'll roll this problem around in my head for a little longer. There 
certainly is a way to do this a bit more intelligent.



Thanks again, Jan




Re: Limiting memory allocation

2022-05-18 Thread Jan Wieck

On 5/17/22 18:30, Stephen Frost wrote:

Greetings,

On Tue, May 17, 2022 at 18:12 Tom Lane <mailto:t...@sss.pgh.pa.us>> wrote:


Jan Wieck mailto:j...@wi3ck.info>> writes:
 > On 5/17/22 15:42, Stephen Frost wrote:
 >> Thoughts?

 > Using cgroups one can actually force a certain process (or user, or
 > service) to use swap if and when that service is using more
memory than
 > it was "expected" to use.

I wonder if we shouldn't just provide documentation pointing to OS-level
facilities like that one.  The kernel has a pretty trivial way to check
the total memory used by a process.  We don't: it'd require tracking
total
space used in all our memory contexts, and then extracting some
number out
of our rear ends for allocations made directly from malloc.  In short,
anything we do here will be slow and unreliable, unless you want to
depend
on platform-specific things like looking at /proc/self/maps.


This isn’t actually a solution though and that’s the problem- you end up 
using swap but if you use more than “expected” the OOM killer comes in 
and happily blows you up anyway. Cgroups are containers and exactly what 
kube is doing.


Maybe I'm missing something, but what is it that you would actually 
consider a solution? Knowing your current memory consumption doesn't 
make the need for allocating some right now go away. What do you 
envision the response of PostgreSQL to be if we had that information 
about resource pressure? I don't see us using mallopt(3) or 
malloc_trim(3) anywhere in the code, so I don't think any of our 
processes give back unused heap at this point (please correct me if I'm 
wrong). This means that even if we knew about the memory pressure of the 
system, adjusting things like work_mem on the fly may not do much at 
all, unless there is a constant turnover of backends.


So what do you propose PostgreSQL's response to high memory pressure to be?


Regards, Jan




Re: Limiting memory allocation

2022-05-17 Thread Jan Wieck

On 5/17/22 15:42, Stephen Frost wrote:

Thoughts?


Yes.

The main and foremost problem is a server that is used for multiple 
services and they behave differently when it comes to memory allocation. 
One service just allocates like we have petabytes of RAM, then uses 
little of it, while another one is doing precise accounting and uses all 
of that. These two types of services don't coexist well on one system 
without intervention.


Unfortunately swap space has been shunned as the ugly stepchild of 
memory in recent years. It could help in this regard to bring back swap 
space, but don't really intend to use it.


Using cgroups one can actually force a certain process (or user, or 
service) to use swap if and when that service is using more memory than 
it was "expected" to use. So I have a server with 64G of RAM. I give 16G 
to Postgres as shared buffers and another 16G to work with. I assume 
another 16G of OS buffers, so I restrict the Apache-Tomcat stuff running 
on it to something like 8-12G. After that, it has to swap. Of course, my 
Postgres processes also will have to swap if they need more than 16G of 
overall workmem ... but that is what I actually intended. I may have to 
reduce workmem, or max_connections, or something else.



Regards, Jan




Re: Change the csv log to 'key:value' to facilitate the user to understanding and processing of logs

2022-03-15 Thread Jan Wieck

On 3/15/22 10:12, Andrew Dunstan wrote:


On 3/15/22 09:30, hubert depesz lubaczewski wrote:

On Tue, Mar 15, 2022 at 09:31:19AM +0800, lupeng wrote:

Dear Hackers
When I audit the Postgresql database recently, I found that after
configuring the log type as csv, the output log content is as follows:
"database ""lp_db1"" does not exist","DROP DATABASE
lp_db1;",,"dropdb, dbcommands.c:841","","client backend",,0 It is very
inconvenient to understand the real meaning of each field. And in the
log content," is escaped as "", which is not friendly to regular
expression matching. Therefore, I want to modify the csv log function,
change its format to key:value, assign the content of the non-existing
field to NULL, and at the same time, " will be escaped as  \" in the
log content. After the modification, the above log format is as
follows: Log_time:"2022-03-15 09:17:55.289
CST",User_name:"postgres",Database_name:"lp_db",Process_id:"17995",Remote_host:"192.168.88.130",Remote_port:"38402",Line_number:
"622fe941.464b",PS_display:"DROP
DATABASE",Session_start_timestamp:"2022-03-15 09:17:53
CST",Virtual_transaction_id:"3/2",Transaction_id:"NULL",Error_severity:"ERROR",SQL_state_code
:"3D000",Errmessage:"database \"lp_db1\" does not
exist",Errdetail:"NULL",Errhint:"NULL",Internal_query:"NULL",Internal_pos:"0",Errcontext:"NULL",User_query
:"DROP DATABASE lp_db1;",Cursorpos:"NULL",File_location:"dropdb,
dbcommands.c:841",Application_name:"NULL",Backend_type:"client
backend",Leader_PID:"0",Query_id:"0"

CSV format is well documented
(https://www.postgresql.org/docs/current/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-CSVLOG).

If you want named fields you can wait for pg15 and its jsonlog
(https://www.depesz.com/2022/01/17/waiting-for-postgresql-15-introduce-log_destinationjsonlog/).

I, for one, wouldn't want to have to deal with field names repeated in
every single record.



Indeed. And even if this were a good idea, which it's not, it would be
15 years too late.


Also, the CSV format, while human readable to a degree, wasn't meant for 
direct, human consumption. It was meant to be read by programs and at 
the time, CSV made the most sense.



Regards, Jan




Re: should we enable log_checkpoints out of the box?

2021-11-05 Thread Jan Wieck

On 11/5/21 10:50, Alvaro Herrera wrote:

On 2021-Nov-05, Michael Banck wrote:


Well that, and the fact those distinctions are only done for user-
facing events, whereas it seems to me we only distinguish between LOG
and PANIC for server-facing events; maybe we need one or more
additional levels here in order to make it easier for admins to see the
really bad things that are happening?


I think what we need is an orthogonal classification.  "This FATAL here
is routine; that ERROR there denotes a severe problem in the underlying
OS".  Additional levels won't help with that.  Maybe adding the concept
of "severity" or "criticality" to some messages would be useful to
decide what to keep and what to discard.



That would go a long way. I would add a third classification that is 
"area", indicating if this is for example resource or application logic 
related. An FK violation is app-logic, running checkpoints too often is 
a resource problem. Allow the DBA to create some filter based on 
combinations of them and it should work well enough.



Regards, Jan Wieck




Re: should we enable log_checkpoints out of the box?

2021-11-03 Thread Jan Wieck

On 11/3/21 09:09, Robert Haas wrote:


For better or for worse, the
distinction between ERROR, FATAL, and PANIC is entirely based on what
we do after printing the message, and NOT on how serious the message
is.


THAT is a real problem with our error handling and logging system. Often 
using RAISE in a user defined function or procedure is part of the 
communication between the application's code in the backend and the 
parts in the middleware. The information that a function rejected user 
input after a sanity check doesn't belong in the system log. It should 
produce an error message on the user's end and that is it. However, 
currently there is no way for a UDF to ensure the transaction gets 
rolled back without raising an ERROR and bloating the log.


For example the BenchmarkSQL UDF implementation raises such ERROR for 
the transactions that the TPC-C requires to contain an input error 
(non-existing ITEM), which happens on 1% of all NEW-ORDER transactions. 
Running an 8-core system with plenty of RAM close to saturation will 
generate about 10MB of log just for that per hour. That is a quarter GB 
of useless garbage, no DBA is ever going to be interested in, every day.


If anybody is worried about producing too much log output, this should 
be their primary focus.



Regards, Jan





Re: should we enable log_checkpoints out of the box?

2021-11-02 Thread Jan Wieck

On 11/2/21 20:02, Tom Lane wrote:


My objection to log_checkpoints=on is that that's going to produce
a constant stream of messages even when *nothing at all* is wrong.
Worse yet, a novice DBA would likely have a hard time understanding
from those messages whether there was anything to worry about or not.
If we could design a variant of log_checkpoints that would produce
output only when the situation really needs attention, I'd be fine
with enabling that by default.



Making log_checkpoints an enum sort of thing as already suggested might 
do that. Or (also already suggested) elevating checkpoint logging once 
it happened because of WAL for a while.


The thing I don't want to see us doing is *nothing at all* when pretty 
much everyone with some customer experience in the field is saying "this 
is the information we want to see post incident and nobody has it so we 
sit there waiting for the next time it happens."



Regards

--
Jan Wieck




Re: should we enable log_checkpoints out of the box?

2021-11-02 Thread Jan Wieck

On 11/2/21 12:09, Peter Geoghegan wrote:

On Tue, Nov 2, 2021 at 8:55 AM Robert Haas  wrote:

I think shipping with log_checkpoints=on and
log_autovacuum_min_duration=10m or so would be one of the best things
we could possibly do to allow ex-post-facto troubleshooting of
system-wide performance issues. The idea that users care more about
the inconvenience of a handful of extra log messages than they do
about being able to find problems when they have them matches no part
of my experience. I suspect that many users would be willing to incur
*way more* useless log messages than those settings would ever
generate if it meant that they could actually find problems when and
if they have them.


I fully agree.



+1

--
Jan Wieck




Re: should we enable log_checkpoints out of the box?

2021-10-31 Thread Jan Wieck

On 10/31/21 16:16, Andres Freund wrote:

Hi,

On 2021-10-31 15:43:57 -0400, Tom Lane wrote:

Andres Freund  writes:
> On 2021-10-31 10:59:19 -0400, Tom Lane wrote:
>> No DBA would be likely to consider it as anything but log spam.

> I don't agree at all. No postgres instance should be run without
> log_checkpoints enabled. Performance is poor if checkpoints are
> triggered by anything but time, and that can only be diagnosed if
> log_checkpoints is on.

This is complete nonsense.


Shrug. It's based on many years of doing or being around people doing
postgres support escalation shifts. And it's not like log_checkpoints
incurs meaningful overhead or causes that much log volume.


I agree with Andres 100%. Whenever called to diagnose any type of 
problems this is on the usual checklist and very few customers have it 
turned on. The usefulness of this information very much outweighs the 
tiny amount of extra log created.







If we think that's a generic problem, we should be fixing the problem
(ie, making the checkpointer smarter);


We've made it less bad (checkpoint_segments -> max_wal_size, sorting IO
for checkpoints, forcing the OS to flush writes earlier). But it's still
a significant issue. It's not that easy to make it better.


And we kept the default for max_wal_size at 1GB. While it is a "soft" 
limit, it is the main reason why instances are running full bore with a 
huge percentage of full page writes because it is way too small for 
their throughput and nothing in the logs warns them about it. I can run 
a certain TPC-C workload on an 8-core machine quite comfortably when 
max_wal_size is configured at 100G. The exact same TPC-C configuration 
will spiral the machine down if left with default max_wal_size and there 
is zero hint in the logs as to why.





--
Jan Wieck




Re: PostgreSQL High Precision Support Extension.

2021-09-21 Thread Jan Wieck

On 9/21/21 6:20 PM, Thomas Munro wrote:

On Tue, Sep 21, 2021 at 2:58 PM Thomas Munro  wrote:

On Tue, Sep 21, 2021 at 1:30 PM A Z  wrote:
> -A library like GMP, written in C, is an appropriate basis to start from and 
to include, for all OS platforms involved.

Are you aware of Daniele Varrazzo's extension
https://github.com/dvarrazzo/pgmp/ ?  (Never looked into it myself,
but this seems like the sort of thing you might be looking for?)


[A Z replied off-list and mentioned areas where pgmp falls short, but
I'll reply on-list to try to increase the chance of useful discussion
here...]


This seems to become a common pattern to open source communities. Not 
just PostgreSQL, I have seen it elsewhere. People make vague or just 
specification level "proposals" and as soon as anyone replies, try to 
drag it into private and off-list/off-forum conversations.


Most of the time this is because they aren't really proposing anything, 
but are just looking for someone else to implement what they need for 
their own, paying customer. They are not willing or able to contribute 
anything but requirements.


As the original author of the NUMERIC data type I am definitely 
interested in this sort of stuff. And I would love contributing in the 
design of the on-disk and in-memory structures of these new data types 
created as an EXTENSION.


However, so far I only see a request for someone else to create this 
extension. What exactly is "A Z" (poweruserm) going to contribute to 
this effort?



Regards, Jan

--
Jan Wieck




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Jan Wieck

On 7/30/21 1:05 PM, Tom Lane wrote:

I don't see the need for it anyway.  What is different from just
putting the update actions into an extension version upgrade
script created according to the current rules, and then setting
that new extension version as the default version in the extension
build you ship for the new server version?


You are right. The real fix should actually be that an extension, that 
creates different objects depending on the major server version it is 
installed on, should not use the same version number for itself on those 
two server versions. It is actually wrong to have DO blocks that execute 
server version dependent sections in the CREATE EXTENSION scripts. 
However similar the code may be, it is intended for different server 
versions, so it is not the same version of the extension.



Regards, Jan

--
Jan Wieck




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Jan Wieck

On 7/30/21 7:40 AM, Julien Rouhaud wrote:

On Fri, Jul 30, 2021 at 07:33:55AM -0400, Dave Cramer wrote:


Ah, ok, got it, thanks. Well I'm not sure how to deal with this.
The only thing I can think of is that we add a post_upgrade function to the
API.

All extensions would have to implement this.


It seems like a really big hammer for a niche usage.  As far as I know I'm the
only one who wrote an extension that can create different objects depending on
the server version, so I'm entirely fine with dealing with that problem in my
extension rather than forcing everyone to implement an otherwise useless API.

Now if that API can be useful for other cases or if there are other extensions
with similar problems that would be different story.



I haven't worked on it for a while, but I think pl_profiler does the 
same thing, so you are not alone.



Jan

--
Jan Wieck




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Jan Wieck

On 7/30/21 7:33 AM, Dave Cramer wrote:





Yes, but as I said twice only if the currently installed version is
different
from the default version.  Otherwise ALTER EXTENSION UPGRADE is a no-op.


Ah, ok, got it, thanks. Well I'm not sure how to deal with this.
The only thing I can think of is that we add a post_upgrade function to 
the API.


All extensions would have to implement this.


An alternative to this would be a recommended version number scheme for 
extensions. If extensions were numbered


   MAJOR_SERVER.MAJOR_EXTENSION.MINOR_EXTENSION

then pg_upgrade could check the new cluster for the existence of an SQL 
script that upgrades the extension from the old cluster's version to the 
new current. And since an extension cannot have the same version number 
on two major server versions, there is no ambiguity here.


The downside is that all the extensions that don't need anything done 
for those upgrades (which I believe is the majority of them) would have 
to provide empty SQL files. Not necessarily a bad thing, as it actually 
documents "yes, the extension developer checked this and there is 
nothing to do here."



Regards, Jan

--
Jan Wieck




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-29 Thread Jan Wieck

On 7/29/21 2:04 PM, Julien Rouhaud wrote:

On Thu, Jul 29, 2021 at 11:46:12AM -0400, Jan Wieck wrote:



> I don't have a ready example of such an extension, but if we ever would
> convert the backend parts of Slony into an extension, it would be one.



FWIW I have an example of such an extension: powa-archivist extension
script runs an anonymous block code and creates if needed a custom
wrapper to emulate the current_setting(text, boolean) variant that
doesn't exist on pre-pg96 servers.



Thank you!

I presume that pg_upgrade on a database with that extension installed 
would silently succeed and have the pg_catalog as well as public (or 
wherever) version of that function present.



Regards, Jan

--
Jan Wieck




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-29 Thread Jan Wieck

On 7/29/21 12:44 PM, David G. Johnston wrote:

... - but at present pg_upgrade simply 
doesn't care and hopes that the extension author's documentation deals 
with these possibilities in a sane manner.


pg_upgrade is not able to address this in a guaranteed, consistent 
fashion. At this point there is no way to even make sure that a 3rd 
party extension provides the scripts needed to upgrade from one 
extension version to the next. We don't have the mechanism to upgrade 
the same extension version from one server version to the next, in case 
there are any modifications in place necessary.


How exactly do you envision that we, the PostgreSQL project, make sure 
that extension developers provide those mechanisms in the future?



Regards, Jan

--
Jan Wieck




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-29 Thread Jan Wieck

On 7/29/21 12:00 PM, David G. Johnston wrote:
Ok, looking at the flow again, where exactly would the user even be able 
to execute "CREATE EXTENSION" meaningfully?  The relevant databases do 
not exist (not totally sure what happens to the postgres database 
created during the initdb step...) so at the point where the user is 
"installing the extension" all they can reasonably do is a server-level 
install (they could maybe create extension in the postgres database, but 
does that even matter?).


So, I'd propose simplifying this all to something like:

Install extensions on the new server


Extensions are not installed on the server level. Their binary 
components (shared objects) are, but the actual catalog modifications 
that make them accessible are performed per database by CREATE 
EXTENSION, which executes the SQL files associated with the extension. 
And they can be performed differently per database, like for example 
placing one and the same extension into different schemas in different 
databases.


pg_upgrade is not (and should not be) concerned with placing the 
extension's installation components into the new version's lib and share 
directories. But it is pg_upgrade's job to perform the correct catalog 
modification per database during the upgrade.


Any extensions that are used by the old cluster need to be installed 
into the new cluster.  Each database in the old cluster will have its 
current version of all extensions migrated to the new cluster as-is.  
You can use the ALTER EXTENSION command, on a per-database basis, to 
update its extensions post-upgrade.


That assumes that the extension SQL files are capable of detecting a 
server version change and perform the necessary (if any) steps to alter 
the extension's objects accordingly.


Off the top of my head I don't remember what happens when one executes 
ALTER EXTENSION ... UPGRADE ... when it is already on the latest version 
*of the extension*. Might be an error or a no-op.


And to make matters worse, it is not possible to work around this with a 
DROP EXTENSION ... CREATE EXTENSION. There are extensions that create 
objects, like user defined data types and functions, that will be 
referenced by end user objects like tables and views.



Regards, Jan

--
Jan Wieck




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-29 Thread Jan Wieck

On 7/29/21 11:10 AM, Bruce Momjian wrote:

On Thu, Jul 29, 2021 at 11:01:43AM -0400, Dave Cramer wrote:

Much better, however I'm unclear on whether CREATE EXTENSION is actually
executed on the upgraded server.

From what I could gather it is not, otherwise the new function definitions
should have been in place. 
I think what happens is that the function definitions are copied as part of the

schema and pg_extension is also copied.


Yes, the _effect_ of CREATE EXTENSION in the old cluster is copied to
the new cluster as object definitions.  CREATE EXTENSION runs the SQL
files associated with the extension.



This assumes that the scripts executed during CREATE EXTENSION have no 
conditional code in them that depends on the server version. Running the 
same SQL script on different server versions can have different effects.


I don't have a ready example of such an extension, but if we ever would 
convert the backend parts of Slony into an extension, it would be one.



Regards, Jan

--
Jan Wieck




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-15 Thread Jan Wieck

On 7/15/21 4:24 PM, David G. Johnston wrote:

OK, especially as this seems useful outside of pg_upgrade, and if done 
separately is something pg_upgrade could just run as part of its new 
cluster evaluation scripts.  Knowing whether an extension is outdated 
doesn't require the old cluster.


Knowing that (the extension is outdated) exactly how? Can you give us an 
example query, maybe a few SQL snippets explaining what exactly you are 
talking about? Because at this point you completely lost me.


Sorry, Jan

--
Jan Wieck




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-15 Thread Jan Wieck

On 7/15/21 1:10 PM, David G. Johnston wrote:
... But while 
PostgreSQL doesn't really have a choice here - it cannot be expected to 
subdivide itself - extensions (at least external ones - PostGIS is one I 
have in mind presently) - can and often do attempt to support multiple 
versions of PostgreSQL for whatever major versions of their product they 
are offering.  For these it is possible to adhere to the "change one 
thing at a time principle" and to treat the extensions as not being part 
of "ALL the changes from one major version to the target version..."


You may make that exception for an external extension like PostGIS. But 
I don't think it is valid for one distributed in sync with the core 
system in the contrib package, like pg_stat_statements. Which happens to 
be the one named in the subject line of this entire discussion.



Regards, Jan

--
Jan Wieck




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-15 Thread Jan Wieck

On 7/15/21 12:46 PM, David G. Johnston wrote:

I am an application developer who operates on the principle of "change 
only one thing at a time".


Which pg_upgrade by definition isn't. It is bringing ALL the changes 
from one major version to the target version, which may be multiple at 
once. Including, but not limited to, catalog schema changes, SQL 
language changes, extension behavior changes and utility command 
behavior changes.


On that principle, you should advocate against using pg_upgrade in the 
first place.



Regards, Jan

--
Jan Wieck




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-15 Thread Jan Wieck

On 7/15/21 12:31 PM, Robert Eckhardt wrote:
  I don’t know if this is a terrible flaw in pg_upgrade, it is a 
terrible flaw in the overall Postgres experience.


+1 (that is the actual problem here)


--
Jan Wieck




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-15 Thread Jan Wieck

On 7/15/21 12:25 PM, David G. Johnston wrote:

I would say that it probably should be "--upgrade-extension=aaa 
--upgrade_extension=bbb" though if we are going to the effort to offer 
something.


I am a bit confused here. From the previous exchange I get the feeling 
that you haven't created and maintained a single extension that survived 
a single version upgrade of itself or PostgreSQL (in the latter case 
requiring code changes to the extension due to internal API changes 
inside the PostgreSQL version).


I have. PL/Profiler to be explicit.

Can you please elaborate what experience your opinion is based on?


Regards, Jan


--
Jan Wieck




Re: Always bump PG_CONTROL_VERSION?

2021-05-13 Thread Jan Wieck

On 5/13/21 6:45 PM, Tom Lane wrote:

  Bumping the version in the commit that changes
things is not optional, because if you don't do that then you'll
probably burn some other developer also working on HEAD.  So
I don't want people thinking they can skip this because it was
done at the beginning of the development cycle.


And we make sure this is done how?


Regards, Jan

--
Jan Wieck
Postgres User since 1994




Re: Always bump PG_CONTROL_VERSION?

2021-05-13 Thread Jan Wieck

On 5/12/21 10:04 PM, Michael Paquier wrote:

On Wed, May 12, 2021 at 01:30:27PM -0700, Andres Freund wrote:

That said, I don't think it's a good practice to use the control file
version as an identifier for the major version. Who knows, it might be
necessary to add an optional new format in a minor version at some point
or such crazyness. And then there's the beta stuff you'd mentioned, etc.


Yes, PG_VERSION, as you wrote upthread already, is already fine for
the job, and FWIW, I have yet to see a case where being able to easily
detect the minor version in a data folder matters.


Regardless of PG_VERSION doing the job or not, shouldn't there be a bump 
in PG_CONTROL_VERSION whenever there is a structural or semantic change 
in the control file data? And wouldn't the easiest way to ensure that be 
to bump the version with every release?


Also, can someone give me a good reason NOT to bump the version?


Thanks, Jan

--
Jan Wieck
Postgres User since 1994




Re: SELECT INTO deprecation

2021-03-30 Thread Jan Wieck

On 12/15/20 5:13 PM, Bruce Momjian wrote:

On Wed, Dec  9, 2020 at 09:48:54PM +0100, Peter Eisentraut wrote:

On 2020-12-03 20:26, Peter Eisentraut wrote:
> On 2020-12-03 16:34, Tom Lane wrote:
> > As I recall, a whole lot of the pain we have with INTO has to do
> > with the semantics we've chosen for INTO in a set-operation nest.
> > We think you can write something like
> > 
> >  SELECT ... INTO foo FROM ... UNION SELECT ... FROM ...
> > 
> > but we insist on the INTO being in the first component SELECT.

> > I'd like to know exactly how much of that messiness is shared
> > by SQL Server.
> 
> On sqlfiddle.com, this works:
> 
> select a into t3 from t1 union select a from t2;
> 
> but this gets an error:
> 
> select a from t1 union select a into t4 from t2;
> 
> SELECT INTO must be the first query in a statement containing a UNION,

> INTERSECT or EXCEPT operator.

So, with that in mind, here is an alternative proposal that points out that
SELECT INTO has some use for compatibility.


Do we really want to carry around confusing syntax for compatibility?  I
doubt we would ever add INTO now, even for compatibility.



If memory serves the INTO syntax is a result from the first incarnation 
of PL/pgSQL being based on the Oracle PL/SQL syntax. I think it has been 
there from the very beginning, which makes it likely that by now a lot 
of migrants are using it in rather old code.


I don't think it should be our business to throw wrenches into their 
existing applications.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-24 Thread Jan Wieck

On 3/24/21 12:04 PM, Jan Wieck wrote:

In any case I changed the options so that they behave the same way, the
existing -o and -O (for old/new postmaster options) work. I don't think
it would be wise to have option forwarding work differently between
options for postmaster and options for pg_dump/pg_restore.


Attaching the actual diff might help.

--
Jan Wieck
Principle Database Engineer
Amazon Web Services
diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c
index c7351a4..4a611d0 100644
--- a/src/bin/pg_dump/parallel.c
+++ b/src/bin/pg_dump/parallel.c
@@ -864,6 +864,11 @@ RunWorker(ArchiveHandle *AH, ParallelSlot *slot)
 	WaitForCommands(AH, pipefd);
 
 	/*
+	 * Close an eventually open BLOB batch transaction.
+	 */
+	CommitBlobTransaction((Archive *)AH);
+
+	/*
 	 * Disconnect from database and clean up.
 	 */
 	set_cancel_slot_archive(slot, NULL);
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 0296b9b..cd8a590 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -203,6 +203,8 @@ typedef struct Archive
 	int			numWorkers;		/* number of parallel processes */
 	char	   *sync_snapshot_id;	/* sync snapshot id for parallel operation */
 
+	int			blobBatchSize;	/* # of blobs to restore per transaction */
+
 	/* info needed for string escaping */
 	int			encoding;		/* libpq code for client_encoding */
 	bool		std_strings;	/* standard_conforming_strings */
@@ -269,6 +271,7 @@ extern void WriteData(Archive *AH, const void *data, size_t dLen);
 extern int	StartBlob(Archive *AH, Oid oid);
 extern int	EndBlob(Archive *AH, Oid oid);
 
+extern void	CommitBlobTransaction(Archive *AH);
 extern void CloseArchive(Archive *AH);
 
 extern void SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt);
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 1f82c64..8331e8a 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -68,6 +68,7 @@ typedef struct _parallelReadyList
 	bool		sorted;			/* are valid entries currently sorted? */
 } ParallelReadyList;
 
+static int		blobBatchCount = 0;
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 			   const int compression, bool dosync, ArchiveMode mode,
@@ -265,6 +266,8 @@ CloseArchive(Archive *AHX)
 	int			res = 0;
 	ArchiveHandle *AH = (ArchiveHandle *) AHX;
 
+	CommitBlobTransaction(AHX);
+
 	AH->ClosePtr(AH);
 
 	/* Close the output */
@@ -279,6 +282,23 @@ CloseArchive(Archive *AHX)
 
 /* Public */
 void
+CommitBlobTransaction(Archive *AHX)
+{
+	ArchiveHandle *AH = (ArchiveHandle *) AHX;
+
+	if (blobBatchCount > 0)
+	{
+		ahprintf(AH, "--\n");
+		ahprintf(AH, "-- End BLOB restore batch\n");
+		ahprintf(AH, "--\n");
+		ahprintf(AH, "COMMIT;\n\n");
+
+		blobBatchCount = 0;
+	}
+}
+
+/* Public */
+void
 SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt)
 {
 	/* Caller can omit dump options, in which case we synthesize them */
@@ -3531,6 +3551,57 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData)
 {
 	RestoreOptions *ropt = AH->public.ropt;
 
+	/* We restore BLOBs in batches to reduce XID consumption */
+	if (strcmp(te->desc, "BLOB") == 0 && AH->public.blobBatchSize > 0)
+	{
+		if (blobBatchCount > 0)
+		{
+			/* We are inside a BLOB restore transaction */
+			if (blobBatchCount >= AH->public.blobBatchSize)
+			{
+/*
+ * We did reach the batch size with the previous BLOB.
+ * Commit and start a new batch.
+ */
+ahprintf(AH, "--\n");
+ahprintf(AH, "-- BLOB batch size reached\n");
+ahprintf(AH, "--\n");
+ahprintf(AH, "COMMIT;\n");
+ahprintf(AH, "BEGIN;\n\n");
+
+blobBatchCount = 1;
+			}
+			else
+			{
+/* This one still fits into the current batch */
+blobBatchCount++;
+			}
+		}
+		else
+		{
+			/* Not inside a transaction, start a new batch */
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "-- Start BLOB restore batch\n");
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "BEGIN;\n\n");
+
+			blobBatchCount = 1;
+		}
+	}
+	else
+	{
+		/* Not a BLOB. If we have a BLOB batch open, close it. */
+		if (blobBatchCount > 0)
+		{
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "-- End BLOB restore batch\n");
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "COMMIT;\n\n");
+
+			blobBatchCount = 0;
+		}
+	}
+
 	/* Select owner, schema, tablespace and default AM as necessary */
 	_becomeOwner(AH, te);
 	_selectOutputSchema(AH, te->namespace);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index f8bec3f..f153f08 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -165,12 +165,20 @@ static void guessConstraintInheritance(TableInfo *tblinfo, int numTables);
 static void du

Re: pg_upgrade failing for 200+ million Large Objects

2021-03-24 Thread Jan Wieck

On 3/23/21 4:55 PM, Tom Lane wrote:

Jan Wieck  writes:
Have we even reached a consensus yet on that doing it the way, my patch 
is proposing, is the right way to go? Like that emitting BLOB TOC 
entries into SECTION_DATA when in binary upgrade mode is a good thing? 
Or that bunching all the SQL statements for creating the blob, changing 
the ACL and COMMENT and SECLABEL all in one multi-statement-query is.


Now you're asking for actual review effort, which is a little hard
to come by towards the tail end of the last CF of a cycle.  I'm
interested in this topic, but I can't justify spending much time
on it right now.


Understood.

In any case I changed the options so that they behave the same way, the 
existing -o and -O (for old/new postmaster options) work. I don't think 
it would be wise to have option forwarding work differently between 
options for postmaster and options for pg_dump/pg_restore.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Jan Wieck

On 3/23/21 3:35 PM, Tom Lane wrote:

Jan Wieck  writes:
The problem here is that pg_upgrade itself is invoking a shell again. It 
is not assembling an array of arguments to pass into exec*(). I'd be a 
happy camper if it did the latter. But as things are we'd have to add 
full shell escapeing for arbitrary strings.


Surely we need that (and have it already) anyway?


There are functions to shell escape a single string, like

appendShellString()

but that is hardly enough when a single optarg for --restore-option 
could look like any of


--jobs 8
--jobs=8
--jobs='8'
--jobs '8'
--jobs "8"
--jobs="8"
--dont-bother-about-jobs

When placed into a shell string, those things have very different 
effects on your args[].


I also want to say that we are overengineering this whole thing. Yes, 
there is the problem of shell quoting possibly going wrong as it passes 
from one shell to another. But for now this is all about passing a few 
numbers down from pg_upgrade to pg_restore (and eventually pg_dump).


Have we even reached a consensus yet on that doing it the way, my patch 
is proposing, is the right way to go? Like that emitting BLOB TOC 
entries into SECTION_DATA when in binary upgrade mode is a good thing? 
Or that bunching all the SQL statements for creating the blob, changing 
the ACL and COMMENT and SECLABEL all in one multi-statement-query is.


Maybe we should focus on those details before getting into all the 
parameter naming stuff.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Jan Wieck

On 3/23/21 2:59 PM, Tom Lane wrote:

Jan Wieck  writes:

On 3/23/21 2:35 PM, Tom Lane wrote:

If you're passing multiple options, that is
--pg-dump-options "--foo=x --bar=y"
it seems just horribly fragile.  Lose the double quotes and suddenly
--bar is a separate option to pg_upgrade itself, not part of the argument
for the previous option.  That's pretty easy to do when passing things
through shell scripts, too.


... which would be all really easy if pg_upgrade wouldn't be assembling 
a shell script string to pass into parallel_exec_prog() by itself.


No, what I was worried about is shell script(s) that invoke pg_upgrade
and have to pass down some of these options through multiple levels of
option parsing.


The problem here is that pg_upgrade itself is invoking a shell again. It 
is not assembling an array of arguments to pass into exec*(). I'd be a 
happy camper if it did the latter. But as things are we'd have to add 
full shell escapeing for arbitrary strings.




BTW, it doesn't seem like the "pg-" prefix has any value-add here,
so maybe "--dump-option" and "--restore-option" would be suitable
spellings.


Agreed.


Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Jan Wieck

On 3/23/21 2:35 PM, Tom Lane wrote:

Jan Wieck  writes:

So the question remains, how do we name this?



 --pg-dump-options ""
 --pg-restore-options ""


If you're passing multiple options, that is

--pg-dump-options "--foo=x --bar=y"

it seems just horribly fragile.  Lose the double quotes and suddenly
--bar is a separate option to pg_upgrade itself, not part of the argument
for the previous option.  That's pretty easy to do when passing things
through shell scripts, too.  So it'd likely be safer to write

--pg-dump-option=--foo=x --pg-dump-option=--bar=y

which requires pg_upgrade to allow aggregating multiple options,
but you'd probably want it to act that way anyway.


... which would be all really easy if pg_upgrade wouldn't be assembling 
a shell script string to pass into parallel_exec_prog() by itself.


But I will see what I can do ...


Regards, Jan


--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Jan Wieck

On 3/23/21 2:06 PM, Bruce Momjian wrote:

We have the postmaster which can pass arbitrary arguments to postgres
processes using -o.


Right, and -o is already taken in pg_upgrade for sending options to the 
old postmaster.


What we are looking for are options for sending options to pg_dump and 
pg_restore, which are not postmasters or children of postmaster, but 
rather clients. There is no option to send options to clients of 
postmasters.


So the question remains, how do we name this?

--pg-dump-options ""
--pg-restore-options ""

where "" could be something like "--whatever[=NUM] [...]" would 
be something unambiguous.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Jan Wieck

On 3/23/21 10:56 AM, Bruce Momjian wrote:

On Tue, Mar 23, 2021 at 08:51:32AM -0400, Jan Wieck wrote:

On 3/22/21 7:18 PM, Jan Wieck wrote:
> On 3/22/21 5:36 PM, Zhihong Yu wrote:
> > Hi,
> > 
> > w.r.t. pg_upgrade_improvements.v2.diff.
> > 
> > +       blobBatchCount = 0;

> > +       blobInXact = false;
> > 
> > The count and bool flag are always reset in tandem. It seems

> > variable blobInXact is not needed.
> 
> You are right. I will fix that.


New patch v3 attached.


Would it be better to allow pg_upgrade to pass arbitrary arguments to
pg_restore, instead of just these specific ones?



That would mean arbitrary parameters to pg_dump as well as pg_restore. 
But yes, that would probably be better in the long run.


Any suggestion as to how that would actually look like? Unfortunately 
pg_restore has -[dDoOr] already used, so it doesn't look like there will 
be any naturally intelligible short options for that.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Jan Wieck

On 3/22/21 7:18 PM, Jan Wieck wrote:

On 3/22/21 5:36 PM, Zhihong Yu wrote:

Hi,

w.r.t. pg_upgrade_improvements.v2.diff.

+       blobBatchCount = 0;
+       blobInXact = false;

The count and bool flag are always reset in tandem. It seems 
variable blobInXact is not needed.


You are right. I will fix that.


New patch v3 attached.


Thanks, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services
diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c
index c7351a4..4a611d0 100644
--- a/src/bin/pg_dump/parallel.c
+++ b/src/bin/pg_dump/parallel.c
@@ -864,6 +864,11 @@ RunWorker(ArchiveHandle *AH, ParallelSlot *slot)
 	WaitForCommands(AH, pipefd);
 
 	/*
+	 * Close an eventually open BLOB batch transaction.
+	 */
+	CommitBlobTransaction((Archive *)AH);
+
+	/*
 	 * Disconnect from database and clean up.
 	 */
 	set_cancel_slot_archive(slot, NULL);
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 0296b9b..cd8a590 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -203,6 +203,8 @@ typedef struct Archive
 	int			numWorkers;		/* number of parallel processes */
 	char	   *sync_snapshot_id;	/* sync snapshot id for parallel operation */
 
+	int			blobBatchSize;	/* # of blobs to restore per transaction */
+
 	/* info needed for string escaping */
 	int			encoding;		/* libpq code for client_encoding */
 	bool		std_strings;	/* standard_conforming_strings */
@@ -269,6 +271,7 @@ extern void WriteData(Archive *AH, const void *data, size_t dLen);
 extern int	StartBlob(Archive *AH, Oid oid);
 extern int	EndBlob(Archive *AH, Oid oid);
 
+extern void	CommitBlobTransaction(Archive *AH);
 extern void CloseArchive(Archive *AH);
 
 extern void SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt);
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 1f82c64..8331e8a 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -68,6 +68,7 @@ typedef struct _parallelReadyList
 	bool		sorted;			/* are valid entries currently sorted? */
 } ParallelReadyList;
 
+static int		blobBatchCount = 0;
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 			   const int compression, bool dosync, ArchiveMode mode,
@@ -265,6 +266,8 @@ CloseArchive(Archive *AHX)
 	int			res = 0;
 	ArchiveHandle *AH = (ArchiveHandle *) AHX;
 
+	CommitBlobTransaction(AHX);
+
 	AH->ClosePtr(AH);
 
 	/* Close the output */
@@ -279,6 +282,23 @@ CloseArchive(Archive *AHX)
 
 /* Public */
 void
+CommitBlobTransaction(Archive *AHX)
+{
+	ArchiveHandle *AH = (ArchiveHandle *) AHX;
+
+	if (blobBatchCount > 0)
+	{
+		ahprintf(AH, "--\n");
+		ahprintf(AH, "-- End BLOB restore batch\n");
+		ahprintf(AH, "--\n");
+		ahprintf(AH, "COMMIT;\n\n");
+
+		blobBatchCount = 0;
+	}
+}
+
+/* Public */
+void
 SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt)
 {
 	/* Caller can omit dump options, in which case we synthesize them */
@@ -3531,6 +3551,57 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData)
 {
 	RestoreOptions *ropt = AH->public.ropt;
 
+	/* We restore BLOBs in batches to reduce XID consumption */
+	if (strcmp(te->desc, "BLOB") == 0 && AH->public.blobBatchSize > 0)
+	{
+		if (blobBatchCount > 0)
+		{
+			/* We are inside a BLOB restore transaction */
+			if (blobBatchCount >= AH->public.blobBatchSize)
+			{
+/*
+ * We did reach the batch size with the previous BLOB.
+ * Commit and start a new batch.
+ */
+ahprintf(AH, "--\n");
+ahprintf(AH, "-- BLOB batch size reached\n");
+ahprintf(AH, "--\n");
+ahprintf(AH, "COMMIT;\n");
+ahprintf(AH, "BEGIN;\n\n");
+
+blobBatchCount = 1;
+			}
+			else
+			{
+/* This one still fits into the current batch */
+blobBatchCount++;
+			}
+		}
+		else
+		{
+			/* Not inside a transaction, start a new batch */
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "-- Start BLOB restore batch\n");
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "BEGIN;\n\n");
+
+			blobBatchCount = 1;
+		}
+	}
+	else
+	{
+		/* Not a BLOB. If we have a BLOB batch open, close it. */
+		if (blobBatchCount > 0)
+		{
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "-- End BLOB restore batch\n");
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "COMMIT;\n\n");
+
+			blobBatchCount = 0;
+		}
+	}
+
 	/* Select owner, schema, tablespace and default AM as necessary */
 	_becomeOwner(AH, te);
 	_selectOutputSchema(AH, te->namespace);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index f8bec3f..f153f08 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -165,12 +165,20 @@ static void guessConstraintInheritance(TableInfo *tblinfo, int numTables);
 s

Re: pg_upgrade failing for 200+ million Large Objects

2021-03-22 Thread Jan Wieck

On 3/22/21 5:36 PM, Zhihong Yu wrote:

Hi,

w.r.t. pg_upgrade_improvements.v2.diff.

+       blobBatchCount = 0;
+       blobInXact = false;

The count and bool flag are always reset in tandem. It seems 
variable blobInXact is not needed.


You are right. I will fix that.


Thanks, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: Fix pg_upgrade to preserve datdba

2021-03-22 Thread Jan Wieck

On 3/21/21 3:56 PM, Tom Lane wrote:

Jan Wieck  writes:
So let's focus on the actual problem of running out of XIDs and memory 
while doing the upgrade involving millions of small large objects.


Right.  So as far as --single-transaction vs. --create goes, that's
mostly a definitional problem.  As long as the contents of a DB are
restored in one transaction, it's not gonna matter if we eat one or
two more XIDs while creating the DB itself.  So we could either
relax pg_restore's complaint, or invent a different switch that's
named to acknowledge that it's not really only one transaction.

That still leaves us with the lots-o-locks problem.  However, once
we've crossed the Rubicon of "it's not really only one transaction",
you could imagine that the switch is "--fewer-transactions", and the
idea is for pg_restore to commit after every (say) 10 operations.
That would both bound its lock requirements and greatly cut its XID
consumption.


It leaves us with three things.

1) tremendous amounts of locks
2) tremendous amounts of memory needed
3) taking forever because it is single threaded.

I created a pathological case here on a VM with 24GB of RAM, 80GB of 
SWAP sitting on NVME. The database has 20 million large objects, each of 
which has 2 GRANTS, 1 COMMENT and 1 SECURITY LABEL (dummy). Each LO only 
contains a string "large object ", so the whole database in 9.5 is 
about 15GB in size.


A stock pg_upgrade to version 14devel using --link takes about 15 hours. 
This is partly because the pg_dump and pg_restore both grow to something 
like 50GB+ to hold the TOC. Which sounds out of touch considering that 
the entire system catalog on disk is less than 15GB. But aside from the 
ridiculous amount of swapping, the whole thing also suffers from 
consuming about 80 million transactions and apparently having just as 
many network round trips with a single client.




The work you described sounded like it could fit into that paradigm,
with the additional ability to run some parallel restore tasks
that are each consuming a bounded number of locks.


I have attached a POC patch that implements two new options for pg_upgrade.

  --restore-jobs=NUM --jobs parameter passed to pg_restore
  --restore-blob-batch-size=NUM  number of blobs restored in one xact

It does a bit more than just that. It rearranges the way large objects 
are dumped so that most of the commands are all in one TOC entry and the 
entry is emitted into SECTION_DATA when in binary upgrade mode (which 
guarantees that there isn't any actual BLOB data in the dump). This 
greatly reduces the number of network round trips and when using 8 
parallel restore jobs, almost saturates the 4-core VM. Reducing the 
number of TOC entries also reduces the total virtual memory need of 
pg_restore to 15G, so there is a lot less swapping going on.


It cuts down the pg_upgrade time from 15 hours to 1.5 hours. In that run 
I used --restore-jobs=8 and --restore-blob-batch-size=1 (with a 
max_locks_per_transaction=12000).


As said, this isn't a "one size fits all" solution. The pg_upgrade 
parameters for --jobs and --restore-jobs will really depend on the 
situation. Hundreds of small databases want --jobs, but one database 
with millions of large objects wants --restore-jobs.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services
diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c
index c7351a4..4a611d0 100644
--- a/src/bin/pg_dump/parallel.c
+++ b/src/bin/pg_dump/parallel.c
@@ -864,6 +864,11 @@ RunWorker(ArchiveHandle *AH, ParallelSlot *slot)
 	WaitForCommands(AH, pipefd);
 
 	/*
+	 * Close an eventually open BLOB batch transaction.
+	 */
+	CommitBlobTransaction((Archive *)AH);
+
+	/*
 	 * Disconnect from database and clean up.
 	 */
 	set_cancel_slot_archive(slot, NULL);
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 0296b9b..cd8a590 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -203,6 +203,8 @@ typedef struct Archive
 	int			numWorkers;		/* number of parallel processes */
 	char	   *sync_snapshot_id;	/* sync snapshot id for parallel operation */
 
+	int			blobBatchSize;	/* # of blobs to restore per transaction */
+
 	/* info needed for string escaping */
 	int			encoding;		/* libpq code for client_encoding */
 	bool		std_strings;	/* standard_conforming_strings */
@@ -269,6 +271,7 @@ extern void WriteData(Archive *AH, const void *data, size_t dLen);
 extern int	StartBlob(Archive *AH, Oid oid);
 extern int	EndBlob(Archive *AH, Oid oid);
 
+extern void	CommitBlobTransaction(Archive *AH);
 extern void CloseArchive(Archive *AH);
 
 extern void SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt);
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 1f82c64..51a862a 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_arc

Re: Fix pg_upgrade to preserve datdba

2021-03-21 Thread Jan Wieck

On 3/21/21 2:34 PM, Tom Lane wrote:

and I see

--
-- Name: joe; Type: DATABASE; Schema: -; Owner: joe
--

CREATE DATABASE joe WITH TEMPLATE = template0 ENCODING = 'SQL_ASCII' LOCALE = 
'C';


ALTER DATABASE joe OWNER TO joe;

so at least in this case it's doing the right thing.  We need a bit
more detail about the context in which it's doing the wrong thing
for you.


After moving all of this to a pristine postgresql.org based repo I see 
the same. My best guess at this point is that the permission hoops, that 
RDS and Aurora PostgreSQL are jumping through, was messing with this. 
But that has nothing to do with the actual topic.


So let's focus on the actual problem of running out of XIDs and memory 
while doing the upgrade involving millions of small large objects.



Regards, Jan


--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: Fix pg_upgrade to preserve datdba

2021-03-21 Thread Jan Wieck

On 3/21/21 1:15 PM, Jan Wieck wrote:

On 3/21/21 12:57 PM, Tom Lane wrote:

Jan Wieck  writes:

On 3/20/21 12:39 AM, Jan Wieck wrote:

On the way pg_upgrade also mangles the pg_database.datdba
(all databases are owned by postgres after an upgrade; will submit a
separate patch for that as I consider that a bug by itself).



Patch attached.


Hmm, doesn't this lose all *other* database-level properties?

I think maybe what we have here is a bug in pg_restore, its
--create switch ought to be trying to update the database's
ownership.


Possibly. I didn't look into that route.


Thanks for that. I like this patch a lot better.


Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index f8bec3f..19c1e71 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3030,6 +3030,8 @@ dumpDatabase(Archive *fout)
 	resetPQExpBuffer(creaQry);
 	resetPQExpBuffer(delQry);
 
+	appendPQExpBuffer(creaQry, "ALTER DATABASE %s OWNER TO %s;\n", qdatname, dba);
+
 	if (strlen(datconnlimit) > 0 && strcmp(datconnlimit, "-1") != 0)
 		appendPQExpBuffer(creaQry, "ALTER DATABASE %s CONNECTION LIMIT = %s;\n",
 		  qdatname, datconnlimit);


Re: Fix pg_upgrade to preserve datdba

2021-03-21 Thread Jan Wieck

On 3/21/21 12:57 PM, Tom Lane wrote:

Jan Wieck  writes:

On 3/20/21 12:39 AM, Jan Wieck wrote:

On the way pg_upgrade also mangles the pg_database.datdba
(all databases are owned by postgres after an upgrade; will submit a
separate patch for that as I consider that a bug by itself).



Patch attached.


Hmm, doesn't this lose all *other* database-level properties?

I think maybe what we have here is a bug in pg_restore, its
--create switch ought to be trying to update the database's
ownership.


Possibly. I didn't look into that route.


Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-21 Thread Jan Wieck

On 3/21/21 7:47 AM, Andrew Dunstan wrote:

One possible (probable?) source is the JDBC driver, which currently
treats all Blobs (and Clobs, for that matter) as LOs. I'm working on
improving that some: <https://github.com/pgjdbc/pgjdbc/pull/2093>


You mean the user is using OID columns pointing to large objects and the 
JDBC driver is mapping those for streaming operations?


Yeah, that would explain a lot.


Thanks, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Fix pg_upgrade to preserve datdba (was: Re: pg_upgrade failing for 200+ million Large Objects)

2021-03-21 Thread Jan Wieck

On 3/20/21 12:39 AM, Jan Wieck wrote:

On the way pg_upgrade also mangles the pg_database.datdba
(all databases are owned by postgres after an upgrade; will submit a
separate patch for that as I consider that a bug by itself).


Patch attached.


Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index 5d9a26c..38f7202 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -344,6 +344,7 @@ get_db_infos(ClusterInfo *cluster)
 	DbInfo	   *dbinfos;
 	int			i_datname,
 i_oid,
+i_datdba,
 i_encoding,
 i_datcollate,
 i_datctype,
@@ -351,9 +352,12 @@ get_db_infos(ClusterInfo *cluster)
 	char		query[QUERY_ALLOC];
 
 	snprintf(query, sizeof(query),
-			 "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, "
+			 "SELECT d.oid, d.datname, u.rolname, d.encoding, "
+			 "d.datcollate, d.datctype, "
 			 "%s AS spclocation "
 			 "FROM pg_catalog.pg_database d "
+			 " JOIN pg_catalog.pg_authid u "
+			 " ON d.datdba = u.oid "
 			 " LEFT OUTER JOIN pg_catalog.pg_tablespace t "
 			 " ON d.dattablespace = t.oid "
 			 "WHERE d.datallowconn = true "
@@ -367,6 +371,7 @@ get_db_infos(ClusterInfo *cluster)
 
 	i_oid = PQfnumber(res, "oid");
 	i_datname = PQfnumber(res, "datname");
+	i_datdba = PQfnumber(res, "rolname");
 	i_encoding = PQfnumber(res, "encoding");
 	i_datcollate = PQfnumber(res, "datcollate");
 	i_datctype = PQfnumber(res, "datctype");
@@ -379,6 +384,7 @@ get_db_infos(ClusterInfo *cluster)
 	{
 		dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid));
 		dbinfos[tupnum].db_name = pg_strdup(PQgetvalue(res, tupnum, i_datname));
+		dbinfos[tupnum].db_owner = pg_strdup(PQgetvalue(res, tupnum, i_datdba));
 		dbinfos[tupnum].db_encoding = atoi(PQgetvalue(res, tupnum, i_encoding));
 		dbinfos[tupnum].db_collate = pg_strdup(PQgetvalue(res, tupnum, i_datcollate));
 		dbinfos[tupnum].db_ctype = pg_strdup(PQgetvalue(res, tupnum, i_datctype));
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index e23b8ca..8fd9a13 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -378,18 +378,36 @@ create_new_objects(void)
 		 * propagate its database-level properties.
 		 */
 		if (strcmp(old_db->db_name, "postgres") == 0)
-			create_opts = "--clean --create";
+		{
+			parallel_exec_prog(log_file_name,
+			   NULL,
+			   "\"%s/pg_restore\" %s --exit-on-error "
+			   "--verbose --clean --create "
+			   "--dbname template1 \"%s\"",
+			   new_cluster.bindir,
+			   cluster_conn_opts(_cluster),
+			   sql_file_name);
+		}
 		else
-			create_opts = "--create";
-
-		parallel_exec_prog(log_file_name,
-		   NULL,
-		   "\"%s/pg_restore\" %s %s --exit-on-error --verbose "
-		   "--dbname template1 \"%s\"",
-		   new_cluster.bindir,
-		   cluster_conn_opts(_cluster),
-		   create_opts,
-		   sql_file_name);
+		{
+			exec_prog(log_file_name, NULL, true, true,
+	  "\"%s/createdb\" -O \"%s\" %s \"%s\"",
+	  new_cluster.bindir,
+	  old_db->db_owner,
+	  cluster_conn_opts(_cluster),
+	  old_db->db_name);
+			parallel_exec_prog(log_file_name,
+			   NULL,
+			   "\"%s/pg_restore\" %s --exit-on-error "
+			   "--verbose "
+			   "--dbname \"%s\" \"%s\"",
+			   new_cluster.bindir,
+			   cluster_conn_opts(_cluster),
+			   old_db->db_name,
+			   sql_file_name);
+		}
+
+
 	}
 
 	/* reap all children */
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 919a784..a3cda97 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -177,6 +177,7 @@ typedef struct
 {
 	Oid			db_oid;			/* oid of the database */
 	char	   *db_name;		/* database name */
+	char	   *db_owner;		/* database owner */
 	char		db_tablespace[MAXPGPATH];	/* database default tablespace
 			 * path */
 	char	   *db_collate;


Re: pg_upgrade failing for 200+ million Large Objects

2021-03-20 Thread Jan Wieck

On 3/20/21 11:23 AM, Tom Lane wrote:

Jan Wieck  writes:

All that aside, the entire approach doesn't scale.


Yeah, agreed.  When we gave large objects individual ownership and ACL
info, it was argued that pg_dump could afford to treat each one as a
separate TOC entry because "you wouldn't have that many of them, if
they're large".  The limits of that approach were obvious even at the
time, and I think now we're starting to see people for whom it really
doesn't work.


It actually looks more like some users have millions of "small objects". 
I am still wondering where that is coming from and why they are abusing 
LOs in that way, but that is more out of curiosity. Fact is that they 
are out there and that they cannot upgrade from their 9.5 databases, 
which are now past EOL.




I wonder if pg_dump could improve matters cheaply by aggregating the
large objects by owner and ACL contents.  That is, do

select distinct lomowner, lomacl from pg_largeobject_metadata;

and make just *one* BLOB TOC entry for each result.  Then dump out
all the matching blobs under that heading.


What I am currently experimenting with is moving the BLOB TOC entries 
into the parallel data phase of pg_restore "when doing binary upgrade". 
It seems to scale nicely with the number of cores in the system. In 
addition to that have options for pg_upgrade and pg_restore that cause 
the restore to batch them into transactions, like 10,000 objects at a 
time. There was a separate thread for that but I guess it is better to 
keep it all together here now.




A possible objection is that it'd reduce the ability to restore blobs
selectively, so maybe we'd need to make it optional.


I fully intend to make all this into new "options". I am afraid that 
there is no one-size-fits-all solution here.


Of course, that just reduces the memory consumption on the client
side; it does nothing for the locks.  Can we get away with releasing the
lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob?


I'm not very fond of the idea going lockless when at the same time 
trying to parallelize the restore phase. That can lead to really nasty 
race conditions. For now I'm aiming at batches in transactions.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-19 Thread Jan Wieck

On 3/8/21 11:58 AM, Tom Lane wrote:

The answer up to now has been "raise max_locks_per_transaction enough
so you don't see the failure".  Having now consumed a little more
caffeine, I remember that that works in pg_upgrade scenarios too,
since the user can fiddle with the target cluster's postgresql.conf
before starting pg_upgrade.

So it seems like the path of least resistance is

(a) make pg_upgrade use --single-transaction when calling pg_restore

(b) document (better) how to get around too-many-locks failures.


That would first require to fix how pg_upgrade is creating the 
databases. It uses "pg_restore --create", which is mutually exclusive 
with --single-transaction because we cannot create a database inside of 
a transaction. On the way pg_upgrade also mangles the pg_database.datdba 
(all databases are owned by postgres after an upgrade; will submit a 
separate patch for that as I consider that a bug by itself).


All that aside, the entire approach doesn't scale.

In a hacked up pg_upgrade that does "createdb" first before calling 
pg_upgrade with --single-transaction. I can upgrade 1M large objects with

max_locks_per_transaction = 5300
max_connectinons=100
which contradicts the docs. Need to find out where that math went off 
the rails because that config should only have room for 530,000 locks, 
not 1M. The same test fails with max_locks_per_transaction = 5200.


But this would mean that one has to modify the postgresql.conf to 
something like 530,000 max_locks_per_transaction at 100 max_connections 
in order to actually run a successful upgrade of 100M large objects. 
This config requires 26GB of memory just for locks. Add to that the 
memory pg_restore needs to load the entire TOC before even restoring a 
single object.


Not going to work. But tests are still ongoing ...


Regards, Jan


--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: Extensibility of the PostgreSQL wire protocol

2021-03-04 Thread Jan Wieck

On 3/4/21 7:38 PM, Hannu Krosing wrote:

On Thu, Mar 4, 2021 at 9:55 PM Jan Wieck  wrote:

but the whole thing was developed that way from the beginning and
it is working. I don't have a definitive date when that code will be
presented. Kuntal or Prateek may be able to fill in more details.


Are you really fully replacing the main loop, or are you running a second
main loop in parallel in the same database server instance, perhaps as
a separate TDS_postmaster backend ?

Will the data still also be accessible "as postgres" via port 5432 when
TDS/SQLServer support is active ?


The individual backend (session) is running a different main loop. A 
libpq based client will still get the regular libpq and the original 
PostgresMain() behavior on port 5432. The default port for TDS is 1433 
and with everything in place I can connect to the same database on that 
port with Microsoft's SQLCMD.


The whole point of all this is to allow the postmaster to listen to more 
than just 5432 and have different communication protocols on those 
*additional* ports. Nothing is really *replaced*. The parts of the 
backend, that do actual socket communication, are just routed through 
function pointers so that an extension can change their behavior.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: PROXY protocol support

2021-03-04 Thread Jan Wieck

On 3/4/21 3:40 PM, Magnus Hagander wrote:

On Thu, Mar 4, 2021 at 9:29 PM Jan Wieck  wrote:

This looks like it would only need a few extra protocol messages to be
understood by the backend. It might be possible to implement that with
the loadable wire protocol extensions proposed here:

https://commitfest.postgresql.org/32/3018/


Actually the whole point of it is that it *doesn't* need any new
protocol messages. And that it *wraps* whatever is there, definitely
doesn't replace it. It should equally be wrapping whatever an
extension uses.

So while the base topic is not unrelated, I don't think there is any
overlap between these.


I might be missing something here, but isn't sending some extra, 
informational *header*, which is understood by the backend, in essence a 
protocol extension?



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: Extensibility of the PostgreSQL wire protocol

2021-03-04 Thread Jan Wieck

On 3/3/21 2:43 PM, Peter Eisentraut wrote:


I think, the way the abstractions are chosen in this patch, it is still
very much tied to how the libpq protocol works.  For example, there is a
cancel key and a ready-for-query message.  Some of the details of the
simple and the extended query are exposed.  So you could create a
protocol that has a different way of encoding the payloads, as your
telnet example does, but I don't believe that you could implement a
competitor's protocol through this.  Unless you have that done and want
to show it?



Correct, a lot of what this patch does is to allow a developer of such 
protocol extension to just "extend" what the server side libpq does, by 
building a wrapper around the function they are interested in. That 
doesn't change the protocol, but rather allows additional functionality 
like the telemetry data gathering, Fabrizio was talking about.


The telnet_srv tutorial extension (which needs more documentation) is an 
example of how far one can go by replacing those funcitons, in that it 
actually implements a very different wire protocol. This one still fits 
into the regular libpq message flow.


Another possibility, and this is what is being used by the AWS team 
implementing the TDS protocol for Babelfish, is to completely replace 
the entire TCOP mainloop function PostgresMain(). That is of course a 
rather drastic move and requires a lot more coding on the extension 
side, but the whole thing was developed that way from the beginning and 
it is working. I don't have a definitive date when that code will be 
presented. Kuntal or Prateek may be able to fill in more details.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: PROXY protocol support

2021-03-04 Thread Jan Wieck

On 3/4/21 2:45 PM, Jacob Champion wrote:

On Thu, 2021-03-04 at 10:42 +0900, Tatsuo Ishii wrote:

Is there any formal specification for the "a protocol common and very
light weight in proxies"?


See

 https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt

which is maintained by HAProxy Technologies.

--Jacob



This looks like it would only need a few extra protocol messages to be 
understood by the backend. It might be possible to implement that with 
the loadable wire protocol extensions proposed here:


https://commitfest.postgresql.org/32/3018/


Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: Extensibility of the PostgreSQL wire protocol

2021-02-24 Thread Jan Wieck

On 2/19/21 10:13 AM, Jan Wieck wrote:


Give the function, that postmaster is calling to accept a connection
when a server_fd is ready, a return code that it can use to tell
postmaster "forget about it, don't fork or do anything else with it".
This function is normally calling StreamConnection() before the
postmaster then forks the backend. But it could instead hand over the
socket to the pool background worker (I presume Jonah is transferring
them from process to process via UDP packet). The pool worker is then
launching the actual backends which receive a requesting client via the
same socket transfer to perform one or more transactions, then hand the
socket back to the pool worker.


The function in question, which is StreamConnection() and with this 
patch can be replaced with an extension funtion via the fn_accept 
pointer, already has that capability. If StreamConnection() or its 
replacement returns a NULL pointer, the postmaster just skips calling 
BackendStartup(). So everything is already in place for the above to work.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: Extensibility of the PostgreSQL wire protocol

2021-02-22 Thread Jan Wieck

On 2/22/21 1:01 PM, Tom Lane wrote:

Yeah, and as I pointed out somewhere upthread, trying to replace the
whole parser will just end in a maintenance nightmare.  The constructs
that the parser has to emit are complex, Postgres-specific, and
constantly evolving.  We are NOT going to promise any sort of cross
version compatibility for parse trees.


Absolutely agreed. We cannot promise that the parsetree generated in one 
version will work with the planner, optimizer and executor of the next. 
These types of projects will need to pay close attention and more 
importantly, develop their own regression test suites that detect when 
something has changed in core. That said, discussion about the parser 
hook should happen in the other thread.


I don't even expect that we can guarantee that the functions I am trying 
to allow to be redirected for the wire protocol will be stable forever. 
libpq V4 may need to change some of the call signatures, which has 
happened before. For example, the function to send the command 
completion message to the frontend (tcop/dest.c EndCommand()) changed 
from 12 to 13.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: Extensibility of the PostgreSQL wire protocol

2021-02-22 Thread Jan Wieck

On 2/10/21 1:10 PM, Tom Lane wrote:

What I'm actually more concerned about, in this whole line of development,
is the follow-on requests that will surely occur to kluge up Postgres
to make its behavior more like $whatever.  As in "well, now that we
can serve MySQL clients protocol-wise, can't we pretty please have a
mode that makes the parser act more like MySQL".


Those requests will naturally follow. But I don't see it as the main 
project's responsibility to satisfy them. It would be rather natural to 
develop the two things together. The same developer or group of 
developers, who are trying to connect a certain client, will want to 
have other compatibility features.


As Jim Mlodgenski just posted in [0], having the ability to also extend 
and/or replace the parser will give them the ability to do just that.



Regards, Jan

[0] 
https://www.postgresql.org/message-id/CAB_5SReoPJAPO26Z8+WN6ugfBb2UDc3c21rRz9=bzibmcap...@mail.gmail.com



--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: Extensibility of the PostgreSQL wire protocol

2021-02-22 Thread Jan Wieck

On 2/19/21 4:36 AM, Kuntal Ghosh wrote:

On Thu, Feb 18, 2021 at 9:32 PM Jan Wieck  wrote:



Few comments in the extension code (although experimental):

1. In telnet_srv.c,
+ static intpe_port;
..
+   DefineCustomIntVariable("telnet_srv.port",
+   "Telnet server port.",
+   NULL,
+   _port,
+   pe_port,
+   1024,
+   65536,
+   PGC_POSTMASTER,
+   0,
+   NULL,
+   NULL,
+   NULL);

The variable pe_port should be initialized to a value which is > 1024
and < 65536. Otherwise, the following assert will fail,
TRAP: FailedAssertion("newval >= conf->min", File: "guc.c", Line:
5541, PID: 12100)

2. The function pq_putbytes shouldn't be used by anyone other than
old-style COPY out.
+   pq_putbytes(msg, strlen(msg));
Otherwise, the following assert will fail in the same function:
 /* Should only be called by old-style COPY OUT */
 Assert(DoingCopyOut);



Attached are an updated patch and telnet_srv addressing the above problems.


Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services


telnet_srv_2021-02-22_1.tgz
Description: application/compressed-tar


wire_proto_2021-02-22_1.diff.gz
Description: application/gzip


Re: Extensibility of the PostgreSQL wire protocol

2021-02-22 Thread Jan Wieck

On 2/22/21 7:34 AM, Dave Cramer wrote:

Yes, when did it become a good idea to put a connection pooler in the 
backend???


Dave Cramer
www.postgres.rocks


As Alvaro said, different pool purposes lead to different pool 
architecture and placement.


However, the changes proposed here, aiming at the ability to load 
modified or entirely different wire protocol handlers, do not limit such 
connection pooling. To the contrary.


Any connection pool, that wants to maintain more client connections than 
actual database backends, must know when it is appropriate to do so. 
Usually the right moment to break the current client-backend association 
is when the backend is outside a transaction block and waiting for the 
next client request. To do so pools cannot blindly shovel data back and 
forth. They need to scan one way or another for the backend's 'Z' 
message, sent in tcop/dest.c ReadyForQuery(), where the backend also 
reports the current transaction state. IOW the pool must follow the flow 
of libpq messages on all connections, message by message, row by row, 
just for the purpose of seeing that one, single bit. It is possible to 
transmit that information to the pool on a separate channel.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Jan Wieck

On 2/19/21 12:18 PM, Damir Simunic wrote:



On 19 Feb 2021, at 14:48, Heikki Linnakangas  wrote:

For example, there has been discussion elsewhere about integrating connection 
pooling into the server itself. For that, you want to have a custom process 
that listens for incoming connections, and launches backends independently of 
the incoming connections. These hooks would not help with that.



Not clear how the connection polling in the core is linked to discussing 
pluggable wire protocols.


It isn't per se. But there are things pluggable wire protocols can help 
with in regards to connection pooling. For example a connection pool 
like pgbouncer can be configured to switch client-backend association on 
a transaction level. It therefore scans the traffic for the in 
transaction state. This however only works if an application uses 
identical session states across all connections in a pool. The JDBC 
driver for example only really prepares PreparedStatements after a 
number of executions and then assigns a name based on a counter to them. 
So it is neither guaranteed that a certain backend has the same 
statements prepared, nor that they are named the same. Therefore JDBC 
based applications cannot use PreparedStatements through pgbouncer in 
transaction mode.


An "extended" libpq protocol could allow the pool to give clients a 
unique ID. The protocol handler would then maintain maps with the SQL of 
prepared statements and what the client thinks their prepared statement 
name is. So when a client sends a P packet, the protocol handler would 
lookup the mapping and see if it already has that statement prepared. 
Just add the mapping info or actually create a new statement entry in 
the maps. These maps are of course shared across backends. So if then 
another client sends bind+execute and the backend doesn't have a plan 
for that query, it would internally create one.


There are security implications here, so things like the search path 
might have to be part of the maps, but those are implementation details.


At the end this would allow a project like pgbouncer to create an 
extended version of libpq protocol that caters to the very special needs 
of that pool.


Most of that would of course be possible on the pool side itself. But 
the internal structure of pgbouncer isn't suitable for that. It is very 
lightweight and for long SQL queries may never have the complete 'P' 
message in memory. It would also not have direct access to security 
related information like the search path, which would require extra 
round trips between the pool and the backend to retrieve it.


So while not suitable to create a built in pool by itself, loadable wire 
protocols can definitely help with connection pooling.


I also am not sure if building a connection pool into a background 
worker or postmaster is a good idea to begin with. One of the important 
features of a pool is to be able to suspend traffic and make the server 
completely idle to for example be able to restart the postmaster without 
forcibly disconnecting all clients. A pool built into a background 
worker cannot do that.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Jan Wieck

On 2/19/21 8:48 AM, Heikki Linnakangas wrote:

I can see value in supporting different protocols. I don't like the
approach discussed in this thread, however.

For example, there has been discussion elsewhere about integrating
connection pooling into the server itself. For that, you want to have a
custom process that listens for incoming connections, and launches
backends independently of the incoming connections. These hooks would
not help with that.


The two are not mutually exclusive. You are right that the current 
proposal would not help with that type of built in connection pool, but 
it may be extended to that.


Give the function, that postmaster is calling to accept a connection 
when a server_fd is ready, a return code that it can use to tell 
postmaster "forget about it, don't fork or do anything else with it". 
This function is normally calling StreamConnection() before the 
postmaster then forks the backend. But it could instead hand over the 
socket to the pool background worker (I presume Jonah is transferring 
them from process to process via UDP packet). The pool worker is then 
launching the actual backends which receive a requesting client via the 
same socket transfer to perform one or more transactions, then hand the 
socket back to the pool worker.


All of that would still require a protocol extension that has special 
messages for "here is a client socket for you" and "you can have that 
back".




I would recommend this approach: write a separate program that sits
between the client and PostgreSQL, speaking custom protocol to the
client, and libpq to the backend. And then move that program into a
background worker process.


That is a classic protocol converting proxy. It has been done in the 
past with not really good results, both performance wise as with respect 
to protocol completeness.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Jan Wieck
Thank you Kuntal,

On Fri, Feb 19, 2021 at 4:36 AM Kuntal Ghosh 
wrote:

> On Thu, Feb 18, 2021 at 9:32 PM Jan Wieck  wrote:
>
>
> Few comments in the extension code (although experimental):
>
> 1. In telnet_srv.c,
> + static intpe_port;
> ..
> +   DefineCustomIntVariable("telnet_srv.port",
> +   "Telnet server
> port.",
> +   NULL,
> +   _port,
> +   pe_port,
> +   1024,
> +   65536,
> +   PGC_POSTMASTER,
> +   0,
> +   NULL,
> +   NULL,
> +   NULL);
>
> The variable pe_port should be initialized to a value which is > 1024
> and < 65536. Otherwise, the following assert will fail,
> TRAP: FailedAssertion("newval >= conf->min", File: "guc.c", Line:
> 5541, PID: 12100)
>
>
Right, forgot to turn on Asserts.


> 2. The function pq_putbytes shouldn't be used by anyone other than
> old-style COPY out.
> +   pq_putbytes(msg, strlen(msg));
> Otherwise, the following assert will fail in the same function:
> /* Should only be called by old-style COPY OUT */
> Assert(DoingCopyOut);
>

I would argue that the Assert needs to be changed. It is obvious that the
Assert in place is meant to guard against direct usage of pg_putbytes() in
an attempt to force all code to use pq_putmessage() instead. This is good
when speaking libpq wire protocol since all messages there are prefixed
with a one byte message type. It does not apply to other protocols.

I propose to create another global boolean IsNonLibpqFrontend which the
protocol extension will set to true when accepting the connection and the
above then will change to

Assert(DoingCopyOut || IsNonLibpqFrontend);


Regards, Jan



>
> --
> Thanks & Regards,
> Kuntal Ghosh
> Amazon Web Services
>


-- 
Jan Wieck


Re: Extensibility of the PostgreSQL wire protocol

2021-02-18 Thread Jan Wieck
Attached are a first patch and a functioning extension that implements a
telnet protocol server.

The extension needs to be loaded via shared_preload_libraries and
configured for a port number and listen_addresses as follows:

shared_preload_libraries = 'telnet_srv'

telnet_srv.listen_addresses = '*'
telnet_srv.port = 54323

It is incomplete in that it doesn't address things like the COPY protocol.
But it is enough to give a more detailed idea of what this interface will
look like and what someone would do to implement their own protocol or
extend an existing one.

The overall idea here is to route all functions, that communicate with the
frontend, through function pointers that hang off of MyProcPort. Since we
are performing socket communication in them I believe one extra function
pointer indirection is unlikely to have significant performance impact.

Best Regards, Jan
On behalf of Amazon Web Services





On Sun, Feb 14, 2021 at 12:36 PM Dave Cramer 
wrote:

>
>
> On Thu, 11 Feb 2021 at 09:28, Robert Haas  wrote:
>
>> On Wed, Feb 10, 2021 at 2:04 PM Tom Lane  wrote:
>> > That is a spot-on definition of where I do NOT want to end up.  Hooks
>> > everywhere and enormous extensions that break anytime we change anything
>> > in the core.  It's not really clear that anybody is going to find that
>> > more maintainable than a straight fork, except to the extent that it
>> > enables the erstwhile forkers to shove some of their work onto the PG
>> > community.
>>
>> +1.
>>
>> Making the lexer and parser extensible seems desirable to me. It would
>> be beneficial not only for companies like EDB and Amazon that might
>> want to extend the grammar in various ways, but also for extension
>> authors. However, it's vastly harder than Jan's proposal to make the
>> wire protocol pluggable. The wire protocol is pretty well-isolated
>> from the rest of the system. As long as you can get queries out of the
>> packets the client sends and package up the results to send back, it's
>> all good.
>
>
> I would have to disagree that the wire protocol is well-isolated. Sending
> and receiving are not in a single file
> The codes are not even named constants so trying to find a specific one is
> difficult.
>
> Anything that would clean this up would be a benefit
>
>
> That being said, I'm not in favor of transferring maintenance work to
>> the community for this set of hooks any more than I am for something
>> on the parsing side. In general, I'm in favor of as much extensibility
>> as we can reasonably create, but with a complicated proposal like this
>> one, the community should expect to be able to get something out of
>> it. And so far what I hear Jan saying is that these hooks could in
>> theory be used for things other than Amazon's proprietary efforts and
>> those things could in theory bring benefits to the community, but
>> there are no actual plans to do anything with this that would benefit
>> anyone other than Amazon. Which seems to bring us right back to
>> expecting the community to maintain things for the benefit of
>> third-party forks.
>>
>
> if this proposal brought us the ability stream results that would be a
> huge plus!
>
> Dave Cramer
> www.postgres.rocks
>
>>
>>

-- 
Jan Wieck


wire_proto_2021-02-18_1.diff.gz
Description: application/gzip


telnet_srv_2021-02-18_1.tgz
Description: application/compressed-tar


Re: Extensibility of the PostgreSQL wire protocol

2021-02-10 Thread Jan Wieck
On Wed, Feb 10, 2021 at 11:43 AM Robert Haas  wrote:

> On Mon, Jan 25, 2021 at 10:07 AM Jan Wieck  wrote:
> > Our current plan is to create a new set of API calls and hooks that
> allow to register additional wire protocols. The existing backend libpq
> implementation will be modified to register itself using the new API. This
> will serve as a proof of concept as well as ensure that the API definition
> is not slanted towards a specific protocol. It is also similar to the way
> table access methods and compression methods are added.
>
> If we're going to end up with an open source implementation of
> something useful in contrib or whatever, then I think this is fine.
> But, if not, then we're just making it easier for Amazon to do
> proprietary stuff without getting any benefit for the open-source
> project. In fact, in that case PostgreSQL would ensure have to somehow
> ensure that the hooks don't get broken without having any code that
> actually uses them, so not only would the project get no benefit, but
> it would actually incur a small tax. I wouldn't say that's an
> absolutely show-stopper, but it definitely isn't my first choice.
>

At this very moment there are several parts to this. One is the hooks to
make wire protocols into loadable modules, which is what this effort is
about. Another is the TDS protocol as it is being implemented for Babelfish
and third is the Babelfish extension itself. Both will require additional
hooks and APIs I am not going to address here. I consider them not material
to my effort.

As for making the wire protocol itself expandable I really see a lot of
potential outside of what Amazon wants here. And I would not be advertising
it if it would be for Babelfish alone. As I laid out, just the ability for
a third party to add additional messages for special connection pool
support would be enough to make it useful. There also have been discussions
in the JDBC subproject to combine certain messages into one single message.
Why not allow the JDBC project to develop their own, JDBC-optimized backend
side? Last but not least, what would be wrong with listening for MariaDB
clients?

I am planning on a follow up project to this, demoting libpq itself to just
another loadable protocol. Just the way procedural languages are all on the
same level because that is how I developed the loadable, procedural
language handler all those years ago.

Considering how spread out and quite frankly unorganized our wire protocol
handling is, this is not a small order.


Regards, Jan








>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>


-- 
Jan Wieck


Re: Extensibility of the PostgreSQL wire protocol

2021-01-25 Thread Jan Wieck
Hi Jonah,

On Mon, Jan 25, 2021 at 10:18 AM Jonah H. Harris 
wrote:

> On Mon, Jan 25, 2021 at 10:07 AM Jan Wieck  wrote:
>
>> The following is a request for discussion and comments, not a refined
>> proposal accompanied by a working patch.
>>
>
> After implementing this three different ways inside the backend over the
> years, I landed on almost this identical approach for handling the MySQL,
> TDS, MongoDB, and Oracle protocols for NEXTGRES.
>

Could any of that be open sourced? It would be an excellent addition to add
one of those as example code.


Regards, Jan



>
> Initially, each was implemented as an background worker extension which
> had to handle its own networking, passing the fd off to new
> protocol-specific connections, etc. This worked, but duplicate a good
> amount of logic. It would be great to have a standard, loadable, way to add
> support for a new protocol.
>
> --
> Jonah H. Harris
>
>

-- 
Jan Wieck


Extensibility of the PostgreSQL wire protocol

2021-01-25 Thread Jan Wieck
The following is a request for discussion and comments, not a refined
proposal accompanied by a working patch.

As recently publicly announced Amazon Web Services is working on Babelfish,
a set of extensions that will allow PostgreSQL to be compatible with other
database systems. One part of this will be an extension that allows
PostgreSQL to listen on a secondary port and process a different wire
protocol. The first extension we are creating in this direction is handling
of the Tabular Data Stream (TDS), used by Sybase and Microsoft SQL-Server
databases. It is more efficient to build an extension, that can handle the
TDS protocol inside the backend, than creating a proxy process that
translates from TDS to libpq protocol and back.

Creating the necessary infrastructure in the postmaster and backend will
open up more possibilities, that are not tied to our compatibility efforts.
Possible use cases for wire protocol extensibility include the development
of a completely new, not backwards compatible PostgreSQL protocol or
extending the existing wire protocol with things like 3rd party connection
pool specific features (like transfer of file descriptors between pool and
working backend for example).

Our current plan is to create a new set of API calls and hooks that allow
to register additional wire protocols. The existing backend libpq
implementation will be modified to register itself using the new API. This
will serve as a proof of concept as well as ensure that the API definition
is not slanted towards a specific protocol. It is also similar to the way
table access methods and compression methods are added.

A wire protocol extension will be a standard PostgreSQL dynamic loadable
extension module. The wire protocol extensions to load will be listed in
the shared_preload_libraries GUC. The extension's Init function will
register a hook function to be called where the postmaster is currently
creating the libpq server sockets. This hook callback will then create the
server sockets and register them for monitoring via select(2) in the
postmaster main loop, using a new API function. Part of the registration
information are callback functions to invoke for accepting and
authenticating incoming connections, error reporting as well as a function
that will implement the TCOP loop for the protocol. Ongoing work on the TDS
protocol has shown us that different protocols make it desirable to have
separate implementations of the TCOP loop. The TCOP function will return
only after the connection has been terminated. Fortunately half the
interface already exists since the sending of result sets is implemented
via callback functions that are registered as the dest receiver, which
works pretty well in our current code.


Regards, Jan

-- 
Jan Wieck
Principal Database Engineer
Amazon Web Services


Re: Performance improvement of WAL writing?

2019-08-28 Thread Jan Wieck
On Wed, Aug 28, 2019 at 2:43 AM Moon, Insung 
wrote:

> So what about modifying the XLogWrite function only to write the size
> that should record?
> Can this idea benefit from WAL writing performance?
> If it's OK to improve, I want to do modification.
> How do you think of it?
>

I have performed tests in this direction several years ago.

The way it works now guarantees that the data of recycled WAL segments is
never read from disk, as long as the WAL block size is a multiple of the
filesystem's page/fragment size. The OS sees that the write() is on a
fragment boundary and full fragment(s) in size. If the write() would be
smaller in size, the OS would be forced to combine the new data with the
rest of the fragment's old data on disk. To do so the system now has to
wait until the old fragment is actually in the OS buffer. Which means that
once you have enough WAL segments to cycle through so that the checkpoint
reason is never XLOG and the blocks of the WAL segment that is cycled in
have been evicted since it was last used, this causes real reads. On
spinning media, which are still an excellent choice for WAL, this turns
into a total disaster because it adds a rotational delay for every single
WAL block that is only partially overwritten.

I believe that this could only work if we stopped recycling WAL segments
and instead delete old segments, then create new files as needed. This
however adds the overhead of constant metadata change to WAL writing and I
have no idea what performance or reliability impact that may have. There
were reasons we chose to implement WAL segment recycling many years ago.
These reasons may no longer be valid on modern filesystems, but it
definitely is not a performance question alone.

Regards, Jan

-- 
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info


Re: lazy detoasting

2018-04-10 Thread Jan Wieck
Maybe I'm missing something here, but let me put $.02 in anyway.

TOAST reuses entries. If a toasted value is unchanged on UPDATE (i.e. the
toast pointer didn't get replaced by a new value), the new tuple points to
the same toast slices as the old. If it is changed, the current transaction
DELETEs the old toast slices and INSERTs new ones (if needed).

If your session has a transaction snapshot that protects the old toast
slices from being vacuumed away, you are fine. Even under READ COMMITTED
(that is why it uses that other visibility, so they don't go away when the
concurrent UPDATE commits and rather behave like REPEATABLE READ).

At least that is what the code was intended to do originally.


Regards, Jan



On Tue, Apr 10, 2018 at 6:24 PM, Chapman Flack <c...@anastigmatix.net>
wrote:

> On 04/10/2018 05:04 PM, Chapman Flack wrote:
> > ... I wonder if
> > there's at least a cheap way to check a particular snapshot
> > for suitability wrt a given toast pointer. Check a couple usual
> > suspects, find one most of the time, fall back to eager detoasting
> > otherwise?
> >
> > Guess I need to go back for a deeper dive on just what constitutes
> > a toast pointer. I was skimming last time
>
> I see all I have in a toast pointer is a relation id and a value
> oid, so probably no way to check that a given snapshot 'works' to
> find it, at least no more cheaply than by opening the toast relation
> and trying to find it.
>
> Welp, what tuptoaster does to construct its SnapshotToast is to
> grab GetOldestSnapshot():
>
> *  since we don't know which one to use, just use the oldest one.
> *  This is safe: at worst, we will get a "snapshot too old" error
> *  that might have been avoided otherwise.
>
> ... which means, I take it, that a more recent snapshot
> might work, but this conservative choice would only fail
> if the oldest snapshot has really been around for many days,
> under typical old_snapshot_threshold settings?
>
> ... and if I'm getting a value from a portal that happens to have
> a non-null holdSnapshot, then that would be the one to use, in
> preference to just conservatively grabbing the oldest?
>
> -Chap
>
>
> ... am I right that the init_toast_snapshot construction is really
> making only one change to the snapshot data, namely changing the
> 'satisfies' condition to HeapTupleSatisfiesToast ?
>
> The lsn and whenTaken seem to be fetched from the original data
> and stored right back without change.
>
>


-- 
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info