proposal: plpgsql: special comments that will be part of AST

2021-07-23 Thread Pavel Stehule
Hi,

As an author of plpgsql_check, I permanently have to try to solve genetic
problems with the necessity of external information for successful static
validation.

create or replace function foo(tablename text)
returns void as $$
declare r record;
begin
  execute format('select * from %I', tablename) into r;
  raise notice '%', r.x;
end;
$$ language plpgsql;

On this very simple code it is not possible to make static validation.
Currently, there is no way to push some extra information to source code,
that helps with static analysis, but doesn't break evaluation without
plpgsql_check, and doesn't do some significant slowdown.

I proposed Ada language pragmas, but it was rejected.

I implemented fake pragmas in plpgsql_check via arguments of special
function

PERFORM plpgsql_check_pragma('plpgsql_check: off');

It is working, but it looks bizarre, and it requires a fake function
plpgsql_check_pragma on production, so it is a little bit a dirty solution.

Now, I have another proposal(s).

a) Can we invite new syntax for comments, that will be stored in AST like a
non executing statement?

some like:

//* plpgsql_check: OFF *//
RAISE NOTICE '%', r.x

or second design

b) can we introduce some flag for plpgsql_parser, that allows storing
comments in AST (it will not be default).

so  "-- plpgsql_check: OFF " will work for my purpose, and I don't need any
special syntax.

Comments, notes?

Pavel

https://github.com/okbob/plpgsql_check


Re: A micro-optimisation for ProcSendSignal()

2021-07-23 Thread Soumyadeep Chakraborty
HI Thomas,

On Tue, Jul 20, 2021 at 10:40 PM Thomas Munro  wrote:

> > Slightly tangential: we should add a comment to PGPROC.pgprocno, for more
> > immediate understandability:
> >
> > + int pgprocno; /* index of this PGPROC in ProcGlobal->allProcs */
>
> I wonder why we need this member anyway, when you can compute it from
> the address... #define GetPGProcNumber(p) ((p) - ProcGlobal->allProcs)
> or something like that?  Kinda wonder why we don't use
> GetPGProcByNumber() in more places instead of open-coding access to
> ProcGlobal->allProcs, too...

I tried this out. See attached v4 of your patch with these changes.

> > Also, why don't we take the opportunity to get rid of 
> > SERIALIZABLEXACT->pid? We
> > took a stab. Attached is v2 of your patch with these changes.
>
> SERIALIZABLEXACT objects can live longer than the backends that
> created them.  They hang around to sabotage other transactions' plans,
> depending on what else they overlapped with before they committed.
> With that change, the pg_locks view might show the pid of some
> unrelated session that moves into the same PGPROC.

I see.

>
> It's only an "informational" pid, and pids are imperfect information
> anyway because (1) they are themselves recycled, and (2) they won't be
> interesting in a hypothetical multi-threaded future.  One solution
> would be to hide the pids from the view after the backend disconnects
> (somehow -- add a generation number?), but they're also still kinda
> useful, despite the weaknesses.  I wonder what the ideal way would be
> to refer to sessions, anyway, including those that are no longer
> active; perhaps we could invent a new "session ID" concept.

Updating the pg_locks view:

Yes, the pids may be valuable for future debugging/audit purposes. Also,
systems where pid_max is high enough to not see wraparound, will have
pids that are not recycled. I would lean towards showing the pid even
after the backend has exited.

Perhaps we could overload the stored pid to be negated (i.e. a backend
with pid 2 will become -2) to indicate that the pid belongs to
a backend that has exited. Additionally, we could introduce a boolean
field in pg_locks "backendisalive", so that the end user doesn't have
to reason about negative pids.

Session ID:

Interesting, Greenplum uses the concept of session ID pervasively. Being
a distributed database, the session ID helps to tie individual backends
on each PG instance to the same session. The session ID of course comes
with its share of bookkeeping:

* These IDs are incrementally dished out with a counter
  (with pg_atomic_add_fetch_u32), in increments of 1, on the Greenplum
  coordinator PG instance in InitProcess.

* The counter is a part of ProcGlobal and itself is initialized to 0 in
  InitProcGlobal, which means that session IDs are reset on cluster
  restart.

* The sessionID tied to each proceess is maintained in PGPROC.

* The sessionID finds its way into PgBackendStatus, which is further
  reported with pg_stat_get_activity.

A session ID seems a bit heavy just to indicate whether a backend has
exited.

Regards,
Soumyadeep
From c964a977ca41c057737ca34420e002dc85663eb0 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 11 Mar 2021 23:09:11 +1300
Subject: [PATCH v4 1/1] Optimize ProcSendSignal().

Instead of referring to target backends by pid, use pgprocno.  This
means that we don't have to scan the ProcArray, and we can drop some
special case code for dealing with the startup process.

Discussion: https://postgr.es/m/CA%2BhUKGLYRyDaneEwz5Uya_OgFLMx5BgJfkQSD%3Dq9HmwsfRRb-w%40mail.gmail.com
Reviewed-by: Soumyadeep Chakraborty 
Reviewed-by: Ashwin Agrawal 
---
 src/backend/access/transam/clog.c |  2 +-
 src/backend/access/transam/twophase.c |  3 +-
 src/backend/access/transam/xlog.c |  3 +-
 src/backend/postmaster/bgwriter.c |  2 +-
 src/backend/postmaster/pgarch.c   |  2 +-
 src/backend/storage/buffer/buf_init.c |  3 +-
 src/backend/storage/buffer/bufmgr.c   | 10 ++--
 src/backend/storage/ipc/procarray.c   |  6 +--
 src/backend/storage/lmgr/condition_variable.c | 12 ++---
 src/backend/storage/lmgr/lwlock.c |  6 +--
 src/backend/storage/lmgr/predicate.c  |  6 ++-
 src/backend/storage/lmgr/proc.c   | 50 ++-
 src/backend/utils/adt/lockfuncs.c |  1 +
 src/include/storage/buf_internals.h   |  2 +-
 src/include/storage/lock.h|  2 +-
 src/include/storage/predicate_internals.h |  1 +
 src/include/storage/proc.h| 12 ++---
 17 files changed, 42 insertions(+), 81 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 3ea16a270a8..d0557f6c555 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -464,7 +464,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 
 		if 

Re: Followup Timestamp to timestamp with TZ conversion

2021-07-23 Thread Robert Haas
On Fri, Jul 23, 2021 at 6:18 PM Tom Lane  wrote:
> For btree indexes, you need a compatible notion of ordering, not only
> equality.  That's really what's breaking my hypothetical case of a uint
> type.  But as long as you implement operators that behave in a consistent
> fashion, whether they interpret the same heap bitpattern the same is not
> something that matters for constructing a consistent operator family.
> datetime_ops (which includes timestamp and timestamptz) is already a
> counterexample, since unless the timezone is UTC, its operators *don't*
> all agree on what a particular bitpattern means.

Well, that depends a bit on what "means" means. I would argue that the
meaning does not depend on the timezone setting, and that the timezone
merely controls the way that values are printed. That is, I would say
that the meaning is the point in time which the timestamp represents,
considered as an abstract concept, and timezone is merely the user's
way of asking that point in time to be expressed in a way that's easy
for them to understand. Regardless of that philosophical point, I
think it must be true that if a and b are timestamps, a < b implies
a::timestamptz < b::timestamptz, and a > b implies a::timestamptz >
b::timestamptz, and the other way around, which is surely good enough,
but wasn't the case in your example. So the question I suppose is
whether in your example it's really legitimate to include your uint
type in the integer_ops opfamily. I said above that I didn't think so,
but I'm less sure now, because I realize that I read what you wrote
carefully enough, and it's not as artificial as I first thought.

> >> ... I'm also a bit confused about how it ever succeeds at all.
>
> > Well, you can change just the typemod, for example, which was a case
> > that motivated this work originally.
>
> Ah, right.  I guess binary-compatible cases such as text and varchar
> would also fit into that.

Oh, right. That was another one of the cases that motivated that work.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: visibility map corruption

2021-07-23 Thread Bruce Momjian
On Fri, Jul 23, 2021 at 05:47:18PM -0700, Peter Geoghegan wrote:
> > I could perhaps see corruption happening if pg_control's oldest xid
> > value was closer to the current xid value than it should be, but I can't
> > see how having it 2-billion away could cause harm, unless perhaps
> > pg_upgrade itself used enough xids to cause the counter to wrap more
> > than 2^31 away from the oldest xid recorded in pg_control.
> >
> > What I am basically asking is how to document this and what it fixes.
> 
> ISTM that this is a little like commits 78db307bb2 and a61daa14. Maybe
> take a look at those?

Agreed.  I just wanted to make sure I wasn't missing an important aspect
of this patch.  Thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: visibility map corruption

2021-07-23 Thread Peter Geoghegan
On Fri, Jul 23, 2021 at 5:08 PM Bruce Momjian  wrote:
> However, I am now stuck on the commit message text, and I think this is
> the point Peter Geoghegan was trying to make earlier --- while we know
> that preserving the oldest xid in pg_control is the right thing to do,
> and that setting it to the current xid - 2 billion (the old behavior)
> causes vacuum freeze to run on all tables, but what else does this patch
> affect?

As far as I know the only other thing that it might affect is the
traditional use of pg_resetwal: recovering likely-corrupt data.
Getting the database to limp along for long enough to pg_dump. That is
the only interpretation that makes sense, because the code in question
predates pg_upgrade.

AFAICT that was the original spirit of the code that we're changing here.

> As far as I know, seeing a very low oldest xid causes autovacuum to
> check all objects and make sure their relfrozenxid is less then
> autovacuum_freeze_max_age, but isn't that just a check?  Would that
> cause any table scans?  I would think not.  And would this cause
> incorrect truncation of pg_xact or fsm or vm files?  I would think not
> too.

Tom actually wrote this code. I believe that he questioned the whole
basis of it himself quite recently.

Whether or not it's okay to change the behavior in contexts outside of
pg_upgrade (contexts where the user invokes pg_resetwal -x to get the
system to start) is perhaps debatable. It probably doesn't matter very
much if you preserve that behavior for non-pg_upgrade cases -- hard to
say. At the same time it's now easy to see that pg_upgrade shouldn't
be doing this.

> Even if the old and new cluster had mismatched autovacuum_freeze_max_age
> values, I don't see how that would cause any corruption either.

Sometimes the pg_control value for oldest XID is used as the oldest
non-frozen XID that's expected in the table. Other times it's
relfrozenxid itself IIRC.

> I could perhaps see corruption happening if pg_control's oldest xid
> value was closer to the current xid value than it should be, but I can't
> see how having it 2-billion away could cause harm, unless perhaps
> pg_upgrade itself used enough xids to cause the counter to wrap more
> than 2^31 away from the oldest xid recorded in pg_control.
>
> What I am basically asking is how to document this and what it fixes.

ISTM that this is a little like commits 78db307bb2 and a61daa14. Maybe
take a look at those?

-- 
Peter Geoghegan




Re: visibility map corruption

2021-07-23 Thread Bruce Momjian
On Thu, Jul  8, 2021 at 09:51:47AM -0400, Bruce Momjian wrote:
> On Thu, Jul  8, 2021 at 08:11:14AM -0500, Justin Pryzby wrote:
> > Also, the pg_upgrade status message still seems to be misplaced:
> > 
> > In 20210706190612.gm22...@telsasoft.com, Justin Pryzby wrote:
> > > I re-arranged the pg_upgrade output of that patch: it was in the middle 
> > > of the
> > > two halves: "Setting next transaction ID and epoch for new cluster"
> > 
> > +++ b/src/bin/pg_upgrade/pg_upgrade.c
> > @@ -473,6 +473,12 @@ copy_xact_xlog_xid(void)
> >   "\"%s/pg_resetwal\" -f -x %u \"%s\"",
> >   new_cluster.bindir, 
> > old_cluster.controldata.chkpnt_nxtxid,
> >   new_cluster.pgdata);
> > +   check_ok();
> > +   prep_status("Setting oldest XID for new cluster");
> > +   exec_prog(UTILITY_LOG_FILE, NULL, true, true,
> > + "\"%s/pg_resetwal\" -f -u %u \"%s\"",
> > + new_cluster.bindir, 
> > old_cluster.controldata.chkpnt_oldstxid,
> > + new_cluster.pgdata);
> > exec_prog(UTILITY_LOG_FILE, NULL, true, true,
> >   "\"%s/pg_resetwal\" -f -e %u \"%s\"",
> >   new_cluster.bindir, 
> > old_cluster.controldata.chkpnt_nxtepoch,
> 
> Wow, you are 100% correct.  Updated patch attached.

OK, I have the patch ready to apply to all supported Postgres versions,
and it passes all my cross-version pg_upgrade tests.

However, I am now stuck on the commit message text, and I think this is
the point Peter Geoghegan was trying to make earlier --- while we know
that preserving the oldest xid in pg_control is the right thing to do,
and that setting it to the current xid - 2 billion (the old behavior)
causes vacuum freeze to run on all tables, but what else does this patch
affect?

As far as I know, seeing a very low oldest xid causes autovacuum to
check all objects and make sure their relfrozenxid is less then
autovacuum_freeze_max_age, but isn't that just a check?  Would that
cause any table scans?  I would think not.  And would this cause
incorrect truncation of pg_xact or fsm or vm files?  I would think not
too.

Even if the old and new cluster had mismatched autovacuum_freeze_max_age
values, I don't see how that would cause any corruption either.

I could perhaps see corruption happening if pg_control's oldest xid
value was closer to the current xid value than it should be, but I can't
see how having it 2-billion away could cause harm, unless perhaps
pg_upgrade itself used enough xids to cause the counter to wrap more
than 2^31 away from the oldest xid recorded in pg_control.

What I am basically asking is how to document this and what it fixes.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Followup Timestamp to timestamp with TZ conversion

2021-07-23 Thread Tom Lane
Robert Haas  writes:
> On Fri, Jul 23, 2021 at 5:47 PM Tom Lane  wrote:
>> Hmm.  Note that what this is checking for is same operator *class* not
>> same operator family (if it were doing the latter, Peter's case would
>> already work).  I think it has to do that.  Extending my previous
>> thought experiment about an unsigned integer type, if someone were to
>> invent one, it would make a lot of sense to include it in integer_ops,
>> and then the logic you suggest is toast.

> Mumble. I hadn't considered that sort of thing. I assumed that when
> the documentation and/or code comments talked about a compatible
> notion of equality, it was a strong enough notion of "compatible" to
> preclude this sort of case.

For btree indexes, you need a compatible notion of ordering, not only
equality.  That's really what's breaking my hypothetical case of a uint
type.  But as long as you implement operators that behave in a consistent
fashion, whether they interpret the same heap bitpattern the same is not
something that matters for constructing a consistent operator family.
datetime_ops (which includes timestamp and timestamptz) is already a
counterexample, since unless the timezone is UTC, its operators *don't*
all agree on what a particular bitpattern means.

>> ... I'm also a bit confused about how it ever succeeds at all.

> Well, you can change just the typemod, for example, which was a case
> that motivated this work originally.

Ah, right.  I guess binary-compatible cases such as text and varchar
would also fit into that.

regards, tom lane




Re: Configure with thread sanitizer fails the thread test

2021-07-23 Thread Tom Lane
Andres Freund  writes:
> On 2021-07-23 17:18:37 -0400, Tom Lane wrote:
>> Should we back-patch this, or just do it in HEAD?  Maybe HEAD+v14?

> I'm ok with all, with a preference for HEAD+v14, followed by HEAD, and
> all branches.

After a bit more thought, HEAD+v14 is also my preference.  I'm not
eager to change this in stable branches, but it doesn't seem too
late for v14.

regards, tom lane




Re: Followup Timestamp to timestamp with TZ conversion

2021-07-23 Thread Robert Haas
On Fri, Jul 23, 2021 at 5:47 PM Tom Lane  wrote:
> > You seemed to think my previous comments about comparing opfamilies
> > were hypothetical but I think we actually already have the
> > optimization Peter wants, and it just doesn't apply in this case for
> > lack of hacks.
>
> Hmm.  Note that what this is checking for is same operator *class* not
> same operator family (if it were doing the latter, Peter's case would
> already work).  I think it has to do that.  Extending my previous
> thought experiment about an unsigned integer type, if someone were to
> invent one, it would make a lot of sense to include it in integer_ops,
> and then the logic you suggest is toast.  (Obviously, the cross-type
> comparison operators you'd need to write would have to be careful,
> but you'd almost certainly wish to write them anyway.)

Mumble. I hadn't considered that sort of thing. I assumed that when
the documentation and/or code comments talked about a compatible
notion of equality, it was a strong enough notion of "compatible" to
preclude this sort of case. I'm not really sure why we shouldn't think
of it that way; the example you give here is reasonable, but
artificial.

> Given that we require the non-cross-type members of an opclass to be
> immutable, what this is actually doing may be safe.  At least I can't
> construct a counterexample after five minutes' thought.  On the other
> hand, I'm also a bit confused about how it ever succeeds at all.
> If we've changed the heap column's type, it should not be using the
> same opclass anymore (unless the opclass is polymorphic, but that
> case is rejected too).  I'm suspicious that this is just an expensive
> way of writing "we can only preserve indexes that aren't on the
> column that changed type".

Well, you can change just the typemod, for example, which was a case
that motivated this work originally. People wanted to be able to make
varchar(10) into varchar(20) without doing a ton of work, and I think
this lets that work. I seem to recall that there are at least a few
cases that involve actually changing the type as well, but at 6pm on a
Friday evening when I haven't looked at this in years, I can't tell
you what they are off the top of my head.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Configure with thread sanitizer fails the thread test

2021-07-23 Thread Andres Freund
Hi,

On 2021-07-23 17:18:37 -0400, Tom Lane wrote:
> I'm not prepared to go that far just yet; but certainly we can stop
> expending configure cycles on the case.
> 
> Should we back-patch this, or just do it in HEAD?  Maybe HEAD+v14?

I'm ok with all, with a preference for HEAD+v14, followed by HEAD, and
all branches.

Greetings,

Andres Freund




Re: CREATE SEQUENCE with RESTART option

2021-07-23 Thread Cary Huang
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi

I have applied and run your patch, which works fine in my environment. 
Regarding your comments in the patch:

/*
 * Restarting a sequence while defining it doesn't make any sense
 * and it may override the START value. Allowing both START and
 * RESTART option for CREATE SEQUENCE may cause confusion to user.
 * Hence, we throw error for CREATE SEQUENCE if RESTART option is
 * specified. However, it can be used with ALTER SEQUENCE.
 */

I would remove the first sentence, because it seems like a personal opinion to 
me. I am sure someone, somewhere may think it makes total sense :).

I would rephrase like this:

/* 
 * Allowing both START and RESTART option for CREATE SEQUENCE 
 * could override the START value and cause confusion to user. Hence, 
 * we throw an error for CREATE SEQUENCE if RESTART option is
 * specified; it can only be used with ALTER SEQUENCE.
 */

just a thought.

thanks!

-
Cary Huang
HighGo Software Canada
www.highgo.ca

Re: Followup Timestamp to timestamp with TZ conversion

2021-07-23 Thread Tom Lane
Robert Haas  writes:
> On Fri, Jul 23, 2021 at 2:07 PM Tom Lane  wrote:
>> However, for the reasons I explained before, there are no general-purpose
>> cases where we can skip an index build on a type-changed column, so
>> there is no place to insert a similar hack for the timestamp[tz] case.

> Wouldn't the hack just go into CheckIndexCompatible()?

Oh!  I went looking for code to skip rebuilding indexes during ALTER TYPE,
but I guess I looked in the wrong place, because I missed that somehow.

> You seemed to think my previous comments about comparing opfamilies
> were hypothetical but I think we actually already have the
> optimization Peter wants, and it just doesn't apply in this case for
> lack of hacks.

Hmm.  Note that what this is checking for is same operator *class* not
same operator family (if it were doing the latter, Peter's case would
already work).  I think it has to do that.  Extending my previous
thought experiment about an unsigned integer type, if someone were to
invent one, it would make a lot of sense to include it in integer_ops,
and then the logic you suggest is toast.  (Obviously, the cross-type
comparison operators you'd need to write would have to be careful,
but you'd almost certainly wish to write them anyway.)

Given that we require the non-cross-type members of an opclass to be
immutable, what this is actually doing may be safe.  At least I can't
construct a counterexample after five minutes' thought.  On the other
hand, I'm also a bit confused about how it ever succeeds at all.
If we've changed the heap column's type, it should not be using the
same opclass anymore (unless the opclass is polymorphic, but that
case is rejected too).  I'm suspicious that this is just an expensive
way of writing "we can only preserve indexes that aren't on the
column that changed type".

regards, tom lane




Re: .ready and .done files considered harmful

2021-07-23 Thread Bossart, Nathan
On 5/6/21, 1:01 PM, "Andres Freund"  wrote:
> If we leave history files and gaps in the .ready sequence aside for a
> second, we really only need an LSN or segment number describing the
> current "archive position". Then we can iterate over the segments
> between the "archive position" and the flush position (which we already
> know). Even if we needed to keep statting .ready/.done files (to handle
> gaps due to archive command mucking around with .ready/done), it'd still
> be a lot cheaper than what we do today.  It probably would even still be
> cheaper if we just statted all potentially relevant timeline history
> files all the time to send them first.

My apologies for chiming in so late to this thread, but a similar idea
crossed my mind while working on a bug where .ready files get created
too early [0].  Specifically, instead of maintaining a status file per
WAL segment, I was thinking we could narrow it down to a couple of
files to keep track of the boundaries we care about:

1. earliest_done: the oldest segment that has been archived and
   can be recycled/removed
2. latest_done: the newest segment that has been archived
3. latest_ready: the newest segment that is ready for archival

This might complicate matters for backup utilities that currently
modify the .ready/.done files, but it would simplify this archive
status stuff quite a bit and eliminate the need to worry about the
directory scans in the first place.

Nathan

[0] 
https://www.postgresql.org/message-id/flat/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com



Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-23 Thread Mark Dilger



> On Jul 23, 2021, at 2:04 PM, Mark Dilger  wrote:
> 
> If the GUC merely converts the event trigger into an error, then you have the 
> problem that the customer can create event triggers which the service 
> provider will need to disable (because they cause the service providers 
> legitimate actions to error rather than succeed).

I'd like to expound on this a little more.

Imagine the service provider has scripts that perform actions within the 
database, such as physical replication, or the creation and removal of database 
users in response to actions taken at the service portal web interface, and 
they don't want the actions performed by those scripts to be leveraged by the 
customer to break out of the jail.

The customer has event triggers which perform no illicit activities.  They 
don't try to break out of the jail.  But for compliance with HIPAA regulations 
(or whatever), they need to audit log everything, and they can't just have the 
service provider's actions unlogged.

What to do?  If the service provider disables the event triggers, then the 
customer will fail their regulation audit.  If the service provider allows the 
event triggers to fire, the customer might create a new event trigger embedding 
illicit actions.  The service provider is totally stuck.

OTOH, if there were a mechanism by which an event trigger could run with only 
the intersection of the privileges enjoyed by the service provider's scripts 
and the customer's event trigger owner, then the service provider can allow 
their own actions to be logged, without fear that any hijacking of their 
privilege will occur.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Configure with thread sanitizer fails the thread test

2021-07-23 Thread Tom Lane
Andres Freund  writes:
> On 2021-07-23 13:42:41 -0400, Tom Lane wrote:
>> TBH, I wonder why we don't just nuke thread_test.c altogether.

> +1. And before long it might be time to remove support for systems
> without threads...

I'm not prepared to go that far just yet; but certainly we can stop
expending configure cycles on the case.

Should we back-patch this, or just do it in HEAD?  Maybe HEAD+v14?

regards, tom lane




Removing "long int"-related limit on hash table sizes

2021-07-23 Thread Tom Lane
Per the discussion at [1], users on Windows are seeing nasty performance
losses in v13/v14 (compared to prior releases) for hash aggregations that
required somewhat more than 2GB in the prior releases.  That's because
they spill to disk where they did not before.  The easy answer of "raise
hash_mem_multiplier" doesn't help, because on Windows the product of
work_mem and hash_mem_multiplier is clamped to 2GB, thanks to the ancient
decision to do a lot of memory-space-related calculations in "long int",
which is only 32 bits on Win64.

While I don't personally have the interest to fix that altogether,
it does seem like we've got a performance regression that we ought
to do something about immediately.  So I took a look at getting rid of
this restriction for calculations associated with hash_mem_multiplier,
and it doesn't seem to be too bad.  I propose the attached patch.
(This is against HEAD; there are minor conflicts in v13 and v14.)

A couple of notes:

* I did not change most of the comments referring to "hash_mem",
even though that's not really a thing anymore.  They seem readable
enough anyway, and I failed to think of a reasonably-short substitute.

* We should drop get_hash_mem() altogether in HEAD and maybe v14.
I figure we'd better leave it available in v13, though, in case
any outside code is using it.

Comments?

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/MN2PR15MB25601E80A9B6D1BA6F592B1985E39%40MN2PR15MB2560.namprd15.prod.outlook.com

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 5fd0b26cbc..c11427a1f6 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -165,14 +165,16 @@ BuildTupleHashTableExt(PlanState *parent,
 {
 	TupleHashTable hashtable;
 	Size		entrysize = sizeof(TupleHashEntryData) + additionalsize;
-	int			hash_mem = get_hash_mem();
+	Size		hash_mem_limit;
 	MemoryContext oldcontext;
 	bool		allow_jit;
 
 	Assert(nbuckets > 0);
 
 	/* Limit initial table size request to not more than hash_mem */
-	nbuckets = Min(nbuckets, (long) ((hash_mem * 1024L) / entrysize));
+	hash_mem_limit = get_hash_memory_limit() / entrysize;
+	if (nbuckets > hash_mem_limit)
+		nbuckets = hash_mem_limit;
 
 	oldcontext = MemoryContextSwitchTo(metacxt);
 
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 914b02ceee..39bea204d1 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1802,15 +1802,15 @@ hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
 {
 	int			npartitions;
 	Size		partition_mem;
-	int			hash_mem = get_hash_mem();
+	Size		hash_mem_limit = get_hash_memory_limit();
 
 	/* if not expected to spill, use all of hash_mem */
-	if (input_groups * hashentrysize < hash_mem * 1024L)
+	if (input_groups * hashentrysize <= hash_mem_limit)
 	{
 		if (num_partitions != NULL)
 			*num_partitions = 0;
-		*mem_limit = hash_mem * 1024L;
-		*ngroups_limit = *mem_limit / hashentrysize;
+		*mem_limit = hash_mem_limit;
+		*ngroups_limit = hash_mem_limit / hashentrysize;
 		return;
 	}
 
@@ -1835,10 +1835,10 @@ hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
 	 * minimum number of partitions, so we aren't going to dramatically exceed
 	 * work mem anyway.
 	 */
-	if (hash_mem * 1024L > 4 * partition_mem)
-		*mem_limit = hash_mem * 1024L - partition_mem;
+	if (hash_mem_limit > 4 * partition_mem)
+		*mem_limit = hash_mem_limit - partition_mem;
 	else
-		*mem_limit = hash_mem * 1024L * 0.75;
+		*mem_limit = hash_mem_limit * 0.75;
 
 	if (*mem_limit > hashentrysize)
 		*ngroups_limit = *mem_limit / hashentrysize;
@@ -1992,32 +1992,36 @@ static int
 hash_choose_num_partitions(double input_groups, double hashentrysize,
 		   int used_bits, int *log2_npartitions)
 {
-	Size		mem_wanted;
-	int			partition_limit;
+	Size		hash_mem_limit = get_hash_memory_limit();
+	double		partition_limit;
+	double		mem_wanted;
+	double		dpartitions;
 	int			npartitions;
 	int			partition_bits;
-	int			hash_mem = get_hash_mem();
 
 	/*
 	 * Avoid creating so many partitions that the memory requirements of the
 	 * open partition files are greater than 1/4 of hash_mem.
 	 */
 	partition_limit =
-		(hash_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
+		(hash_mem_limit * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
 		HASHAGG_WRITE_BUFFER_SIZE;
 
 	mem_wanted = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;
 
 	/* make enough partitions so that each one is likely to fit in memory */
-	npartitions = 1 + (mem_wanted / (hash_mem * 1024L));
+	dpartitions = 1 + (mem_wanted / hash_mem_limit);
+
+	if (dpartitions > partition_limit)
+		dpartitions = partition_limit;
 
-	if (npartitions > partition_limit)
-		npartitions = partition_limit;
+	if (dpartitions < HASHAGG_MIN_PARTITIONS)
+		dpartitions = HASHAGG_MIN_PARTITIONS;
+	if (dpartitions > HASHAGG_MAX_PARTITIONS)
+		dpartitions = 

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-23 Thread Mark Dilger



> On Jul 23, 2021, at 1:57 PM, Mark Dilger  wrote:
> 
> What's the point in having these as separate roles if they can circumvent 
> each other's authority?

That was probably too brief a reply, so let me try again.  If the GUC 
circumvents the event trigger, then my answer above stands.  If the GUC merely 
converts the event trigger into an error, then you have the problem that the 
customer can create event triggers which the service provider will need to 
disable (because they cause the service providers legitimate actions to error 
rather than succeed).  Presumably the service provider can disable them logged 
in as superuser.  But that means the service customer has their event trigger 
turned off, at least for some length of time, which is not good if the event 
trigger is performing audit logging for compliance purposes, etc.  Also, we 
can't say whether pg_network_security role has been given to the customer, or 
if that is being kept for the provider's use only, so we're not really sure 
whether pg_network_security should be able to do these sorts of things, but in 
the case that the service provider is keeping pg_network_security for themself, 
it seems they wouldn't want the customer to cause pg_network_security 
operations to fail.  We can't make too many assumptions about the exact 
relationship between those two roles.



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-23 Thread Mark Dilger



> On Jul 23, 2021, at 1:54 PM, Robert Haas  wrote:
> 
> Yeah, but you're inventing a system for allowing the restriction on a
> GUC to be something other than is-superuser in the very patch we're
> talking about. So it could be something like is-database-security.
> Therefore I don't grok the objection.

I'm not objecting to how hard it would be to implement.  I'm objecting to the 
semantics.  If the only non-superuser who can set the GUC is 
pg_database_security, then it is absolutely worthless in preventing 
pg_database_security from trapping actions performed by pg_network_security 
members.  On the other hand, if pg_network_security can also set the GUC, then 
pg_network_security can circumvent audit logging that pg_database_security put 
in place.  What's the point in having these as separate roles if they can 
circumvent each other's authority?


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-23 Thread Robert Haas
On Fri, Jul 23, 2021 at 12:11 PM Mark Dilger
 wrote:
> A superuser-only GUC for this is also a bit too heavy handed.

Yeah, but you're inventing a system for allowing the restriction on a
GUC to be something other than is-superuser in the very patch we're
talking about. So it could be something like is-database-security.
Therefore I don't grok the objection.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Followup Timestamp to timestamp with TZ conversion

2021-07-23 Thread Robert Haas
On Fri, Jul 23, 2021 at 2:07 PM Tom Lane  wrote:
> Yes, I'm very well aware of that optimization.  While it's certainly
> a hack, it fits within a design that isn't a hack, ie that there are
> common, well-defined cases where we can skip the table rewrite.
> However, for the reasons I explained before, there are no general-purpose
> cases where we can skip an index build on a type-changed column, so
> there is no place to insert a similar hack for the timestamp[tz] case.

Wouldn't the hack just go into CheckIndexCompatible()?

You seemed to think my previous comments about comparing opfamilies
were hypothetical but I think we actually already have the
optimization Peter wants, and it just doesn't apply in this case for
lack of hacks.

Maybe I am missing something.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Have I found an interval arithmetic bug?

2021-07-23 Thread Bruce Momjian
On Fri, Jul 23, 2021 at 10:55:11AM -0700, Bryn Llewellyn wrote:
> SELECT
>   '1.2345 months 1.2345 days 1.2345 seconds'::interval = 
>   '1 month 1 day 1 second'::interval*1.2345;
> 
> In 13.3, the result is TRUE. (I know that this doesn’t guarantee that the
> internal representations of the two compared interval values are the same. But
> it’s a necessary condition for the outcome that I’m referring to and serves to
> indecate the pont I’m making. A more careful test can be made.

So you are saying fractional unit output should match multiplication
output?  It doesn't now for all units:

SELECT interval '1.3443 years';
   interval
---
 1 year 4 mons

SELECT interval '1 years' * 1.3443;
?column?
-
 1 year 4 mons 3 days 22:45:07.2

It is true this patch is further reducing that matching.  Do people
think I should make them match as part of this patch?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: [Patch] ALTER SYSTEM READ ONLY

2021-07-23 Thread Robert Haas
On Thu, Jun 17, 2021 at 1:23 AM Amul Sul  wrote:
> Attached is rebase for the latest master head.  Also, I added one more
> refactoring code that deduplicates the code setting database state in the
> control file. The same code set the database state is also needed for this
> feature.

I started studying 0001 today and found that it rearranged the order
of operations in StartupXLOG() more than I was expecting. It does, as
per previous discussions, move a bunch of things to the place where we
now call XLogParamters(). But, unsatisfyingly, InRecovery = false and
XLogReaderFree() then have to move down even further. Since the goal
here is to get to a situation where we sometimes XLogAcceptWrites()
after InRecovery = false, it didn't seem nice for this refactoring
patch to still end up with a situation where this stuff happens while
InRecovery = true. In fact, with the patch, the amount of code that
runs with InRecovery = true actually *increases*, which is not what I
think should be happening here. That's why the patch ends up having to
adjust SetMultiXactIdLimit to not Assert(!InRecovery).

And then I started to wonder how this was ever going to work as part
of the larger patch set, because as you have it here,
XLogAcceptWrites() takes arguments XLogReaderState *xlogreader,
XLogRecPtr EndOfLog, and TimeLineID EndOfLogTLI and if the
checkpointer is calling that at a later time after the user issues
pg_prohibit_wal(false), it's going to have none of those things. So I
had a quick look at that part of the code and found this in
checkpointer.c:

XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);

For those following along from home, the additional "true" is a bool
needChkpt argument added to XLogAcceptWrites() by 0003. Well, none of
this is very satisfying. The whole purpose of passing the xlogreader
is so we can figure out whether we need a checkpoint (never mind the
question of whether the existing algorithm for determining that is
really sensible) but now we need a second argument that basically
serves the same purpose since one of the two callers to this function
won't have an xlogreader. And then we're passing the EndOfLog and
EndOfLogTLI as dummy values which seems like it's probably just
totally wrong, but if for some reason it works correctly there sure
don't seem to be any comments explaining why.

So I started doing a bit of hacking myself and ended up with the
attached, which I think is not completely the right thing yet but I
think it's better than your version. I split this into three parts.
0001 splits up the logic that currently decides whether to write an
end-of-recovery record or a checkpoint record and if the latter how
the checkpoint ought to be performed into two functions.
DetermineRecoveryXlogAction() figures out what we want to do, and
PerformRecoveryXlogAction() does it. It also moves the code to run
recovery_end_command and related stuff into a new function
CleanupAfterArchiveRecovery(). 0002 then builds on this by postponing
UpdateFullPageWrites(), PerformRecoveryXLogAction(), and
CleanupAfterArchiveRecovery() to just before we
XLogReportParameters(). Because of the refactoring done by 0001, this
is only a small amount of code movement. Because of the separation
between DetermineRecoveryXlogAction() and PerformRecoveryXlogAction(),
the latter doesn't need the xlogreader. So we can do
DetermineRecoveryXlogAction() at the same time as now, while the
xlogreader is available, and then we don't need it later when we
PerformRecoveryXlogAction(), because we already know what we need to
know. I think this is all fine as far as it goes.

My 0003 is where I see some lingering problems. It creates
XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
need the xlogreader. But it doesn't really solve the problem of how
checkpointer.c would be able to call this function with proper
arguments. It is at least better in not needing two arguments to
decide what to do, but how is checkpointer.c supposed to know what to
pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
what to pass for EndOfLogTLI and EndOfLog? Actually, EndOfLog doesn't
seem too problematic, because that value has been stored in four (!)
places inside XLogCtl by this code:

LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;

XLogCtl->LogwrtResult = LogwrtResult;

XLogCtl->LogwrtRqst.Write = EndOfLog;
XLogCtl->LogwrtRqst.Flush = EndOfLog;

Presumably we could relatively easily change things around so that we
finish one of those values ... probably one of the "write" values ..
back out of XLogCtl instead of passing it as a parameter. That would
work just as well from the checkpointer as from the startup process,
and there seems to be no way for the value to change until after
XLogAcceptWrites() has been called, so it seems fine. But that doesn't
help for the other arguments. What I'm thinking is that we should just
arrange to store EndOfLogTLI and xlogaction into XLogCtl also, and
then 

Re: Configure with thread sanitizer fails the thread test

2021-07-23 Thread Andres Freund
Hi,

On 2021-07-23 13:42:41 -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > ... that said, I wonder why would we do this in the thread_test program
> > rather than in configure itself.  Wouldn't it make more sense for the
> > configure test to be skipped altogether (marking the result as
> > thread-safe) when running under thread sanitizer, if there's a way to
> > detect that?
> 
> TBH, I wonder why we don't just nuke thread_test.c altogether.
> Is it still useful in 2021?  Machines that still need
> --disable-thread-safety can doubtless be counted without running
> out of fingers, and I think their owners can be expected to know
> that they need that.

+1. And before long it might be time to remove support for systems
without threads...

Greetings,

Andres Freund




Re: Fix memory leak when output postgres_fdw's "Relations"

2021-07-23 Thread Ranier Vilela
Em sex., 23 de jul. de 2021 às 11:32, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > This is a minor leak, oversight in
> >
> https://github.com/postgres/postgres/commit/4526951d564a7eed512b4a0ac3b5893e0a115690#diff-e399f5c029192320f310a79f18c20fb18c8e916fee993237f6f82f05dad851c5
>
> I don't think you understand how Postgres memory management works.
>
Maybe not yet. Valgrind may also don't understand yet.


> There's no permanent leak here, just till the end of the command;
> so it's pretty doubtful that there's any need to expend cycles on
> an explicit pfree.
>
Maybe.

==30691== 24 bytes in 1 blocks are definitely lost in loss record 123 of 469
==30691==at 0x8991F0: MemoryContextAlloc (mcxt.c:893)
==30691==by 0x899F29: MemoryContextStrdup (mcxt.c:1291)
==30691==by 0x864E09: RelationInitIndexAccessInfo (relcache.c:1419)
==30691==by 0x865F81: RelationBuildDesc (relcache.c:1175)
==30691==by 0x868575: load_critical_index (relcache.c:4168)
==30691==by 0x8684A0: RelationCacheInitializePhase3 (relcache.c:3980)
==30691==by 0x88047A: InitPostgres (postinit.c:1031)
==30691==by 0x773F12: PostgresMain (postgres.c:4081)
==30691==by 0x6F9C33: BackendRun (postmaster.c:4506)
==30691==by 0x6F96D8: BackendStartup (postmaster.c:4228)
==30691==by 0x6F8C08: ServerLoop (postmaster.c:1745)
==30691==by 0x6F747B: PostmasterMain (postmaster.c:1417)

Is this is a false-positive?

regards,
Ranier Vilela


Re: Followup Timestamp to timestamp with TZ conversion

2021-07-23 Thread Tom Lane
Peter Volk  writes:
> thanks for the reply, I do understand that if a rewrite of the table
> needs to be avoided the binary image needs to be the same. Since PG 12
> there is an optimisation to avoid a rewrite of timestamp columns if
> they are converted to timestamp with tz and the target tz offset is 0

Yes, I'm very well aware of that optimization.  While it's certainly
a hack, it fits within a design that isn't a hack, ie that there are
common, well-defined cases where we can skip the table rewrite.
However, for the reasons I explained before, there are no general-purpose
cases where we can skip an index build on a type-changed column, so
there is no place to insert a similar hack for the timestamp[tz] case.
I'm unwilling to kluge up ALTER TYPE to the extent that would be needed
if the result would only be to handle this one case.

regards, tom lane




Re: pg_stats and range statistics

2021-07-23 Thread Egor Rogov

Hi Tomas,

On 12.07.2021 16:04, Tomas Vondra wrote:

On 7/12/21 1:10 PM, Egor Rogov wrote:

Hi,

thanks for the review and corrections.

On 11.07.2021 21:54, Soumyadeep Chakraborty wrote:

Hello,

This should have been added with [1].

Excerpt from the documentation:
"pg_stats is also designed to present the information in a more readable
format than the underlying catalog — at the cost that its schema must
be extended whenever new slot types are defined for pg_statistic." [2]

So, I added a reminder in pg_statistic.h.

Good point.



Attached is v2 of this patch with some cosmetic changes.

I wonder why "TODO: catalog version bump"? This patch doesn't change
catalog structure, or I miss something?


It changes system_views.sql, which is catalog change, as it redefines
the pg_stats system view (it adds 3 more columns). So it changes what
you get after initdb, hence catversion has to be bumped.


Renamed the columns a
bit and updated the docs to be a bit more descriptive.
(range_length_empty_frac -> empty_range_frac, range_bounds_histogram ->
range_bounds_histograms)

I intended to make the same prefix ("range_") for all columns concerned
with range types, although I'm fine with the proposed naming.


Yeah, I'd vote to change empty_range_frac -> range_empty_frac.


One question:

We do have the option of representing the histogram of lower bounds
separately
from the histogram of upper bounds, as two separate view columns.
Don't know if
there is much utility though and there is a fair bit of added
complexity: see
below. Thoughts?

I thought about it too, and decided not to transform the underlying data
structure. As far as I can see, pg_stats never employed such
transformations. For example, STATISTIC_KIND_DECHIST is an array
containing the histogram followed by the average in its last element. It
is shown in pg_stats.elem_count_histogram as is, although it arguably
may be splitted into two fields. All in all, I believe pg_stats's job is
to "unpack" stavalues and stanumbers into meaningful fields, and not to
try to go deeper than that.


Not firm opinion, but the pg_stats is meant to be easier to
read/understand for humans. So far the transformation were simple
because all the data was fairly simple, but the range stuff may need
more complex transformation.

For example we do quite a bit more in pg_stats_ext views, because it
deals with multi-column stats.



In pg_stats_ext, yes, but not in pg_stats (at least until now).

Since no one has expressed a strong desire for a more complex 
transformation, should we proceed with the proposed approach (with 
further renaming empty_range_frac -> range_empty_frac as you suggested)? 
Or should we wait more for someone to weigh in?






regards






Re: Configure with thread sanitizer fails the thread test

2021-07-23 Thread Bruce Momjian
On Fri, Jul 23, 2021 at 01:42:41PM -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > ... that said, I wonder why would we do this in the thread_test program
> > rather than in configure itself.  Wouldn't it make more sense for the
> > configure test to be skipped altogether (marking the result as
> > thread-safe) when running under thread sanitizer, if there's a way to
> > detect that?
> 
> TBH, I wonder why we don't just nuke thread_test.c altogether.
> Is it still useful in 2021?  Machines that still need
> --disable-thread-safety can doubtless be counted without running
> out of fingers, and I think their owners can be expected to know
> that they need that.

I think it is fine to remove it.  It was really designed to just try a
lot of flags to see what the compiler required, but things have become
much more stable since it was written.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Have I found an interval arithmetic bug?

2021-07-23 Thread Bryn Llewellyn
> On 23-Jul-2021, at 08:05, Bruce Momjian  wrote:
> 
> On Thu, Jul 22, 2021 at 03:17:52PM -0700, Zhihong Yu wrote:
>> On Thu, Jul 22, 2021 at 2:59 PM Zhihong Yu  wrote:
>>Hi,
>> 
>>-   tm->tm_mon += (fval * MONTHS_PER_YEAR);
>>+   tm->tm_mon += rint(fval * MONTHS_PER_YEAR);
>> 
>>Should the handling for year use the same check as that for month ?
>> 
>>-   AdjustFractDays(fval, tm, fsec, DAYS_PER_MONTH);
>>+   /* round to a full month? */
>>+   if (rint(fval * DAYS_PER_MONTH) == DAYS_PER_MONTH)
>> 
>>Cheers 
>> 
>> Hi,
>> I guess the reason for current patch was that year to months conversion is
>> accurate.
> 
> Our internal units are hours/days/seconds, so the spill _up_ from months
> to years happens automatically:
> 
>   SELECT INTERVAL '23.99 months';
>interval
>   --
>2 years
> 
>> On the new test:
>> 
>> +SELECT INTERVAL '1.16 months 01:00:00' AS "One mon 5 days one hour";
>> 
>> 0.16 * 31 = 4.96 < 5
>> 
>> I wonder why 5 days were chosen in the test output.
> 
> We use 30 days/month, not 31.  However, I think you are missing the
> changes in the patch and I am just understanding them fully now.  There
> are two big changes:
> 
> 1.  The amount of spill from months only to days
> 2.  The _rounding_ of the result once we stop spilling at months or days
> 
> #2 is the part I think you missed.
> 
> One thing missing from my previous patch was the handling of negative
> units, which is now handled properly in the attached patch:
> 
>   SELECT INTERVAL '-1.99 years';
>interval
>   --
>-2 years
> 
>   SELECT INTERVAL '-1.99 months';
>interval
>   --
>-2 mons
> 
> I ended up creating a function to handle this, which allowed me to
> simplify some of the surrounding code.
> 
> -- 
>  Bruce Momjian  
> https://www.google.com/url?q=https://momjian.us=gmail-imap=162765755400=AOvVaw2pMx7QBd3qSjHK1L9oUnl0
>  EDB  
> https://www.google.com/url?q=https://enterprisedb.com=gmail-imap=162765755400=AOvVaw2Q92apfhXmqqFYz7aN16YL
> 
>  If only the physical world exists, free will is an illusion.
> 
> 
> 

Will the same new spilldown rules hold in the same way for interval 
multiplication and division as they will for the interpretation of an interval 
literal?

The semantics here are (at least as far as my limited search skills have shown 
me) simply undocumented. But my tests in 13.3 have to date not disproved this 
hypothesis:

* considering "new_i ◄— i * f"

* # notice that the internal representation is _months_, days, and seconds at 
odds with "Our internal units are hours/days/seconds,"
* let i’s internal representation be [mm, dd, ss] 

* new_i’s “intermediate” internal representation is [mm*f, dd*f, ss*f]

* input these values to the same spilldown algorithm that is applied when these 
same intermediate values are used in an interval literal

* so the result is [new_mm, new_dd, new_ss]

Here’s an example:

select
  '1.2345 months 1.2345 days 1.2345 seconds'::interval = 
  '1 month 1 day 1 second'::interval*1.2345;

In 13.3, the result is TRUE. (I know that this doesn’t guarantee that the 
internal representations of the two compared interval values are the same. But 
it’s a necessary condition for the outcome that I’m referring to and serves to 
indecate the pont I’m making. A more careful test can be made.

Re: Avoiding data loss with synchronous replication

2021-07-23 Thread Bossart, Nathan
On 7/23/21, 4:33 AM, "Andrey Borodin"  wrote:
> Thanks for you interest in the topic. I think in the thread [0] we almost 
> agreed on general design.
> The only left question is that we want to threat pg_ctl stop and kill SIGTERM 
> differently to pg_terminate_backend().

I didn't get the idea that there was a tremendous amount of support
for the approach to block canceling waits for synchronous replication.
FWIW this was my initial approach as well, but I've been trying to
think of alternatives.

If we can gather support for some variation of the block-cancels
approach, I think that would be preferred over my proposal from a
complexity standpoint.  Robert's idea to provide a way to understand
the intent of the cancellation/termination request [0] could improve
matters.  Perhaps adding an argument to pg_cancel/terminate_backend()
and using different signals to indicate that we want to cancel the
wait would be something that folks could get on board with.

Nathan

[0] 
https://www.postgresql.org/message-id/CA%2BTgmoaW8syC_wqQcsJ%3DsQ0gTbFVC6MqYmxbwNHk5w%3DxJ-McOQ%40mail.gmail.com



Re: Avoiding data loss with synchronous replication

2021-07-23 Thread Bossart, Nathan
On 7/23/21, 4:23 AM, "Laurenz Albe"  wrote:
> But that would mean that changes ostensibly rolled back (because the
> cancel request succeeded) will later turn out to be committed after all,
> just like it is now (only later).  Where is the advantage?

The advantage is that I can cancel waits for synchronous replication
without risking data loss.  The transactions would still be marked in-
progress until we get the proper acknowledgement from the standbys.

> Besides, there is no room for another transaction status in the
> commit log.

Right.  Like the existing synchronous replication functionality, the
commit log would be updated, but the transactions would still appear
to be in-progress.  Today, this is done via the procarray.

Nathan



Re: Avoiding data loss with synchronous replication

2021-07-23 Thread Bossart, Nathan
On 7/23/21, 3:58 AM, "Amit Kapila"  wrote:
> On Fri, Jul 23, 2021 at 2:48 AM Bossart, Nathan  wrote:
>> Instead of blocking query cancellations and backend terminations, I
>> think we should allow them to proceed, but we should keep the
>> transactions marked in-progress so they do not yet become visible to
>> sessions on the primary.
>>
>
> One naive question, what if the primary gets some error while changing
> the status from in-progress to committed? Won't in such a case the
> transaction will be visible on standby but not on the primary?

Yes.  In this case, the transaction would remain in-progress on the
primary until it can be marked committed.

>>  Once replication has caught up to the
>> the necessary point, the transactions can be marked completed, and
>> they would finally become visible.
>>
>
> If the session issued the commit is terminated, will this work be done
> by some background process?

I think the way I'm imagining it is that a background process would be
responsible for handling all of the "offloaded" transactions.  I'm not
wedded to any particular design at this point, though.

Nathan



Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-23 Thread Mark Dilger



> On Jul 23, 2021, at 9:20 AM, Stephen Frost  wrote:
> 
> These considerations were addressed with row_security by allowing the
> GUC to be set by anyone, but throwing an ERROR if RLS would have been
> required by the query instead of just allowing it.  I don't see any
> obvious reason why that couldn't be the case for event triggers..?

Because a postgres-as-a-service provider may want to install their own event 
triggers as well as allowing the customer to do so, and it seems too coarse 
grained to either skip all of them or none of them.  It's perfectly reasonable 
to want to skip your customer's event triggers while not skipping your own.

>> A superuser-only GUC for this is also a bit too heavy handed.  The superuser 
>> may not want to circumvent all event triggers, just those put in place by 
>> the pg_database_security role.  If that sounds arbitrary, just consider the 
>> postgres-as-a-service case.  The superuser wants to be able to grant 
>> pg_database_security to the customer, but doesn't want the customer to be 
>> able to use that to trap the service provider.
> 
> Having a trust system for triggers, functions, etc, where you can say
> whose triggers you're ok running might be interesting but it also seems
> like an awful lot of work and I'm not sure that it's actually really
> that much better than a GUC similar to row_security.

My first impression was that it is too much work, which is why I put event 
trigger creation into the pg_host_security bucket.  It might be more sane to 
just leave it as superuser-only.  But if we're going to fix this and make it a 
pg_database_security usable feature, then I think we need to solve the problems 
a naive approach would create for service providers.

>> Instead of a GUC, how about checking permissions inside event triggers for 
>> both the user firing the trigger *and* the trigger owner.  That's a backward 
>> compatibility break, but maybe not a bad one, since until now only 
>> superusers have been allowed to create event triggers.  Systems which create 
>> an event trigger using a role that later has superuser revoked, or which 
>> change ownership to a non-superuser, will see a behavior change.  I'm not 
>> super happy with that, but I think it is better than the GUC based solution. 
>>  Event triggers owned by a superuser continue to work as they do now.  Event 
>> triggers owned by a non-superuser cannot be used to force a privileged user 
>> to run a command that the event trigger owner could not have run for 
>> themself.
> 
> I'm not following what this suggestion is, exactly.  What permissions
> are being checked inside the event trigger being run, exactly..?  Who
> would get to set those permissions?  Any object owner, today, can GRANT
> access to any other user in the system, we don't prevent that.

I don't think GRANT is really relevant here, as what I'm trying to avoid is a 
less privileged user trapping a more privileged user into running a function 
that the less privileged user can't directly run.  Certainly such a user cannot 
GRANT privilege on a function that they cannot even run, else your system has a 
privilege escalation hazard already. 

It's a substantial change to the security model, but the idea is that inside an 
event trigger, we'd SetUserIdAndSecContext to a new type of context similar to 
SECURITY_LOCAL_USERID_CHANGE but where instead of simply changing to the owner 
of the event trigger, we'd be changing to a virtual user who is defined to only 
have the privileges of the intersection of the current user and the event 
trigger owner.  That entails at least two problems, though I don't see that 
they are insoluble.  First, all places in the code that check permissions need 
to check in a way that works in this mode.  We might not be able to call 
GetUserId() as part of aclchecks any longer, and instead have to call some new 
function GetVirtualUserId() as part of aclchecks, reserving GetUserId() just 
for cases where you're not trying to perform a permissions check.  Second, 
since event triggers can cause other event triggers to fire, we'd need these 
virtual users to be able to nest, so we'd have to push a stack of users and 
check all of them in each case, and we'd have to think carefully about how to 
handle errors, since GetUserIdAndSecContext and SetUserIdAndSecContext are 
called inside transaction start and end, where errors must not be raised.  But 
since transaction start and end would never want to set or reset the state to a 
virtual user, those should never throw, and the calls elsewhere are at liberty 
to throw if they like, so we'd just have to be careful to use the right version 
of these operations in the right places.

This all sounds a bit much, but it has knock-on benefits.  We'd be able to 
preserve the historical behavior of table triggers while having a mode that 
behaves in this new way, which means that privileged users could be a bit more 
cavalier than they can now about DML against user 

Re: Configure with thread sanitizer fails the thread test

2021-07-23 Thread Tom Lane
Alvaro Herrera  writes:
> ... that said, I wonder why would we do this in the thread_test program
> rather than in configure itself.  Wouldn't it make more sense for the
> configure test to be skipped altogether (marking the result as
> thread-safe) when running under thread sanitizer, if there's a way to
> detect that?

TBH, I wonder why we don't just nuke thread_test.c altogether.
Is it still useful in 2021?  Machines that still need
--disable-thread-safety can doubtless be counted without running
out of fingers, and I think their owners can be expected to know
that they need that.

regards, tom lane




Configure with thread sanitizer fails the thread test

2021-07-23 Thread Mikhail Matrosov
This is a reply to an old thread with the same name:
https://www.postgresql.org/message-id/1513971675.5870501.1439797066345.javamail.ya...@mail.yahoo.com
I was not able to do a proper reply since I cannot download the raw
message: https://postgrespro.com/list/thread-id/2483942

Anyway, the problem with thread sanitizer is still present. If I try to
build libpq v13.3 with thread sanitizer, I get a configuration error like
this:

configure:18852: checking thread safety of required library functions
configure:18875: /usr/bin/clang-12 -o conftest -Wall -Wmissing-prototypes
-Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv
-Wno-unused-command-line-argument -m64 -O3 -fsanitize=thread -pthread
-D_REENTRANT -D_THREAD_SAFE -D_POSIX_PTHREAD_SEMANTICS -DIN_CONFIGURE
-I/home/mmatrosov/.conan/data/zlib/1.2.11/_/_/package/98c3afaf7dd035538c92258b227714d9d4a19852/include
-DNDEBUG -D_GNU_SOURCE  -m64
-L/home/mmatrosov/.conan/data/zlib/1.2.11/_/_/package/98c3afaf7dd035538c92258b227714d9d4a19852/lib
 conftest.c -lz -lm -lz  >&5
configure:18875: $? = 0
configure:18875: ./conftest
==
WARNING: ThreadSanitizer: data race (pid=3413987)
  Write of size 4 at 0x00f1744c by thread T2:
#0 func_call_2  (conftest+0x4b5e51)

  Previous read of size 4 at 0x00f1744c by thread T1:
#0 func_call_1  (conftest+0x4b5d12)

  Location is global 'errno2_set' of size 4 at 0x00f1744c
(conftest+0x00f1744c)

  Thread T2 (tid=3413990, running) created by main thread at:
#0 pthread_create  (conftest+0x424b3b)
#1 main  (conftest+0x4b5b49)

  Thread T1 (tid=3413989, running) created by main thread at:
#0 pthread_create  (conftest+0x424b3b)
#1 main  (conftest+0x4b5b2e)

...

configure:18879: result: no
configure:18881: error: thread test program failed
This platform is not thread-safe.  Check the file 'config.log' or compile
and run src/test/thread/thread_test for the exact reason.
Use --disable-thread-safety to disable thread safety.


I am not sure what is the proper fix for the issue, but I at least suggest
to disable this test when building with thread sanitizer:
https://github.com/conan-io/conan-center-index/pull/6472/files#diff-b8639f81e30f36c8ba29a0878f1ef4d9f1552293bc9098ebb9b429ddb1f0935f

-
Best regards, Mikhail Matrosov


[Doc] Tiny fix for regression tests example

2021-07-23 Thread tanghy.f...@fujitsu.com
Hi



When reading PG DOC, found some example code not correct as it said.

https://www.postgresql.org/docs/devel/regress-run.html



Here's a tiny fix in regress.sgml.



-make check PGOPTIONS="-c log_checkpoints=on -c work_mem=50MB"

+make check PGOPTIONS="-c geqo=off -c work_mem=50MB"



log_checkpoints couldn't be set in PGOPTIONS.

Replace log_checkpoints with geqo in the example code.



-make check EXTRA_REGRESS_OPTS="--temp-config=test_postgresql.conf"

+make check EXTRA_REGRESS_OPTS="--temp-config=$(pwd)/test_postgresql.conf"



User needs to specify $(pwd) to let the command execute as expected.



The above example code is added by Peter in PG14. So I think we need to apply 
this fix at PG14/master.

I proposed this fix at 
pgsql-d...@lists.postgresql.org at [1]. 
But no reply except Craig. So I please allow me to post the patch at Hackers’ 
mail list again in case the fix is missed.



[1] 
https://www.postgresql.org/message-id/OS0PR01MB6113FA937648B8F7A372359BFB009%40OS0PR01MB6113.jpnprd01.prod.outlook.com



Regards,

Tang



0001-minor-fix-for-regress-example.patch
Description: 0001-minor-fix-for-regress-example.patch


Re: Followup Timestamp to timestamp with TZ conversion

2021-07-23 Thread Peter Volk
Hi Tom,

thanks for the reply, I do understand that if a rewrite of the table
needs to be avoided the binary image needs to be the same. Since PG 12
there is an optimisation to avoid a rewrite of timestamp columns if
they are converted to timestamp with tz and the target tz offset is 0

I am referring to the function

ATColumnChangeRequiresRewrite(Node *expr, AttrNumber varattno)

in which the following is checked:

(b/src/backend/commands/tablecmds.c)

else if (IsA(expr, FuncExpr))
  {
  FuncExpr   *f = (FuncExpr *) expr;

   switch (f->funcid)
   {
   case F_TIMESTAMPTZ_TIMESTAMP:
   case F_TIMESTAMP_TIMESTAMPTZ:
   if (TimestampTimestampTzRequiresRewrite())
   return true;
   else
   expr = linitial(f->args);
   break;
   default:
 return true;
   }


and TimestampTimestampTzRequiresRewrite checks if the offset is 0:

(b/src/backend/utils/adt/timestamp.c)

 TimestampTimestampTzRequiresRewrite()
 *
 * Returns false if the TimeZone GUC setting causes timestamp_timestamptz and
 * timestamptz_timestamp to be no-ops, where the return value has the same
 * bits as the argument.  Since project convention is to assume a GUC changes
 * no more often than STABLE functions change, the answer is valid that long.
 */
bool
TimestampTimestampTzRequiresRewrite(void)
{
   longoffset;

   if (pg_get_timezone_offset(session_timezone, ) && offset == 0)
   PG_RETURN_BOOL(false);
   PG_RETURN_BOOL(true);
}

So in this case it is already proven that there is a binary equality
between the data types timestamp and timestamp with tz if the offset
is considered with 0. Hence this type of optimisation should / could
also apply to indexes as well as the columns used in partitions

Thanks,
Peter


On Thu, Jul 22, 2021 at 5:29 PM Tom Lane  wrote:
>
> Peter Volk  writes:
> > The problem is that I have a 60TB+ PG installation for which we need to
> > modify all of the timestamp columns to timestamp with tz. The data in the
> > columns are already in UTC so we can benefit from the patch listed above.
> > Yet there are 2 cases in which we are having an issue.
>
> > 1) Index rebuilds: The patch is only avoiding a rewrite of the table data
> > but is not avoiding a rebuild of the indexes. Following the logic in the
> > patch above this should also be avoidable under the same condition
>
> I don't think that follows.  What we are concerned about when determining
> whether a heap rewrite can be skipped is whether the stored heap entries
> are bit-compatible between the two data types.  To decide that an index
> rebuild is not needed, you'd need to further determine that their sort
> orders are equivalent (for a btree index, or who-knows-what semantic
> condition for other types of indexes).  We don't attempt to do that,
> so index rebuilds are always needed.
>
> As a thought experiment to prove that this is an issue, suppose that
> somebody invented an unsigned integer type, and made the cast from
> regular int4 follow the rules of a C cast, so that e.g. -1 becomes
> 2^32-1.  Given that, an ALTER TYPE from int4 to the unsigned type
> could skip the heap rewrite.  But we absolutely would have to rebuild
> any btree index on the column, because the sort ordering of the two
> types is different.  OTOH, it's quite likely that a hash index would
> not really need to be rebuilt.  So this is a real can of worms and
> we've not cared to open it.
>
> > 2) Partitioned tables with the timestamp as partition column: In this case
> > the current version does not allow a modification of the column data type
> > at all.
>
> PG's partitioning features are still being built out, but I would not
> hold my breath for that specific thing to change.  Again, the issue
> is that bit-compatibility of individual values doesn't prove much
> about comparison semantics, so it's not clear that a change of
> data type still allows the value-to-partition assignment to be
> identical.  (This is clearly an issue for RANGE partitioning.  Maybe
> it could be finessed for HASH or LIST, but you'd still be needing
> semantic assumptions that go beyond mere bit-compatibility of values.)
>
> regards, tom lane
>
>




Re: Configure with thread sanitizer fails the thread test

2021-07-23 Thread Alvaro Herrera
On 2021-Jul-23, Alvaro Herrera wrote:

> On 2021-Jul-23, Mikhail Matrosov wrote:
> 
> > I am not sure what is the proper fix for the issue, but I at least suggest
> > to disable this test when building with thread sanitizer:
> > https://github.com/conan-io/conan-center-index/pull/6472/files#diff-b8639f81e30f36c8ba29a0878f1ef4d9f1552293bc9098ebb9b429ddb1f0935f
> 
> Here's the proposed patch.  Patches posted to the mailing list by their
> authors proposed for inclusion are considered to be under the PostgreSQL
> license, but this patch hasn't been posted by the author so let's assume
> they're not authorizing them to use it.  (Otherwise, why wouldn't they
> just post it here instead of doing the absurdly convoluted dance of a
> github PR?)

... that said, I wonder why would we do this in the thread_test program
rather than in configure itself.  Wouldn't it make more sense for the
configure test to be skipped altogether (marking the result as
thread-safe) when running under thread sanitizer, if there's a way to
detect that?

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"No me acuerdo, pero no es cierto.  No es cierto, y si fuera cierto,
 no me acuerdo." (Augusto Pinochet a una corte de justicia)




Re: Configure with thread sanitizer fails the thread test

2021-07-23 Thread Alvaro Herrera
On 2021-Jul-23, Mikhail Matrosov wrote:

> I am not sure what is the proper fix for the issue, but I at least suggest
> to disable this test when building with thread sanitizer:
> https://github.com/conan-io/conan-center-index/pull/6472/files#diff-b8639f81e30f36c8ba29a0878f1ef4d9f1552293bc9098ebb9b429ddb1f0935f

Here's the proposed patch.  Patches posted to the mailing list by their
authors proposed for inclusion are considered to be under the PostgreSQL
license, but this patch hasn't been posted by the author so let's assume
they're not authorizing them to use it.  (Otherwise, why wouldn't they
just post it here instead of doing the absurdly convoluted dance of a
github PR?)

diff --git a/src/test/thread/thread_test.c b/src/test/thread/thread_test.c
index e1bec01..e4ffd78 100644
--- a/src/test/thread/thread_test.c
+++ b/src/test/thread/thread_test.c
@@ -61,6 +61,14 @@ main(int argc, char *argv[])
fprintf(stderr, "Perhaps rerun 'configure' using 
'--enable-thread-safety'.\n");
return 1;
 }
+
+#elif __has_feature(thread_sanitizer)
+int
+main(int argc, char *argv[])
+{
+   printf("Thread safety check is skipped since it does not work with 
thread sanitizer.\n");
+   return 0;
+}
 #else
 
 /* This must be down here because this is the code that uses threads. */

-- 
Álvaro Herrera




Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-23 Thread Stephen Frost
Greetings,

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Jul 23, 2021, at 6:22 AM, Robert Haas  wrote:
> > And I think that's a good illustration of why it's a bad idea to
> > categorize things according to whether or not they have a certain
> > consequence.
> 
> Well, one very big reason for wanting to break superuser into separate roles 
> is to make postgres-as-a-service providers comfortable granting more 
> privileges to customers.  If we design any privilege escalations into one of 
> those roles, then no sensible service provider is ever going to grant it to 
> anybody, which fairly much defeats the purpose of this work.  The privilege 
> escalations we need to prevent are not just escalations to superuser, but 
> also escalations to other privileged roles.  Contrary to this design goal, 
> the "pg_host_security" role is a bit of a synonym for "superuser", since 
> being able to write files or execute shell commands is a path to superuser, 
> and we can't do too much about that.   "pg_database_security", 
> "pg_network_security", and "pg_logical_replication" are not synonyms for 
> "superuser".
> 
> I like your idea of designing some extra security around event triggers to 
> resolve their privilege escalation problems.  A GUC seems the wrong approach 
> to me.
> 
> I think a superuser-only GUC to suppress firing event triggers won't quite 
> cut it, because the other privileged roles would still be in danger of being 
> trapped by a clever pg_database_security event trigger author; but extending 
> permissions on the GUC to include the other roles would mean that they, and 
> not just superuser, could evade event trigger based auditing solutions.  That 
> is odd, because you wouldn't expect granting pg_network_security or 
> pg_logical_replication to have anything to do with granting privilege to 
> defeat audit logging.

These considerations were addressed with row_security by allowing the
GUC to be set by anyone, but throwing an ERROR if RLS would have been
required by the query instead of just allowing it.  I don't see any
obvious reason why that couldn't be the case for event triggers..?

> A superuser-only GUC for this is also a bit too heavy handed.  The superuser 
> may not want to circumvent all event triggers, just those put in place by the 
> pg_database_security role.  If that sounds arbitrary, just consider the 
> postgres-as-a-service case.  The superuser wants to be able to grant 
> pg_database_security to the customer, but doesn't want the customer to be 
> able to use that to trap the service provider.

Having a trust system for triggers, functions, etc, where you can say
whose triggers you're ok running might be interesting but it also seems
like an awful lot of work and I'm not sure that it's actually really
that much better than a GUC similar to row_security.

> Instead of a GUC, how about checking permissions inside event triggers for 
> both the user firing the trigger *and* the trigger owner.  That's a backward 
> compatibility break, but maybe not a bad one, since until now only superusers 
> have been allowed to create event triggers.  Systems which create an event 
> trigger using a role that later has superuser revoked, or which change 
> ownership to a non-superuser, will see a behavior change.  I'm not super 
> happy with that, but I think it is better than the GUC based solution.  Event 
> triggers owned by a superuser continue to work as they do now.  Event 
> triggers owned by a non-superuser cannot be used to force a privileged user 
> to run a command that the event trigger owner could not have run for themself.

I'm not following what this suggestion is, exactly.  What permissions
are being checked inside the event trigger being run, exactly..?  Who
would get to set those permissions?  Any object owner, today, can GRANT
access to any other user in the system, we don't prevent that.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-23 Thread Mark Dilger



> On Jul 23, 2021, at 6:22 AM, Robert Haas  wrote:
> 
> And I think that's a good illustration of why it's a bad idea to
> categorize things according to whether or not they have a certain
> consequence.

Well, one very big reason for wanting to break superuser into separate roles is 
to make postgres-as-a-service providers comfortable granting more privileges to 
customers.  If we design any privilege escalations into one of those roles, 
then no sensible service provider is ever going to grant it to anybody, which 
fairly much defeats the purpose of this work.  The privilege escalations we 
need to prevent are not just escalations to superuser, but also escalations to 
other privileged roles.  Contrary to this design goal, the "pg_host_security" 
role is a bit of a synonym for "superuser", since being able to write files or 
execute shell commands is a path to superuser, and we can't do too much about 
that.   "pg_database_security", "pg_network_security", and 
"pg_logical_replication" are not synonyms for "superuser".

I like your idea of designing some extra security around event triggers to 
resolve their privilege escalation problems.  A GUC seems the wrong approach to 
me.

I think a superuser-only GUC to suppress firing event triggers won't quite cut 
it, because the other privileged roles would still be in danger of being 
trapped by a clever pg_database_security event trigger author; but extending 
permissions on the GUC to include the other roles would mean that they, and not 
just superuser, could evade event trigger based auditing solutions.  That is 
odd, because you wouldn't expect granting pg_network_security or 
pg_logical_replication to have anything to do with granting privilege to defeat 
audit logging.

A superuser-only GUC for this is also a bit too heavy handed.  The superuser 
may not want to circumvent all event triggers, just those put in place by the 
pg_database_security role.  If that sounds arbitrary, just consider the 
postgres-as-a-service case.  The superuser wants to be able to grant 
pg_database_security to the customer, but doesn't want the customer to be able 
to use that to trap the service provider.

Instead of a GUC, how about checking permissions inside event triggers for both 
the user firing the trigger *and* the trigger owner.  That's a backward 
compatibility break, but maybe not a bad one, since until now only superusers 
have been allowed to create event triggers.  Systems which create an event 
trigger using a role that later has superuser revoked, or which change 
ownership to a non-superuser, will see a behavior change.  I'm not super happy 
with that, but I think it is better than the GUC based solution.  Event 
triggers owned by a superuser continue to work as they do now.  Event triggers 
owned by a non-superuser cannot be used to force a privileged user to run a 
command that the event trigger owner could not have run for themself.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Slightly improve initdb --sync-only option's help message

2021-07-23 Thread Bossart, Nathan
On 7/22/21, 6:31 PM, "Justin Pryzby"  wrote:
> On Thu, Jul 22, 2021 at 10:32:18PM +, Bossart, Nathan wrote:
>> IMO the note about the option being helpful after using the --no-sync
>> option would fit better in the docs, but I'm struggling to think of a
>> use case for using --no-sync and then calling initdb again with
>> --sync-only.  Why wouldn't you just leave out --no-sync the first
>> time?
>
> It's to allow safely running bulk loading with fsync=off - if the bulk load
> fails, you can wipe out the partially-loaded cluster and start over.
> But then transitioning to a durable state requires not just setting fsync=on,
> which enables future fsync calls.  It also requires syncing all dirty buffers.

Right.  Perhaps the documentation for --sync-only could mention this
use-case instead.

Safely write all database files to disk and exit. This does
not perform any of the normal initdb operations.  Generally,
this option is useful for ensuring reliable recovery after
changing fsync from off to on.

Nathan



Re: Corrected documentation of data type for the logical replication message formats.

2021-07-23 Thread vignesh C
On Fri, Jul 23, 2021 at 3:23 AM Peter Smith  wrote:
>
> I think the patch maybe is not quite correct for all the flags.
>
> For example,
>
> @@ -7607,44 +7615,44 @@ are available since protocol version 3.
>  
>  Int8
>  
> -Flags; currently unused (must be 0).
> +Flags (uint8); currently unused.
>  
>  
>
> AFAIK, even though the flags are "unused", the code still insists that
> most (or all? Please check the code) of these flag values MUST be 0,
> so I think that this zero value requirement ought to be indicated in
> the docs using the "Int8(0)" convention [1]. For example,
>
> BEFORE
> Int8
> Flags (uint8); currently unused.
>
> AFTER
> Int8(0)
> Flags (uint8); currently unused.

Thanks for the comments. Attached v6 patch has the changes for the same.

Regards,
Vignesh
From 398e67e672cb4b264dd8863f17caa431b2a82153 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Tue, 11 May 2021 20:29:43 +0530
Subject: [PATCH v6] Included the actual datatype used in logical replication
 message description.

Included the actual datatype used in logical replication message description.
---
 doc/src/sgml/protocol.sgml | 168 -
 1 file changed, 91 insertions(+), 77 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index e8cb78ff1f..5aad6dbf7a 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -6399,7 +6399,8 @@ This section describes the detailed format of each logical replication message.
 These messages are returned either by the replication slot SQL interface or are
 sent by a walsender. In case of a walsender they are encapsulated inside the replication
 protocol WAL messages as described in 
-and generally obey same message flow as physical replication.
+and generally obey same message flow as physical replication. The base data types
+used are described in .
 
 
 
@@ -6428,7 +6429,7 @@ Begin
 
 
 
-The final LSN of the transaction.
+The final LSN (XLogRecPtr) of the transaction.
 
 
 
@@ -6438,8 +6439,8 @@ Begin
 
 
 
-Commit timestamp of the transaction. The value is in number
-of microseconds since PostgreSQL epoch (2000-01-01).
+Commit timestamp (TimestampTz) of the transaction. The value is
+in number of microseconds since PostgreSQL epoch (2000-01-01).
 
 
 
@@ -6449,7 +6450,7 @@ Begin
 
 
 
-Xid of the transaction.
+Xid (TransactionId) of the transaction.
 
 
 
@@ -6483,8 +6484,9 @@ Message
 
 
 
-Xid of the transaction (only present for streamed transactions).
-This field is available since protocol version 2.
+Xid (TransactionId) of the transaction (only present for
+streamed transactions).  This field is available since protocol
+version 2.
 
 
 
@@ -6494,8 +6496,8 @@ Message
 
 
 
-Flags; Either 0 for no flags or 1 if the logical decoding
-message is transactional.
+Flags (uint8); Either 0 for no flags or 1 if the logical
+decoding message is transactional.
 
 
 
@@ -6505,7 +6507,7 @@ Message
 
 
 
-The LSN of the logical decoding message.
+The LSN (XLogRecPtr) of the logical decoding message.
 
 
 
@@ -6567,11 +6569,11 @@ Commit
 
 
 
-Int8
+Int8(0)
 
 
 
-Flags; currently unused (must be 0).
+Flags (uint8); currently unused.
 
 
 
@@ -6581,7 +6583,7 @@ Commit
 
 
 
-The LSN of the commit.
+The LSN (XLogRecPtr) of the commit.
 
 
 
@@ -6591,7 +6593,7 @@ Commit
 
 
 
-The end LSN of the transaction.
+The end LSN (XLogRecPtr) of the transaction.
 
 
 
@@ -6601,8 +6603,8 @@ Commit
 
 
 
-Commit timestamp of the transaction. The value is in number
-of microseconds since PostgreSQL epoch (2000-01-01).
+Commit timestamp (TimestampTz) of the transaction. The value is
+in number of microseconds since PostgreSQL epoch (2000-01-01).
 
 
 
@@ -6636,7 +6638,7 @@ Origin
 
 
 
-The LSN of the commit on the origin server.
+The LSN (XLogRecPtr) of the commit on the origin server.
 
 
 
@@ -6685,8 +6687,9 @@ Relation
 
 
 
-   Xid of the transaction (only present for streamed transactions).
-   This field is available since protocol version 2.
+   Xid (TransactionId) of the transaction (only present for
+   streamed transactions).  This field is available since
+   protocol version 2.
 
 
 
@@ -6696,7 +6699,7 @@ Relation
 
 
 
-ID of the relation.
+ID (Oid) of the relation.
 
 
 
@@ -6752,8 +6755,8 @@ Relation
 
 
 
-Flags for the column. Currently can be either 0 for no flags
-

Re: WIP: Relaxing the constraints on numeric scale

2021-07-23 Thread Tom Lane
Dean Rasheed  writes:
> All your other suggestions make sense too. Attached is a new version.

OK, I've now studied this more closely, and have some additional
nitpicks:

* I felt the way you did the documentation was confusing.  It seems
better to explain the normal case first, and then describe the two
extended cases.

* As long as we're encapsulating typmod construction/extraction, let's
also encapsulate the checks for valid typmods.

* Other places are fairly careful to declare typmod values as "int32",
so I think this code should too.

Attached is a proposed delta patch making those changes.

(I made the docs mention that the extension cases are allowed as of v15.
While useful in the short run, that will look like noise in ten years;
so I could go either way on whether to do that.)

If you're good with these, then I think it's ready to go.
I'll mark it RfC in the commitfest.

regards, tom lane

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 6abda2f1d2..d3c70667a3 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -545,8 +545,8 @@
 
 NUMERIC(precision, scale)
 
- The precision must be positive, the scale may be positive or negative
- (see below).  Alternatively:
+ The precision must be positive, while the scale may be positive or
+ negative (see below).  Alternatively:
 
 NUMERIC(precision)
 
@@ -569,8 +569,8 @@ NUMERIC
 
  
   The maximum precision that can be explicitly specified in
-  a NUMERIC type declaration is 1000.  An
-  unconstrained NUMERIC column is subject to the limits
+  a numeric type declaration is 1000.  An
+  unconstrained numeric column is subject to the limits
   described in .
  
 
@@ -578,38 +578,48 @@ NUMERIC
 
  If the scale of a value to be stored is greater than the declared
  scale of the column, the system will round the value to the specified
- number of fractional digits.  If the declared scale of the column is
- negative, the value will be rounded to the left of the decimal point.
- If, after rounding, the number of digits to the left of the decimal point
- exceeds the declared precision minus the declared scale, an error is
- raised.  Similarly, if the declared scale exceeds the declared precision
- and the number of zero digits to the right of the decimal point is less
- than the declared scale minus the declared precision, an error is raised.
+ number of fractional digits.  Then, if the number of digits to the
+ left of the decimal point exceeds the declared precision minus the
+ declared scale, an error is raised.
  For example, a column declared as
 
 NUMERIC(3, 1)
 
- will round values to 1 decimal place and be able to store values between
- -99.9 and 99.9, inclusive.  A column declared as
+ will round values to 1 decimal place and can store values between
+ -99.9 and 99.9, inclusive.
+
+
+
+ Beginning in PostgreSQL 15, it is allowed
+ to declare a numeric column with a negative scale.  Then
+ values will be rounded to the left of the decimal point.  The
+ precision still represents the maximum number of non-rounded digits.
+ Thus, a column declared as
 
 NUMERIC(2, -3)
 
- will round values to the nearest thousand and be able to store values
- between -99000 and 99000, inclusive.  A column declared as
+ will round values to the nearest thousand and can store values
+ between -99000 and 99000, inclusive.
+ It is also allowed to declare a scale larger than the declared
+ precision.  Such a column can only hold fractional values, and it
+ requires the number of zero digits just to the right of the decimal
+ point to be at least the declared scale minus the declared precision.
+ For example, a column declared as
 
 NUMERIC(3, 5)
 
- will round values to 5 decimal places and be able to store values between
+ will round values to 5 decimal places and can store values between
  -0.00999 and 0.00999, inclusive.
 
 
 
  
-  The scale in a NUMERIC type declaration may be any value in
-  the range -1000 to 1000.  (The SQL standard requires
-  the scale to be in the range 0 to precision.
-  Using values outside this range may not be portable to other database
-  systems.)
+  PostgreSQL permits the scale in
+  a numeric type declaration to be any value in the range
+  -1000 to 1000.  However, the SQL standard requires
+  the scale to be in the range 0
+  to precision.  Using scales outside that
+  range may not be portable to other database systems.
  
 
 
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 46cb37cea1..faff09f5d5 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -827,21 +827,31 @@ numeric_is_integral(Numeric num)
  *
  *	For purely historical reasons 

Re: badly calculated width of emoji in psql

2021-07-23 Thread Pavel Stehule
Hi

čt 22. 7. 2021 v 0:12 odesílatel Jacob Champion 
napsal:

> On Wed, 2021-07-21 at 00:08 +, Jacob Champion wrote:
> > I note that the doc comment for ucs_wcwidth()...
> >
> > >  *- Spacing characters in the East Asian Wide (W) or East Asian
> > >  *  FullWidth (F) category as defined in Unicode Technical
> > >  *  Report #11 have a column width of 2.
> >
> > ...doesn't match reality anymore. The East Asian width handling was
> > last updated in 2006, it looks like? So I wonder whether fixing the
> > code to match the comment would not only fix the emoji problem but also
> > a bunch of other non-emoji characters.
>
> Attached is my attempt at that. This adds a second interval table,
> handling not only the emoji range in the original patch but also
> correcting several non-emoji character ranges which are included in the
> 13.0 East Asian Wide/Fullwidth sets. Try for example
>
> - U+2329 LEFT POINTING ANGLE BRACKET
> - U+16FE0 TANGUT ITERATION MARK
> - U+18000 KATAKANA LETTER ARCHAIC E
>
> This should work reasonably well for terminals that depend on modern
> versions of Unicode's EastAsianWidth.txt to figure out character width.
> I don't know how it behaves on BSD libc or Windows.
>
> The new binary search isn't free, but my naive attempt at measuring the
> performance hit made it look worse than it actually is. Since the
> measurement function was previously returning an incorrect (too short)
> width, we used to get a free performance boost by not printing the
> correct number of alignment/border characters. I'm still trying to
> figure out how best to isolate the performance changes due to this
> patch.
>
> Pavel, I'd be interested to see what your benchmarks find with this
> code. Does this fix the original issue for you?
>

I can confirm that the original issue is fixed.

I tested performance

I had three data sets

1. typical data - mix ascii and utf characters typical for czech language -
25K lines - there is very small slowdown 2ms from 24 to 26ms (stored file
of this result has 3MB)

2. the worst case - this reports has only emoji 1000 chars * 10K rows -
there is more significant slowdown - from 160 ms to 220 ms (stored file has
39MB)

3. a little bit of obscure datasets generated by \x and select * from
pg_proc - it has 99K lines - and there are a lot of unicode decorations
(borders). The line has 17K chars. (the stored file has 1.7GB)
In this dataset I see a slowdown from 4300 to 4700 ms.

In all cases, the data are in memory (in filesystem cache). I tested load
to pspg.

9% looks too high, but in absolute time it is 400ms for 99K lines and very
untypical data, or 2ms for more typical results., 2ms are nothing (for
interactive work). More - this is from a pspg perspective. In psql there
can be overhead of network, protocol processing, formatting, and more and
more, and psql doesn't need to calculate display width of decorations
(borders), what is the reason for slowdowns in pspg.

Pavel




> --Jacob
>


Re: Showing applied extended statistics in explain

2021-07-23 Thread Ronan Dunklau
Le samedi 27 mars 2021, 01:50:54 CEST Tomas Vondra a écrit :
> The current implementation is a bit ugly PoC, with a couple annoying
> issues that need to be solved:
> 
Hello Thomas,

I haven't looked at the implementation at all but I think it's an interesting 
idea.


> 1) The information is stashed in multiple lists added to a Plan. Maybe
> there's a better place, and maybe we need to invent a better way to
> track the info (a new node stashed in a single List).

Yes this would probably be cleaner.

> 
> 2) The deparsing is modeled (i.e. copied) from how we deal with index
> quals, but it's having issues with nested OR clauses, because there are
> nested RestrictInfo nodes and the deparsing does not expect that.
> 
> 3) It does not work for functional dependencies, because we effectively
> "merge" all functional dependencies and apply the entries. Not sure how
> to display this, but I think it should show the individual dependencies
> actually applied.

Yes that would be useful when trying to understand where an estimation comes 
from. 

> 
> 4) The info is collected always, but I guess we should do that only when
> in explain mode. Not sure how expensive it is.

That would probably be better yes. 

> 
> 5) It includes just statistics name + clauses, but maybe we should
> include additional info (e.g estimate for that combination of clauses).

I'm not sure the estimate for the combination is that useful, as you have an 
associated estimated number of rows attached to the node.  I think to be able 
to really make sense of it, a GUC disabling the extended statistics could be 
useful for the curious DBA to compare the selectivity estimation for a plan 
with the statistics and a plan without. 

> 
> 6) The clauses in the grouping query are transformed to AND list, which
> is wrong. This is easy to fix, I was lazy to do that in a PoC patch.
> 
> 7) It does not show statistics for individual expressions. I suppose
> examine_variable could add it to the rel somehow, and maybe we could do
> that with index expressions too?

Yes this would be useful for single-expression extended statistics as well as 
statistics collected from a functional index.

I don't know if it's doable, but if we want to provide insights into how 
statistics are used, it could be nice to also display the statistics target 
used. Since the values at the time of the last analyze and the current value 
might be different, it could be nice to store it along with the stats. I 
remember having to troubleshoot queries where the problem was an ALTER  ... SET STATISTICS had not been run as expected, and having that 
information available in the plan for a complex query might help in diagnosing 
the problem quicker.

Regards,

-- 
Ronan Dunklau






Re: Have I found an interval arithmetic bug?

2021-07-23 Thread Bruce Momjian
On Thu, Jul 22, 2021 at 03:17:52PM -0700, Zhihong Yu wrote:
> On Thu, Jul 22, 2021 at 2:59 PM Zhihong Yu  wrote:
> Hi,
> 
> -                   tm->tm_mon += (fval * MONTHS_PER_YEAR);
> +                   tm->tm_mon += rint(fval * MONTHS_PER_YEAR);
> 
> Should the handling for year use the same check as that for month ?
> 
> -                   AdjustFractDays(fval, tm, fsec, DAYS_PER_MONTH);
> +                   /* round to a full month? */
> +                   if (rint(fval * DAYS_PER_MONTH) == DAYS_PER_MONTH)
> 
> Cheers 
> 
> Hi,
> I guess the reason for current patch was that year to months conversion is
> accurate.

Our internal units are hours/days/seconds, so the spill _up_ from months
to years happens automatically:

SELECT INTERVAL '23.99 months';
 interval
--
 2 years

> On the new test:
> 
> +SELECT INTERVAL '1.16 months 01:00:00' AS "One mon 5 days one hour";
> 
> 0.16 * 31 = 4.96 < 5
> 
> I wonder why 5 days were chosen in the test output.

We use 30 days/month, not 31.  However, I think you are missing the
changes in the patch and I am just understanding them fully now.  There
are two big changes:

1.  The amount of spill from months only to days
2.  The _rounding_ of the result once we stop spilling at months or days

#2 is the part I think you missed.

One thing missing from my previous patch was the handling of negative
units, which is now handled properly in the attached patch:

SELECT INTERVAL '-1.99 years';
 interval
--
 -2 years

SELECT INTERVAL '-1.99 months';
 interval
--
 -2 mons

I ended up creating a function to handle this, which allowed me to
simplify some of the surrounding code.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.



interval.diff.gz
Description: application/gzip


Re: Fix memory leak when output postgres_fdw's "Relations"

2021-07-23 Thread Tom Lane
Ranier Vilela  writes:
> This is a minor leak, oversight in
> https://github.com/postgres/postgres/commit/4526951d564a7eed512b4a0ac3b5893e0a115690#diff-e399f5c029192320f310a79f18c20fb18c8e916fee993237f6f82f05dad851c5

I don't think you understand how Postgres memory management works.
There's no permanent leak here, just till the end of the command;
so it's pretty doubtful that there's any need to expend cycles on
an explicit pfree.

regards, tom lane




Fix memory leak when output postgres_fdw's "Relations"

2021-07-23 Thread Ranier Vilela
Hi,

This is a minor leak, oversight in
https://github.com/postgres/postgres/commit/4526951d564a7eed512b4a0ac3b5893e0a115690#diff-e399f5c029192320f310a79f18c20fb18c8e916fee993237f6f82f05dad851c5

ExplainPropertyText does not save the *relations->data* pointer and
var relations goes out of scope.

No need to backpatch.

regards,
Ranier Vilela


fix_memory_leak_postgres_fdw.patch
Description: Binary data


Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-07-23 Thread Robert Haas
On Wed, Jul 21, 2021 at 11:55 PM Amit Kapila  wrote:
> I see here we have a mix of opinions from various people. Dilip seems
> to be favoring the approach where we provide some option to the user
> for partitioned tables and automatic behavior for non-partitioned
> tables but he also seems to have mild concerns about this behavior.
> OTOH, Greg and Hou-San seem to favor an approach where we can provide
> an option to the user for both partitioned and non-partitioned tables.
> I am also in favor of providing an option to the user for the sake of
> consistency in behavior and not trying to introduce a special kind of
> invalidation as it doesn't serve the purpose for partitioned tables.
> Robert seems to be in favor of automatic behavior but it is not very
> clear to me if he is fine with dealing differently for partitioned and
> non-partitioned relations. Robert, can you please provide your opinion
> on what do you think is the best way to move forward here?

I thought we had agreed on handling partitioned and unpartitioned
tables differently, but maybe I misunderstood the discussion.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-23 Thread Robert Haas
On Thu, Jul 22, 2021 at 5:35 PM Mark Dilger
 wrote:
> I agree with the need to document the theory behind all this.  Event triggers 
> are dangerous because they can trap a superuser into executing code they do 
> not intend:

That's true. Regular triggers and views can also do that, and so can
operators and functions that capture queries intended to reference
system-defined objects. It's already the case that a superuser who
executes any optimizable query potentially compromises the system
unless (a) they first sanitize their search_path and (b) the query is
a SELECT that involves no views or an INSERT, UPDATE, or DELETE on a
table that has no triggers. However, event triggers owned by
non-superusers would extend the hazard to nearly all DDL commands.

Classifying event triggers as "host" security doesn't seem like the
right solution, though. I think the only way that fits into the "host"
security category is if you use some kind of definition that works by
exclusion: things that are "database" security shouldn't have
consequence X, therefore anything that does must go in some other
category. I think that kind of definition is very hard for people to
understand, though. It's much nicer to have definitions that tell you
what does go into a category than what doesn't.

I suppose one alternative we could consider is just leaving some
things uncategorized. I had hoped we could put everything in a bucket
so that the superuser role was really divided into pieces and not just
having bits chipped off of it, but maybe that's too ambitious. The
more we drift into saying that some things like "well this has to be
host security because database security breaks the model" the more
useless host security is likely to be as a concept. It's not a related
collection of things any more; it's just whatever didn't fit in the
other bucket. And nobody wants to GRANT a_bunch_of_other_stuff.

However, I also wonder whether we should think about engineering a
solution to this problem. For example, we have a row_security GUC. If
you set it to off, no RLS policies will be applied; if applying one
would have been mandatory, you will get an error instead. I don't
think that exact design will work here because there's no such thing
as permission to bypass event triggers, but we could have a
fire_event_triggers GUC (or whatever we call it) which is normally
"on" but privileged users can turn it off. Now if you're worried about
this problem, you have an easy way around it.

And I think that's a good illustration of why it's a bad idea to
categorize things according to whether or not they have a certain
consequence. Suppose we said, ah well, let's make event triggers
"host" security because it's too dangerous to make them "database"
security. Well then say someone comes along and implements the feature
I just described, reducing the danger. Do we then reclassify that
feature as "database" security? The original rationale for making it
something else is no longer valid, but on the other hand, what about
backward compatibility? Classifying things based on what they do,
rather than on the ultimate consequences that they may have, avoids
this kind of conundrum.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Next Steps with Hash Indexes

2021-07-23 Thread Simon Riggs
On Thu, 22 Jul 2021 at 06:10, Amit Kapila  wrote:

> It will surely work if we have an exclusive lock on both the buckets
> (old and new) in this case but I think it is better if we can avoid
> exclusive locking the old bucket (bucket_being_split) unless it is
> really required. We need an exclusive lock on the primary bucket where
> we are trying to insert to avoid any other inserter with the same key
> but I think we don't need it for the old bucket because no inserter
> with the same key can try to insert the key in an old bucket which
> would belong to the new bucket.

Agreed.

> > I don't think btree does that, so I'm not sure we do need that for
> > hash. Yes, there is a race condition, but someone will win. Do we care
> > who? Do we care enough to take the concurrency hit?
> >
>
> I think if we don't care we might end up with duplicates. I might be
> missing something here but let me explain the problem I see. Say,
> while doing a unique check we found the same hash value in the bucket
> we are trying to insert, we can't say unique key violation at this
> stage and error out without checking the actual value in heap. This is
> because there is always a chance that two different key values can map
> to the same hash value. Now, after checking in the heap if we found
> that the actual value doesn't match so we decide to insert the value
> in the hash index, and in the meantime, another insert of the same key
> value already performed these checks and ends up inserting the value
> in hash index and that would lead to a duplicate value in the hash
> index. I think btree doesn't have a similar risk so we don't need such
> a mechanism for btree.

Agreed, after thinking about it more while coding.

All of the above implemented in the patches below:

Complete patch for hash_multicol.v3.patch attached, slightly updated
from earlier patch.
Docs, tests, passes make check.

WIP for hash_unique.v4.patch attached, patch-on-patch, to allow us to
discuss flow of logic and shape of code.
The actual comparison is not implemented yet. Not trivial, but can
wait until we decide main logic.
Passes make check and executes attached tests.

Tests in separate file also attached, will eventually be merged into
src/test/regress/sql/hash_index.sql

No tests yet for splitting or deferred uniqueness checks. The latter
is because there are parse analysis changes needed to undo the
assumption that only btrees support uniqueness, but nothing there
looks like an issue.

Thanks for your input, have a good weekend.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


hash_unique.v4.patch
Description: Binary data


hash_multicol.v3.patch
Description: Binary data


test_hash_unique.sql
Description: Binary data


Re: Added schema level support for publication.

2021-07-23 Thread vignesh C
On Fri, Jul 23, 2021 at 6:26 AM tanghy.f...@fujitsu.com
 wrote:
>
> On Thursday, July 22, 2021 1:30 AM vignesh C  wrote:
> >
> > Thanks for reporting this issue, this issue is fixed in the attached v13 
> > patch.
> > I have changed relation name pg_publication_schema to
> > pg_publication_sch so that the names are in similar lines with
> > pg_publication_rel relation and similar changes were done for variable
> > names too.
>
> Thanks for your fixing. The issue is fixed as you said.
>
> After applying your V13 patch. I noticed that if I specify duplicate schema 
> names when using "ALTER PUBLICATION ... SET SCHEMA ...", I would get the 
> following error message:
>
> postgres=# ALTER PUBLICATION pub1 SET SCHEMA s1,s1;
> ERROR:  duplicate key value violates unique constraint 
> "pg_publication_sch_psnspcid_pspubid_index"
> DETAIL:  Key (psnspcid, pspubid)=(16406, 16405) already exists.
>
> I think the error message is pretty hard to understand. Maybe we can do sth 
> to improve this scenario.
>
> Here is two proposal:
> 1. Don't report error message, just add some code to make the above command 
> to be executed successfully,
>just like "ALTER PUBLICATION ... SET TABLE ..." as follolws:
>
> postgres=# ALTER PUBLICATION pub2 SET TABLE t1,t1;
> ALTER PUBLICATION
> postgres=# \dRp+ pub2
>Publication pub2
>   Owner   | All tables | Inserts | Updates | Deletes | Truncates | Via root | 
> Pubtype
> --++-+-+-+---+--+-
>  postgres | f  | t   | t   | t   | t | f| 
> t
> Tables:
> "public.t1"
>
> 2. Report a easily understandable message like: Schema s1 specified more than 
> once
>

I have changed it to not report any error, this issue is fixed in the
v14 patch attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3DTj535ezfmm8QHLOtOkcHF2ZcCfSjfR%3DVbTbLZXFRsA%40mail.gmail.com

Regards,
Vignesh




Re: Added schema level support for publication.

2021-07-23 Thread vignesh C
On Thu, Jul 22, 2021 at 9:38 AM Greg Nancarrow  wrote:
>
> On Thu, Jul 22, 2021 at 1:42 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > Personally, the new name pg_publication_sch is not very easy to
understand.
> > (Maybe it's because I am not a native english speaker. If others feel
ok,
> > please ignore this comment)
> >
>
> I was actually thinking the same thing.
> I prefer the full SCHEMA/schema, even for all the internal
> variables/definitions which have been changed since the last patch
> version.
> I think Vignesh was trying to be consistent with pg_publication_rel
> and pg_subscription_rel, but maybe "rel" is better understood to be an
> abbreviation for "relation" than "sch" for "schema"?
> Thoughts from others?

I have changed it to pg_pubication_schema as earlier as both you and Houzj
San felt pg_publication_schema was better. This is fixed in the v14 patch
attached at [1].
[1] -
https://www.postgresql.org/message-id/CALDaNm3DTj535ezfmm8QHLOtOkcHF2ZcCfSjfR%3DVbTbLZXFRsA%40mail.gmail.com

Regards,
Vignesh


Re: truncating timestamps on arbitrary intervals

2021-07-23 Thread John Naylor
On Thu, Jul 22, 2021 at 4:49 PM Bauyrzhan Sakhariyev <
baurzhansahar...@gmail.com> wrote:
>
> > No, the boundary is intentionally the earlier one:
>
> I found that commit in GitHub, thanks for pointing it out.
> When I test locally origin_in_the_future case I get different results for
positive and negative intervals (see queries #1 and #2 from above, they
have same timestamp, origin and interval magnitude, difference is only in
interval sign) - can it be that the version I downloaded from
https://www.enterprisedb.com/postgresql-early-experience doesn't include
commit with that improvement?

Sorry, I wasn't clear. The intention is that the boundary is on the lower
side, but query #1 doesn't follow that, so that's a bug in my view. I found
while developing the feature that the sign of the stride didn't seem to
matter, but evidently I didn't try with the origin in the future.

> >  I wonder if we should just disallow negative intervals here.
>
> I cannot imagine somebody using negative as a constant argument but users
can pass another column as a first argument date or some function(ts) - not
likely but possible. A line in docs about the leftmost point of interval as
start of the bin could be helpful.

In recent years there have been at least two attempts to add an absolute
value function for intervals, and both stalled over semantics, so I'd
rather just side-step the issue, especially as we're in beta.

> >In the case of full units (1 minute, 1 hour, etc.), it gives the same
result as the analogous date_trunc call,
> Was not obvious to me that we need to supply Monday origin to make
date_bin(1 week, ts) produce same result with date_trunc

The docs for date_trunc() don't mention this explicitly, but it might be
worth mentioning ISO weeks. There is a nearby mention for EXTRACT():

https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT

"The number of the ISO 8601 week-numbering week of the year. By definition,
ISO weeks start on Mondays and the first week of a year contains January 4
of that year. In other words, the first Thursday of a year is in week 1 of
that year."

> Sorry for the verbose report and thanks for the nice function -  I know
it's not yet released, was just playing around with beta as I want to align
CrateDB date_bin with Postgresql

Thanks again for testing! This is good feedback.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: allow partial union-all and improve parallel subquery costing

2021-07-23 Thread Ronan Dunklau
Le lundi 12 avril 2021, 14:01:36 CEST Luc Vlaming a écrit :
> Here's an improved and rebased patch. Hope the description helps some
> people. I will resubmit it to the next commitfest.
> 

Hello Luc,

I've taken a look at this patch, and while I don't fully understand its 
implications here are a couple remarks.

I think you should add a test demonstrating the use of the new partial append 
path you add, for example using your base query:

explain (costs off)
select sum(two) from
(   
select *, 1::int  from tenk1 a
union all
select *, 1::bigint  from tenk1 b 
) t
;

I'm not sure I understand why the subquery scan rows estimate has not been 
accounted like you propose before, because the way it's done as of now 
basically doubles the estimate for the subqueryscan, since we account for it 
already being divided by it's number of workers, as mentioned in cost_append:

/*
 * Apply parallel divisor to subpaths.  Scale the number of rows
 * for each partial subpath based on the ratio of the parallel
 * divisor originally used for the subpath to the one we adopted.
 * Also add the cost of partial paths to the total cost, but
 * ignore non-partial paths for now.
 */

Do we have other nodes for which we make this assumption ?

Also, adding a partial path comprised only of underlying partial paths might 
not be enough: maybe we should add one partial path even in the case of mixed 
partial / nonpartial paths like it's done in add_paths_to_append_rel ?

Regards,

-- 
Ronan Dunklau






Re: Avoiding data loss with synchronous replication

2021-07-23 Thread Andrey Borodin
Hi Nathan!

Thanks for you interest in the topic. I think in the thread [0] we almost 
agreed on general design.
The only left question is that we want to threat pg_ctl stop and kill SIGTERM 
differently to pg_terminate_backend().

> 23 июля 2021 г., в 02:17, Bossart, Nathan  написал(а):
> 
> Hi hackers,
> 
> As previously discussed [0], canceling synchronous replication waits
> can have the unfortunate side effect of making transactions visible on
> a primary server before they are replicated.  A failover at this time
> would cause such transactions to be lost.  The proposed solution in
> the previous thread [0] involved blocking such cancellations, but many
> had concerns about that approach (e.g., backends could be
> unresponsive, server restarts were still affected by this problem).  I
> would like to propose something more like what Fujii-san suggested [1]
> that would avoid blocking cancellations while still preventing data
> loss.  I believe this is a key missing piece of the synchronous
> replication functionality in PostgreSQL.
> 
> AFAICT there are a variety of ways that the aforementioned problem may
> occur:
>  1. Server restarts: As noted in the docs [2], "waiting transactions
> will be marked fully committed once the primary database
> recovers."  I think there are a few options for handling this,
> but the simplest would be to simply failover anytime the primary
> server shut down.  My proposal may offer other ways of helping
> with this.
I think simple check that no other primary exists would suffice.
Currently this is totally concern of HA-tool.

>  2. Backend crashes: If a backend crashes, the postmaster process
> will restart everything, leading to the same problem described in
> 1.  However, this behavior can be prevented with the
> restart_after_crash parameter [3].
>  3. Client disconnections: During waits for synchronous replication,
> interrupt processing is turned off, so disconnected clients
> actually don't seem to cause a problem.  The server will still
> wait for synchronous replication to complete prior to making the
> transaction visible on the primary.
+1.

>  4. Query cancellations and backend terminations: This appears to be
> the only gap where there is no way to avoid potential data loss,
> and it is the main target of my proposal.
> 
> Instead of blocking query cancellations and backend terminations, I
> think we should allow them to proceed, but we should keep the
> transactions marked in-progress so they do not yet become visible to
> sessions on the primary.  Once replication has caught up to the
> the necessary point, the transactions can be marked completed, and
> they would finally become visible.
> 
> The main advantages of this approach are 1) it still allows for
> canceling waits for synchronous replication
You can cancel synchronous replication by 
ALTER SYSTEM SET synchnorou_standby_names to 'new quorum';
SELECT pg_reload_conf();

All backends waiting for sync rep will proceed with new quorum.

> and 2) it provides an
> opportunity to view and manage waits for synchronous replication
> outside of the standard cancellation/termination functionality.  The
> tooling for 2 could even allow a session to begin waiting for
> synchronous replication again if it "inadvertently interrupted a
> replication wait..." [4].  I think the main disadvantage of this
> approach is that transactions committed by a session may not be
> immediately visible to the session when the command returns after
> canceling the wait for synchronous replication.  Instead, the
> transactions would become visible in the future once the change is
> replicated.  This may cause problems for an application if it doesn't
> handle this scenario carefully.
> 
> What are folks' opinions on this idea?  Is this something that is
> worth prototyping?

In fact you propose converting transaction to 2PC if we get CANCEL during sync 
rep wait.
Transferring locks and other stuff somewhere, acquiring new VXid to our 
backend, sending CommandComplete while it's not in fact complete etc.
I think it's kind of overly complex for provided reasons.

The ultimate reason of synchronous replication is to make a client wait when 
it's necessary to wait. If the client wish to execute more commands they can 
open new connection or set synchronous_commit to desired level in first place. 
Canceling committed locally transaction will not be possible anyway.

Thanks!

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/6a052e81060824a8286148b1165bafedbd7c86cd.camel%40j-davis.com#415dc2f7d41b8a251b419256407bb64d



Re: Avoiding data loss with synchronous replication

2021-07-23 Thread Laurenz Albe
On Thu, 2021-07-22 at 21:17 +, Bossart, Nathan wrote:
> As previously discussed [0], canceling synchronous replication waits
> can have the unfortunate side effect of making transactions visible on
> a primary server before they are replicated.  A failover at this time
> would cause such transactions to be lost.
> 
> AFAICT there are a variety of ways that the aforementioned problem may
> occur:
>   4. Query cancellations and backend terminations: This appears to be
>  the only gap where there is no way to avoid potential data loss,
>  and it is the main target of my proposal.
> 
> Instead of blocking query cancellations and backend terminations, I
> think we should allow them to proceed, but we should keep the
> transactions marked in-progress so they do not yet become visible to
> sessions on the primary.  Once replication has caught up to the
> the necessary point, the transactions can be marked completed, and
> they would finally become visible.
> 
> The main advantages of this approach are 1) it still allows for
> canceling waits for synchronous replication and 2) it provides an
> opportunity to view and manage waits for synchronous replication
> outside of the standard cancellation/termination functionality.  The
> tooling for 2 could even allow a session to begin waiting for
> synchronous replication again if it "inadvertently interrupted a
> replication wait..." [4].  I think the main disadvantage of this
> approach is that transactions committed by a session may not be
> immediately visible to the session when the command returns after
> canceling the wait for synchronous replication.  Instead, the
> transactions would become visible in the future once the change is
> replicated.  This may cause problems for an application if it doesn't
> handle this scenario carefully.
> 
> What are folks' opinions on this idea?  Is this something that is
> worth prototyping?

But that would mean that changes ostensibly rolled back (because the
cancel request succeeded) will later turn out to be committed after all,
just like it is now (only later).  Where is the advantage?

Besides, there is no room for another transaction status in the
commit log.

Yours,
Laurenz Albe





Re: Signed vs Unsigned (take 2) (src/backend/storage/ipc/procarray.c)

2021-07-23 Thread Ranier Vilela
Em sex., 23 de jul. de 2021 às 07:02, Aleksander Alekseev <
aleksan...@timescale.com> escreveu:

> Hi hackers,
>
> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, passed
>> Implements feature:   tested, passed
>> Spec compliant:   tested, passed
>> Documentation:tested, passed
>>
>> The patch was tested on MacOS against master `80ba4bb3`.
>>
>> The new status of this patch is: Ready for Committer
>>
>
> The second patch seems fine too. I'm attaching both patches to trigger
> cfbot and to double-check them.
>
Thanks Aleksander, for reviewing this.

regards,
Ranier Vilela


Re: How is this possible "publication does not exist"

2021-07-23 Thread Andrey Borodin


> 20 дек. 2019 г., в 06:39, Tomas Vondra  
> написал(а):
> 
> While trying to reproduce this I however ran into a related issue with
> pgoutput/pg_logical_slot_get_binary_changes. If you call the function
> repeatedly (~10x) you'll get an error like this:
> 
> FATAL:  out of relcache_callback_list slots
> CONTEXT:  slot "slot", output plugin "pgoutput", in the startup callback
> server closed the connection unexpectedly
>   This probably means the server terminated abnormally
>   before or while processing the request.
> 
> The reason is very simple - each call executes pgoutput_startup, which
> does CacheRegisterRelcacheCallback in init_rel_sync_cache. And we do
> this on each pg_logical_slot_get_binary_changes() call and never remove
> the callbacks, so we simply run out of MAX_RELCACHE_CALLBACKS slots.
> 
> Not sure if this is a known issue/behavior, but it seems a bit annoying
> and possibly related to the issue reported by Dave.

Sorry for bumping old thread.
I was involved in troubleshooting logical replication recently. And found out 
that it sometimes has a really annoying error reporting.
While source of the problem was allegedly in low max_replication_slots, users 
were getting only this error about relcache_callback_list.

Maybe we could fix this particular error by deduplicating callbacks?


Best regards, Andrey Borodin.



0001-Avoid-duplication-in-relcache-and-syscache-callbacks.patch
Description: Binary data


Re: Avoiding data loss with synchronous replication

2021-07-23 Thread Amit Kapila
On Fri, Jul 23, 2021 at 2:48 AM Bossart, Nathan  wrote:
>
> Hi hackers,
>
> As previously discussed [0], canceling synchronous replication waits
> can have the unfortunate side effect of making transactions visible on
> a primary server before they are replicated.  A failover at this time
> would cause such transactions to be lost.  The proposed solution in
> the previous thread [0] involved blocking such cancellations, but many
> had concerns about that approach (e.g., backends could be
> unresponsive, server restarts were still affected by this problem).  I
> would like to propose something more like what Fujii-san suggested [1]
> that would avoid blocking cancellations while still preventing data
> loss.  I believe this is a key missing piece of the synchronous
> replication functionality in PostgreSQL.
>
> AFAICT there are a variety of ways that the aforementioned problem may
> occur:
>   1. Server restarts: As noted in the docs [2], "waiting transactions
>  will be marked fully committed once the primary database
>  recovers."  I think there are a few options for handling this,
>  but the simplest would be to simply failover anytime the primary
>  server shut down.  My proposal may offer other ways of helping
>  with this.
>   2. Backend crashes: If a backend crashes, the postmaster process
>  will restart everything, leading to the same problem described in
>  1.  However, this behavior can be prevented with the
>  restart_after_crash parameter [3].
>   3. Client disconnections: During waits for synchronous replication,
>  interrupt processing is turned off, so disconnected clients
>  actually don't seem to cause a problem.  The server will still
>  wait for synchronous replication to complete prior to making the
>  transaction visible on the primary.
>   4. Query cancellations and backend terminations: This appears to be
>  the only gap where there is no way to avoid potential data loss,
>  and it is the main target of my proposal.
>
> Instead of blocking query cancellations and backend terminations, I
> think we should allow them to proceed, but we should keep the
> transactions marked in-progress so they do not yet become visible to
> sessions on the primary.
>

One naive question, what if the primary gets some error while changing
the status from in-progress to committed? Won't in such a case the
transaction will be visible on standby but not on the primary?

>  Once replication has caught up to the
> the necessary point, the transactions can be marked completed, and
> they would finally become visible.
>

If the session issued the commit is terminated, will this work be done
by some background process?

-- 
With Regards,
Amit Kapila.




Re: when the startup process doesn't (logging startup delays)

2021-07-23 Thread Nitin Jadhav
> I still don't see the need for two functions LogStartupProgress and
> LogStartupProgressComplete. Most of the code is duplicate. I think we
> can just do it with a single function something like [1]:

Initially I had written a common function for these 2. You can see
that in the earlier version of the patch. Later separated it since it
was too much for one function. If others also agree to make it common,
I can make that change.

> Why isn't there a
> LogStartupProgressComplete(STARTUP_PROCESS_OP_REDO)? Is it because of
> the below existing log message?
> ereport(LOG,
> (errmsg("redo done at %X/%X system usage: %s",
> LSN_FORMAT_ARGS(ReadRecPtr),
> pg_rusage_show(;

Yes. Adding another log message makes it redundant.

>  I think it should be, "," after occurred instead of "."
> + * elapsed or not. TRUE if timeout occurred, FALSE otherwise.
> instead of
> + * elapsed or not. TRUE if timeout occurred. FALSE otherwise.

Fixed.

> I think walkdir() should only call LogStartupProgress(FSYNC_IN_PROGRESS, 
> path);
> when action == datadir_fsync_fname.

I agree and fixed it.

> ResetUnloggedRelations() is calling
> ResetUnloggedRelationsInTablespaceDir("base", op);
> before calling InitStartupProgress().

Fixed.

> This shows StartupXLOG() calling ResetUnloggedRelations() twice.
> Should they both be shown ?  If so, maybe they should be distinguished as 
> here:
>
>elog(DEBUG1, "resetting unlogged relations: cleanup %d init %d",
> (op & UNLOGGED_RELATION_CLEANUP) != 0,
> (op & UNLOGGED_RELATION_INIT) != 0);

Fixed. Added separate codes to distinguish.

Please find the patch attached.




On Wed, Jul 21, 2021 at 6:43 PM Justin Pryzby  wrote:
>
> I think walkdir() should only call LogStartupProgress(FSYNC_IN_PROGRESS, 
> path);
> when action == datadir_fsync_fname.
>
> ResetUnloggedRelations() is calling
> ResetUnloggedRelationsInTablespaceDir("base", op);
> before calling InitStartupProgress().
>
> This shows StartupXLOG() calling ResetUnloggedRelations() twice.
> Should they both be shown ?  If so, maybe they should be distinguished as 
> here:
>
> elog(DEBUG1, "resetting unlogged relations: cleanup %d init %d",
>  (op & UNLOGGED_RELATION_CLEANUP) != 0,
>  (op & UNLOGGED_RELATION_INIT) != 0);
>
> On Wed, Jul 21, 2021 at 12:52:24PM +0530, Nitin Jadhav wrote:
> > 2021-07-20 18:47:32.046 IST [102230] LOG:  listening on IPv4 address 
> > "127.0.0.1", port 5445
> > 2021-07-20 18:47:32.048 IST [102230] LOG:  listening on Unix socket 
> > "/tmp/.s.PGSQL.5445"
> > 2021-07-20 18:47:32.051 IST [102234] LOG:  database system was interrupted; 
> > last known up at 2021-07-20 18:46:27 IST
> > 2021-07-20 18:47:32.060 IST [102234] LOG:  data directory sync (fsync) 
> > complete after 0.00 s
> > 2021-07-20 18:47:32.060 IST [102234] LOG:  database system was not properly 
> > shut down; automatic recovery in progress
> > 2021-07-20 18:47:32.063 IST [102234] LOG:  unlogged relations reset after 
> > 0.00 s
> > 2021-07-20 18:47:32.063 IST [102234] LOG:  redo starts at 0/14EF418
> > 2021-07-20 18:47:33.063 IST [102234] LOG:  redo in progress, elapsed time: 
> > 1.00 s, current LSN: 0/5C13D28
> > 2021-07-20 18:47:34.063 IST [102234] LOG:  redo in progress, elapsed time: 
> > 2.00 s, current LSN: 0/A289160
> > 2021-07-20 18:47:35.063 IST [102234] LOG:  redo in progress, elapsed time: 
> > 3.00 s, current LSN: 0/EE2DE10
> > 2021-07-20 18:47:35.563 IST [102234] LOG:  invalid record length at 
> > 0/115C63E0: wanted 24, got 0
> > 2021-07-20 18:47:35.563 IST [102234] LOG:  redo done at 0/115C63B8 system 
> > usage: CPU: user: 3.58 s, system: 0.14 s, elapsed: 3.50 s
> > 2021-07-20 18:47:35.564 IST [102234] LOG:  unlogged relations reset after 
> > 0.00 s
> > 2021-07-20 18:47:35.706 IST [102230] LOG:  database system is ready to 
> > accept connections
>
> --
> Justin
From 85c0e711cc48bcf2f76ed8ed0257af3ecf2de306 Mon Sep 17 00:00:00 2001
From: Nitin 
Date: Fri, 23 Jul 2021 15:46:56 +0530
Subject: [PATCH] Shows the progress of the operations performed during startup
 process. The interval between each progress update is configurable and it
 should be mentioned in seconds

---
 src/backend/access/transam/xlog.c | 178 ++
 src/backend/postmaster/startup.c  |   1 +
 src/backend/storage/file/fd.c |  13 ++
 src/backend/storage/file/reinit.c |  18 +++
 src/backend/utils/misc/guc.c  |  13 +-
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/access/xlog.h |  18 +++
 src/include/utils/timeout.h   |   1 +
 8 files changed, 247 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3479402..4f05205 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -79,6 +79,7 @@
 #include "utils/relmapper.h"
 #include "utils/pg_rusage.h"
 

Re: row filtering for logical replication

2021-07-23 Thread Amit Kapila
On Fri, Jul 23, 2021 at 2:27 PM Rahila Syed  wrote:
>
> On Fri, Jul 23, 2021 at 8:36 AM Amit Kapila  wrote:
>>
>> On Fri, Jul 23, 2021 at 8:29 AM Amit Kapila  wrote:
>> >
>> > On Thu, Jul 22, 2021 at 8:06 PM Dilip Kumar  wrote:
>> > >
>> > > On Thu, Jul 22, 2021 at 5:15 PM Amit Kapila  
>> > > wrote:
>> > > >
>> > > > On Tue, Jul 20, 2021 at 4:33 PM Dilip Kumar  
>> > > > wrote:
>> > > > >
>> > > > > On Tue, Jul 20, 2021 at 3:43 PM Tomas Vondra
>> > > > >  wrote:
>> > > > > >
>> > > > > > Do we log the TOAST-ed values that were not updated?
>> > > > >
>> > > > > No, we don't, I have submitted a patch sometime back to fix that [1]
>> > > > >
>> > > >
>> > > > That patch seems to log WAL for key unchanged columns. What about if
>> > > > unchanged non-key columns? Do they get logged as part of the new tuple
>> > > > or is there some other way we can get those? If not, then we need to
>> > > > probably think of restricting filter clause in some way.
>> > >
>> > > But what sort of restrictions? I mean we can not put based on data
>> > > type right that will be too restrictive,
>> > >
>> >
>> > Yeah, data type restriction sounds too restrictive and unless the data
>> > is toasted, the data will be anyway available. I think such kind of
>> > restriction should be the last resort but let's try to see if we can
>> > do something better.
>> >
>> > > other option is only to allow
>> > > replica identity keys columns in the filter condition?
>> > >
>> >
>> > Yes, that is what I had in mind because if key column(s) is changed
>> > then we will have data for both old and new tuples. But if it is not
>> > changed then we will have it probably for the old tuple unless we
>> > decide to fix the bug you mentioned in a different way in which case
>> > we might either need to log it for the purpose of this feature (but
>> > that will be any way for HEAD) or need to come up with some other
>> > solution here. I think we can't even fetch such columns data during
>> > decoding because we have catalog-only historic snapshots here. Do you
>> > have any better ideas?
>> >
>>
>> BTW, I wonder how pglogical can handle this because if these unchanged
>> toasted values are not logged in WAL for the new tuple then how the
>> comparison for such columns will work? Either they are forcing WAL in
>> some way or don't allow WHERE clause on such columns or maybe they
>> have dealt with it in some other way unless they are unaware of this
>> problem.
>>
>
> The column comparison for row filtering happens before the unchanged toast
> columns are filtered. Unchanged toast columns are filtered just before 
> writing the tuple
> to output stream.
>

To perform filtering, you need to use the tuple from WAL and that
tuple doesn't seem to have unchanged toast values, so how can we do
filtering? I think it is a good idea to test this once.

-- 
With Regards,
Amit Kapila.




Re: logical replication empty transactions

2021-07-23 Thread Ajin Cherian
On Fri, Jul 23, 2021 at 7:38 PM Peter Smith  wrote:
>
> I have reviewed the v10 patch.
>
> Apply / build / test was all OK.
>
> Just one review comment:
>
> //
>
> 1. Typo
>
> @@ -130,6 +132,17 @@ typedef struct RelationSyncEntry
>   TupleConversionMap *map;
>  } RelationSyncEntry;
>
> +/*
> + * Maintain a per-transaction level variable to track whether the
> + * transaction has sent BEGIN or BEGIN PREPARE. BEGIN or BEGIN PREPARE
> + * is only sent when the first change in a transaction is processed.
> + * This make it possible to skip transactions that are empty.
> + */
>
> =>
>
> typo: "make it possible" --> "makes it possible"
>

fixed.

regards,
Ajin Cherian
Fujitsu Australia


v11-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: [HACKERS] logical decoding of two-phase transactions

2021-07-23 Thread Amit Kapila
On Tue, Jul 20, 2021 at 9:24 AM Peter Smith  wrote:
>
> Please find attached the latest patch set v98*
>

Review comments:

1.
/*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)

Let's extract this common functionality (common to current code and
the patch) as a separate patch? I think we can commit this as a
separate patch.

2.
apply_spooled_messages()
{
..
  elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
  nchanges, path);
..
}

You have this DEBUG1 message in apply_spooled_messages and its
callers. You can remove it from callers as the patch already has
another debug message to indicate whether it is stream prepare or
stream commit. Also, if this is the only reason to return nchanges
from apply_spooled_messages() then we can get rid of that as well.

3.
+ /*
+ * 2. Mark the transaction as prepared. - Similar code as for
+ * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+ */
+
+ /*
+ * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+ * called within the PrepareTransactionBlock below.
+ */
+ BeginTransactionBlock();
+ CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+ /*
+ * Update origin state so we can restart streaming from correct position
+ * in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data.end_lsn;
+ replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+ PrepareTransactionBlock(gid);

I think you can move this part into a common function
apply_handle_prepare_internal. If that is possible then you might want
to move this part into a common functionality patch as mentioned in
point-1.

4.
+ xid = logicalrep_read_stream_prepare(s, _data);
+ elog(DEBUG1, "received prepare for streamed transaction %u", xid);

It is better to have an empty line between the above code lines for
the sake of clarity.

5.
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData

How is this structure related to abort? Even if it is, why this
comment belongs to this patch?

6. Most of the code in logicalrep_write_stream_prepare() and
logicalrep_write_prepare() is same except for message. I think if we
want we can handle both of them with a single message by setting some
flag for stream case but probably there will be some additional
checking required on the worker-side. What do you think? I think if we
want to keep them separate then at least we should keep the common
functionality in logicalrep_write_*/logicalrep_read_* in separate
functions. This way we will avoid minor inconsistencies in-stream and
non-stream functions.

7.
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
Begin Prepare and Prepare messages belong to the same transaction.
It also sends changes of large in-progress transactions between a pair of
Stream Start and Stream Stop messages. The last stream of such a transaction
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare, Stream Commit or Stream Abort message.

I am not sure if it is correct to mention Stream Prepare here because
after that we will send commit prepared as well for such a
transaction. So, I think we should remove this change.

8.
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+

+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);

Is there a reason for this change in the tests?

9.
I think this contains a lot of streaming tests in 023_twophase_stream.
Let's keep just one test for crash-restart scenario (+# Check that 2PC
COMMIT PREPARED is decoded properly on crash restart.) where both
publisher and subscriber get restarted. I think others are covered in
one or another way by other existing tests. Apart from that, I also
don't see the need for the below tests:
# Do DELETE after PREPARE but before COMMIT PREPARED.
This is mostly the same as the previous test where the patch is testing Insert
# Try 2PC transaction works using an empty GID literal
This is covered in 021_twophase.

10.
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.

In the above comment, you might want to say something about streaming.
In general, I am not sure if it is really adding value to have these
many streaming tests for cascaded setup and doing the whole setup
again after we have done in tests 022_twophase_cascade. I think it is
sufficient to do just one or two streaming tests by enhancing
022_twophase_cascade, you can alter subscription to enable streaming
after doing non-streaming tests.

11. Have you verified that all these tests went through the streaming
code path? If not, you can once enable DEBUG message in

Re: Signed vs Unsigned (take 2) (src/backend/storage/ipc/procarray.c)

2021-07-23 Thread Aleksander Alekseev
Hi hackers,

The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:tested, passed
>
> The patch was tested on MacOS against master `80ba4bb3`.
>
> The new status of this patch is: Ready for Committer
>

The second patch seems fine too. I'm attaching both patches to trigger
cfbot and to double-check them.

-- 
Best regards,
Aleksander Alekseev


0001-Reduce-Wsign-compare-warnings-from-clang-12-compiler.patch
Description: Binary data


0002-Promove-unshadowing-of-two-variables-PGPROC-type.patch
Description: Binary data


Re: Signed vs Unsigned (take 2) (src/backend/storage/ipc/procarray.c)

2021-07-23 Thread Aleksander Alekseev
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

The patch was tested on MacOS against master `80ba4bb3`.

The new status of this patch is: Ready for Committer


Re: logical replication empty transactions

2021-07-23 Thread Peter Smith
I have reviewed the v10 patch.

Apply / build / test was all OK.

Just one review comment:

//

1. Typo

@@ -130,6 +132,17 @@ typedef struct RelationSyncEntry
  TupleConversionMap *map;
 } RelationSyncEntry;

+/*
+ * Maintain a per-transaction level variable to track whether the
+ * transaction has sent BEGIN or BEGIN PREPARE. BEGIN or BEGIN PREPARE
+ * is only sent when the first change in a transaction is processed.
+ * This make it possible to skip transactions that are empty.
+ */

=>

typo: "make it possible" --> "makes it possible"

--

Kind Regards,
Peter Smith.
Fujitsu Australia




Logical replication error "no record found" /* shouldn't happen */

2021-07-23 Thread Andrey Borodin
Hi!

From time to time I observe $subj on clusters using logical replication.
I most of cases there are a lot of other errors. Probably $subj condition 
should be kind of impossible without other problems.
I propose to enhance error logging of XLogReadRecord() in ReadPageInternal().

Thank you!

Best regards, Andrey Borodin.



0001-Improve-error-reporting-of-ReadPageInternal.patch
Description: Binary data


Re: row filtering for logical replication

2021-07-23 Thread Rahila Syed
On Fri, Jul 23, 2021 at 8:36 AM Amit Kapila  wrote:

> On Fri, Jul 23, 2021 at 8:29 AM Amit Kapila 
> wrote:
> >
> > On Thu, Jul 22, 2021 at 8:06 PM Dilip Kumar 
> wrote:
> > >
> > > On Thu, Jul 22, 2021 at 5:15 PM Amit Kapila 
> wrote:
> > > >
> > > > On Tue, Jul 20, 2021 at 4:33 PM Dilip Kumar 
> wrote:
> > > > >
> > > > > On Tue, Jul 20, 2021 at 3:43 PM Tomas Vondra
> > > > >  wrote:
> > > > > >
> > > > > > Do we log the TOAST-ed values that were not updated?
> > > > >
> > > > > No, we don't, I have submitted a patch sometime back to fix that
> [1]
> > > > >
> > > >
> > > > That patch seems to log WAL for key unchanged columns. What about if
> > > > unchanged non-key columns? Do they get logged as part of the new
> tuple
> > > > or is there some other way we can get those? If not, then we need to
> > > > probably think of restricting filter clause in some way.
> > >
> > > But what sort of restrictions? I mean we can not put based on data
> > > type right that will be too restrictive,
> > >
> >
> > Yeah, data type restriction sounds too restrictive and unless the data
> > is toasted, the data will be anyway available. I think such kind of
> > restriction should be the last resort but let's try to see if we can
> > do something better.
> >
> > > other option is only to allow
> > > replica identity keys columns in the filter condition?
> > >
> >
> > Yes, that is what I had in mind because if key column(s) is changed
> > then we will have data for both old and new tuples. But if it is not
> > changed then we will have it probably for the old tuple unless we
> > decide to fix the bug you mentioned in a different way in which case
> > we might either need to log it for the purpose of this feature (but
> > that will be any way for HEAD) or need to come up with some other
> > solution here. I think we can't even fetch such columns data during
> > decoding because we have catalog-only historic snapshots here. Do you
> > have any better ideas?
> >
>
> BTW, I wonder how pglogical can handle this because if these unchanged
> toasted values are not logged in WAL for the new tuple then how the
> comparison for such columns will work? Either they are forcing WAL in
> some way or don't allow WHERE clause on such columns or maybe they
> have dealt with it in some other way unless they are unaware of this
> problem.
>
>
The column comparison for row filtering happens before the unchanged toast
columns are filtered. Unchanged toast columns are filtered just before
writing the tuple
to output stream. I think this is the case both for pglogical and the
proposed patch.
So, I can't see why the not logging of unchanged toast columns would be a
problem
for row filtering. Am I missing something?


Thank you,
Rahila Syed


Re: proposal: enhancing plpgsql debug API - returns text value of variable content

2021-07-23 Thread Pavel Stehule
pá 23. 7. 2021 v 10:30 odesílatel Aleksander Alekseev <
aleksan...@timescale.com> napsal:

> Hi Pavel,
>
> > I know it. Attached patch try to fix this issue
> >
> > I merged you patch (thank you)
>
> Thanks! I did some more minor changes, mostly in the comments. See the
> attached patch. Other than that it looks OK. I think it's Ready for
> Committer now.
>

looks well,

thank you very much

Pavel


> --
> Best regards,
> Aleksander Alekseev
>


Re: proposal: enhancing plpgsql debug API - returns text value of variable content

2021-07-23 Thread Aleksander Alekseev
Hi Pavel,

> I know it. Attached patch try to fix this issue
>
> I merged you patch (thank you)

Thanks! I did some more minor changes, mostly in the comments. See the
attached patch. Other than that it looks OK. I think it's Ready for
Committer now.

-- 
Best regards,
Aleksander Alekseev


enhancing-plpgsql-dbgapi-20210723.patch
Description: Binary data


Re: psql - add SHOW_ALL_RESULTS option

2021-07-23 Thread Pavel Stehule
pá 23. 7. 2021 v 9:41 odesílatel Fabien COELHO  napsal:

>
> >>> I tested manually for the pager feature, which mostly work, althoug
> >>> "pspg --stream" does not seem to expect two tables, or maybe there is
> >>> a way to switch between these that I have not found.
> >>
> >> pspg doesn't support this feature.
>
> Sure. Note that it is not a feature yet:-)
>
> ISTM that having some configurable pager-targetted marker would greatly
> help parsing on the pager side, so this might be the way to go, if this
> finally becomes a feature.
>

yes, It can help me lot of, and pspg can be less sensitive (or immune)
against synchronization errors.

Pavel


> --
> Fabien.
>


Re: psql - add SHOW_ALL_RESULTS option

2021-07-23 Thread Fabien COELHO



I tested manually for the pager feature, which mostly work, althoug 
"pspg --stream" does not seem to expect two tables, or maybe there is 
a way to switch between these that I have not found.


pspg doesn't support this feature.


Sure. Note that it is not a feature yet:-)

ISTM that having some configurable pager-targetted marker would greatly 
help parsing on the pager side, so this might be the way to go, if this

finally becomes a feature.

--
Fabien.




Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-07-23 Thread Greg Nancarrow
On Thu, Jul 22, 2021 at 2:15 AM Robert Haas  wrote:
>
> Thanks to Thomas Munro for drawing my attention to this thread. I
> wasn't intentionally ignoring it, but there's a lot of email in the
> world and only so much time.
>
> When I wrote this code originally, the idea that I had in mind was
> simply this: whatever state we have in the leader ought to be
> reproduced in each worker. So if there's an active snapshot in the
> leader, we ought to make that active in all the workers, and if
> there's a transaction snapshot in the leader, we ought to make that
> the transaction snapshot in all of the workers.
>
> But I see now that my thinking was fuzzy, and I'm going to blame that
> on the name GetTransactionSnapshot() being slightly misleading. If
> IsolationUsesXactSnapshot() is true, then there's really such a thing
> as a transaction snapshot and reproducing that in the worker is a
> sensible thing to do. But when !IsolationUsesXactSnapshot(),
> GetTransactionSnapshot() doesn't just "get the transaction snapshot",
> because there isn't any such thing. It takes a whole new snapshot, on
> the theory that you wouldn't be calling this function unless you had
> finished up with the snapshot you got the last time you called this
> function. And in the case of initiating parallel query, that is the
> wrong thing.
>
> I think that, at least in the case where IsolationUsesXactSnapshot()
> is true, we need to make sure that calling GetTransactionSnapshot() in
> a worker produces the same result that it would have produced in the
> leader. Say one of the workers calls an sql or plpgsql function and
> that function runs a bunch of SQL statements. It seems to me that
> there's probably a way for this to result in calls inside the worker
> to GetTransactionSnapshot(), and if that doesn't return the same
> snapshot as in the leader, then we've broken MVCC.
>
> What about when IsolationUsesXactSnapshot() is false? Perhaps it's OK
> to just skip this altogether in that case. Certainly what we're doing
> can't be right, because copying a snapshot that wouldn't have been
> taken without parallel query can't ever be the right thing to do.
> Perhaps we need to copy something else instead. I'm not really sure.
>
> So I think v2 is probably on the right track, but wrong when the
> transaction isolation level is REPEATABLE READ or SERIALIZABLE, and v3
> and v4 just seem like unprincipled hacks that try to avoid the
> assertion failure by lying about whether there's a problem.
>

Many thanks for taking time to respond to this (and thanks to Thomas Munro too).
It's much appreciated, as this is a complex area.
For the time being, I'll reinstate the v2 patch (say as v5) as a
partial solution, and then work on addressing the REPEATABLE READ and
SERIALIZABLE transaction isolation levels, which you point out are not
handled correctly by the patch.

Regards,
Greg Nancarrow
Fujitsu Australia


v5-0001-PG15-Fix-parallel-worker-failed-assertion-and-coredump.patch
Description: Binary data