date:20211217

Re: parallel vacuum comments

2021-12-17 Thread Amit Kapila

On Fri, Dec 17, 2021 at 10:51 AM Masahiko Sawada  wrote:
>
> I've attached updated patches. The first patch just moves common
> function for index bulk-deletion and cleanup to vacuum.c. And the
> second patch moves parallel vacuum code to vacuumparallel.c. The
> comments I got so far are incorporated into these patches. Please
> review them.
>

Thanks, it is helpful for the purpose of review.

Few comments:
=
1.
+ * dead_items stores TIDs whose index tuples are deleted by index vacuuming.
+ * Each TID points to an LP_DEAD line pointer from a heap page that has been
+ * processed by lazy_scan_prune.  Also needed by lazy_vacuum_heap_rel, which
+ * marks the same LP_DEAD line pointers as LP_UNUSED during second heap pass.
  */
- LVDeadItems *dead_items; /* TIDs whose index tuples we'll delete */
+ VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */

Isn't it better to keep these comments atop the structure VacDeadItems
declaration?

2. What is the reason for not moving
lazy_vacuum_one_index/lazy_cleanup_one_index to vacuum.c so that they
can be called from vacuumlazy.c and vacuumparallel.c? Without this
refactoring patch, I think both leader and workers set the same error
context phase (VACUUM_ERRCB_PHASE_VACUUM_INDEX) during index
vacuuming? Is it because you want a separate context phase for a
parallel vacuum? The other thing related to this is that if we have to
do the way you have it here then we don't need pg_rusage_init() in
functions lazy_vacuum_one_index/lazy_cleanup_one_index.

3.
@@ -3713,7 +3152,7 @@ update_index_statistics(LVRelState *vacrel)
  int nindexes = vacrel->nindexes;
  IndexBulkDeleteResult **indstats = vacrel->indstats;

- Assert(!IsInParallelMode());
+ Assert(!ParallelVacuumIsActive(vacrel));

I think we can retain the older Assert. If we do that then I think we
don't need to define ParallelVacuumIsActive in vacuumlazy.c.

-- 
With Regards,
Amit Kapila.

Re: sequences vs. synchronous replication

2021-12-17 Thread Tomas Vondra





On 12/18/21 05:52, Tom Lane wrote:

Tomas Vondra  writes:

The problem is exactly the same as in [1] - the aborted transaction
generated WAL, but RecordTransactionAbort() ignores that and does not
update LogwrtResult.Write, with the reasoning that aborted transactions
do not matter. But sequences violate that, because we only write WAL
once every 32 increments, so the following nextval() gets "committed"
without waiting for the replica (because it did not produce WAL).


Ugh.


I'm not sure this is a clear data corruption bug, but it surely walks
and quacks like one. My proposal is to fix this by tracking the lsn of
the last LSN for a sequence increment, and then check that LSN in
RecordTransactionCommit() before calling XLogFlush().


(1) Does that work if the aborted increment was in a different
session?  I think it is okay but I'm tired enough to not be sure.



Good point - it doesn't :-( At least not by simply storing LSN in a 
global variable or something like that.


The second backend needs to know the LSN of the last WAL-logged sequence 
increment, but only the first backend knows that. So we'd need to share 
that between backends somehow. I doubt we want to track LSN for every 
individual sequence (because for clusters with many dbs / sequences that 
may be a lot).


Perhaps we could track just a fixed number o LSN values in shared memory 
(say, 1024), and update/read just the element determined by hash(oid). 
That is, the backend WAL-logging sequence with given oid would set the 
current LSN to array[hash(oid) % 1024], and backend doing nextval() 
would simply remember the LSN in that slot. Yes, if there are conflicts 
that'll flush more than needed.


Alternatively we could simply use the current insert LSN, but that's 
going to flush more stuff than needed all the time.




(2) I'm starting to wonder if we should rethink the sequence logging
mechanism altogether.  It was cool when designed, but it seems
really problematic when you start thinking about replication
behaviors.  Perhaps if wal_level > minimal, we don't do things
the same way?


Maybe, but I have no idea how should the reworked WAL logging work. Any 
batching seems to have this issue, and loging individual increments is 
likely going to be slower.


Of course, reworking how sequences are WAL-logged may invalidate the 
"sequence decoding" patch I've been working on :-(



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: sequences vs. synchronous replication

2021-12-17 Thread Tom Lane

Tomas Vondra  writes:
> The problem is exactly the same as in [1] - the aborted transaction 
> generated WAL, but RecordTransactionAbort() ignores that and does not 
> update LogwrtResult.Write, with the reasoning that aborted transactions 
> do not matter. But sequences violate that, because we only write WAL 
> once every 32 increments, so the following nextval() gets "committed" 
> without waiting for the replica (because it did not produce WAL).

Ugh.

> I'm not sure this is a clear data corruption bug, but it surely walks 
> and quacks like one. My proposal is to fix this by tracking the lsn of 
> the last LSN for a sequence increment, and then check that LSN in 
> RecordTransactionCommit() before calling XLogFlush().

(1) Does that work if the aborted increment was in a different
session?  I think it is okay but I'm tired enough to not be sure.

(2) I'm starting to wonder if we should rethink the sequence logging
mechanism altogether.  It was cool when designed, but it seems
really problematic when you start thinking about replication
behaviors.  Perhaps if wal_level > minimal, we don't do things
the same way?

regards, tom lane

Re: Column Filtering in Logical Replication

2021-12-17 Thread Amit Kapila

On Fri, Dec 17, 2021 at 3:16 PM houzj.f...@fujitsu.com
 wrote:
>
> On Friday, December 17, 2021 1:55 AM Alvaro Herrera  
> wrote:
> > On 2021-Dec-16, houzj.f...@fujitsu.com wrote:
> >
> > > The patch ensures all columns of RT are in column list when
> > > CREATE/ALTER publication, but it seems doesn't prevent user from
> > > changing the replica identity or dropping the index used in replica
> > > identity. Do we also need to check those cases ?
> >
> > Yes, we do.  As it happens, I spent a couple of hours yesterday writing 
> > code for
> > that, at least partially.  I haven't yet checked what happens with cases 
> > like
> > REPLICA NOTHING, or REPLICA INDEX  and then dropping that index.
> >
> > My initial ideas were a bit wrong BTW: I thought we should check the
> > combination of column lists in all publications (a bitwise-OR of column 
> > bitmaps,
> > so to speak).  But conceptually that's wrong: we need to check the column 
> > list
> > of each publication individually instead.  Otherwise, if you wanted to hide 
> > a
> > column from some publication but that column was part of the replica 
> > identity,
> > there'd be no way to identify the tuple in the replica.  (Or, if the 
> > pgouput code
> > disobeys the column list and sends the replica identity even if it's not in 
> > the
> > column list, then you'd be potentially publishing data that you wanted to 
> > hide.)
>
> Thanks for the explanation.
>
> Apart from ALTER REPLICA IDENTITY and DROP INDEX, I think there could be
> some other cases we need to handle for the replica identity check:
>
> 1)
> When adding a partitioned table with column list to the publication, I think 
> we
> need to check the RI of all its leaf partition. Because the RI on the 
> partition
> is the one actually takes effect.
>
> 2)
> ALTER TABLE ADD PRIMARY KEY;
> ALTER TABLE DROP CONSTRAINT "PRIMAEY KEY";
>
> If the replica identity is default, it will use the primary key. we might also
> need to prevent user from adding or removing primary key in this case.
>
>
> Based on the above cases, the RI check seems could bring considerable amount 
> of
> code. So, how about we follow what we already did in 
> CheckCmdReplicaIdentity(),
> we can put the check for RI in that function, so that we can cover all the
> cases and reduce the code change. And if we are worried about the cost of do
> the check for UPDATE and DELETE every time, we can also save the result in the
> relcache. It's safe because every operation change the RI will invalidate the
> relcache. We are using this approach in row filter patch to make sure all
> columns in row filter expression are part of RI.
>

Another point related to RI is that this patch seems to restrict
specifying the RI columns in the column filter list irrespective of
publish action. Do we need to have such a restriction if the
publication publishes 'insert' or 'truncate'?

-- 
With Regards,
Amit Kapila.

Re: Column Filtering in Logical Replication

2021-12-17 Thread Amit Kapila

On Sat, Dec 18, 2021 at 7:04 AM Alvaro Herrera  wrote:
>
> On 2021-Dec-17, Tomas Vondra wrote:
>
> > On 12/17/21 22:07, Alvaro Herrera wrote:
> > > So I've been thinking about this as a "security" item (you can see my
> > > comments to that effect sprinkled all over this thread), in the sense
> > > that if a publication "hides" some column, then the replica just won't
> > > get access to it.  But in reality that's mistaken: the filtering that
> > > this patch implements is done based on the queries that *the replica*
> > > executes at its own volition; if the replica decides to ignore the list
> > > of columns, it'll be able to get all columns.  All it takes is an
> > > uncooperative replica in order for the lot of data to be exposed anyway.
> >
> > Interesting, I haven't really looked at this as a security feature. And in
> > my experience if something is not carefully designed to be secure from the
> > get go, it's really hard to add that bit later ...
>
> I guess the way to really harden replication is to use the GRANT system
> at the publisher's side to restrict access for the replication user.
> This would provide actual security.  So you're right that I seem to be
> barking at the wrong tree ...  maybe I need to give a careful look at
> the documentation for logical replication to understand what is being
> offered, and to make sure that we explicitly indicate that limiting the
> column list does not provide any actual security.
>

IIRC, the use cases as mentioned by other databases (like Oracle) are
(a) this helps when the target table doesn't have the same set of
columns or (b) when the columns contain some sensitive information
like personal identification number, etc. I think there could be a
side benefit in this which comes from the fact that the lesser data
will flow across the network which could lead to faster replication
especially when the user filters large column data.

-- 
With Regards,
Amit Kapila.

Re: row filtering for logical replication

2021-12-17 Thread Amit Kapila

On Fri, Dec 17, 2021 at 5:29 PM Greg Nancarrow  wrote:
>
> On Fri, Dec 17, 2021 at 7:20 PM Ajin Cherian  wrote:
> >
> > On Fri, Dec 17, 2021 at 5:46 PM Greg Nancarrow  wrote:
> >
> > > So using the v47 patch-set, I still find that the UPDATE above results in 
> > > publication of an INSERT of (2,1), rather than an UPDATE of (1,1) to 
> > > (2,1).
> > > This is according to the 2nd UPDATE rule below, from patch 0003.
> > >
> > > + * old-row (no match)new-row (no match)  -> (drop change)
> > > + * old-row (no match)new row (match) -> INSERT
> > > + * old-row (match)   new-row (no match)  -> DELETE
> > > + * old-row (match)   new row (match) -> UPDATE
> > >
> > > This is because the old row (1,1) doesn't match the UPDATE filter 
> > > "(a>1)", but the new row (2,1) does.
> > > This functionality doesn't seem right to me. I don't think it can be 
> > > assumed that (1,1) was never published (and thus requires an INSERT 
> > > rather than UPDATE) based on these checks, because in this example, (1,1) 
> > > was previously published via a different operation - INSERT (and using a 
> > > different filter too).
> > > I think the fundamental problem here is that these UPDATE rules assume 
> > > that the old (current) row was previously UPDATEd (and published, or not 
> > > published, according to the filter applicable to UPDATE), but this is not 
> > > necessarily the case.
> > > Or am I missing something?
> >
> > But it need not be correct in assuming that the old-row was part of a
> > previous INSERT either (and published, or not published according to
> > the filter applicable to an INSERT).
> > For example, change the sequence of inserts and updates prior to the
> > last update:
> >
> > truncate tbl1 ;
> > insert into tbl1 values (1,5); ==> not replicated since insert and ! (b < 
> > 2);
> > update tbl1 set b = 1; ==> not replicated since update and ! (a > 1)
> > update tbl1 set a = 2; ==> replicated and update converted to insert
> > since (a > 1)
> >
> > In this case, the last update "update tbl1 set a = 2; " is updating a
> > row that was previously updated and not inserted and not replicated to
> > the subscriber.
> > How does the replication logic differentiate between these two cases,
> > and decide if the update was previously published or not?
> > I think it's futile for the publisher side to try and figure out the
> > history of published rows. In fact, if this level of logic is required
> > then it is best implemented on the subscriber side, which then defeats
> > the purpose of a publication filter.
> >
>
> I think it's a concern, for such a basic example with only one row,
> getting unpredictable (and even wrong) replication results, depending
> upon the order of operations.
>

I am not sure how we can deduce that. The results are based on current
and new values of row which is what I think we are expecting here.

> Doesn't this problem result from allowing different WHERE clauses for
> different pubactions for the same table?
> My current thoughts are that this shouldn't be allowed, and also WHERE
> clauses for INSERTs should, like UPDATE and DELETE, be restricted to
> using only columns covered by the replica identity or primary key.
>

Hmm, even if we do that one could have removed the insert row filter
by the time we are evaluating the update. So, we will get the same
result. I think the behavior in your example is as we expect as per
the specs defined by the patch and I don't see any problem, in this
case, w.r.t replication results. Let us see what others think on this?



--
With Regards,
Amit Kapila.

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2021-12-17 Thread Peter Geoghegan

On Thu, Dec 16, 2021 at 10:46 PM Masahiko Sawada  wrote:
> > My emphasis here has been on making non-aggressive VACUUMs *always*
> > advance relfrozenxid, outside of certain obvious edge cases. And so
> > with all the patches applied, up to and including the opportunistic
> > freezing patch, every autovacuum of every table manages to advance
> > relfrozenxid during benchmarking -- usually to a fairly recent value.
> > I've focussed on making aggressive VACUUMs (especially anti-wraparound
> > autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
> > user keeps canceling autovacuums, maybe due to automated script that
> > performs DDL). That has taken priority over other goals, for now.
>
> Great!

Maybe this is a good time to revisit basic questions about VACUUM. I
wonder if we can get rid of some of the GUCs for VACUUM now.

Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of
vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for
physical blocks, but we use logical units (XIDs).

We probably shouldn't be using any units, but using XIDs "feels wrong"
to me. Even with my patch, it is theoretically possible that we won't
be able to advance relfrozenxid very much, because we cannot get a
cleanup lock on one single heap page with one old XID. But even in
this extreme case, how relevant is the "age" of this old XID, really?
What really matters is whether or not we can advance relfrozenxid in
time (with time to spare). And so the wraparound risk of the system is
not affected all that much by the age of the single oldest XID. The
risk mostly comes from how much total work we still need to do to
advance relfrozenxid. If the single old XID is quite old indeed (~1.5
billion XIDs), but there is only one, then we just have to freeze one
tuple to be able to safely advance relfrozenxid (maybe advance it by a
huge amount!). How long can it take to freeze one tuple, with the
freeze map, etc?

On the other hand, the risk may be far greater if we have *many*
tuples that are still unfrozen, whose XIDs are only "middle aged"
right now. The idea behind vacuum_freeze_min_age seems to be to be
lazy about work (tuple freezing) in the hope that we'll never need to
do it, but that seems obsolete now. (It probably made a little more
sense before the visibility map.)

Using XIDs makes sense for things like autovacuum_freeze_max_age,
because there we have to worry about wraparound and relfrozenxid
(whether or not we like it). But with this patch, and with everything
else (the failsafe, insert-driven autovacuums, everything we've done
over the last several years) I think that it might be time to increase
the autovacuum_freeze_max_age default. Maybe even to something as high
as 800 million transaction IDs, but certainly to 400 million. What do
you think? (Maybe don't answer just yet, something to think about.)

> +   vacrel->aggressive = aggressive;
> vacrel->failsafe_active = false;
> vacrel->consider_bypass_optimization = true;
>
> How about adding skipwithvm to LVRelState too?

Agreed -- it's slightly better that way. Will change this.

>  */
> -   if (skipping_blocks && !FORCE_CHECK_PAGE())
> +   if (skipping_blocks && blkno < nblocks - 1)
>
> Why do we always need to scan the last page even if heap truncation is
> disabled (or in the failsafe mode)?

My goal here was to keep the behavior from commit e8429082, "Avoid
useless truncation attempts during VACUUM", while simplifying things
around skipping heap pages via the visibility map (including removing
the FORCE_CHECK_PAGE() macro). Of course you're right that this
particular change that you have highlighted does change the behavior a
little -- now we will always treat the final page as a "scanned page",
except perhaps when 100% of all pages in the relation are skipped
using the visibility map.

This was a deliberate choice (and perhaps even a good choice!). I
think that avoiding accessing the last heap page like this isn't worth
the complexity. Note that we may already access heap pages (making
them "scanned pages") despite the fact that we know it's unnecessary:
the SKIP_PAGES_THRESHOLD test leads to this behavior (and we don't
even try to avoid wasting CPU cycles on these
not-skipped-but-skippable pages). So I think that the performance cost
for the last page isn't going to be noticeable.

However, now that I think about it, I wonder...what do you think of
SKIP_PAGES_THRESHOLD, in general? Is the optimal value still 32 today?
SKIP_PAGES_THRESHOLD hasn't changed since commit bf136cf6e3, shortly
after the original visibility map implementation was committed in
2009. The idea that it helps us to advance relfrozenxid outside of
aggressive VACUUMs (per commit message from bf136cf6e3) seems like it
might no longer matter with the patch -- because now we won't ever set
a page all-visible but not all-frozen. Plus the idea that we need to
do all this work just

Re: Column Filtering in Logical Replication

2021-12-17 Thread Tomas Vondra





On 12/18/21 02:34, Alvaro Herrera wrote:

On 2021-Dec-17, Tomas Vondra wrote:


On 12/17/21 22:07, Alvaro Herrera wrote:

So I've been thinking about this as a "security" item (you can see my
comments to that effect sprinkled all over this thread), in the sense
that if a publication "hides" some column, then the replica just won't
get access to it.  But in reality that's mistaken: the filtering that
this patch implements is done based on the queries that *the replica*
executes at its own volition; if the replica decides to ignore the list
of columns, it'll be able to get all columns.  All it takes is an
uncooperative replica in order for the lot of data to be exposed anyway.


Interesting, I haven't really looked at this as a security feature. And in
my experience if something is not carefully designed to be secure from the
get go, it's really hard to add that bit later ...


I guess the way to really harden replication is to use the GRANT system
at the publisher's side to restrict access for the replication user.
This would provide actual security.  So you're right that I seem to be
barking at the wrong tree ...  maybe I need to give a careful look at
the documentation for logical replication to understand what is being
offered, and to make sure that we explicitly indicate that limiting the
column list does not provide any actual security.


You say it's the replica making the decisions, but my mental model is it's
the publisher decoding the data for a given list of publications (which
indeed is specified by the subscriber). But the subscriber can't tweak the
definition of publications, right? Or what do you mean by queries executed
by the replica? What are the gap?


I am thinking in somebody modifying the code that the replica runs, so
that it ignores the column list that the publication has been configured
to provide; instead of querying only those columns, it would query all
columns.


If the server has a *separate* security mechanism to hide the columns
(per-column privs), it is that feature that will protect the data, not
the logical-replication-feature to filter out columns.


Right. Although I haven't thought about how logical decoding interacts with
column privileges. I don't think logical decoding actually checks column
privileges - I certainly don't recall any ACL checks in
src/backend/replication ...


Well, in practice if you're confronted with a replica that's controlled
by a malicious user that can tweak its behavior, then replica-side
privilege checking won't do anything useful.



I don't follow. Surely the decoding happens on the primary node, right? 
Which is where the ACL checks would happen, using the role the 
replication connection is opened with.



This led me to realize that the replica-side code in tablesync.c is
totally oblivious to what's the publication through which a table is
being received from in the replica.  So we're not aware of a replica
being exposed only a subset of columns through some specific
publication; and a lot more hacking is needed than this patch does, in
order to be aware of which publications are being used.



Does that mean we currently sync all the columns in the initial sync, and
only start filtering columns later while decoding transactions?


No, it does filter the list of columns in the initial sync.  But the
current implementation is bogus, because it obtains the list of *all*
publications in which the table is published, not just the ones that the
subscription is configured to get data from.  And the sync code doesn't
receive the list of publications.  We need more thorough patching of the
sync code to close that hole.


Ah, got it. Thanks for the explanation. Yeah, that makes no sense.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences

2021-12-17 Thread Tomas Vondra





On 12/15/21 14:51, Tomas Vondra wrote:

On 12/15/21 14:20, Amit Kapila wrote:

On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra
 wrote:


Hi,

here's an updated version of the patches, dealing with almost all of the
issues (at least in the 0001 and 0002 parts). The main changes:

1) I've removed the  'created' flag from fill_seq_with_data, as
discussed. I don't think it's needed by any of the parts (not even 0003,
AFAICS). We still need it in xl_seq_rec, though.

2) GetCurrentTransactionId() added to sequence.c are called only with
wal_level=logical, to minimize the overhead.


There's still one remaining problem, that I already explained in [1].
The problem is that with this:

   BEGIN;
   SELECT nextval('s') FROM generate_series(1,100);
   ROLLBACK;


The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write,
which is updated by XLogFlush() - but only in RecordTransactionCommit.
Which makes sense, because only the committed stuff is "visible".

But the non-transactional behavior of sequence decoding disagrees with
this, because now some of the changes from aborted transactions may be
replicated. Which means the wait_for_catchup() ends up not waiting for
the sequence change to be replicated. This is an issue for tests in
patch 0003, at least.

My concern is this actually affects other places waiting for things
getting replicated :-/



By any chance, will this impact synchronous replication as well which
waits for commits to be replicated?



Physical or logical replication? Physical is certainly not replicated.



Actually, I take that back. It does affect physical (sync) replication 
just as well, and I think it might be considered a data loss issue. I 
started a new thread to discuss that, so that it's not buried in this 
thread about logical replication.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

sequences vs. synchronous replication

2021-12-17 Thread Tomas Vondra


Hi,

while working on logical decoding of sequences, I ran into an issue with 
nextval() in a transaction that rolls back, described in [1]. But after 
thinking about it a bit more (and chatting with Petr Jelinek), I think 
this issue affects physical sync replication too.


Imagine you have a primary <-> sync_replica cluster, and you do this:

  CREATE SEQUENCE s;

  -- shutdown the sync replica

  BEGIN;
  SELECT nextval('s') FROM generate_series(1,50);
  ROLLBACK;

  BEGIN;
  SELECT nextval('s');
  COMMIT;

The natural expectation would be the COMMIT gets stuck, waiting for the 
sync replica (which is not running), right? But it does not.


The problem is exactly the same as in [1] - the aborted transaction 
generated WAL, but RecordTransactionAbort() ignores that and does not 
update LogwrtResult.Write, with the reasoning that aborted transactions 
do not matter. But sequences violate that, because we only write WAL 
once every 32 increments, so the following nextval() gets "committed" 
without waiting for the replica (because it did not produce WAL).


I'm not sure this is a clear data corruption bug, but it surely walks 
and quacks like one. My proposal is to fix this by tracking the lsn of 
the last LSN for a sequence increment, and then check that LSN in 
RecordTransactionCommit() before calling XLogFlush().



regards


[1] 
https://www.postgresql.org/message-id/ae3cab67-c31e-b527-dd73-08f196999ad4%40enterprisedb.com


--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Column Filtering in Logical Replication

2021-12-17 Thread Alvaro Herrera

On 2021-Dec-17, Tomas Vondra wrote:

> On 12/17/21 22:07, Alvaro Herrera wrote:
> > So I've been thinking about this as a "security" item (you can see my
> > comments to that effect sprinkled all over this thread), in the sense
> > that if a publication "hides" some column, then the replica just won't
> > get access to it.  But in reality that's mistaken: the filtering that
> > this patch implements is done based on the queries that *the replica*
> > executes at its own volition; if the replica decides to ignore the list
> > of columns, it'll be able to get all columns.  All it takes is an
> > uncooperative replica in order for the lot of data to be exposed anyway.
> 
> Interesting, I haven't really looked at this as a security feature. And in
> my experience if something is not carefully designed to be secure from the
> get go, it's really hard to add that bit later ...

I guess the way to really harden replication is to use the GRANT system
at the publisher's side to restrict access for the replication user.
This would provide actual security.  So you're right that I seem to be
barking at the wrong tree ...  maybe I need to give a careful look at
the documentation for logical replication to understand what is being
offered, and to make sure that we explicitly indicate that limiting the
column list does not provide any actual security.

> You say it's the replica making the decisions, but my mental model is it's
> the publisher decoding the data for a given list of publications (which
> indeed is specified by the subscriber). But the subscriber can't tweak the
> definition of publications, right? Or what do you mean by queries executed
> by the replica? What are the gap?

I am thinking in somebody modifying the code that the replica runs, so
that it ignores the column list that the publication has been configured
to provide; instead of querying only those columns, it would query all
columns.

> > If the server has a *separate* security mechanism to hide the columns
> > (per-column privs), it is that feature that will protect the data, not
> > the logical-replication-feature to filter out columns.
> 
> Right. Although I haven't thought about how logical decoding interacts with
> column privileges. I don't think logical decoding actually checks column
> privileges - I certainly don't recall any ACL checks in
> src/backend/replication ...

Well, in practice if you're confronted with a replica that's controlled
by a malicious user that can tweak its behavior, then replica-side
privilege checking won't do anything useful.

> > This led me to realize that the replica-side code in tablesync.c is
> > totally oblivious to what's the publication through which a table is
> > being received from in the replica.  So we're not aware of a replica
> > being exposed only a subset of columns through some specific
> > publication; and a lot more hacking is needed than this patch does, in
> > order to be aware of which publications are being used.

> Does that mean we currently sync all the columns in the initial sync, and
> only start filtering columns later while decoding transactions?

No, it does filter the list of columns in the initial sync.  But the
current implementation is bogus, because it obtains the list of *all*
publications in which the table is published, not just the ones that the
subscription is configured to get data from.  And the sync code doesn't
receive the list of publications.  We need more thorough patching of the
sync code to close that hole.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Tom Lane

Greg Stark  writes:
> But the bigger question is. Are we really concerned about this flaky
> problem? Is it worth investing time and money on? I can get money to
> go buy a G4 or G5 and spend some time on it. It just seems a bit...
> niche. But if it's a real bug that represents something broken on
> other architectures that just happens to be easier to trigger here it
> might be worthwhile.

TBH, I don't know.  There seem to be three plausible explanations:

1. Flaky hardware in my unit.
2. Ancient macOS bug, as Andres suggested upthread.
3. Actual PG bug.

If it's #1 or #2 then we're just wasting our time here.  I'm not
sure how to estimate the relative probabilities, but I suspect
#3 is the least likely of the lot.

FWIW, I did just reproduce the problem on that machine with current HEAD:

2021-12-17 18:40:40.293 EST [21369] FATAL:  inconsistent page found, rel 
1663/167772/2673, forknum 0, blkno 26
2021-12-17 18:40:40.293 EST [21369] CONTEXT:  WAL redo at C/3DE3F658 for 
Btree/INSERT_LEAF: off 208; blkref #0: rel 1663/167772/2673, blk 26 FPW
2021-12-17 18:40:40.522 EST [21365] LOG:  startup process (PID 21369) exited 
with exit code 1

That was after only five loops of the regression tests, so either
I got lucky or the failure probability has increased again.

In any case, it seems clear that the problem exists independently of
Munro's patches, so I don't really think this question should be
considered a blocker for those.

regards, tom lane

Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Greg Stark

On Fri, 17 Dec 2021 at 18:40, Tom Lane  wrote:
>
> Greg Stark  writes:
> > Hm. I seem to have picked a bad checkout. I took the last one before
> > the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc).
>
> FWIW, I think that's the first one *after* the revert.

Doh

But the bigger question is. Are we really concerned about this flaky
problem? Is it worth investing time and money on? I can get money to
go buy a G4 or G5 and spend some time on it. It just seems a bit...
niche. But if it's a real bug that represents something broken on
other architectures that just happens to be easier to trigger here it
might be worthwhile.

-- 
greg

Re: Add id's to various elements in protocol.sgml

2021-12-17 Thread Brar Piening


On Dec 17, 2021 at 08:13, Peter Eisentraut wrote:

On 15.12.21 16:59, Brar Piening wrote:

On Dec 15, 2021 at 15:49, Alvaro Herrera wrote:

On 2021-Dec-15, Brar Piening wrote:

Since I can't argue towards some general utility for the xreflabels
and don't have any other solid argument in favor of adding more, I
will remove them from my current patch but leave the existing ones
intact.

Yeah, I think not adding them until we have a use for them might be
wisest.

A new version of the patch that doesn't add xreflabels is attached.


Now this patch adds a bunch of ids, but you can't use them to link to,
because as soon as you do, you will get complaints about a missing
xreflabel.  So what is the remaining purpose?


The purpose is that you can directly link to the id in the public html
docs which still gets generated (e. g.
https://www.postgresql.org/docs/14/protocol-replication.html#PROTOCOL-REPLICATION-BASE-BACKUP).

Essentially it gives people discussing the protocol and pointing to a
certain command or message format the chance to link to the very thing
they are discussing instead of the top of the lengthy html page.

Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Tom Lane

Greg Stark  writes:
> Hm. I seem to have picked a bad checkout. I took the last one before
> the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc).

FWIW, I think that's the first one *after* the revert.

> 2021-12-17 17:51:51.688 EST [50955] LOG:  background worker "parallel
> worker" (PID 54073) was terminated by signal 10: Bus error

I'm betting on weird emulation issue.  None of my real PPC machines
showed such things.

regards, tom lane

Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Tomas Vondra


On 12/17/21 23:56, Greg Stark wrote:

Hm. I seem to have picked a bad checkout. I took the last one before
the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some
incompatibility with the emulation and the IPC stuff parallel workers
use.


2021-12-17 17:51:51.688 EST [50955] LOG:  background worker "parallel
worker" (PID 54073) was terminated by signal 10: Bus error
2021-12-17 17:51:51.688 EST [50955] DETAIL:  Failed process was
running: SELECT variance(unique1::int4), sum(unique1::int8),
regr_count(unique1::float8, unique1::float8)
FROM (SELECT * FROM tenk1
   UNION ALL SELECT * FROM tenk1
   UNION ALL SELECT * FROM tenk1
   UNION ALL SELECT * FROM tenk1) u;
2021-12-17 17:51:51.690 EST [50955] LOG:  terminating any other active
server processes
2021-12-17 17:51:51.748 EST [54078] FATAL:  the database system is in
recovery mode
2021-12-17 17:51:51.761 EST [50955] LOG:  all server processes
terminated; reinitializing



Interesting. In my experience SIGBUS on PPC tends to be due to incorrect 
alignment, but I'm not sure how that works with the emulation. Can you 
get a backtrace?


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Greg Stark

Hm. I seem to have picked a bad checkout. I took the last one before
the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some
incompatibility with the emulation and the IPC stuff parallel workers
use.


2021-12-17 17:51:51.688 EST [50955] LOG:  background worker "parallel
worker" (PID 54073) was terminated by signal 10: Bus error
2021-12-17 17:51:51.688 EST [50955] DETAIL:  Failed process was
running: SELECT variance(unique1::int4), sum(unique1::int8),
regr_count(unique1::float8, unique1::float8)
FROM (SELECT * FROM tenk1
  UNION ALL SELECT * FROM tenk1
  UNION ALL SELECT * FROM tenk1
  UNION ALL SELECT * FROM tenk1) u;
2021-12-17 17:51:51.690 EST [50955] LOG:  terminating any other active
server processes
2021-12-17 17:51:51.748 EST [54078] FATAL:  the database system is in
recovery mode
2021-12-17 17:51:51.761 EST [50955] LOG:  all server processes
terminated; reinitializing

Re: Column Filtering in Logical Replication

2021-12-17 Thread Tomas Vondra





On 12/17/21 22:07, Alvaro Herrera wrote:

So I've been thinking about this as a "security" item (you can see my
comments to that effect sprinkled all over this thread), in the sense
that if a publication "hides" some column, then the replica just won't
get access to it.  But in reality that's mistaken: the filtering that
this patch implements is done based on the queries that *the replica*
executes at its own volition; if the replica decides to ignore the list
of columns, it'll be able to get all columns.  All it takes is an
uncooperative replica in order for the lot of data to be exposed anyway.



Interesting, I haven't really looked at this as a security feature. And 
in my experience if something is not carefully designed to be secure 
from the get go, it's really hard to add that bit later ...


You say it's the replica making the decisions, but my mental model is 
it's the publisher decoding the data for a given list of publications 
(which indeed is specified by the subscriber). But the subscriber can't 
tweak the definition of publications, right? Or what do you mean by 
queries executed by the replica? What are the gap?



If the server has a *separate* security mechanism to hide the columns
(per-column privs), it is that feature that will protect the data, not
the logical-replication-feature to filter out columns.



Right. Although I haven't thought about how logical decoding interacts 
with column privileges. I don't think logical decoding actually checks 
column privileges - I certainly don't recall any ACL checks in 
src/backend/replication ...


AFAIK we only really check privileges during initial sync (when creating 
the slot and copying data), but then we keep replicating data even if 
the privilege gets revoked for the table/column. In principle the 
replication role is pretty close to superuser.




This led me to realize that the replica-side code in tablesync.c is
totally oblivious to what's the publication through which a table is
being received from in the replica.  So we're not aware of a replica
being exposed only a subset of columns through some specific
publication; and a lot more hacking is needed than this patch does, in
order to be aware of which publications are being used.

I'm going to have a deeper look at this whole thing.



Does that mean we currently sync all the columns in the initial sync, 
and only start filtering columns later while decoding transactions?



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Adding CI to our tree

2021-12-17 Thread Andres Freund

On 2021-12-17 11:31:59 -0800, Andres Freund wrote:
> > Don't we need the ulimit call for FreeBSD?
> 
> I think the default core limits were different, I will check.

Yep, freebsd has -c unlimited by default:
https://cirrus-ci.com/task/6199382239346688?logs=sysinfo#L23
vs
https://cirrus-ci.com/task/4792007355793408?logs=sysinfo#L32

Re: Column Filtering in Logical Replication

2021-12-17 Thread Alvaro Herrera

So I've been thinking about this as a "security" item (you can see my
comments to that effect sprinkled all over this thread), in the sense
that if a publication "hides" some column, then the replica just won't
get access to it.  But in reality that's mistaken: the filtering that
this patch implements is done based on the queries that *the replica*
executes at its own volition; if the replica decides to ignore the list
of columns, it'll be able to get all columns.  All it takes is an
uncooperative replica in order for the lot of data to be exposed anyway.

If the server has a *separate* security mechanism to hide the columns
(per-column privs), it is that feature that will protect the data, not
the logical-replication-feature to filter out columns.


This led me to realize that the replica-side code in tablesync.c is
totally oblivious to what's the publication through which a table is
being received from in the replica.  So we're not aware of a replica
being exposed only a subset of columns through some specific
publication; and a lot more hacking is needed than this patch does, in
order to be aware of which publications are being used.

I'm going to have a deeper look at this whole thing.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: speed up text_position() for utf-8

2021-12-17 Thread John Naylor

Attached is a short patch series to develop some ideas of inlining
pg_utf_mblen().

0001 puts the main implementation of pg_utf_mblen() into an inline
function and uses this in pg_mblen(). This is somewhat faster in the
strpos tests, so that gives some measure of the speedup expected for
other callers. Text search seems to call this a lot, so this might
have noticeable benefit.

0002 refactors text_position_get_match_pos() to use
pg_mbstrlen_with_len(). This itself is significantly faster when
combined with 0001, likely because the latter can inline the call to
pg_mblen(). The intention is to speed up more than just text_position.

0003 explicitly specializes for the inline version of pg_utf_mblen()
into pg_mbstrlen_with_len(), but turns out to be almost as slow as
master for ascii. It doesn't help if I undo the previous change in
pg_mblen(), and I haven't investigated why yet.

0002 looks good now, but the experience with 0003 makes me hesitant to
propose this seriously until I can figure out what's going on there.

The test is as earlier, a worst-case substring search, times in milliseconds.

 patch  | no match | ascii | multibyte
+--+---+---
 PG11   | 1220 |  1220 |  1150
 master |  385 |  2420 |  1980
 0001   |  390 |  2180 |  1670
 0002   |  389 |  1330 |  1100
 0003   |  391 |  2100 |  1360


-- 
John Naylor
EDB: http://www.enterprisedb.com


v2-0002-Refactor-text_position_get_match_pos-to-use-pg_mb.patch
Description: Binary data


v2-0003-Specialize-pg_mbstrlen_with_len-for-UTF-8.patch
Description: Binary data


v2-0001-Move-the-implementation-of-pg_utf_mblen-to-an-inl.patch
Description: Binary data

Re: Adding CI to our tree

2021-12-17 Thread Daniel Gustafsson

> On 17 Dec 2021, at 20:31, Andres Freund  wrote:
> On 2021-12-17 12:34:36 +0100, Peter Eisentraut wrote:

>> I don't like that the -s option is used.  I would like to see what commands
>> are executed.
> 
> I can change it - but it makes it *much* harder to spot compiler warnings.

Having used Cirrus et.al a fair bit I strongly agree with Andres, working with
huge logs in the browser is painful whereas -s makes it useable even on mobile
devices.

--
Daniel Gustafsson   https://vmware.com/

Re: make tuplestore helper function

2021-12-17 Thread Justin Pryzby

On Mon, Dec 13, 2021 at 05:27:52PM -0500, Melanie Plageman wrote:
> Done in attached v4

Thanks.

I think these comments can be removed from the callers:
/* it's a simple type, so don't use get_call_result_type */

You removed one call to tuplestore_donestoring(), but not the others.
I guess you could remove them all (optionally as a separate patch).

The rest of these are questions more than comments, and I'm not necessarily
suggesting to change the patch:

This isn't new in your patch, but for me, it's more clear if the callers have a
separate boolean for randomAccess, rather than having the long expression in
the function call:

+   tupstore = MakeFuncResultTuplestore(fcinfo, NULL,
+   rsinfo->allowedModes & SFRM_Materialize_Random);

vs

randomAccess = (rsinfo->allowedModes & SFRM_Materialize_Random) != 0;
-   tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+   tupstore = MakeFuncResultTuplestore(fcinfo, NULL, randomAccess);

One could also argue for passing rsinfo->allowedModes, instead of
(rsinfo->allowedModes & SFRM_Materialize_Random).

There's a couples places that you're checking expectedDesc where it wasn't
being checked before.  Is that some kind of live bug ?
pg_config() text_to_table()

It'd be nice if the "expected tuple format" error didn't need to be duplicated
for each SRFs.  I suppose the helper function could just take a boolean
determining whether or not to throw an error (passing "expectedDesc==NULL").
You'd have to call the helper before CreateTupleDescCopy() - which you're
already doing in a couple places for similar reasons.

I noticed that tuplestore.h isn't included most places.  I suppose it's
included via execnodes.h.  Your patch doesn't change that, arguably it
should've always been included.

-- 
Justin

Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Tom Lane

Greg Stark  writes:
> I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1
> or maybe 4.0.1. What version is on your G4?

$ gcc -v
Using built-in specs.
Target: powerpc-apple-darwin9
Configured with: /var/tmp/gcc/gcc-5493~1/src/configure --disable-checking 
-enable-werror --prefix=/usr --mandir=/share/man 
--enable-languages=c,objc,c++,obj-c++ 
--program-transform-name=/^[cg][^.-]*$/s/$/-4.0/ 
--with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib 
--build=i686-apple-darwin9 --program-prefix= --host=powerpc-apple-darwin9 
--target=powerpc-apple-darwin9
Thread model: posix
gcc version 4.0.1 (Apple Inc. build 5493)

I see that gcc 4.2.1 is also present on this machine, but I've
never used it.

regards, tom lane

Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Greg Stark

I have

IBUILD:postgresql gsstark$ ls /usr/bin/*gcc*
/usr/bin/gcc
/usr/bin/gcc-4.0
/usr/bin/gcc-4.2
/usr/bin/i686-apple-darwin9-gcc-4.0.1
/usr/bin/i686-apple-darwin9-gcc-4.2.1
/usr/bin/powerpc-apple-darwin9-gcc-4.0.1
/usr/bin/powerpc-apple-darwin9-gcc-4.2.1

I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1
or maybe 4.0.1. What version is on your G4?

Re: Adding CI to our tree

2021-12-17 Thread Andres Freund

Hi,

On 2021-12-17 09:36:05 -0500, Tom Lane wrote:
> Andrew Dunstan  writes:
> > Maye I have missed it, but why are we using ccache here? That seems a
> > bit pointless in an ephemeral instance.
> 
> I believe Munro's cfbot tooling is able to save and re-use ccache
> across successive instantiations of a build instance.  I've not
> looked at this code, but if it can do that there'd be point to it.

Yes, the ccache cache is persisted across runs (see the *_cache and
upload_cache inststructions). It makes a quite substantial difference. One
reason the windows runs are a lot slower than the others is just that visual
studio isn't yet supported by ccache, and that there doesn't seem to be good
other tools.

The ccache maintainers merged more of the msvc support last weekend! So I have
quite a bit of hope of being able to use ccache there as well.

Greetings,

Andres Freund

Re: Adding CI to our tree

2021-12-17 Thread Andres Freund

Hi,

On 2021-12-17 12:34:36 +0100, Peter Eisentraut wrote:
> On 13.12.21 22:12, Andres Freund wrote:
> > Attached is an updated version of the CI patches. An example of a test run
> > with the attached version of this
> > https://cirrus-ci.com/build/6501998521483264
> 
> +  only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' ||
> $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*freebsd.*'
> 
> I'm not in favor of this kind of thing.  I don't understand how this is
> useful, other than perhaps for developing *this* patch.  I don't think
> people would like having these tags in the mainline, and if it's for local
> use, then people can adjust the file locally.

Definitely not for mainline. But it's extremely useful for development. If you
iteratively try to fix windows, running all the other tests can be slower -
there's a concurrency limit in how many tests you can run for free...

> +  CC="ccache cc" CFLAGS="-O0 -ggdb"'
> 
> I don't think using -O0 is the right thing.  It will miss some compiler
> warnings, and it will not thoroughly test the compiler.   We should test
> using the configurations that are close to what users actually use.

Hm. I personally always end up using -O0 for the actual development tree, and
it seems a lot of others do as well. Building with -O2 makes backtraces etc
just less useful.

> +- su postgres -c 'gmake -s -j3 && gmake -s -j3 -C contrib'
> 
> Why doesn't this use make world (or world-bin, if you prefer).

I started working on this well before world-bin existed. And using 'world' as
the target builds the docs, which is quite expensive... I happened to actually
make the change to world-bin yesterday for the next version to send :)

> Why does this use -j3 if there are two CPUs configured?  (Perhaps the number
> of CPUs should be put into a variable.)

I tried a few and that worked best.

> I don't like that the -s option is used.  I would like to see what commands
> are executed.

I can change it - but it makes it *much* harder to spot compiler warnings.

> +cpu: 4
> 
> Why does the Linux job use 4 CPUs and the FreeBSD job 2?

I'll add a comment about it. Two reasons
1) the limits on cirrus are lower for linux than freebsd:
   https://cirrus-ci.org/faq/
2) There's some issues on freebsd where test performance regressess *very*
   substantially with higher concurrency. Thomas and I looked a bunch into it
   without figuring out the details.

> +- export

> I don't think that is portable to all shells.

Doesn't really need to be?

> +- su postgres -c 'time script test.log gmake -s -j2 ${CHECK}
> ${CHECKFLAGS}'
> 
> +su postgres -c '\
> +  ulimit -c unlimited; \
> +  make -s ${CHECK} ${CHECKFLAGS} -j8 \
> +  '
> 
> Not clear why these are so different.  Don't we need the test.log file for
> Linux?

There's a comment about the use of script:
  # Use of script is to avoid make complaints about fcntl()
  # https://savannah.gnu.org/bugs/?60774

that bug is specific to platforms that don't allow locking pipes. Which linux
does allow, but freebsd doesn't.

> Don't we need the ulimit call for FreeBSD?

I think the default core limits were different, I will check.

> Why the -j8 option even though 4 CPUs have been configured?

That might have been an accident.

> +brew install \
> +  ccache \
> +  coreutils \
> +  icu4c \
> +  krb5 \
> +  llvm \
> +  lz4 \
> +  make \
> +  openldap \
> +  openssl \
> +  python@3.10 \
> +  tcl-tk
> 
> Curious why coreutils and make are installed?  The system-supplied tools
> ought to work.

make because newer versions of make have -Otarget, which makes concurrent
check-world output at least kind-of readable.

> +brew cleanup -s
> 
> Seems unnecessary?

It reduces the size of the cached downloads. Not much point in keeping older
versions of the package around. Populating the cache takes time.

> +PKG_CONFIG_PATH="/usr/local/opt/krb5/lib/pkgconfig:$PKG_CONFIG_PATH"
> 
> AFAICT, we don't use pkg-config for the krb5 package.

I now converted this to a loop.

> +- gmake -s -j12 && gmake -s -j12 -C contrib
> 
> These numbers should also be explained or configured somewhere. Possibly
> query the number of CPUs on the instance.

macOS instances have a fixed number of cores - 12. Might make sense to query
it, but not sure what a good portable way there is.

> +PROVE_FLAGS: -j10
> 
> Why only on Windows?

Because windows doesn't have a way to run tests in parallel in another
way. prove-level concurrency is the only thing. Whereas other platforms can
run tests in parallel via make. Combining both tends to not work very well in
my experience.

> +# Installation on windows currently only completely works from
> src\tools\msvc
> 
> If that is so, let's fix that.

I did report the problem - just haven't gotten around to fixing it.  Note this
is also how the buildfarm invokes installation... The problem is that
Install.pm includes config.pl from the c

Re: In-placre persistance change of a relation

2021-12-17 Thread Justin Pryzby

On Fri, Dec 17, 2021 at 02:33:25PM +, Jakub Wartak wrote:
> >  Justin wrote:
> > On Fri, Dec 17, 2021 at 09:10:30AM +, Jakub Wartak wrote:
> > > As the thread didn't get a lot of traction, I've registered it into 
> > > current
> > commitfest
> > https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommitf
> > est.postgresql.org%2F36%2F3461%2F&data=04%7C01%7CJakub.Wartak%
> > 40tomtom.com%7Cb815e75090d44e20fd0a08d9c15b45cc%7C374f80267b544a
> > 3ab87d328fa26ec10d%7C0%7C0%7C637753420044612362%7CUnknown%7CT
> > WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV
> > CI6Mn0%3D%7C3000&sdata=0BTQSVDnVPu4YpECKXXlBJT5q3Gfgv099SaC
> > NuBwiW4%3D&reserved=0 with You as the author and in 'Ready for
> > review' state.
> > 
> > The patch is failing:
> [..] 
> > I think you ran "make check", but should run something like this:
> > make check-world -j8 >check-world.log 2>&1 && echo Success
> 
> Hi Justin,
> 
> I've repeated the check-world and it says to me all is ok locally (also with 
> --enable-cassert --enable-debug , at least on Amazon Linux 2) and also 
> installcheck on default params seems to be ok  
> I don't seem to understand why testfarm reports errors for tests like "path, 
> polygon, rowsecurity" e.g. on Linux/graviton2 and FreeBSD while not on 
> MacOS(?) . 
> Could someone point to me where to start looking/fixing?

Since it says this, it looks a lot like a memory error like a use-after-free
 - like in fsync_parent_path():

 CREATE TABLE PATH_TBL (f1 path);
+ERROR:  could not open file <>  Pacific": No such file or directory

I see at least this one is still failing, though:
time make -C src/test/recovery check
>From 676ecf794b2b0e98d8f31e4245f6f455da5e19cb Mon Sep 17 00:00:00 2001
From: Jakub Wartak 
Date: Thu, 16 Dec 2021 12:03:42 +
Subject: [PATCH 1/2] In-place table persistence change with new command ALTER
 TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.

ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED: To ease invoking
ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence
of all tables in the specified tablespace.
---
 src/backend/access/rmgrdesc/smgrdesc.c |  52 +++
 src/backend/access/transam/README  |   8 +
 src/backend/access/transam/xlog.c  |  17 +
 src/backend/catalog/storage.c  | 518 +++--
 src/backend/commands/tablecmds.c   | 374 --
 src/backend/nodes/copyfuncs.c  |  16 +
 src/backend/nodes/equalfuncs.c |  15 +
 src/backend/parser/gram.y  |  20 +
 src/backend/replication/basebackup.c   |   3 +-
 src/backend/storage/buffer/bufmgr.c|  88 +
 src/backend/storage/file/fd.c  |   4 +-
 src/backend/storage/file/reinit.c  | 318 +++
 src/backend/storage/smgr/md.c  |  92 -
 src/backend/storage/smgr/smgr.c|  32 ++
 src/backend/storage/sync/sync.c|  20 +-
 src/backend/tcop/utility.c |  11 +
 src/bin/pg_rewind/parsexlog.c  |  24 ++
 src/common/relpath.c   |  47 ++-
 src/include/catalog/storage.h  |   2 +
 src/include/catalog/storage_xlog.h |  42 +-
 src/include/commands/tablecmds.h   |   2 +
 src/include/common/relpath.h   |   9 +-
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/parsenodes.h |   9 +
 src/include/storage/bufmgr.h   |   2 +
 src/include/storage/fd.h   |   1 +
 src/include/storage/md.h   |   8 +-
 src/include/storage/reinit.h   |  10 +-
 src/include/storage/smgr.h |  17 +
 29 files changed, 1583 insertions(+), 179 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch (xlrec->a

Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Tom Lane

Greg Stark  writes:
> What tools and tool versions are you using to build? Is it just GCC for PPC?
> There aren't any special build processes to make a fat binary involved?

Nope, just "configure; make" using that macOS version's regular gcc.

regards, tom lane

Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Greg Stark

What tools and tool versions are you using to build? Is it just GCC for PPC?

There aren't any special build processes to make a fat binary involved?

On Thu, 16 Dec 2021 at 23:11, Tom Lane  wrote:
>
> Greg Stark  writes:
> > But if you're interested and can explain the tests to run I can try to
> > get the tests running on this machine:
>
> I'm not sure that machine is close enough to prove much, but by all
> means give it a go if you wish.  My test setup was explained in [1]:
>
> >> To recap, the test lashup is:
> >> * 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive)
> >> * Standard debug build (--enable-debug --enable-cassert)
> >> * Out-of-the-box configuration, except add wal_consistency_checking = all
> >> and configure a wal-streaming standby on the same machine
> >> * Repeatedly run "make installcheck-parallel", but skip the tablespace
> >> test to avoid issues with the standby trying to use the same directory
> >> * Delay long enough after each installcheck-parallel to let the
> >> standby catch up (the run proper is ~24 min, plus 2 min for catchup)
>
> Remember also that the code in question is not in HEAD; you'd
> need to apply Munro's patches, or check out some commit from
> around 2021-04-22.
>
> regards, tom lane
>
> [1] https://www.postgresql.org/message-id/3502526.1619925367%40sss.pgh.pa.us



-- 
greg

[PATCH] Add reloption for views to enable RLS

2021-12-17 Thread Christoph Heiss


Hi all!

As part of a customer project we are looking to implement an reloption 
for views which when set, runs the subquery as invoked by the user 
rather than the view owner, as is currently the case.
The rewrite rule's table references are then checked as if the user were 
referencing the table(s) directly.


This feature is similar to so-called 'SECURITY INVOKER' views in other DBMS.
Although such permission checking could be implemented using views which 
SELECT from a table function and further using triggers, that approach 
has obvious performance downsides.


Our initial thought on implementing this was to simply add another 
reloption for views, just like the already existing `security_barrier`. 
With this in place, we then can conditionally evaluate in 
RelationBuildRuleLock() if we need to call setRuleCheckAsUser() or not.
The new reloption has been named `security`, which is an enum currently 
only supporting a single value: `relation_permissions`.


The code for fetching the rules and triggers in RelationBuildDesc() had 
to be moved after the parsing of the reloptions, since with this change 
RelationBuildRuleLock()now depends upon having relation->rd_options 
available.


The current behavior of views without that new reloption set is unaltered.
This is implemented as such in patch 0001.

Regression tests are included for both the new reloption of CREATE VIEW 
and the row level security side of this too, contained in patch 0002.

All regression tests are passing without errors.

Finally, patch 0003 updates the documentation for this new reloption.

An simplified example on how this feature can be used could look like this:

  CREATE TABLE people (id int, name text, company text);
  ALTER TABLE people ENABLE ROW LEVEL SECURITY;
  INSERT INTO people VALUES (1, 'alice', 'foo'), (2, 'bob', 'bar');

  CREATE VIEW customers_no_security
  AS SELECT * FROM people;

  CREATE VIEW customers
  WITH (security=relation_permissions)
  AS SELECT * FROM people;

  -- We want carol to only see people from company 'foo'
  CREATE ROLE carol;
  CREATE POLICY company_foo_only
  ON people FOR ALL TO carol USING (company = 'foo');

  GRANT SELECT ON people TO carol;
  GRANT SELECT ON customers_no_security TO carol;
  GRANT SELECT ON customers TO carol;

Now using these tables as carol:

postgres=# SET ROLE carol;
SET

For the `people` table, the policy is applied as expected:

postgres=> SELECT * FROM people;
 id | name  | company
+---+-
  1 | alice | foo
(1 row)

If we now use the view with the new relopt set, the policy is applied too:

postgres=> SELECT * FROM customers;
 id | name  | company
+---+-
  1 | alice | foo
(1 row)

But without the `security=relation_permissions` relopt, carol gets to 
see data they should not be able to due to the policy not being applied, 
since the rules are checked against the view owner:


postgres=> SELECT * FROM customers_no_security;
 id | name  | company
+---+-
  1 | alice | foo
  2 | bob   | bar
(2 rows)


Excluding regression tests and documentation, the changes boil down to this:
 src/backend/access/common/reloptions.c| 20
 src/backend/nodes/copyfuncs.c |  1
 src/backend/nodes/equalfuncs.c|  1
 src/backend/nodes/outfuncs.c  |  1
 src/backend/nodes/readfuncs.c |  1
 src/backend/optimizer/plan/subselect.c|  1
 src/backend/optimizer/prep/prepjointree.c |  1
 src/backend/rewrite/rewriteHandler.c  |  1
 src/backend/utils/cache/relcache.c| 62
 src/include/nodes/parsenodes.h|  3
 src/include/utils/rel.h   | 21
 11 files changed, 84 insertions(+), 29 deletions(-)

All patches are against current master.

Thanks,
Christoph HeissFrom 2ed6b63adcebfff14965b8c9913ae0fafbe904a2 Mon Sep 17 00:00:00 2001
From: Christoph Heiss 
Date: Fri, 17 Dec 2021 17:17:54 +0100
Subject: [PATCH 3/3] Add documentation for new 'security' reloption on views

---
 doc/src/sgml/ddl.sgml |  4 
 doc/src/sgml/ref/alter_view.sgml  |  9 +
 doc/src/sgml/ref/create_view.sgml | 18 ++
 3 files changed, 31 insertions(+)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 64d9030652..760ea2f794 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2292,6 +2292,10 @@ GRANT SELECT (col1), UPDATE (col1) ON mytable TO miriam_rw;
are not subject to row security.
   
 
+  
+   For views, the policies are applied as being referenced through the view owner by default, rather than the user referencing the view. To apply row security policies as defined for the invoking user, the security option can be set on views (see CREATE VIEW) to get the same behavior.
+  
+
   
Row security policies can be specific to commands, or to roles, or to
both.  A policy can be specified to apply to ALL
diff --git a/doc/src/sgml/ref/alter_vie

Re: pg_upgrade should truncate/remove its logs before running

2021-12-17 Thread Justin Pryzby

On Thu, Dec 16, 2021 at 12:23:08PM +0100, Daniel Gustafsson wrote:
> > On 16 Dec 2021, at 12:11, Peter Eisentraut 
> >  wrote:
> 
> > Could we make it write just one log file?  Is having multiple log files 
> > better?
> 
> Having individual .txt files from checks with additional 
> information
> on how to handle the error are quite convenient when writing wrappers around
> pg_upgrade (speaking from experience of having written multiple pg_upgraade
> frontends).  Parsing a single logfile is more work, and will break existing
> scripts.
> 
> I'm in favor of a predictable by default logpath, with a parameter to 
> override,
> as mentioned upthread.

I put this together in the simplest way, prefixing all the filenames with the
configured path..

Another options is to chdir() into the given path.  But, pg_upgrade takes (and
requires) a bunch of other paths, like -d -D -b -B, and those are traditionally
interpretted relative to CWD.  I could getcwd() and prefix all the -[dDbB] with
that, but prefixing a handful of binary/data paths is hardly better than
prefixing a handful of dump/logfile paths.  I suppose that openat() isn't
portable.  I don't think this it's worth prohibiting relative paths, so I can't
think of any less-naive way to do this.

I didn't move the delete-old-cluster.sh, since that's intended to stay around
even after a successful upgrade, as opposed to the other logs, which are
typically removed at that point.

-- 
Justin
>From 7b608a415645682a9853bb4fbfde54caa6ccc137 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 11 Dec 2021 19:19:50 -0600
Subject: [PATCH] pg_upgrade: write logfiles and dumps in subdir..

The subdir must not exist when pg_upgrade is started.
This avoids appending the logfiles which have pre-existing errors in them.
And allows cleaning up after pg_upgrade more easily.
---
 src/bin/pg_upgrade/check.c  | 10 ++---
 src/bin/pg_upgrade/dump.c   |  8 ++--
 src/bin/pg_upgrade/exec.c   |  5 ++-
 src/bin/pg_upgrade/function.c   |  3 +-
 src/bin/pg_upgrade/option.c | 21 +--
 src/bin/pg_upgrade/pg_upgrade.c | 65 +
 src/bin/pg_upgrade/pg_upgrade.h |  1 +
 src/bin/pg_upgrade/server.c |  4 +-
 8 files changed, 77 insertions(+), 40 deletions(-)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 1b8ef79242..9bc4e106dd 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -552,7 +552,7 @@ create_script_for_old_cluster_deletion(char **deletion_script_file_name)
 
 	prep_status("Creating script to delete old cluster");
 
-	if ((script = fopen_priv(*deletion_script_file_name, "w")) == NULL)
+	if ((script = fopen_priv(*deletion_script_file_name, "w")) == NULL) // OK ?
 		pg_fatal("could not open file \"%s\": %s\n",
  *deletion_script_file_name, strerror(errno));
 
@@ -784,7 +784,7 @@ check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster)
 		return;
 	}
 
-	snprintf(output_path, sizeof(output_path),
+	snprintf(output_path, sizeof(output_path), "%s/%s", log_opts.basedir,
 			 "contrib_isn_and_int8_pass_by_value.txt");
 
 	for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
@@ -861,7 +861,7 @@ check_for_user_defined_postfix_ops(ClusterInfo *cluster)
 
 	prep_status("Checking for user-defined postfix operators");
 
-	snprintf(output_path, sizeof(output_path),
+	snprintf(output_path, sizeof(output_path), "%s/%s", log_opts.basedir,
 			 "postfix_ops.txt");
 
 	/* Find any user defined postfix operators */
@@ -960,7 +960,7 @@ check_for_tables_with_oids(ClusterInfo *cluster)
 
 	prep_status("Checking for tables WITH OIDS");
 
-	snprintf(output_path, sizeof(output_path),
+	snprintf(output_path, sizeof(output_path), "%s/%s", log_opts.basedir,
 			 "tables_with_oids.txt");
 
 	/* Find any tables declared WITH OIDS */
@@ -1215,7 +1215,7 @@ check_for_user_defined_encoding_conversions(ClusterInfo *cluster)
 
 	prep_status("Checking for user-defined encoding conversions");
 
-	snprintf(output_path, sizeof(output_path),
+	snprintf(output_path, sizeof(output_path), "%s/%s", log_opts.basedir,
 			 "encoding_conversions.txt");
 
 	/* Find any user defined encoding conversions */
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index 90060d0f8e..fe979e0df4 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -22,10 +22,10 @@ generate_old_dump(void)
 	/* run new pg_dumpall binary for globals */
 	exec_prog(UTILITY_LOG_FILE, NULL, true, true,
 			  "\"%s/pg_dumpall\" %s --globals-only --quote-all-identifiers "
-			  "--binary-upgrade %s -f %s",
+			  "--binary-upgrade %s -f %s/%s",
 			  new_cluster.bindir, cluster_conn_opts(&old_cluster),
 			  log_opts.verbose ? "--verbose" : "",
-			  GLOBALS_DUMP_FILE);
+			  log_opts.basedir, GLOBALS_DUMP_FILE);
 	check_ok();
 
 	prep_status("Creating dump of database schemas\n");
@@ -52,10 +52,10 @@ generate_old_dump(void)
 
 		parallel_exec_prog(log_file_name, NULL,
 		   "\"%s/pg_dump\" %s --schema-only --quote

Re: Column Filtering in Logical Replication

2021-12-17 Thread Alvaro Herrera

On 2021-Dec-17, Rahila Syed wrote:

> > This means that we need to support changing the column list of a
> > table in a publication.  I'm looking at implementing some form of
> > ALTER PUBLICATION for that.
>
> I think right now the patch contains support only for ALTER
> PUBLICATION..  ADD TABLE with column filters.  In order to achieve
> changing the column lists of a published table, I think we can extend
> the ALTER TABLE ..SET TABLE syntax to support specification of column
> list.

Yeah, that's what I was thinking too.

> > So this whole thing can be reduced to just this:
> 
> > if (att_map != NULL && !bms_is_member(att->attnum, att_map))
> >continue;/* that is, don't send this attribute */
> 
> I agree the condition can be shortened now. The long if condition was
> included because initially the feature allowed specifying filters
> without replica identity columns(sent those columns internally without
> user having to specify).

Ah, true, I had forgotten that.  Thanks.

> >  900 +* the table is partitioned.  Run a recursive query to iterate 
> > through all
> >  901 +* the parents of the partition and retreive the record for 
> > the parent
> >  902 +* that exists in pg_publication_rel.
> >  903 +*/
> 
> The above comment in fetch_remote_table_info() can be changed as the
> recursive query is no longer used.

Oh, of course.

I'll finish some loose ends and submit a v10, but it's still not final.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Right now the sectors on the hard disk run clockwise, but I heard a rumor that
you can squeeze 0.2% more throughput by running them counterclockwise.
It's worth the effort. Recommended."  (Gerry Pourwelle)

\d with triggers: more than one row returned by a subquery used as an expression

2021-12-17 Thread Justin Pryzby

I want to mention that the 2nd problem I mentioned here is still broken.
https://www.postgresql.org/message-id/20210717010259.gu20...@telsasoft.com

It happens if non-inheritted triggers on child and parent have the same name.

On Fri, Jul 16, 2021 at 08:02:59PM -0500, Justin Pryzby wrote:
> On Fri, Jul 16, 2021 at 06:01:12PM -0400, Alvaro Herrera wrote:
> > On 2021-Jul-16, Justin Pryzby wrote:
> > > CREATE TABLE p(i int) PARTITION BY RANGE(i);
> > > CREATE TABLE p1 PARTITION OF p FOR VALUES FROM (1)TO(2);
> > > CREATE FUNCTION foo() returns trigger LANGUAGE plpgsql AS $$begin end$$;
> > > CREATE TRIGGER x AFTER DELETE ON p1 EXECUTE FUNCTION foo();
> > > CREATE TRIGGER x AFTER DELETE ON p EXECUTE FUNCTION foo();
> > 
> > Hmm, interesting -- those statement triggers are not cloned, so what is
> > going on here is just that the psql query to show them is tripping on
> > its shoelaces ... I'll try to find a fix.
> > 
> > I *think* the problem is that the query matches triggers by name and
> > parent/child relationship; we're missing to ignore triggers by tgtype.
> > It's not great design that tgtype is a bitmask of unrelated flags ...
> 
> I see it's the subquery Amit wrote and proposed here:
> https://www.postgresql.org/message-id/CA+HiwqEiMe0tCOoPOwjQrdH5fxnZccMR7oeW=f9fmgszjqb...@mail.gmail.com
> 
> .. and I realize that I've accidentally succeeded in breaking what I first
> attempted to break 15 months ago:
> 
> On Mon, Apr 20, 2020 at 02:57:40PM -0500, Justin Pryzby wrote:
> > I'm happy to see that this doesn't require a recursive cte, at least.
> > I was trying to think how to break it by returning multiple results or 
> > results
> > out of order, but I think that can't happen.
> 
> If you assume that pg_partition_ancestors returns its results in order, I 
> think
> you can fix it by adding LIMIT 1.  Otherwise I think you need a recursive CTE,
> as I'd feared.
> 
> Note also that I'd sent a patch to add newlines, to make psql -E look pretty.
> v6-0001-fixups-c33869cc3bfc42bce822251f2fa1a2a346f86cc5.patch

Re: Addition of --no-sync to pg_upgrade for test speedup

2021-12-17 Thread Bruce Momjian

On Fri, Dec 17, 2021 at 10:21:04AM +0100, Peter Eisentraut wrote:
> On 16.12.21 07:50, Michael Paquier wrote:
> > As per $subject, avoiding the flush of the new cluster's data
> > directory shortens a bint the runtime of the test.  In some of my slow
> > VMs, aka Windows, this shaves a couple of seconds even if the bulk of
> > the time is still spent on the main regression test suite.
> > 
> > In pg_upgrade, we let the flush happen with initdb --sync-only, based
> > on the binary path of the new cluster, so I think that we are not
> > going to miss any test coverage by skipping that.
> 
> I think that is reasonable.
> 
> Maybe we could have some global option, like some environment variable, that
> enables the "sync" mode in all tests, so it's easy to test that once in a
> while.  Not really a requirement for your patch, but an idea in case this is
> a concern.

Yes, I think it would be good to see all the places we might want to
pass the no-sync option.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Adding CI to our tree

2021-12-17 Thread Tom Lane

Andrew Dunstan  writes:
> Maye I have missed it, but why are we using ccache here? That seems a
> bit pointless in an ephemeral instance.

I believe Munro's cfbot tooling is able to save and re-use ccache
across successive instantiations of a build instance.  I've not
looked at this code, but if it can do that there'd be point to it.

regards, tom lane

RE: In-placre persistance change of a relation

2021-12-17 Thread Jakub Wartak

>  Justin wrote:
> On Fri, Dec 17, 2021 at 09:10:30AM +, Jakub Wartak wrote:
> > As the thread didn't get a lot of traction, I've registered it into current
> commitfest
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommitf
> est.postgresql.org%2F36%2F3461%2F&data=04%7C01%7CJakub.Wartak%
> 40tomtom.com%7Cb815e75090d44e20fd0a08d9c15b45cc%7C374f80267b544a
> 3ab87d328fa26ec10d%7C0%7C0%7C637753420044612362%7CUnknown%7CT
> WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV
> CI6Mn0%3D%7C3000&sdata=0BTQSVDnVPu4YpECKXXlBJT5q3Gfgv099SaC
> NuBwiW4%3D&reserved=0 with You as the author and in 'Ready for
> review' state.
> 
> The patch is failing:
[..] 
> I think you ran "make check", but should run something like this:
> make check-world -j8 >check-world.log 2>&1 && echo Success

Hi Justin,

I've repeated the check-world and it says to me all is ok locally (also with 
--enable-cassert --enable-debug , at least on Amazon Linux 2) and also 
installcheck on default params seems to be ok  
I don't seem to understand why testfarm reports errors for tests like "path, 
polygon, rowsecurity" e.g. on Linux/graviton2 and FreeBSD while not on MacOS(?) 
. 
Could someone point to me where to start looking/fixing?

-J.

Re: Adding CI to our tree

2021-12-17 Thread Andrew Dunstan



On 12/13/21 16:12, Andres Freund wrote:
> Hi,
>
> Attached is an updated version of the CI patches. An example of a test run
> with the attached version of this
> https://cirrus-ci.com/build/6501998521483264
>

Maye I have missed it, but why are we using ccache here? That seems a
bit pointless in an ephemeral instance.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Use generation context to speed up tuplesorts

2021-12-17 Thread Ronan Dunklau

Le vendredi 17 décembre 2021, 14:39:10 CET Tomas Vondra a écrit :
> I wasn't really suggesting to investigate those other allocators in this
> patch - it seems like a task requiring a pretty significant amount of
> work/time. My point was that we should make it reasonably easy to add
> tweaks for those other environments, if someone is interested enough to
> do the legwork.
> 
> >> 2) In fact, I wonder if different glibc versions behave differently?
> >> Hopefully it's not changing that much, though. Ditto kernel versions,
> >> but the mmap/sbrk interface is likely more stable. We can test this.
> > 
> > That could be tested, yes. As a matter of fact, a commit removing the
> > upper
> > limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to
> > glibc,
> > which means we can service much bigger allocation without mmap.
> 
> Yeah, I noticed that commit too. Most systems stick to one glibc
> version, so it'll take time to reach most systems. Let's continue with
> just one glibc version and then maybe test other versions.

Yes, I also need to figure out how to detect we're using glibc as I'm not very 
familiar with configure. 

> 
> >> 3) If we bump the thresholds, won't that work against reusing the
> >> memory? I mean, if we free a whole block (from any allocator we have),
> >> glibc might return it to kernel, depending on mmap threshold value. It's
> >> not guaranteed, but increasing the malloc thresholds will make that even
> >> less likely. So we might just as well increase the minimum block size,
> >> with about the same effect, no?
> > 
> > It is my understanding that malloc will try to compact memory by moving it
> > around. So the memory should be actually be released to the kernel at some
> > point. In the meantime, malloc can reuse it for our next invocation (which
> > can be in a different memory context on our side).
> > 
> > If we increase the minimum block size, this is memory we will actually
> > 
> > reserve, and it will not protect us against the ramping-up behaviour:
> >  - the first allocation of a big block may be over mmap_threshold, and
> >  serviced> 
> > by an expensive mmap
> > 
> >  - when it's free, the threshold is doubled
> >  - next invocation is serviced by an sbrk call
> >  - freeing it will be above the trim threshold, and it will be returned.
> > 
> > After several "big" allocations, the thresholds will raise to their
> > maximum
> > values (well, it used to, I need to check what happens with that latest
> > patch of glibc...)
> > 
> > This will typically happen several times as malloc doubles the threshold
> > each time. This is probably the reason quadrupling the block sizes was
> > more effective.
> 
> Hmmm, OK. Can we we benchmark the case with large initial block size, at
> least for comparison?

The benchmark I called "fixed" was with a fixed block size of 
ALLOCSET_DEFAULT_MAXSIZE (first proposed patch) and showed roughly the same 
performance profile as the growing blocks + malloc tuning.
But if  I understand correctly, you implemented the growing blocks logic after 
concerns about wasting memory with a constant large block size. If we tune 
malloc, that memory would not be wasted if we don't alloc it, just not 
released as eagerly when it's allocated. 

Or do you want a benchmark with an even bigger initial block size ? With the 
growing blocks patch with a large initial size ? I can run either, I just want 
to understand what is interesting to you.

One thing that would be interesting would be to trace the total amount of 
memory allocated in the different cases. This is something I will need to do 
anyway when I propose that patch; 

Best regards,

-- 
Ronan Dunklau

Re: Use generation context to speed up tuplesorts

2021-12-17 Thread Tomas Vondra

On 12/17/21 09:08, Ronan Dunklau wrote:
> Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit :
>> On 12/16/21 17:03, Ronan Dunklau wrote:
>>> Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit :
 I will follow up with a benchmark of the test sorting a table with a
 width
 varying from 1 to 32 columns.
>>>
>>> So please find attached another benchmark for that case.
>>>
>>> The 3 different patchsets tested are:
>>>   - master
>>>   - fixed (David's original patch)
>>>   - adjust (Thomas growing blocks patch)
>>
>> Presumably Thomas is me, right?
> 
> I'm really sorry for this typo... Please accept my apologies. 
> 

Nah, no apology needed. I was just wondering if I missed some patch from
Thomas Munro ...

>>
>>> So it looks like tuning malloc for this would be very benificial for any
>>> kind of allocation, and by doing so we reduce the problems seen with the
>>> growing blocks patch to next to nothing, while keeping the ability to not
>>> allocate too much memory from the get go.
>>
>> Thanks for running those tests and investigating the glibc behavior! I
>> find those results very interesting. My conclusions from this is that
>> the interaction interaction between "our" allocator and the allocator in
>> malloc (e.g. glibc) can be problematic. Which makes benchmarking and
>> optimization somewhat tricky because code changes may trigger behavior
>> change in glibc (or whatever allocator backs malloc).
>>
>> I think it's worth exploring if we can tune this in a reasonable way,
>> but I have a couple concerns related to that:
>>
>> 1) I wonder how glibc-specific this is - I'd bet it applies to other
>> allocators (either on another OS or just different allocator on Linux)
>> too. Tweaking glibc parameters won't affect those systems, of course,
>> but maybe we should allow tweaking those systems too ...
> 
> I agree, finding their specific problems and see if we can workaround it 
> would 
> be interesting. I suppose glibc's malloc is the most commonly used allocator 
> in production, as it is the default for most Linux distributions.
> 

I wasn't really suggesting to investigate those other allocators in this
patch - it seems like a task requiring a pretty significant amount of
work/time. My point was that we should make it reasonably easy to add
tweaks for those other environments, if someone is interested enough to
do the legwork.

>>
>> 2) In fact, I wonder if different glibc versions behave differently?
>> Hopefully it's not changing that much, though. Ditto kernel versions,
>> but the mmap/sbrk interface is likely more stable. We can test this.
> 
> That could be tested, yes. As a matter of fact, a commit removing the upper 
> limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc, 
> which means we can service much bigger allocation without mmap.
> 

Yeah, I noticed that commit too. Most systems stick to one glibc
version, so it'll take time to reach most systems. Let's continue with
just one glibc version and then maybe test other versions.

> 
>>
>> 3) If we bump the thresholds, won't that work against reusing the
>> memory? I mean, if we free a whole block (from any allocator we have),
>> glibc might return it to kernel, depending on mmap threshold value. It's
>> not guaranteed, but increasing the malloc thresholds will make that even
>> less likely. So we might just as well increase the minimum block size,
>> with about the same effect, no?
> 
> It is my understanding that malloc will try to compact memory by moving it 
> around. So the memory should be actually be released to the kernel at some 
> point. In the meantime, malloc can reuse it for our next invocation (which 
> can 
> be in a different memory context on our side). 
> 
> If we increase the minimum block size, this is memory we will actually 
> reserve, and it will not protect us against the ramping-up behaviour:
>  - the first allocation of a big block may be over mmap_threshold, and 
> serviced 
> by an expensive mmap
>  - when it's free, the threshold is doubled
>  - next invocation is serviced by an sbrk call
>  - freeing it will be above the trim threshold, and it will be returned.
> 
> After several "big" allocations, the thresholds will raise to their maximum 
> values (well, it used to, I need to check what happens with that latest patch 
> of glibc...)
>  
> This will typically happen several times as malloc doubles the threshold each 
> time. This is probably the reason quadrupling the block sizes was more 
> effective. 
> 

Hmmm, OK. Can we we benchmark the case with large initial block size, at
least for comparison?

> 
>>
>>> I would like to try to implement some dynamic glibc malloc tuning, if that
>>> is something we don't reject on principle from the get go.
>>
>> +1 to that
> 
> Ok, I'll work on a patch for this and submit a new thread. 
> 

Cool!


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: speed up verifying UTF-8

2021-12-17 Thread John Naylor

I plan to push v25 early next week, unless there are further comments.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: In-placre persistance change of a relation

2021-12-17 Thread Justin Pryzby

On Fri, Dec 17, 2021 at 09:10:30AM +, Jakub Wartak wrote:
> I'm especially worried if I didn't screw up something/forgot something 
> related to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests 
> passed".

> As the thread didn't get a lot of traction, I've registered it into current 
> commitfest https://commitfest.postgresql.org/36/3461/ with You as the author 
> and in 'Ready for review' state. 
> I think it behaves as almost finished one and apparently after reading all 
> those discussions that go back over 10years+ time span about this feature, 
> and lot of failed effort towards wal_level=noWAL I think it would be nice to 
> finally start getting some of that of it into the core.

The patch is failing:
http://cfbot.cputube.org/kyotaro-horiguchi.html
https://api.cirrus-ci.com/v1/artifact/task/5564333871595520/regress_diffs/src/bin/pg_upgrade/tmp_check/regress/regression.diffs

I think you ran "make check", but should run something like this:
make check-world -j8 >check-world.log 2>&1 && echo Success

-- 
Justin

Re: Column Filtering in Logical Replication

2021-12-17 Thread Rahila Syed

Hi,

Thank you for updating the patch. The regression tests and tap tests pass
with v9 patch.


>
> After working on this a little bit more, I realized that this is a bad
> idea overall.  It causes lots of complications and it's just not worth
> it.  So I'm back at my original thought that we need to throw an ERROR
> at ALTER TABLE .. DROP COLUMN time if the column is part of a
> replication column filter, and suggest the user to remove the column
> from the filter first and reattempt the DROP COLUMN.
>
> This means that we need to support changing the column list of a table
> in a publication.  I'm looking at implementing some form of ALTER
> PUBLICATION for that.
>
>
I think right now the patch contains support only for ALTER PUBLICATION..
ADD TABLE with column filters.
In order to achieve changing the column lists of a published table, I think
we can extend the
ALTER TABLE ..SET TABLE syntax to support specification of column list.

So this whole thing can
> be reduced to just this:


if (att_map != NULL && !bms_is_member(att->attnum, att_map))
>continue;/* that is, don't send this attribute */


I agree the condition can be shortened now. The long if condition was
included because initially the feature
allowed specifying filters without replica identity columns(sent those
columns internally without user
having to specify).

 900 +* the table is partitioned.  Run a recursive query to iterate
> through all
>  901 +* the parents of the partition and retreive the record for
> the parent
>  902 +* that exists in pg_publication_rel.
>  903 +*/


The above comment in fetch_remote_table_info() can be changed as the
recursive query
is no longer used.

Thank you,
Rahila Syed

Re: row filtering for logical replication

2021-12-17 Thread Greg Nancarrow

On Fri, Dec 17, 2021 at 7:20 PM Ajin Cherian  wrote:
>
> On Fri, Dec 17, 2021 at 5:46 PM Greg Nancarrow  wrote:
>
> > So using the v47 patch-set, I still find that the UPDATE above results in 
> > publication of an INSERT of (2,1), rather than an UPDATE of (1,1) to (2,1).
> > This is according to the 2nd UPDATE rule below, from patch 0003.
> >
> > + * old-row (no match)new-row (no match)  -> (drop change)
> > + * old-row (no match)new row (match) -> INSERT
> > + * old-row (match)   new-row (no match)  -> DELETE
> > + * old-row (match)   new row (match) -> UPDATE
> >
> > This is because the old row (1,1) doesn't match the UPDATE filter "(a>1)", 
> > but the new row (2,1) does.
> > This functionality doesn't seem right to me. I don't think it can be 
> > assumed that (1,1) was never published (and thus requires an INSERT rather 
> > than UPDATE) based on these checks, because in this example, (1,1) was 
> > previously published via a different operation - INSERT (and using a 
> > different filter too).
> > I think the fundamental problem here is that these UPDATE rules assume that 
> > the old (current) row was previously UPDATEd (and published, or not 
> > published, according to the filter applicable to UPDATE), but this is not 
> > necessarily the case.
> > Or am I missing something?
>
> But it need not be correct in assuming that the old-row was part of a
> previous INSERT either (and published, or not published according to
> the filter applicable to an INSERT).
> For example, change the sequence of inserts and updates prior to the
> last update:
>
> truncate tbl1 ;
> insert into tbl1 values (1,5); ==> not replicated since insert and ! (b < 2);
> update tbl1 set b = 1; ==> not replicated since update and ! (a > 1)
> update tbl1 set a = 2; ==> replicated and update converted to insert
> since (a > 1)
>
> In this case, the last update "update tbl1 set a = 2; " is updating a
> row that was previously updated and not inserted and not replicated to
> the subscriber.
> How does the replication logic differentiate between these two cases,
> and decide if the update was previously published or not?
> I think it's futile for the publisher side to try and figure out the
> history of published rows. In fact, if this level of logic is required
> then it is best implemented on the subscriber side, which then defeats
> the purpose of a publication filter.
>

I think it's a concern, for such a basic example with only one row,
getting unpredictable (and even wrong) replication results, depending
upon the order of operations.
Doesn't this problem result from allowing different WHERE clauses for
different pubactions for the same table?
My current thoughts are that this shouldn't be allowed, and also WHERE
clauses for INSERTs should, like UPDATE and DELETE, be restricted to
using only columns covered by the replica identity or primary key.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Adding CI to our tree

2021-12-17 Thread Peter Eisentraut


On 13.12.21 22:12, Andres Freund wrote:

Attached is an updated version of the CI patches. An example of a test run
with the attached version of this
https://cirrus-ci.com/build/6501998521483264


+  only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || 
$CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*freebsd.*'


I'm not in favor of this kind of thing.  I don't understand how this is 
useful, other than perhaps for developing *this* patch.  I don't think 
people would like having these tags in the mainline, and if it's for 
local use, then people can adjust the file locally.


+  CC="ccache cc" CFLAGS="-O0 -ggdb"'

I don't think using -O0 is the right thing.  It will miss some compiler 
warnings, and it will not thoroughly test the compiler.   We should test 
using the configurations that are close to what users actually use.


+- su postgres -c 'gmake -s -j3 && gmake -s -j3 -C contrib'

Why doesn't this use make world (or world-bin, if you prefer).

Why does this use -j3 if there are two CPUs configured?  (Perhaps the 
number of CPUs should be put into a variable.)


I don't like that the -s option is used.  I would like to see what 
commands are executed.


+cpu: 4

Why does the Linux job use 4 CPUs and the FreeBSD job 2?

+- export

I don't think that is portable to all shells.

+- su postgres -c 'time script test.log gmake -s -j2 ${CHECK} 
${CHECKFLAGS}'


+su postgres -c '\
+  ulimit -c unlimited; \
+  make -s ${CHECK} ${CHECKFLAGS} -j8 \
+  '

Not clear why these are so different.  Don't we need the test.log file 
for Linux?  Don't we need the ulimit call for FreeBSD?  Why the -j8 
option even though 4 CPUs have been configured?


+brew install \
+  ccache \
+  coreutils \
+  icu4c \
+  krb5 \
+  llvm \
+  lz4 \
+  make \
+  openldap \
+  openssl \
+  python@3.10 \
+  tcl-tk

Curious why coreutils and make are installed?  The system-supplied tools 
ought to work.


+brew cleanup -s

Seems unnecessary?

+PKG_CONFIG_PATH="/usr/local/opt/krb5/lib/pkgconfig:$PKG_CONFIG_PATH"

AFAICT, we don't use pkg-config for the krb5 package.

+- gmake -s -j12 && gmake -s -j12 -C contrib

These numbers should also be explained or configured somewhere. 
Possibly query the number of CPUs on the instance.


+PROVE_FLAGS: -j10

Why only on Windows?

+# Installation on windows currently only completely works from 
src\tools\msvc


If that is so, let's fix that.  But see that install.pl contains

if (-e "src/tools/msvc/buildenv.pl")

etc. it seems to want to be able to be invoked from the top level.

+- cd src\tools\msvc && perl .\install.pl 
%CIRRUS_WORKING_DIR%\tmp_install


Confusing mix of forward and backward slashes in the Windows section.  I 
think forward slashes should work everywhere.


+  test_plcheck_script:
+- perl src/tools/msvc/vcregress.pl plcheck

etc. Couldn't we enhance vcregress.pl to take multiple arguments or take 
a "check-world" argument or something.  Otherwise, this will be tedious 
to keep updated.


+  test_subscriptioncheck_script:
+- perl src/tools/msvc/vcregress.pl taptest .\src\test\subscription\

This is even worse.  I don't want to have to hand-register every new TAP 
test.


+  always:
+gcc_warning_script: |
+  time ./configure \
+--cache gcc.cache CC="ccache gcc" \
+--enable-dtrace

I don't know why we wouldn't need the full set of options here.  It's 
not like optional code never has compiler warnings.


+  # cross-compile to windows
+  always:
+mingw_cross_warning_script: |

I would welcome a native mingw build with full options and test suite 
run.  This cross-compiling stuff is of course interesting, but I'm not 
sure why it is being preferred over a full native run.


--- /dev/null
+++ b/.dockerignore
@@ -0,0 +1,7 @@
+# Ignore everything, except ci/

I wonder whether this would interfere with other uses of docker.  I 
suppose people might have their custom setups for building docker images 
from PostgreSQL sources.  It seems weird that this file gets this 
prominence, saying that the canonical use of docker inside PostgreSQL 
sources is for Cirrus CI.


It would be useful if the README explained the use of docker.  As I 
mentioned before, it's not immediately clear why docker is used at all 
in this.


The docker file for Windows contains a lot of hardcoded version numbers. 
 This should at least be refactored a little bit so that it is clear 
which version numbers should be updated and how.  Or better yet, avoid 
the need to constantly update version numbers.  For example, the Python 
patch release changes every few weeks (e.g., 3.10.0 isn't current 
anymore).  Also, the way OpenSSL is installed looks a bit fishy.  Is 
this what people actually use in practice?  How can we make it match 
actual practice better?


+# So that tests using the "manually" started postgres on windows can use
+# prepared statements
+max_prepared_transactions

Re: Skipping logical replication transactions on subscriber side

2021-12-17 Thread Masahiko Sawada

On Fri, Dec 17, 2021 at 7:12 PM Amit Kapila  wrote:
>
> On Fri, Dec 17, 2021 at 3:23 PM Peter Eisentraut
>  wrote:
> >
> > On 13.12.21 04:12, Greg Nancarrow wrote:
> > > (ii) "Setting -1 means to reset the transaction ID"
> > >
> > > Shouldn't it be explained what resetting actually does and when it can
> > > be, or is needed to be, done? Isn't it automatically reset?
> > > I notice that negative values (other than -1) seem to be regarded as
> > > valid - is that right?
> > > Also, what happens if this option is set multiple times? Does it just
> > > override and use the latest setting? (other option handling errors out
> > > with errorConflictingDefElem()).
> > > e.g. alter subscription sub skip (xid = 721, xid = 722);
> >
> > Let's not use magic numbers and instead use a syntax that is more
> > explicit, like SKIP (xid = NONE) or RESET SKIP or something like that.
> >
>
> +1 for using SKIP (xid = NONE) because otherwise first we need to
> introduce RESET syntax for this command.

Agreed. Thank you for the comment!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: row filtering for logical replication

2021-12-17 Thread Amit Kapila

On Fri, Dec 17, 2021 at 1:50 PM Ajin Cherian  wrote:
>
> On Fri, Dec 17, 2021 at 5:46 PM Greg Nancarrow  wrote:
>
> > So using the v47 patch-set, I still find that the UPDATE above results in 
> > publication of an INSERT of (2,1), rather than an UPDATE of (1,1) to (2,1).
> > This is according to the 2nd UPDATE rule below, from patch 0003.
> >
> > + * old-row (no match)new-row (no match)  -> (drop change)
> > + * old-row (no match)new row (match) -> INSERT
> > + * old-row (match)   new-row (no match)  -> DELETE
> > + * old-row (match)   new row (match) -> UPDATE
> >
> > This is because the old row (1,1) doesn't match the UPDATE filter "(a>1)", 
> > but the new row (2,1) does.
> > This functionality doesn't seem right to me. I don't think it can be 
> > assumed that (1,1) was never published (and thus requires an INSERT rather 
> > than UPDATE) based on these checks, because in this example, (1,1) was 
> > previously published via a different operation - INSERT (and using a 
> > different filter too).
> > I think the fundamental problem here is that these UPDATE rules assume that 
> > the old (current) row was previously UPDATEd (and published, or not 
> > published, according to the filter applicable to UPDATE), but this is not 
> > necessarily the case.
> > Or am I missing something?
>
> But it need not be correct in assuming that the old-row was part of a
> previous INSERT either (and published, or not published according to
> the filter applicable to an INSERT).
> For example, change the sequence of inserts and updates prior to the
> last update:
>
> truncate tbl1 ;
> insert into tbl1 values (1,5); ==> not replicated since insert and ! (b < 2);
> update tbl1 set b = 1; ==> not replicated since update and ! (a > 1)
> update tbl1 set a = 2; ==> replicated and update converted to insert
> since (a > 1)
>
> In this case, the last update "update tbl1 set a = 2; " is updating a
> row that was previously updated and not inserted and not replicated to
> the subscriber.
> How does the replication logic differentiate between these two cases,
> and decide if the update was previously published or not?
> I think it's futile for the publisher side to try and figure out the
> history of published rows.
>

I also think so. One more thing, even if we want we might not be able
to apply the insert filter as the corresponding values may not be
logged.


-- 
With Regards,
Amit Kapila.

Re: Use generation context to speed up tuplesorts

2021-12-17 Thread Ronan Dunklau

Le vendredi 17 décembre 2021, 09:08:06 CET Ronan Dunklau a écrit :
> It is my understanding that malloc will try to compact memory by moving it
> around. So the memory should be actually be released to the kernel at some
> point. In the meantime, malloc can reuse it for our next invocation (which
> can be in a different memory context on our side).

I've been told off-list this comment wasn't clear: I meant that it compacts 
*free* memory, consolidating into larger blocks that will eventually reach the 
threshold, and be released.

-- 
Ronan Dunklau

Re: Skipping logical replication transactions on subscriber side

2021-12-17 Thread Amit Kapila

On Fri, Dec 17, 2021 at 3:23 PM Peter Eisentraut
 wrote:
>
> On 13.12.21 04:12, Greg Nancarrow wrote:
> > (ii) "Setting -1 means to reset the transaction ID"
> >
> > Shouldn't it be explained what resetting actually does and when it can
> > be, or is needed to be, done? Isn't it automatically reset?
> > I notice that negative values (other than -1) seem to be regarded as
> > valid - is that right?
> > Also, what happens if this option is set multiple times? Does it just
> > override and use the latest setting? (other option handling errors out
> > with errorConflictingDefElem()).
> > e.g. alter subscription sub skip (xid = 721, xid = 722);
>
> Let's not use magic numbers and instead use a syntax that is more
> explicit, like SKIP (xid = NONE) or RESET SKIP or something like that.
>

+1 for using SKIP (xid = NONE) because otherwise first we need to
introduce RESET syntax for this command.

-- 
With Regards,
Amit Kapila.

Re: row filtering for logical replication

2021-12-17 Thread Amit Kapila

On Fri, Dec 17, 2021 at 4:11 AM Peter Smith  wrote:
>
> PSA the v47* patch set.
>

Few comments on v47-0002:
===
1. The handling to find rowfilter for ancestors in
RelationGetInvalidRowFilterCol seems complex. It seems you are
accumulating non-partition relations as well in toprelid_in_pub. Can
we simplify such that we find the ancestor only for 'pubviaroot'
publications?

2. I think the name RelationGetInvalidRowFilterCol is confusing
because the same function is also used to get publication actions. Can
we name it as GetRelationPublicationInfo() and pass a bool parameter
to indicate whether row_filter info needs to be built. We can get the
invalid_row_filter column as output from that function.

3.
+GetRelationPublicationActions(Relation relation)
{
..
+ if (!relation->rd_pubactions)
+ (void) RelationGetInvalidRowFilterCol(relation);
+
+ return memcpy(pubactions, relation->rd_pubactions,
+   sizeof(PublicationActions));
..
..
}

I think here we can reverse the check such that if actions are set
just do memcpy and return otherwise get the relationpublicationactions
info.

4.
invalid_rowfilter_column_walker
{
..

/*
* If pubviaroot is true, we need to convert the column number of
* parent to the column number of child relation first.
*/
if (context->pubviaroot)
{
char *colname = get_attname(context->parentid, attnum, false);
attnum = get_attnum(context->relid, colname);
}

Here, in the comments, you can tell why you need this conversion. Can
we name this function as rowfilter_column_walker()?

5.
+/* For invalid_rowfilter_column_walker. */
+typedef struct {
+ AttrNumber invalid_rfcolnum; /* invalid column number */
+ Bitmapset  *bms_replident; /* bitset of replica identity col indexes */
+ bool pubviaroot; /* true if we are validating the parent
+ * relation's row filter */
+ Oid relid; /* relid of the relation */
+ Oid parentid; /* relid of the parent relation */
+} rf_context;

Normally, we declare structs at the beginning of the file and for the
formatting of struct declarations, see other nearby structs like
RelIdCacheEnt.

6. Can we name IsRowFilterSimpleNode() as IsRowFilterSimpleExpr()?

-- 
With Regards,
Amit Kapila.

Re: psql format output

2021-12-17 Thread Peter Eisentraut


On 15.12.21 20:58, Florian Koch wrote:
I realized that the output of "\df+ func_name" has a formatting problem 
when a
lot of arguments are used. The field 'Arguments data types' gets very 
long and
destroys the whole formatting in the console. The field 'Source code' is 
most of
the time multi-line and I thought that the output for the field 
'Arguments data

types' could also be multiline with one line for each argument.


That's a reasonable idea.  I wonder if it would work in general.  If 
someone had a C function (so no source code) with three arguments, they 
might be annoyed if it now displayed as three lines by default.

Re: Skipping logical replication transactions on subscriber side

2021-12-17 Thread Peter Eisentraut


On 13.12.21 04:12, Greg Nancarrow wrote:

(ii) "Setting -1 means to reset the transaction ID"

Shouldn't it be explained what resetting actually does and when it can
be, or is needed to be, done? Isn't it automatically reset?
I notice that negative values (other than -1) seem to be regarded as
valid - is that right?
Also, what happens if this option is set multiple times? Does it just
override and use the latest setting? (other option handling errors out
with errorConflictingDefElem()).
e.g. alter subscription sub skip (xid = 721, xid = 722);


Let's not use magic numbers and instead use a syntax that is more 
explicit, like SKIP (xid = NONE) or RESET SKIP or something like that.

RE: Column Filtering in Logical Replication

2021-12-17 Thread houzj.f...@fujitsu.com

On Friday, December 17, 2021 1:55 AM Alvaro Herrera  
wrote:
> On 2021-Dec-16, houzj.f...@fujitsu.com wrote:
> 
> > The patch ensures all columns of RT are in column list when
> > CREATE/ALTER publication, but it seems doesn't prevent user from
> > changing the replica identity or dropping the index used in replica
> > identity. Do we also need to check those cases ?
> 
> Yes, we do.  As it happens, I spent a couple of hours yesterday writing code 
> for
> that, at least partially.  I haven't yet checked what happens with cases like
> REPLICA NOTHING, or REPLICA INDEX  and then dropping that index.
> 
> My initial ideas were a bit wrong BTW: I thought we should check the
> combination of column lists in all publications (a bitwise-OR of column 
> bitmaps,
> so to speak).  But conceptually that's wrong: we need to check the column list
> of each publication individually instead.  Otherwise, if you wanted to hide a
> column from some publication but that column was part of the replica identity,
> there'd be no way to identify the tuple in the replica.  (Or, if the pgouput 
> code
> disobeys the column list and sends the replica identity even if it's not in 
> the
> column list, then you'd be potentially publishing data that you wanted to 
> hide.)

Thanks for the explanation.

Apart from ALTER REPLICA IDENTITY and DROP INDEX, I think there could be
some other cases we need to handle for the replica identity check:

1)
When adding a partitioned table with column list to the publication, I think we
need to check the RI of all its leaf partition. Because the RI on the partition
is the one actually takes effect.

2)
ALTER TABLE ADD PRIMARY KEY;
ALTER TABLE DROP CONSTRAINT "PRIMAEY KEY";

If the replica identity is default, it will use the primary key. we might also
need to prevent user from adding or removing primary key in this case.

Based on the above cases, the RI check seems could bring considerable amount of
code. So, how about we follow what we already did in CheckCmdReplicaIdentity(),
we can put the check for RI in that function, so that we can cover all the
cases and reduce the code change. And if we are worried about the cost of do
the check for UPDATE and DELETE every time, we can also save the result in the
relcache. It's safe because every operation change the RI will invalidate the
relcache. We are using this approach in row filter patch to make sure all
columns in row filter expression are part of RI.

Best regards,
Hou zj

Re: Addition of --no-sync to pg_upgrade for test speedup

2021-12-17 Thread Peter Eisentraut


On 16.12.21 07:50, Michael Paquier wrote:

As per $subject, avoiding the flush of the new cluster's data
directory shortens a bint the runtime of the test.  In some of my slow
VMs, aka Windows, this shaves a couple of seconds even if the bulk of
the time is still spent on the main regression test suite.

In pg_upgrade, we let the flush happen with initdb --sync-only, based
on the binary path of the new cluster, so I think that we are not
going to miss any test coverage by skipping that.


I think that is reasonable.

Maybe we could have some global option, like some environment variable, 
that enables the "sync" mode in all tests, so it's easy to test that 
once in a while.  Not really a requirement for your patch, but an idea 
in case this is a concern.

Re: Transparent column encryption

2021-12-17 Thread Peter Eisentraut


On 17.12.21 01:41, Jacob Champion wrote:

(And I think the client should be able to enforce encryption in
the first place, before I distract you too much with other stuff.)


Yes, this is a useful point that I have added to my notes.

RE: In-placre persistance change of a relation

2021-12-17 Thread Jakub Wartak

> Kyotaro wrote:
> > In this version, I got rid of the "CLEANUP FORK"s, and added a new
> > system "Smgr marks".  The mark files have the name of the
> > corresponding fork file followed by ".u" (which means Uncommitted.).
> > "Uncommited"-marked main fork means the same as the
> CLEANUP2_FORKNUM
> > and uncommitted-marked init fork means the same as the
> CLEANUP_FORKNUM
> > in the previous version.x
> >
> > I noticed that the previous version of the patch still leaves an
> > orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
> > before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
> > revmoed at rollback.  In this version the responsibility to remove the
> > mark files is moved to SyncPostCheckpoint, where main fork files are
> > actually removed.
> 
> For the record, I noticed that basebackup could be confused by the mark files
> but I haven't looked that yet.
> 

Good morning Kyotaro,

the patch didn't apply clean (it's from March; some hunks were failing), so 
I've fixed it and the combined git format-patch is attached. It did conflict 
with the following:
b0483263dda - Add support for SET ACCESS METHOD in ALTER TABLE
7b565843a94 - Add call to object access hook at the end of table 
rewrite in ALTER TABLE
9ce346eabf3 - Report progress of startup operations that take a long 
time.
f10f0ae420 - Replace RelationOpenSmgr() with RelationGetSmgr().

I'm especially worried if I didn't screw up something/forgot something related 
to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests passed".

Basic demonstration of this patch (with wal_level=minimal):
create unlogged table t6 (id bigint, t text);
-- produces 110GB table, takes ~5mins
insert into t6 select nextval('s1'), repeat('A', 1000) from 
generate_series(1, 1);
alter table t6 set logged;
on baseline SET LOGGED takes: ~7min10s
on patched SET LOGGED takes: 25s

So basically one can - thanks to this patch - use his application (performing 
classic INSERTs/UPDATEs/DELETEs, so without the need to rewrite to use COPY) 
and perform literally batch upload and then just switch the tables to LOGGED. 

Some more intensive testing also looks good, assuming table prepared to put 
pressure on WAL:
create unlogged table t_unlogged (id bigint, t text) partition by hash 
(id);
create unlogged table t_unlogged_h0 partition of t_unlogged  FOR VALUES 
WITH (modulus 4, remainder 0);
[..]
create unlogged table t_unlogged_h3 partition of t_unlogged  FOR VALUES 
WITH (modulus 4, remainder 3);

Workload would still be pretty heavy on LWLock/BufferContent,WALInsert and 
Lock/extend .
t_logged.sql = insert into t_logged select nextval('s1'), repeat('A', 
1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes 
generates ~1MB of WAL
t_unlogged.sql = insert into t_unlogged select nextval('s1'), 
repeat('A', 1000) from generate_series(1, 1000); # according to 
pg_wal_stats.wal_bytes generates ~3kB of WAL

so using: pgbench -f .sql -T 30 -P 1 -c 32 -j 3 t 
with synchronous_commit =ON(default):
with t_logged.sql:   tps = 229 (lat avg = 138ms)
with t_unlogged.sql  tps = 283 (lat avg = 112ms) # almost all on 
LWLock/WALWrite
with synchronous_commit =OFF:
with t_logged.sql:   tps = 413 (lat avg = 77ms)
with t_unloged.sql:  tps = 782 (lat avg = 40ms)
Afterwards switching the unlogged ~16GB partitions takes 5s per partition. 

As the thread didn't get a lot of traction, I've registered it into current 
commitfest https://commitfest.postgresql.org/36/3461/ with You as the author 
and in 'Ready for review' state. 
I think it behaves as almost finished one and apparently after reading all 
those discussions that go back over 10years+ time span about this feature, and 
lot of failed effort towards wal_level=noWAL I think it would be nice to 
finally start getting some of that of it into the core.

-Jakub Wartak.

v7-0001-In-place-table-persistence-change-with-new-comman.patch
Description:  v7-0001-In-place-table-persistence-change-with-new-comman.patch

Re: [PoC] Delegating pg_ident to a third party

2021-12-17 Thread Peter Eisentraut




On 17.12.21 00:48, Jacob Champion wrote:

WDYT? (My responses here will be slower than usual. Hope you all have a
great end to the year!)


Looks interesting.  I wonder whether putting this into pg_ident.conf is 
sensible.  I suspect people will want to eventually add more features 
around this, like automatically creating roles or role memberships, at 
which point pg_ident.conf doesn't seem appropriate anymore.  Should we 
have a new file for this?  Do you have any further ideas?

Re: row filtering for logical replication

2021-12-17 Thread Ajin Cherian

On Fri, Dec 17, 2021 at 5:46 PM Greg Nancarrow  wrote:

> So using the v47 patch-set, I still find that the UPDATE above results in 
> publication of an INSERT of (2,1), rather than an UPDATE of (1,1) to (2,1).
> This is according to the 2nd UPDATE rule below, from patch 0003.
>
> + * old-row (no match)new-row (no match)  -> (drop change)
> + * old-row (no match)new row (match) -> INSERT
> + * old-row (match)   new-row (no match)  -> DELETE
> + * old-row (match)   new row (match) -> UPDATE
>
> This is because the old row (1,1) doesn't match the UPDATE filter "(a>1)", 
> but the new row (2,1) does.
> This functionality doesn't seem right to me. I don't think it can be assumed 
> that (1,1) was never published (and thus requires an INSERT rather than 
> UPDATE) based on these checks, because in this example, (1,1) was previously 
> published via a different operation - INSERT (and using a different filter 
> too).
> I think the fundamental problem here is that these UPDATE rules assume that 
> the old (current) row was previously UPDATEd (and published, or not 
> published, according to the filter applicable to UPDATE), but this is not 
> necessarily the case.
> Or am I missing something?

But it need not be correct in assuming that the old-row was part of a
previous INSERT either (and published, or not published according to
the filter applicable to an INSERT).
For example, change the sequence of inserts and updates prior to the
last update:

truncate tbl1 ;
insert into tbl1 values (1,5); ==> not replicated since insert and ! (b < 2);
update tbl1 set b = 1; ==> not replicated since update and ! (a > 1)
update tbl1 set a = 2; ==> replicated and update converted to insert
since (a > 1)

In this case, the last update "update tbl1 set a = 2; " is updating a
row that was previously updated and not inserted and not replicated to
the subscriber.
How does the replication logic differentiate between these two cases,
and decide if the update was previously published or not?
I think it's futile for the publisher side to try and figure out the
history of published rows. In fact, if this level of logic is required
then it is best implemented on the subscriber side, which then defeats
the purpose of a publication filter.

regards,
Ajin Cherian
Fujitsu Australia

Re: Use generation context to speed up tuplesorts

2021-12-17 Thread Ronan Dunklau

Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit :
> On 12/16/21 17:03, Ronan Dunklau wrote:
> > Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit :
> >> I will follow up with a benchmark of the test sorting a table with a
> >> width
> >> varying from 1 to 32 columns.
> > 
> > So please find attached another benchmark for that case.
> > 
> > The 3 different patchsets tested are:
> >   - master
> >   - fixed (David's original patch)
> >   - adjust (Thomas growing blocks patch)
> 
> Presumably Thomas is me, right?

I'm really sorry for this typo... Please accept my apologies. 

> 
> > So it looks like tuning malloc for this would be very benificial for any
> > kind of allocation, and by doing so we reduce the problems seen with the
> > growing blocks patch to next to nothing, while keeping the ability to not
> > allocate too much memory from the get go.
> 
> Thanks for running those tests and investigating the glibc behavior! I
> find those results very interesting. My conclusions from this is that
> the interaction interaction between "our" allocator and the allocator in
> malloc (e.g. glibc) can be problematic. Which makes benchmarking and
> optimization somewhat tricky because code changes may trigger behavior
> change in glibc (or whatever allocator backs malloc).
> 
> I think it's worth exploring if we can tune this in a reasonable way,
> but I have a couple concerns related to that:
> 
> 1) I wonder how glibc-specific this is - I'd bet it applies to other
> allocators (either on another OS or just different allocator on Linux)
> too. Tweaking glibc parameters won't affect those systems, of course,
> but maybe we should allow tweaking those systems too ...

I agree, finding their specific problems and see if we can workaround it would 
be interesting. I suppose glibc's malloc is the most commonly used allocator 
in production, as it is the default for most Linux distributions.

> 
> 2) In fact, I wonder if different glibc versions behave differently?
> Hopefully it's not changing that much, though. Ditto kernel versions,
> but the mmap/sbrk interface is likely more stable. We can test this.

That could be tested, yes. As a matter of fact, a commit removing the upper 
limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc, 
which means we can service much bigger allocation without mmap.


> 
> 3) If we bump the thresholds, won't that work against reusing the
> memory? I mean, if we free a whole block (from any allocator we have),
> glibc might return it to kernel, depending on mmap threshold value. It's
> not guaranteed, but increasing the malloc thresholds will make that even
> less likely. So we might just as well increase the minimum block size,
> with about the same effect, no?

It is my understanding that malloc will try to compact memory by moving it 
around. So the memory should be actually be released to the kernel at some 
point. In the meantime, malloc can reuse it for our next invocation (which can 
be in a different memory context on our side). 

If we increase the minimum block size, this is memory we will actually 
reserve, and it will not protect us against the ramping-up behaviour:
 - the first allocation of a big block may be over mmap_threshold, and serviced 
by an expensive mmap
 - when it's free, the threshold is doubled
 - next invocation is serviced by an sbrk call
 - freeing it will be above the trim threshold, and it will be returned.

After several "big" allocations, the thresholds will raise to their maximum 
values (well, it used to, I need to check what happens with that latest patch 
of glibc...)
 
This will typically happen several times as malloc doubles the threshold each 
time. This is probably the reason quadrupling the block sizes was more 
effective. 


> 
> > I would like to try to implement some dynamic glibc malloc tuning, if that
> > is something we don't reject on principle from the get go.
> 
> +1 to that

Ok, I'll work on a patch for this and submit a new thread. 


-- 
Ronan Dunklau

Re: pg_dump versus ancient server versions

2021-12-17 Thread Greg Stark

On Fri, 22 Oct 2021 at 19:27, Robert Haas  wrote:
>
> Another thing to think about in that regard: how likely is that
> PostgreSQL 7.4 and PostgreSQL 15 both compile and run on the same
> operating system? I suspect the answer is "not very." I seem to recall
> Greg Stark trying to compile really old versions of PostgreSQL for a
> conference talk some years ago, and he got back to a point where it
> just became impossible to make work on modern toolchains even with a
> decent amount of hackery.

That was when I compared sorting performance over time. I was able to
get Postgres to build back to the point where 64-bit architecture
support was added. From Andrew Dunstans comment later in this thread
I'm guessing that was 7.2

That was basically at the point where 64-bit architecture support was
added. It looks like the earliest date on the graphs in the talk are
2002-11-27 which matches the 7.3 release date. I think building
earlier versions would have been doable if I had built them in 32-bit
mode.

-- 
greg

60 matches

Mail list logo