date:20220829

Re: Add semi-join pushdown to postgres_fdw

2022-08-29 Thread Alexander Pyhalov


Ashutosh Bapat писал 2022-08-29 17:12:

Hi Alexander,
Thanks for working on this. It's great to see FDW join pushdown scope
being expanded to more complex cases.

I am still figuring out the implementation. It's been a while I have
looked at join push down code.

But following change strikes me odd
 -- subquery using immutable function (can be sent to remote)
 PREPARE st3(int) AS SELECT * FROM ft1 t1 WHERE t1.c1 < $2 AND t1.c3
IN (SELECT c3 FROM ft2 t2 WHERE c1 > $1 AND date(c5) =
'1970-01-17'::date) ORDER BY c1;
 EXPLAIN (VERBOSE, COSTS OFF) EXECUTE st3(10, 20);
-  QUERY PLAN

- Sort
+

QUERY PLAN
+---
+ Foreign Scan
Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
-   Sort Key: t1.c1
-   ->  Nested Loop Semi Join
- Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, 
t1.c8

- Join Filter: (t1.c3 = t2.c3)
- ->  Foreign Scan on public.ft1 t1
-   Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, 
t1.c7, t1.c8

-   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8
FROM "S 1"."T 1" WHERE (("C 1" < 20))
- ->  Materialize
-   Output: t2.c3
-   ->  Foreign Scan on public.ft2 t2
- Output: t2.c3
- Remote SQL: SELECT c3 FROM "S 1"."T 1" WHERE
(("C 1" > 10)) AND ((date(c5) = '1970-01-17'::date))
-(14 rows)
+   Relations: (public.ft1 t1) SEMI JOIN (public.ft2 t2)
+   Remote SQL: SELECT r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6,
r1.c7, r1.c8 FROM "S 1"."T 1" r1 WHERE ((r1."C 1" < 20)) AND (EXISTS
(SELECT NULL FROM "S 1"."T 1" r3 WHERE ((r3."C 1" > 10)) AND
((date(r3.c5) = '1970-01-17'::date)) AND ((r1.c3 = r3.c3 ORDER BY
r1."C 1" ASC NULLS LAST
+(4 rows)

date_in  | s   |1 | [0:0]={cstring}
date_in which will be used to cast a test to date is not immutable. So
the query should't be pushed down. May not be a problem with your
patch. Can you please check?


Hi.

It is not related to my change and works as expected. As I see, we have 
expression FuncExprdate(oid = 2029, args=Var ) = Const(type date)

(date(r3.c5) = '1970-01-17'::date).
Function is

# select proname, provolatile from pg_proc where oid=2029;
 proname | provolatile
-+-
 date| i

So it's shippable.
--
Best regards,
Alexander Pyhalov,
Postgres Professional

Re: pg_rewind WAL segments deletion pitfall

2022-08-29 Thread Alexander Kukushkin

>
>
> I did a slight modification of your script that reproduces a problem.
>
>
> 
>

It seems that formatting damaged the script, so I better attach it as a
file.

Regards,
--
Alexander Kukushkin


pg_rewind-removes-wal-segments-reproduce.sh
Description: application/shellscript

Re: pg_rewind WAL segments deletion pitfall

2022-08-29 Thread Alexander Kukushkin

Hello Kyotaro,

On Tue, 30 Aug 2022 at 07:50, Kyotaro Horiguchi 
wrote:



> So, if I understand you correctly, the issue you are complaining is
> not about the WAL segments on the old timeline but about those on the
> new timeline, which don't have a business with what pg_rewind does. As
> the same with the case of pg_basebackup, the missing segments need to
> be somehow copied from the new primary since the old primary never had
> the chance to have them before.
>

No, we are complaining exactly about WAL segments from the old timeline
that are removed by pg_rewind.
Those segments haven't been archived by the old primary and the new primary
already recycled them.




>
> Thus I don't follow this..
>

I did a slight modification of your script that reproduces a problem.



mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/oldarch/%f'">>
oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/newarch/%f'">>
newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start

# the last common checkpoint
psql -p 5432 -c 'checkpoint'

# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
# advance WAL on the old primary; four WAL segments will never make it to
the archive
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select
pg_switch_wal();'; done

# record approx. diverging WAL segment
start_wal=`psql -p 5432 -Atc "select
pg_walfile_name(pg_last_wal_replay_lsn() - (select setting from pg_settings
where name = 'wal_segment_size')::int);"`
pg_ctl -D newprim promote

# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select
pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint;'
psql -p 5433 -c 'checkpoint;'

pg_ctl -D oldprim stop

# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/oldarch/%f %p'">>
oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c

# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select
pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal

postgres -D oldprim  # fails with "WAL file has been removed"

# The alternative of copying-in
# echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/newarch/%f
%p'">> oldprim/postgresql.conf

# copy-in WAL files from new primary's archive to old primary
(cd newarch;
for f in `ls`; do
  if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal;
fi
done)

postgres -D oldprim  # also fails with "requested WAL segment XXX has
already been removed"
===

Regards,
--
Alexander Kukushkin

Re: patch: Add missing descriptions for rmgr APIs

2022-08-29 Thread Amit Kapila

On Mon, Aug 29, 2022 at 6:19 PM kuroda.hay...@fujitsu.com
 wrote:
>
> > Your observation seems correct to me but you have not updated the
> > comment for the mask. Is there a reason for the same?
>
> Oh, it seems that I attached wrong one. There is no reason.
> PSA the newer version.
>

LGTM. Pushed.

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

2022-08-29 Thread Amit Kapila

On Mon, Aug 29, 2022 at 5:01 PM houzj.f...@fujitsu.com
 wrote:
>
> On Thursday, August 25, 2022 7:33 PM Amit Kapila  
> wrote:
> >
>
> > 11.
> > + /*
> > + * Attach to the message queue.
> > + */
> > + mq = shm_toc_lookup(toc, APPLY_BGWORKER_KEY_ERROR_QUEUE, false);
> > + shm_mq_set_sender(mq, MyProc);
> > + error_mqh = shm_mq_attach(mq, seg, NULL);
> > + pq_redirect_to_shm_mq(seg, error_mqh);
> > +
> > + /*
> > + * Now, we have initialized DSM. Attach to slot.
> > + */
> > + logicalrep_worker_attach(worker_slot);
> > + MyParallelShared->logicalrep_worker_generation =
> > MyLogicalRepWorker->generation;
> > + MyParallelShared->logicalrep_worker_slot_no = worker_slot;
> > +
> > + pq_set_parallel_leader(MyLogicalRepWorker->apply_leader_pid,
> > +InvalidBackendId);
> >
> > Is there a reason to set parallel_leader immediately after
> > pq_redirect_to_shm_mq() as we are doing parallel.c?
>
> Moved the code.
>

Sorry, if I was not clear but what I wanted was something like the below:

diff --git a/src/backend/replication/logical/applyparallelworker.c
b/src/backend/replication/logical/applyparallelworker.c
index 832e99cd48..6646e00658 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -480,6 +480,9 @@ ApplyParallelWorkerMain(Datum main_arg)
mq = shm_toc_lookup(toc, PARALLEL_APPLY_KEY_ERROR_QUEUE, false);
shm_mq_set_sender(mq, MyProc);
error_mqh = shm_mq_attach(mq, seg, NULL);
+   pq_redirect_to_shm_mq(seg, error_mqh);
+   pq_set_parallel_leader(MyLogicalRepWorker->apply_leader_pid,
+  InvalidBackendId);

/*
 * Primary initialization is complete. Now, we can attach to
our slot. This
@@ -490,10 +493,6 @@ ApplyParallelWorkerMain(Datum main_arg)
MyParallelShared->logicalrep_worker_generation =
MyLogicalRepWorker->generation;
MyParallelShared->logicalrep_worker_slot_no = worker_slot;

-   pq_redirect_to_shm_mq(seg, error_mqh);
-   pq_set_parallel_leader(MyLogicalRepWorker->apply_leader_pid,
-  InvalidBackendId);
-
MyLogicalRepWorker->last_send_time =
MyLogicalRepWorker->last_recv_time =
MyLogicalRepWorker->reply_time = 0;


Few other comments on v25-0001*

1.
+ {
+ {"max_apply_parallel_workers_per_subscription",
+ PGC_SIGHUP,
+ REPLICATION_SUBSCRIBERS,
+ gettext_noop("Maximum number of apply parallel workers per subscription."),
+ NULL,
+ },
+ &max_apply_parallel_workers_per_subscription,

Let's model this to max_parallel_workers_per_gather and name this
max_parallel_apply_workers_per_subscription.


+typedef struct ApplyParallelWorkerEntry
+{
+ TransactionId xid; /* Hash key -- must be first */
+ ApplyParallelWorkerInfo *winfo;
+} ApplyParallelWorkerEntry;
+
+/* Apply parallel workers hash table (initialized on first use). */
+static HTAB *ApplyParallelWorkersHash = NULL;
+static List *ApplyParallelWorkersFreeList = NIL;
+static List *ApplyParallelWorkersList = NIL;

Similarly, for above let's name them as ParallelApply*. I think in
comments/doc changes it is better to refer as parallel apply worker.
we can keep filename as it is.


2.
+ * If there are enough apply parallel workers(reache half of the
+ * max_apply_parallel_workers_per_subscription)

/reache/reached. There should be a space before (.

3.
+ * The dynamic shared memory segment will contain (1) a shm_mq that can be used
+ * to transport errors (and other messages reported via elog/ereport) from the
+ * apply parallel worker to leader apply worker (2) another shm_mq that can
+ * be used to transport changes in the transaction from leader apply worker to
+ * apply parallel worker (3) necessary information to be shared among apply
+ * parallel workers to leader apply worker

I think it is better to use send instead of transport in above
paragraph. In (3), /apply parallel workers to leader apply
worker/apply parallel workers and leader apply worker

4.
handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
{
...
...
+ else if (apply_action == TA_SEND_TO_PARALLEL_WORKER)
+ {
+ parallel_apply_send_data(winfo, s->len, s->data);


It is better to have an Assert for winfo being non-null here and other
similar usages.

-- 
With Regards,
Amit Kapila.

Re: pg_rewind WAL segments deletion pitfall

2022-08-29 Thread Kyotaro Horiguchi

Hello, Alex.

At Fri, 26 Aug 2022 10:57:25 +0200, Alexander Kukushkin  
wrote in 
> On Fri, 26 Aug 2022 at 10:04, Kyotaro Horiguchi 
> wrote:
> > What I don't still understand is why pg_rewind doesn't work for the
> > old primary in that case. When archive_mode=on, the old primary has
> > the complete set of WAL files counting both pg_wal and its archive. So
> > as the same to the privious repro, pg_rewind -c ought to work (but it
> > uses its own archive this time).  In that sense the proposed solution
> > is still not needed in this case.
> >
> 
> The pg_rewind finishes successfully. But as a result it removes some files
> from pg_wal that are required to perform recovery because they are missing
> on the new primary.

IFAIS pg_rewind doesn't. -c option contrarily restores the all
segments after the last (common) checkpoint and all of them are left
alone after pg_rewind finishes.  postgres itself removes the WAL files
after recovery. After-promotion cleanup and checkpoint revmoes the
files on the previous timeline.

Before pg_rewind runs in the repro below, the old primary has the
following segments.

TLI1:  2 8 9 A B C D

Just after pg_rewind finishes, the old primary has the following
segments.

TLI1:  2 3   5 6 7
TLI2:  4(and 0002.history)

pg_rewind copied 1-2 to 1-3 and 2-4 and history file from the new
primary, 1-4 to 1-7 from archive. After rewind finished, 1-4,1-8 to
1-D have been removed since the new primary didn't have them.

Recovery starts from 1-3 and promotes at 0/4_00. postgres removes
1-5 to 1-7 by post-promotion cleanup and removes 1-2 to 1-4 by a
restartpoint. All of the segments are useless after the old primary
promotes.

When the old primary starts, it uses 1-3 and 2-4 for recovery and
fails to fetch 2-5 from the new primary.  But it is not an issue of
pg_rewind at all.

> > A bit harder situation comes after the server successfully rewound; if
> > the new primary goes so far that the old primary cannot connect. Even
> > in that case, you can copy-in the requried WAL files or configure
> > restore_command of the old pimary so that it finds required WAL files
> > there.
> >
> 
> Yes, we can do the backup of pg_wal before running pg_rewind, but it feels

So, if I understand you correctly, the issue you are complaining is
not about the WAL segments on the old timeline but about those on the
new timeline, which don't have a business with what pg_rewind does. As
the same with the case of pg_basebackup, the missing segments need to
be somehow copied from the new primary since the old primary never had
the chance to have them before.

> very ugly, because we will also have to clean this "backup" after a
> successful recovery.

What do you mean by the "backup" here? Concretely what WAL segments do
you feel need to remove, for example, in the repro case?  Or, could
you show your issue by something like the repro below?

> It would be much better if pg_rewind didn't remove WAL files between the
> last common checkpoint and diverged LSN in the first place.

Thus I don't follow this..

regards.

(Fixed a bug and slightly modified)

# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/oldarch/%f'">> 
oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/newarch/%f'">> 
newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start

# the last common checkpoint
psql -p 5432 -c 'checkpoint'

# record approx. diverging WAL segment
start_wal=`psql -p 5433 -Atc "select pg_walfile_name(pg_last_wal_replay_lsn() - 
(select setting from pg_settings where name = 'wal_segment_size')::int);"`
psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'
pg_ctl -D newprim promote

# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select 
pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint;'
psql -p 5433 -c 'checkpoint;'

# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
pg_ctl -D oldprim stop

# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/oldarch/%f %p'">> 
oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c

# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select 
pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> ol

Re: shadow variables - pg15 edition

2022-08-29 Thread Justin Pryzby

On Wed, Aug 24, 2022 at 10:47:31PM +1200, David Rowley wrote:
> I really think #2s should be done last. I'm not as comfortable with
> the renaming and we might want to discuss tactics on that. We could
> either opt to rename the shadowed or shadowing variable, or both.  If
> we rename the shadowing variable, then pending patches or forward
> patches could use the wrong variable.  If we rename the shadowed
> variable then it's not impossible that backpatching could go wrong
> where the new code intends to reference the outer variable using the
> newly named variable, but when that's backpatched it uses the variable
> with the same name in the inner scope.  Renaming both would make the
> problem more obvious.

The most *likely* outcome of renaming the *outer* variable is that
*every* cherry-pick involving that variable would fails to compile,
which is an *obvious* failure (good) but also kind of annoying if it
could've worked fine if it weren't renamed.  I think most of the renames
should be applied to the inner var, because it's of narrower scope, and
more likely to cause a conflict (good) rather than appearing to apply
cleanly but then misbehave.  But it seems reasonable to consider
renaming both if the inner scope is longer than a handful of lines.

> Would you be able to write a patch for #4. I'll do #5 now. You could
> do a draft patch for #2 as well, but I think it should be committed
> last, if we decide it's a good move to make. It may be worth having
> the discussion about if we actually want to run
> -Wshadow=compatible-local as a standard build flag before we rename
> anything.

I'm afraid the discussion about default flags would distract from fixing
the individual warnings, which itself preclude usability of the flag by
individual developers, or buildfarm, even as a local setting.

It can't be enabled until *all* the shadows are gone, due to -Werror on
the buildfarm and cirrusci.  Unless perhaps we used -Wno-error=shadow.
I suppose we're only talking about enabling it for gcc?

The biggest benefit is if we fix *all* the local shadow vars, since that
allows someone to make use of the option, and thereby avoiding future
such issues.  Enabling the option could conceivably avoid issues
cherry-picking into back branch - if an inner var is re-introduced
during conflict resolution, then a new warning would be issued, and
hopefully the developer would look more closely.

Would you check if any of these changes are good enough ?

-- 
Justin
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5887166061a..8a06b73948d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6256,45 +6256,45 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
return multi;
}

/*
 * Do a more thorough second pass over the multi to figure out which
 * member XIDs actually need to be kept.  Checking the precise status of
 * individual members might even show that we don't need to keep 
anything.
 */
nnewmembers = 0;
newmembers = palloc(sizeof(MultiXactMember) * nmembers);
has_lockers = false;
update_xid = InvalidTransactionId;
update_committed = false;
temp_xid_out = *mxid_oldest_xid_out;/* init for FRM_RETURN_IS_MULTI 
*/

for (i = 0; i < nmembers; i++)
{
/*
 * Determine whether to keep this member or ignore it.
 */
if (ISUPDATE_from_mxstatus(members[i].status))
{
-   TransactionId xid = members[i].xid;
+   xid = members[i].xid;

Assert(TransactionIdIsValid(xid));
if (TransactionIdPrecedes(xid, relfrozenxid))
ereport(ERROR,

(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg_internal("found update 
xid %u from before relfrozenxid %u",

 xid, relfrozenxid)));

/*
 * It's an update; should we keep it?  If the 
transaction is known
 * aborted or crashed then it's okay to ignore it, 
otherwise not.
 * Note that an updater older than cutoff_xid cannot 
possibly be
 * committed, because HeapTupleSatisfiesVacuum would 
have returned
 * HEAPTUPLE_DEAD and we would not be trying to freeze 
the tuple.
 *
 * As with all tuple visibility routines, it's critical 
to test
 * TransactionIdIsInProgress before 
TransactionIdDidCommit,
 * because of race conditions explained in detail in
 * heapam_visibility.c.

Re: wal_sync_method=fsync_writethrough

2022-08-29 Thread Thomas Munro

On Tue, Aug 30, 2022 at 3:44 AM Magnus Hagander  wrote:
> On Fri, Aug 26, 2022 at 11:29 PM Thomas Munro  wrote:
> > Now it looks strange: we have both "fsync" and "fsync_writethrough"
> > doing exactly the same thing while vaguely implying otherwise, and the
> > contrast with other operating systems (if I divined that aspect
> > correctly) mostly doesn't apply.  How flush commands affect various
> > caches in modern storage stacks is also not really OS-specific AFAIK.
> >
> > (Obviously macOS is a different story...)
>
> Given that it does vary (because macOS is actually an OS :D), we might
> need to start from a matrix of exactly what happens in different
> states, and then try to map that to a set? I fully agree that if
> things actually behave the same, they should be called the same.

Thanks, I'll take that as a +1 for dropping the redundant level for
Windows.  (Of course it stays for macOS).

I like that our current levels are the literal names of standard
interfaces we call, since the rest is out of our hands.  I'm not sure
what you could actually *do* with the information that some OS doesn't
flush write caches, other than document it and suggest a remedy (e.g.
turn it off).  I would even prefer it if fsync_writethrough were
called F_FULLFSYNC, following that just-say-what-it-does-directly
philosophy, but that horse is already over the horizon.

> And it may also be that there is no longer a difference between
> direct-drive and RAID-with-battery-or-flash, which used to be the huge
> difference back then, where you had to tune for it. For many cases
> that has been negated by just not using that (and using NVME and
> possibly software raid instead), but there are certainly still people
> using such systems...

I believe modern systems are a lot better at negotiating the need for
flushes (i.e. for *volatile* caches).  In contrast, the FUA situation
(as used for FILE_FLAG_WRITE_THROUGH) seems like a multi-level
dumpster fire on anything but high-end gear, from what I've been able
to figure out so far, though I'm no expert.

Re: introduce bufmgr hooks

2022-08-29 Thread Kyotaro Horiguchi

At Mon, 29 Aug 2022 15:24:49 -0700, Nathan Bossart  
wrote in 
> I'd like to propose some new hooks for the buffer manager.  My primary goal
> is to allow users to create an additional caching mechanism between the
...
> The attached patch is a first attempt at introducing these hooks with
> acceptable names, placements, arguments, etc.
> 
> Thoughts?

smgr is an abstract interface originally intended to allow to choose
one implementation among several (though cannot dynamically). Even
though the patch intends to replace specific (but most of all) uses of
the smgrread/write, still it sounds somewhat strange to me to add
hooks to replace smgr functions in that respect.  I'm not sure whether
we still regard smgr as just an interface, though..

As for the names, bufmgr_read_hook looks like as if it is additionally
called when the normal operation performed by smgrread completes, or
just before. (planner_hook already doesn't sounds so for me, though:p)
"bufmgr_alt_smgrread" works for me but I'm not sure it is following
the project policy.

I think that the INSTR_* section should enclose the hook call as it is
still an I/O operation in the view of the core.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

[PATCH v1] [doc] polish the comments of reloptions

2022-08-29 Thread Junwang Zhao

When adding an option, we have 5 choices (bool, integer, real, enum, string),
so the comments seem stale.

There are some sentences missing *at ShareUpdateExclusiveLock*, this
patch adds them to make the sentences complete.

One thing I'm not sure is should we use *at ShareUpdateExclusiveLock* or
*with ShareUpdateExclusiveLock*, pls take a look.

 src/backend/access/common/reloptions.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/common/reloptions.c
b/src/backend/access/common/reloptions.c
index 609329bb21..9e99868faa 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -42,9 +42,9 @@
  *
  * To add an option:
  *
- * (i) decide on a type (integer, real, bool, string), name, default value,
- * upper and lower bounds (if applicable); for strings, consider a validation
- * routine.
+ * (i) decide on a type (bool, integer, real, enum, string), name, default
+ * value, upper and lower bounds (if applicable); for strings, consider a
+ * validation routine.
  * (ii) add a record below (or use add__reloption).
  * (iii) add it to the appropriate options struct (perhaps StdRdOptions)
  * (iv) add it to the appropriate handling routine (perhaps
@@ -68,24 +68,24 @@
  * since they are only used by the AV procs and don't change anything
  * currently executing.
  *
- * Fillfactor can be set because it applies only to subsequent changes made to
- * data blocks, as documented in hio.c
+ * Fillfactor can be set at ShareUpdateExclusiveLock because it applies only to
+ * subsequent changes made to data blocks, as documented in hio.c
  *
  * n_distinct options can be set at ShareUpdateExclusiveLock because they
  * are only used during ANALYZE, which uses a ShareUpdateExclusiveLock,
  * so the ANALYZE will not be affected by in-flight changes. Changing those
  * values has no effect until the next ANALYZE, so no need for stronger lock.
  *
- * Planner-related parameters can be set with ShareUpdateExclusiveLock because
+ * Planner-related parameters can be set at ShareUpdateExclusiveLock because
  * they only affect planning and not the correctness of the execution. Plans
  * cannot be changed in mid-flight, so changes here could not easily result in
  * new improved plans in any case. So we allow existing queries to continue
  * and existing plans to survive, a small price to pay for allowing better
  * plans to be introduced concurrently without interfering with users.
  *
- * Setting parallel_workers is safe, since it acts the same as
- * max_parallel_workers_per_gather which is a USERSET parameter that doesn't
- * affect existing plans or queries.
+ * Setting parallel_workers at ShareUpdateExclusiveLock is safe, since it acts
+ * the same as max_parallel_workers_per_gather which is a USERSET parameter
+ * that doesn't affect existing plans or queries.
  *
  * vacuum_truncate can be set at ShareUpdateExclusiveLock because it
  * is only used during VACUUM, which uses a ShareUpdateExclusiveLock,
-- 
2.33.0


-- 
Regards
Junwang Zhao


0001-doc-polish-the-comments-of-reloptions.patch
Description: Binary data

Re: configure openldap crash warning

2022-08-29 Thread Andres Freund

Hi,

On 2022-08-29 23:18:23 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Why do we continue to link the backend to ldap when we find ldap_r, given 
> > that
> > we know that it can cause problems for extension libraries using libpq?
>
> Uh ... if we know that, it's news to me.

Isn't that what the configure warning Peter mentioned upthread is about?

# PGAC_LDAP_SAFE
# --
# PostgreSQL sometimes loads libldap_r and plain libldap into the same
# process.  Check for OpenLDAP versions known not to tolerate doing so; assume
# non-OpenLDAP implementations are safe.  The dblink test suite exercises the
# hazardous interaction directly.

The patch applied as a result of this thread dealt with a different version of
the problem, with -lldap_r picking up a different library version than -lldap.

Leaving that aside it also doesn't seem like a great idea to have two
different copies of the nearly same library loaded for efficiency reasons, not
that it'll make a large difference...

Greetings,

Andres Freund

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tom Lane

David Rowley  writes:
> I can revert df0f4feef, but would prefer just to get the green light
> for d5ee4db0e from those 32-bit arm animals before doing so.

I have a check-world pass on my RPI3 (Fedora 30 armv7l image).
PPC test still running, but I don't doubt it will pass; it's
finished contrib/test_decoding.

regards, tom lane

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread David Rowley

On Tue, 30 Aug 2022 at 15:15, Tom Lane  wrote:
> AFAICS that could only happen if "double" has 8-byte alignment
> requirement but int64 does not.  I recall some discussion about
> that possibility a month or two back, but I think we concluded
> that we weren't going to support it.

ok

> I guess what I mostly don't like about df0f4feef is the hardwired "8"
> constants.  Yeah, it's hard to see how sizeof(uint64) isn't 8, but
> it's not very readable like this IMO.

Yeah, that was just down to lack of any SIZEOF_* macro to tell me
uint64 was 8 bytes.

I can revert df0f4feef, but would prefer just to get the green light
for d5ee4db0e from those 32-bit arm animals before doing so.

David

Re: Making Vars outer-join aware

2022-08-29 Thread Richard Guo

On Mon, Aug 29, 2022 at 2:30 PM Richard Guo  wrote:

>
> On Fri, Aug 19, 2022 at 2:45 AM Tom Lane  wrote:
>
>> Here's a rebase up to HEAD, mainly to get the cfbot back in sync
>> as to what's the live patch.
>
>
> Noticed another different behavior from previous. When we try to reduce
> JOIN_LEFT to JOIN_ANTI, we want to know if the join's own quals are
> strict for any Var that was forced null by higher qual levels. We do
> that by checking whether local_nonnullable_vars and forced_null_vars
> overlap. However, the same Var from local_nonnullable_vars and
> forced_null_vars may be labeled with different varnullingrels. If that
> is the case, currently we would fail to tell they actually overlap.
>

I wonder why this is not noticed by regression tests. So I did some
search and it seems we do not have any test cases covering the
transformation we apply to reduce outer joins. I think maybe we should
add such cases in regression tests.

Thanks
Richard

Re: configure openldap crash warning

2022-08-29 Thread Tom Lane

Andres Freund  writes:
> Why do we continue to link the backend to ldap when we find ldap_r, given that
> we know that it can cause problems for extension libraries using libpq?

Uh ... if we know that, it's news to me.

I think we might've avoided ldap_r for fear of pulling libpthread into
the backend; per recent discussion, it's not clear that avoiding that
is possible anyway.  But you didn't make a case for changing this.

regards, tom lane

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tom Lane

David Rowley  writes:
> On Tue, 30 Aug 2022 at 03:39, Tom Lane  wrote:
>> I'd suggest reverting df0f4feef as it seems to be
>> a red herring.

> I think it's useless providing that a 64-bit variable will always be
> aligned to 8 bytes on all of our supported 32-bit platforms as,
> without the padding, the uint64 hdrmask in MemoryChunk will always be
> aligned to 8 bytes meaning the memory following that will be aligned
> too.  If we have a platform where a uint64 isn't aligned to 8 bytes
> then we might need the padding.

It's not so much "8 bytes".  The question is this: is there any
platform on which uint64 has less than MAXALIGN alignment
requirement?  If it is maxaligned then the compiler will insert any
required padding automatically, so the patch accomplishes little.

AFAICS that could only happen if "double" has 8-byte alignment
requirement but int64 does not.  I recall some discussion about
that possibility a month or two back, but I think we concluded
that we weren't going to support it.

I guess what I mostly don't like about df0f4feef is the hardwired "8"
constants.  Yeah, it's hard to see how sizeof(uint64) isn't 8, but
it's not very readable like this IMO.

regards, tom lane

Re: configure openldap crash warning

2022-08-29 Thread Andres Freund

Hi,

On 2022-05-06 14:00:43 -0400, Tom Lane wrote:
> I wrote:
> > Oh, I have a theory about this: I bet your Homebrew installation
> > has a recent OpenLDAP version that only supplies libldap not libldap_r.
> > In that case, configure will still find libldap_r available and will
> > bind libpq to it, and you get the observed result.  The configure
> > check is not sophisticated enough to realize that it's finding chunks
> > of two different OpenLDAP installations.
> 
> After thinking about this for awhile, it seems like the best solution
> is to make configure proceed like this:
> 
> 1. Find libldap.
> 2. Detect whether it's OpenLDAP 2.5 or newer.
> 3. If not, try to find libldap_r.
> 
> There are various ways we could perform step 2, but I think the most
> reliable is to try to link to some function that's present in 2.5
> but not before.  (In particular, this doesn't require any strong
> assumptions about whether the installation's header files match the
> library.)  After a quick dig in 2.4 and 2.5, it looks like
> ldap_verify_credentials() would serve.

Why do we continue to link the backend to ldap when we find ldap_r, given that
we know that it can cause problems for extension libraries using libpq? I did
a cursory search of the archives without finding an answer. ISTM that we'd be
more robust if we used your check from above to decide when to use ldap (so we
don't use an older ldap_r), and use ldap_r for both FE and BE otherwise.

Greetings,

Andres Freund

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread David Rowley

On Tue, 30 Aug 2022 at 03:39, Tom Lane  wrote:
> I'd suggest reverting df0f4feef as it seems to be
> a red herring.

I think it's useless providing that a 64-bit variable will always be
aligned to 8 bytes on all of our supported 32-bit platforms as,
without the padding, the uint64 hdrmask in MemoryChunk will always be
aligned to 8 bytes meaning the memory following that will be aligned
too.  If we have a platform where a uint64 isn't aligned to 8 bytes
then we might need the padding.

long long seems to align to 8 bytes on my 32-bit Rasberry PI going the
struct being 16 bytes rather than 12.

drowley@raspberrypi:~ $ cat struct.c
#include 

typedef struct test
{
int a;
long long b;
} test;

int main(void)
{
printf("%d\n", sizeof(test));
return 0;
}
drowley@raspberrypi:~ $ gcc struct.c -o struct
drowley@raspberrypi:~ $ ./struct
16
drowley@raspberrypi:~ $ uname -m
armv7l

Is that the case for your 32-bit PPC too?

David

Re: use ARM intrinsics in pg_lfind32() where available

2022-08-29 Thread John Naylor

On Tue, Aug 30, 2022 at 12:17 AM Nathan Bossart
 wrote:
> Thanks!  I've attached a follow-up patch with a couple of small
> suggestions.

Pushed, thanks!

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread David Rowley

On Tue, 30 Aug 2022 at 13:58, Tomas Vondra
 wrote:
>
> On 8/30/22 03:16, David Rowley wrote:
> > Any chance you could run make check-world on your 32-bit Raspberry PI?
> >
>
> Will do, but I think the sentinel fix should be

Thank you.  I think Tom is also running make check-world.  I now have
an old RPI3 setup with a 32-bit OS, which is currently busy compiling
Postgres. I'll kick off a make check-world run once that's done.

>if (!sentinel_ok(chunk, Slab_CHUNKHDRSZ + slab->chunkSize))

Agreed. I've changed the patch to do it that way.  Since the 32-bit
ARM animals are already broken and per what Tom mentioned in [1], I've
pushed the patch.

Thanks again for taking the time to look at this.

David

[1] https://www.postgresql.org/message-id/3455754.1661823...@sss.pgh.pa.us

Re: Postmaster self-deadlock due to PLT linkage resolution

2022-08-29 Thread Noah Misch

On Mon, Aug 29, 2022 at 03:43:55PM -0400, Tom Lane wrote:
> I'd originally intended to make this code "#ifdef __NetBSD__",
> but on looking into the FreeBSD sources I find much the same locking
> logic in their dynamic loader, and now I'm wondering if such behavior
> isn't pretty standard.

I doubt it's standard.  POSIX specifies select() to be async-signal-safe.
This NetBSD bug makes select() not be async-signal-safe.

> The added calls should have negligible cost,
> so it doesn't seem unreasonable to do them everywhere.

Agreed.  I would make the comment mention the NetBSD version that prompted
this, so we have a better chance of removing the workaround in a few decades.

Re: Support logical replication of global object commands

2022-08-29 Thread Zheng Li

> > I think a publication of ALL OBJECTS sounds intuitive. Does it mean we'll
> > publish all DDL commands, all commit and abort operations in every
> > database if there is such publication of ALL OBJECTS?
> >
>
> Actually, I intend something for global objects. But the main thing
> that is worrying me about this is that we don't have a clean way to
> untie global object replication from database-specific object
> replication.

I think ultimately we need a clean and efficient way to publish (and
subscribe to) any changes in all databases, preferably in one logical
replication slot.

--
Regards,
Zheng

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread David Rowley

On Tue, 30 Aug 2022 at 13:55, Tom Lane  wrote:
> I wonder if slab ought to artificially bump up such requests when
> MEMORY_CONTEXT_CHECKING is enabled, so there's room for a sentinel.
> I think it's okay for aset.c to not do that, because its power-of-2
> behavior means there usually is room for a sentinel; but slab's
> policy makes it much more likely that there won't be.

I think it's fairly likely that small allocations are a power of 2,
and I think most of our allocates are small, so I imagine that if we
didn't do that for aset.c, we'd miss out on most of the benefits.

David

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread David Rowley

On Tue, 30 Aug 2022 at 13:58, Tomas Vondra
 wrote:
> armv7l (32-bit rpi4)
>
> +WARNING:  chunkSize 216 fullChunkSize 232 header 16
> +WARNING:  chunkSize 64 fullChunkSize 80 header 16
>
> aarch64 (64-bit rpi4)
>
> +WARNING:  chunkSize 304 fullChunkSize 320 header 16
> +WARNING:  chunkSize 80 fullChunkSize 96 header 16
>
> So indeed, those are *perfect* matches and thus the sentinel_ok() never
> executed. So no failures until now. On x86-64 I get the same thing as on
> aarch64. I guess that explains why it never failed. Seems like a pretty
> amazing coincidence ...

hmm, I'm not so sure I agree that it's an amazing coincidence. Isn't
it quite likely that the chunksize being given to SlabContextCreate()
is the same as MAXALIGN(chunksize)? Isn't that all it would take?

David

RE: Data is copied twice when specifying both child and parent table in publication

2022-08-29 Thread houzj.f...@fujitsu.com

On Wednesday, August 10, 2022 7:45 AM Peter Smith  wrote:
> 
> Here are some more review comments for the HEAD_v8 patch:
> 
> ==
> 
> 1. Commit message
> 
> If there are two publications, one of them publish a parent table with
> (publish_via_partition_root = true) and another publish child table, 
> subscribing
> to both publications from one subscription results in two initial 
> replications. It
> should only be copied once.
> 
> ~
> 
> I took a 2nd look at that commit message and it seemed slightly backwards to
> me - e.g. don't you really mean for the 'publish_via_partition_root' parameter
> to be used when publishing the
> *child* table?

I'm not sure about this, I think we are trying to fix the bug when
'publish_via_partition_root' is used when publishing the parent table.

For this case(via_root used when publishing parent):

CREATE PUBLICATION pub1 for TABLE parent with(publish_via_partition_root);
CREATE PUBLICATION pub2 for TABLE child;
CREATE SUBSCRIPTION sub connect xxx PUBLICATION pub1,pub2;

The expected behavior is only the parent table is published, all the changes
should be replicated using the parent table's identity. So, we should only do
initial sync for the parent table once, but we currently will do table sync for
both parent and child which we think is a bug.

For another case you mentioned(via_root used when publishing child)

CREATE PUBLICATION pub1 for TABLE parent;
CREATE PUBLICATION pub2 for TABLE child with (publish_via_partition_root);
CREATE SUBSCRIPTION sub connect xxx PUBLICATION pub1,pub2;

The expected behavior is only the child table is published, all the changes
should be replicated using the child table's identity. We should do table sync
only for child tables and is same as the current behavior on HEAD. So, I think
there is no bug in this case.

Best regards,
Hou zj

Re: replacing role-level NOINHERIT with a grant-level option

2022-08-29 Thread Robert Haas

On Mon, Aug 29, 2022 at 6:04 PM Nathan Bossart  wrote:
> On Mon, Aug 29, 2022 at 03:38:57PM -0400, Robert Haas wrote:
> > Here is a patch to rearrange the logic slightly and also add a test
> > case memorializing the intended behavior. Without this change, the
> > regression test included in the patch fails like this:
> >
> > ERROR:  no possible grantors
> >
> > ...which is never supposed to happen.
>
> The grantor column in the expected test output refers to "rhaas", so I
> imagine the test only passes on your machines.  Should we create a new
> grantor role for this test?

That's the second time I've made that exact mistake *on this thread*.

I'm super good at this, I promise.

I'll figure out a fix tomorrow when I'm less tired. Thanks for catching it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tomas Vondra

On 8/30/22 03:55, Tom Lane wrote:
> Tomas Vondra  writes:
>> On 8/30/22 02:45, David Rowley wrote:
>>> I think the existing sentinel check looks wrong:
>>> if (!sentinel_ok(chunk, slab->chunkSize))
>>> shouldn't that be passing the pointer rather than the chunk?
> 
>> I agree the check in SlabCheck() looks wrong, as it's ignoring the chunk
>> header (unlike the other contexts).
> 
>> But yeah, I ran "make check-world" and it passed just fine, so my only
>> explanation is that the check never actually executes because there's no
>> space for the sentinel thanks to alignment, and the tweak you did breaks
>> that. Strange ...
> 
> A quick code-coverage check confirms that the sentinel_ok() line
> is not reached in core or test_decoding tests as of HEAD
> (on a 64-bit machine anyway).  So we just happen to be using
> only allocation requests that are already maxaligned.
> 
> I wonder if slab ought to artificially bump up such requests when
> MEMORY_CONTEXT_CHECKING is enabled, so there's room for a sentinel.
> I think it's okay for aset.c to not do that, because its power-of-2
> behavior means there usually is room for a sentinel; but slab's
> policy makes it much more likely that there won't be.
> 

+1 to that

For aset that's fine not just because of power-of-2 behavior, but
because we use it for chunks of many different sizes - so at least some
of those will have sentinel.

But Slab in used only for changes and txns in reorderbuffer, and it just
so happens both structs are maxaligned on 32-bit and 64-bit machines
(rpi and x86-64). We're unlikely to use slab in many other places, and
the structs don't change very often, and it'd probably grow to another
maxaligned size anyway. So it may be pretty rare to have a sentinel.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Add native windows on arm64 support

2022-08-29 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Aug 29, 2022 at 05:33:56PM -0700, Andres Freund wrote:
>> The weirdest part is that it only happens as part of the the pg_upgrade test.

> make check has just failed:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2022-08-30%2001%3A15%3A13

So it *is* probabilistic, which is pretty much what you'd expect
if ASLR triggers it.  That brings us no closer to understanding
what the mechanism is, though.

regards, tom lane

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tomas Vondra

On 8/30/22 03:16, David Rowley wrote:
> On Tue, 30 Aug 2022 at 12:45, David Rowley  wrote:
>> I think the existing sentinel check looks wrong:
>>
>> if (!sentinel_ok(chunk, slab->chunkSize))
>>
>> shouldn't that be passing the pointer rather than the chunk?
> 
> Here's v2 of the slab-fix patch.
> 
> I've included the sentinel check fix.  This passes make check-world
> for me when do a 32-bit build on my x86_64 machine and adjust
> pg_config.h to set MAXIMUM_ALIGNOF to 8.
> 
> Any chance you could run make check-world on your 32-bit Raspberry PI?
> 

Will do, but I think the sentinel fix should be

   if (!sentinel_ok(chunk, Slab_CHUNKHDRSZ + slab->chunkSize))

which is what the other contexts do. However, considering check-world
passed even before the sentinel_ok fix, I'm a bit skeptical about that
proving anything.

FWIW I added a WARNING to SlabCheck before the condition guarding the
sentinel check, printing the (full) chunk size and header size, and this
is what I got in test_decoding (deduplicated):

armv7l (32-bit rpi4)

+WARNING:  chunkSize 216 fullChunkSize 232 header 16
+WARNING:  chunkSize 64 fullChunkSize 80 header 16

aarch64 (64-bit rpi4)

+WARNING:  chunkSize 304 fullChunkSize 320 header 16
+WARNING:  chunkSize 80 fullChunkSize 96 header 16

So indeed, those are *perfect* matches and thus the sentinel_ok() never
executed. So no failures until now. On x86-64 I get the same thing as on
aarch64. I guess that explains why it never failed. Seems like a pretty
amazing coincidence ...


> I'm also wondering if this should also be backpatched back to v10,
> providing the build farm likes it well enough on master.

I'd say the sentinel fix may need to be backpatched.


regard

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tom Lane

Tomas Vondra  writes:
> On 8/30/22 02:45, David Rowley wrote:
>> I think the existing sentinel check looks wrong:
>> if (!sentinel_ok(chunk, slab->chunkSize))
>> shouldn't that be passing the pointer rather than the chunk?

> I agree the check in SlabCheck() looks wrong, as it's ignoring the chunk
> header (unlike the other contexts).

> But yeah, I ran "make check-world" and it passed just fine, so my only
> explanation is that the check never actually executes because there's no
> space for the sentinel thanks to alignment, and the tweak you did breaks
> that. Strange ...

A quick code-coverage check confirms that the sentinel_ok() line
is not reached in core or test_decoding tests as of HEAD
(on a 64-bit machine anyway).  So we just happen to be using
only allocation requests that are already maxaligned.

I wonder if slab ought to artificially bump up such requests when
MEMORY_CONTEXT_CHECKING is enabled, so there's room for a sentinel.
I think it's okay for aset.c to not do that, because its power-of-2
behavior means there usually is room for a sentinel; but slab's
policy makes it much more likely that there won't be.

regards, tom lane

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tom Lane

David Rowley  writes:
> Here's v2 of the slab-fix patch.

> I've included the sentinel check fix.  This passes make check-world
> for me when do a 32-bit build on my x86_64 machine and adjust
> pg_config.h to set MAXIMUM_ALIGNOF to 8.

> Any chance you could run make check-world on your 32-bit Raspberry PI?

I have clean core and test_decoding passes on both 32-bit ARM and
32-bit PPC.  It'll take awhile (couple hours) to finish a full
check-world, but I'd say that's good enough evidence to commit.

regards, tom lane

Re: [PATCH] Add native windows on arm64 support

2022-08-29 Thread Michael Paquier

On Mon, Aug 29, 2022 at 05:33:56PM -0700, Andres Freund wrote:
> The weirdest part is that it only happens as part of the the pg_upgrade test.

make check has just failed:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2022-08-30%2001%3A15%3A13

> I just tested it in my test windows vm (win 10, vs 2019), with a build of
> libxml I had around (2.9.7), and the regression tests passed both "normally"
> and within pg_upgrade.

bowerbird is feeling from c:\\prog\\3p64\\include\\libxml2 and
c:\\prog\\3p64\\lib\\libxml2.lib.  I am not sure which version of
libxml this is, and the other animals of MSVC don't use libxml so it
is not possible to correlate that only to VS 2017, either.
--
Michael


signature.asc
Description: PGP signature

Re: Bump MIN_WINNT to 0x0600 (Vista) as minimal runtime in 16~

2022-08-29 Thread Thomas Munro

On Tue, Aug 30, 2022 at 12:57 PM Michael Paquier  wrote:
> On Sat, Aug 27, 2022 at 02:35:25PM +0900, Michael Paquier wrote:
> > There may be a point in enforcing CreateProcess() if
> > CreateRestrictedToken() cannot be loaded, but that would be a security
> > issue if Windows goes crazy as we should always expect the function,
> > so this had better return an error.
>
> And applied as of b1ec7f4.

This reminds me of 24c3ce8f1, which replaced a dlopen()-ish thing with
a direct function call.  Can you just call all these functions
directly these days?

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tomas Vondra




On 8/30/22 02:45, David Rowley wrote:
> On Tue, 30 Aug 2022 at 03:39, Tomas Vondra
>  wrote:
>> The attached patch seems to fix the issue for me - at least it seems
>> like that. This probably will need to get backpatched, I guess. Maybe we
>> should add an assert to MemoryChunkGetPointer to check alignment?
> 
> Hi Tomas,
> 
> I just wanted to check with you if you ran the full make check-world
> with this patch?
> 
> I don't yet have a working ARM 32-bit environment to test, but on
> trying it with x86 32-bit and adjusting MAXIMUM_ALIGNOF to 8, I'm
> getting failures in test_decoding. Namely:
> 
> test twophase ... FAILED   51 ms
> test twophase_stream  ... FAILED   25 ms
> 
>  INSERT INTO test_prepared2 VALUES (5);
>  SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> +WARNING:  problem in slab Change: detected write past chunk end in
> block 0x58b8ecf0, chunk 0x58b8ed08
> +WARNING:  problem in slab Change: detected write past chunk end in
> block 0x58b8ecf0, chunk 0x58b8ed58
> +WARNING:  problem in slab Change: detected write past chunk end in
> block 0x58b8ecf0, chunk 0x58b8eda8
> +WARNING:  problem in slab Change: detected write past chunk end in
> block 0x58b8ecf0, chunk 0x58b8edf8
> +WARNING:  problem in slab Change: detected write past chunk end in
> block 0x58b8ecf0, chunk 0x58b8ee48
> 
> I think the existing sentinel check looks wrong:
> 
> if (!sentinel_ok(chunk, slab->chunkSize))
> 
> shouldn't that be passing the pointer rather than the chunk?
> 

I agree the check in SlabCheck() looks wrong, as it's ignoring the chunk
header (unlike the other contexts).

But yeah, I ran "make check-world" and it passed just fine, so my only
explanation is that the check never actually executes because there's no
space for the sentinel thanks to alignment, and the tweak you did breaks
that. Strange ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: effective_multixact_freeze_max_age issue

2022-08-29 Thread Peter Geoghegan

On Mon, Aug 29, 2022 at 2:20 AM Matthias van de Meent
 wrote:
> Apart from the message that this behaviour is changing, I'd prefer
> some more description in the commit message as to why this needs
> changing.

I usually only write a full commit message before posting a patch when
it's a full patch series, where it can be helpful to be very explicit
about how the parts fit together. The single line commit message is
just a placeholder -- I'll definitely write a better one before
commit.

> Then, on to the patch itself:
>
> > + * XXX We don't do push back oldestMxact here, which is not 
> > ideal
>
> Do you intend to commit this marker, or is this leftover from the
> development process?

Ordinarily I would never commit an XXX comment, and probably wouldn't
even leave one in early revisions of patches that I post to the list.
This is a special case, though -- it involves the "snapshot too old"
feature, which has many similar XXX/FIXME/TODO comments. I think I
might leave it like that when committing.

The background here is that the snapshot too old code still has lots
of problems -- there is a FIXME comment that gives an overview of this
in TransactionIdLimitedForOldSnapshots(). We're going to have to live
with the fact that that feature isn't in good shape for the
foreseeable future. I can only really work around it.

> > +if (*multiXactCutoff < FirstMultiXactId)
> [...]
> > +if (safeOldestMxact < FirstMultiXactId)
> [...]
> > +if (aggressiveMXIDCutoff < FirstMultiXactId)
>
> I prefer !TransactionId/MultiXactIdIsValid() over '< First
> [MultiXact/Transaction]Id', even though it is the same in
> functionality, because it clarifies the problem we're trying to solve.
> I understand that the use of < is pre-existing, but since we're
> touching this code shouldn't we try to get this new code up to current
> standards?

I agree in principle, but there are already 40+ other places that use
the same idiom in places like multixact.c. Perhaps you can propose a
patch to change all of them at once, together?

-- 
Peter Geoghegan

Re: pg_stat_wal: tracking the compression effect

2022-08-29 Thread Ken Kato


On 2022-08-27 16:48, Bharath Rupireddy wrote:

On Fri, Aug 26, 2022 at 8:39 AM Kyotaro Horiguchi
 wrote:


At Fri, 26 Aug 2022 11:55:27 +0900 (JST), Kyotaro Horiguchi 
 wrote in

> At Thu, 25 Aug 2022 16:04:50 +0900, Ken Kato  wrote 
in
> > Accumulating the values, which indicates how much space is saved by
> > each compression (size before compression - size after compression),
> > and keep track of how many times compression has happened. So that one
> > can know how much space is saved on average.
>
> Honestly, I don't think its useful much.
> How about adding them to pg_waldump and pg_walinspect instead?
>
> # It further widens the output of pg_waldump, though..

Sorry, that was apparently too short.

I know you already see that in per-record output of pg_waldump, but
maybe we need the summary of saved bytes in "pg_waldump -b -z" output
and the corresponding output of pg_walinspect.


+1 for adding compression stats such as type and saved bytes to
pg_waldump and pg_walinspect given that the WAL records already have
the saved bytes info. Collecting them in the server via pg_stat_wal
will require some extra effort, for instance, every WAL record insert
requires that code to be executed. When users want to analyze the
compression efforts they can either use pg_walinspect or pg_waldump
and change the compression type if required.


Thank you for all the comments!

I will go with adding the compression stats in pg_waldump and 
pg_walinspect.


Regards,
--
Ken Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread David Rowley

On Tue, 30 Aug 2022 at 12:45, David Rowley  wrote:
> I think the existing sentinel check looks wrong:
>
> if (!sentinel_ok(chunk, slab->chunkSize))
>
> shouldn't that be passing the pointer rather than the chunk?

Here's v2 of the slab-fix patch.

I've included the sentinel check fix.  This passes make check-world
for me when do a 32-bit build on my x86_64 machine and adjust
pg_config.h to set MAXIMUM_ALIGNOF to 8.

Any chance you could run make check-world on your 32-bit Raspberry PI?

I'm also wondering if this should also be backpatched back to v10,
providing the build farm likes it well enough on master.
diff --git a/src/backend/utils/mmgr/slab.c b/src/backend/utils/mmgr/slab.c
index ae1a735b8c..02f365ff80 100644
--- a/src/backend/utils/mmgr/slab.c
+++ b/src/backend/utils/mmgr/slab.c
@@ -58,6 +58,8 @@
 #include "utils/memutils_memorychunk.h"
 #include "utils/memutils_internal.h"
 
+#define Slab_BLOCKHDRSZMAXALIGN(sizeof(SlabBlock))
+
 /*
  * SlabContext is a specialized implementation of MemoryContext.
  */
@@ -102,10 +104,10 @@ typedef struct SlabBlock
 #define SlabChunkGetPointer(chk)   \
((void *)(((char *)(chk)) + sizeof(MemoryChunk)))
 #define SlabBlockGetChunk(slab, block, idx) \
-   ((MemoryChunk *) ((char *) (block) + sizeof(SlabBlock)  \
+   ((MemoryChunk *) ((char *) (block) + Slab_BLOCKHDRSZ\
+ (idx * slab->fullChunkSize)))
 #define SlabBlockStart(block)  \
-   ((char *) block + sizeof(SlabBlock))
+   ((char *) block + Slab_BLOCKHDRSZ)
 #define SlabChunkIndex(slab, block, chunk) \
(((char *) chunk - SlabBlockStart(block)) / slab->fullChunkSize)
 
@@ -146,12 +148,12 @@ SlabContextCreate(MemoryContext parent,
fullChunkSize = Slab_CHUNKHDRSZ + MAXALIGN(chunkSize);
 
/* Make sure the block can store at least one chunk. */
-   if (blockSize < fullChunkSize + sizeof(SlabBlock))
+   if (blockSize < fullChunkSize + Slab_BLOCKHDRSZ)
elog(ERROR, "block size %zu for slab is too small for %zu 
chunks",
 blockSize, chunkSize);
 
/* Compute maximum number of chunks per block */
-   chunksPerBlock = (blockSize - sizeof(SlabBlock)) / fullChunkSize;
+   chunksPerBlock = (blockSize - Slab_BLOCKHDRSZ) / fullChunkSize;
 
/* The freelist starts with 0, ends with chunksPerBlock. */
freelistSize = sizeof(dlist_head) * (chunksPerBlock + 1);
@@ -733,6 +735,7 @@ SlabCheck(MemoryContext context)
{
MemoryChunk *chunk = 
SlabBlockGetChunk(slab, block, j);
SlabBlock  *chunkblock = (SlabBlock *) 
MemoryChunkGetBlock(chunk);
+   void   *pointer = 
MemoryChunkGetPointer(chunk);
 
/*
 * check the chunk's blockoffset 
correctly points back to
@@ -744,7 +747,7 @@ SlabCheck(MemoryContext context)
 
/* there might be sentinel (thanks to 
alignment) */
if (slab->chunkSize < 
(slab->fullChunkSize - Slab_CHUNKHDRSZ))
-   if (!sentinel_ok(chunk, 
slab->chunkSize))
+   if (!sentinel_ok(pointer, 
slab->chunkSize))
elog(WARNING, "problem 
in slab %s: detected write past chunk end in block %p, chunk %p",
 name, block, 
chunk);
}

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread David Rowley

On Tue, 30 Aug 2022 at 12:22, Tomas Vondra
 wrote:
> I also suggested doing a similar check in MemoryChunkGetPointer, so that
> we catch the issue earlier - right after we allocate the chunk. Any
> opinion on that?

I think it's probably a good idea. However, I'm not yet sure if we can
keep it as a macro or if it would need to become a static inline
function to do that.

What I'd really have wished for is a macro like AssertPointersEqual()
that spat out the two pointer values. That would probably have saved
more time on this issue.

David

Re: Bump MIN_WINNT to 0x0600 (Vista) as minimal runtime in 16~

2022-08-29 Thread Michael Paquier

On Sat, Aug 27, 2022 at 02:35:25PM +0900, Michael Paquier wrote:
> There may be a point in enforcing CreateProcess() if
> CreateRestrictedToken() cannot be loaded, but that would be a security
> issue if Windows goes crazy as we should always expect the function,
> so this had better return an error.

And applied as of b1ec7f4.
--
Michael


signature.asc
Description: PGP signature

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread David Rowley

On Tue, 30 Aug 2022 at 03:39, Tomas Vondra
 wrote:
> The attached patch seems to fix the issue for me - at least it seems
> like that. This probably will need to get backpatched, I guess. Maybe we
> should add an assert to MemoryChunkGetPointer to check alignment?

Hi Tomas,

I just wanted to check with you if you ran the full make check-world
with this patch?

I don't yet have a working ARM 32-bit environment to test, but on
trying it with x86 32-bit and adjusting MAXIMUM_ALIGNOF to 8, I'm
getting failures in test_decoding. Namely:

test twophase ... FAILED   51 ms
test twophase_stream  ... FAILED   25 ms

 INSERT INTO test_prepared2 VALUES (5);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+WARNING:  problem in slab Change: detected write past chunk end in
block 0x58b8ecf0, chunk 0x58b8ed08
+WARNING:  problem in slab Change: detected write past chunk end in
block 0x58b8ecf0, chunk 0x58b8ed58
+WARNING:  problem in slab Change: detected write past chunk end in
block 0x58b8ecf0, chunk 0x58b8eda8
+WARNING:  problem in slab Change: detected write past chunk end in
block 0x58b8ecf0, chunk 0x58b8edf8
+WARNING:  problem in slab Change: detected write past chunk end in
block 0x58b8ecf0, chunk 0x58b8ee48

I think the existing sentinel check looks wrong:

if (!sentinel_ok(chunk, slab->chunkSize))

shouldn't that be passing the pointer rather than the chunk?

David

Re: [PATCH] Add native windows on arm64 support

2022-08-29 Thread Andres Freund

Hi,

On 2022-08-29 19:58:31 -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > This is amazing.  The issue has showed up a second time in a row in
> > bowerbird, as of:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2022-08-29%2013%3A30%3A32
> 
> That is fascinating.

> > I don't know what to think about ASLR that manipulates the comment in
> > this XML object under VS 2017 (perhaps a compiler issue?), but it
> > should be possible to go back to green simply by removing ""
> > from the input string.  Creating an extra output pattern here would be
> > very costly, as xml_2.out and xml.out have outputs for --with-libxml.
> 
> > Would people object if I do that for now?
> 
> Let's let it go for a few more runs.  I want to know whether it
> reproduces 100% or only sometimes.

The weirdest part is that it only happens as part of the the pg_upgrade test.

I just tested it in my test windows vm (win 10, vs 2019), with a build of
libxml I had around (2.9.7), and the regression tests passed both "normally"
and within pg_upgrade.

Greetings,

Andres Freund

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tomas Vondra

On 8/29/22 23:57, Tom Lane wrote:
> David Rowley  writes:
>> I've read over the emails and glanced at Tomas' patch. I think that
>> seems good. I think I'd rather see us do that than pad the struct out
>> further as Tomas' method is more aligned to what we do in aset.c
>> (ALLOC_BLOCKHDRSZ) and generation.c (Generation_BLOCKHDRSZ).
> 
>> I can adjust Tomas' patch to #define Slab_BLOCKHDRSZ
> 
> WFM
> 

Same here.

I also suggested doing a similar check in MemoryChunkGetPointer, so that
we catch the issue earlier - right after we allocate the chunk. Any
opinion on that? With an assert only in GetMemoryChunkMethodID() we'd
notice the issue much later, when it may not be obvious if it's a memory
corruption or what. But maybe it's overkill.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Add native windows on arm64 support

2022-08-29 Thread Andres Freund

Hi,

On 2022-08-30 08:48:38 +0900, Michael Paquier wrote:
> On Mon, Aug 29, 2022 at 10:12:20AM +0900, Michael Paquier wrote:
> > I have noticed that yesterday, but cannot think much about it.  This
> > basically changes the position of "" for the first record,
> > leaving the second one untouched:
> > 
> > 
> > 
> > I am not used to xmltable(), but I wonder if there is something in one
> > of these support functions in xml.c that gets influenced by the
> > randomization.  That sounds a bit hairy as make check passed in
> > bowerbird, and I have noticed at least two other Windows hosts running
> > TAP that passed.  Or that's just something with libxml itself.
> 
> This is amazing.  The issue has showed up a second time in a row in
> bowerbird, as of:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2022-08-29%2013%3A30%3A32
> 
> I don't know what to think about ASLR that manipulates the comment in
> this XML object under VS 2017 (perhaps a compiler issue?), but it
> should be possible to go back to green simply by removing ""
> from the input string.  Creating an extra output pattern here would be
> very costly, as xml_2.out and xml.out have outputs for --with-libxml.
> 
> Would people object if I do that for now?

I do. Clearly something is wrong, what do we gain by averting our eyes? An
alternative output file strikes me as an equally bad idea - it's not like this
is a case where there are legitimately different outputs?

Bowerbird is the only animal testing libxml with visual studio afaics. And CI
doesn't have libxml either.  So we have no idea if it's the msvc version that
matters.

Andrew, what version of libxml is installed? Afaict we don't see that anywhere
with msvc.

Greetings,

Andres Freund

Re: [PATCH] Add native windows on arm64 support

2022-08-29 Thread Tom Lane

Michael Paquier  writes:
> This is amazing.  The issue has showed up a second time in a row in
> bowerbird, as of:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2022-08-29%2013%3A30%3A32

That is fascinating.

> I don't know what to think about ASLR that manipulates the comment in
> this XML object under VS 2017 (perhaps a compiler issue?), but it
> should be possible to go back to green simply by removing ""
> from the input string.  Creating an extra output pattern here would be
> very costly, as xml_2.out and xml.out have outputs for --with-libxml.

> Would people object if I do that for now?

Let's let it go for a few more runs.  I want to know whether it
reproduces 100% or only sometimes.

regards, tom lane

Re: [PATCH] Add native windows on arm64 support

2022-08-29 Thread Michael Paquier

On Mon, Aug 29, 2022 at 10:12:20AM +0900, Michael Paquier wrote:
> I have noticed that yesterday, but cannot think much about it.  This
> basically changes the position of "" for the first record,
> leaving the second one untouched:
> 
> 
> 
> I am not used to xmltable(), but I wonder if there is something in one
> of these support functions in xml.c that gets influenced by the
> randomization.  That sounds a bit hairy as make check passed in
> bowerbird, and I have noticed at least two other Windows hosts running
> TAP that passed.  Or that's just something with libxml itself.

This is amazing.  The issue has showed up a second time in a row in
bowerbird, as of:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2022-08-29%2013%3A30%3A32

I don't know what to think about ASLR that manipulates the comment in
this XML object under VS 2017 (perhaps a compiler issue?), but it
should be possible to go back to green simply by removing ""
from the input string.  Creating an extra output pattern here would be
very costly, as xml_2.out and xml.out have outputs for --with-libxml.

Would people object if I do that for now?
--
Michael

signature.asc
Description: PGP signature

Re: identifying the backend that owns a temporary schema

2022-08-29 Thread Nathan Bossart

On Tue, Aug 23, 2022 at 10:29:05AM +0100, Greg Stark wrote:
> Having this function would be great (I admit I never responded because
> I never figured out if your suggestion was right or not:). But should
> it also be added to the pg_stat_activity view? Perhaps even just in
> the SQL view using the function.
> 
> Alternately should pg_stat_activity show the actual temp schema name
> instead of the id? I don't recall if it's visible outside the backend
> but if it is, could pg_stat_activity show whether the temp schema is
> actually attached or not?

I'm open to adding the backend ID or the temp schema name to
pg_stat_activity, but I wouldn't be surprised to learn that others aren't.
It'd be great to hear a few more opinions on the idea before I spend too
much time on the patches.  IMO we should still adjust the
pg_stat_get_backend_*() functions even if we do end up adjusting
pg_stat_activity.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: archive modules

2022-08-29 Thread Nathan Bossart

On Thu, Aug 25, 2022 at 01:06:00PM -0700, Nathan Bossart wrote:
> On Thu, Aug 25, 2022 at 03:29:41PM +0200, talk to ben wrote:
>> Here is a patch with the proposed wording.
> 
> Here is the same patch with a couple more links.

I would advise that you create a commitfest entry for your patch so that it
isn't forgotten.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: effective_multixact_freeze_max_age issue

2022-08-29 Thread Nathan Bossart

On Mon, Aug 29, 2022 at 10:25:50AM -0700, Peter Geoghegan wrote:
> On Sun, Aug 28, 2022 at 4:14 PM Nathan Bossart  
> wrote:
>> The idea seems sound to me, and IMO your patch simplifies things nicely,
>> which might be reason enough to proceed with it.
> 
> It is primarily a case of making things simpler. Why would it ever
> make sense to interpret age differently in the presence of a long
> running transaction, though only for the FreezeLimit/MultiXactCutoff
> cutoff calculation? And not for the closely related
> freeze_table_age/multixact_freeze_table_age calculation? It's hard to
> imagine that that was ever a deliberate choice.
> 
> vacuum_set_xid_limits() didn't contain the logic for determining if
> its caller's VACUUM should be an aggressive VACUUM until quite
> recently. Postgres 15 commit efa4a9462a put the logic for determining
> aggressiveness right next to the logic for determining FreezeLimit,
> which made the inconsistency much more noticeable. It is easy to
> believe that this was really just an oversight, all along.

Agreed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

introduce bufmgr hooks

2022-08-29 Thread Nathan Bossart

Hi hackers,

I'd like to propose some new hooks for the buffer manager.  My primary goal
is to allow users to create an additional caching mechanism between the
shared buffers and disk for evicted buffers.  For example, some EC2
instance classes have ephemeral disks that are physically attached to the
host machine that might be useful for such a cache.  Presumably there are
other uses (e.g., gathering more information about the buffer cache), but
this is the main use-case I have in mind.  I am proposing the following new
hooks:

 * bufmgr_read_hook: called in place of smgrread() in ReadBuffer_common().
   It is expected that such hooks would call smgrread() as necessary.

 * bufmgr_write_hook: called before smgrwrite() in FlushBuffer().  The hook
   indicateѕ whether the buffer is being evicted.  Hook functions must
   gracefully handle concurrent hint bit updates to the page.

 * bufmgr_invalidate_hook: called within InvalidateBuffer().

The attached patch is a first attempt at introducing these hooks with
acceptable names, placements, arguments, etc.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d1683d73df8930927a464555f256c8c91c7cf24e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Aug 2022 16:24:26 -0700
Subject: [PATCH v1 1/1] Introduce bufmgr hooks.

These hooks can be used for maintaining a secondary buffer cache
outside of the regular shared buffers.  In theory, there are many
other potential uses.
---
 src/backend/storage/buffer/bufmgr.c | 54 -
 src/include/storage/buf_internals.h | 14 
 2 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9c1bd508d3..365272d139 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -166,6 +166,15 @@ static bool IsForInput;
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
+/* hook for plugins to get control when reading in a page */
+bufmgr_read_hook_type bufmgr_read_hook = NULL;
+
+/* hook for plugins to get control when evicting a page */
+bufmgr_write_hook_type bufmgr_write_hook = NULL;
+
+/* hook for plugins to get control when invalidating a page */
+bufmgr_invalidate_hook_type bufmgr_invalidate_hook = NULL;
+
 /*
  * Backend-Private refcount management:
  *
@@ -482,7 +491,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 			   BlockNumber blockNum,
 			   BufferAccessStrategy strategy,
 			   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, bool for_eviction);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 	   ForkNumber forkNum,
 	   BlockNumber nForkBlock,
@@ -1015,17 +1024,22 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			instr_time	io_start,
 		io_time;
 
-			if (track_io_timing)
-INSTR_TIME_SET_CURRENT(io_start);
+			if (bufmgr_read_hook)
+(*bufmgr_read_hook) (smgr, forkNum, blockNum, (char *) bufBlock);
+			else
+			{
+if (track_io_timing)
+	INSTR_TIME_SET_CURRENT(io_start);
 
-			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
+smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
-			if (track_io_timing)
-			{
-INSTR_TIME_SET_CURRENT(io_time);
-INSTR_TIME_SUBTRACT(io_time, io_start);
-pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+if (track_io_timing)
+{
+	INSTR_TIME_SET_CURRENT(io_time);
+	INSTR_TIME_SUBTRACT(io_time, io_start);
+	pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+	INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+}
 			}
 
 			/* check for garbage data */
@@ -1269,7 +1283,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		  smgr->smgr_rlocator.locator.dbOid,
 		  smgr->smgr_rlocator.locator.relNumber);
 
-FlushBuffer(buf, NULL);
+FlushBuffer(buf, NULL, true);
 LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1544,6 +1558,9 @@ retry:
 		goto retry;
 	}
 
+	if (bufmgr_invalidate_hook)
+		(*bufmgr_invalidate_hook) (buf);
+
 	/*
 	 * Clear out the buffer's tag and flags.  We must do this to ensure that
 	 * linear scans of the buffer array don't think the buffer is valid.
@@ -2573,7 +2590,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, false);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2822,7 +2839,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second p

windows resource files, bugs and what do we actually want

2022-08-29 Thread Andres Freund

Hi,

A few things about the windows resource files we generate

1) For make based builds, all libraries that are built with MODULES rather
   than MODULES_big have the wrong "FILETYPE", because Makefile.win32 checks
   $(shlib), which is only set for MODULES_big.

   This used to be even more widely wrong until recently:

   commit 16a4a3d59cd5574fdc697ea16ef5692ce34c54d5
   Author: Peter Eisentraut 
   Date:   2020-01-15 10:15:06 +0100

   Remove libpq.rc, use win32ver.rc for libpq

   Afaict before that we only set it correctly for pgevent.


2) For make base builds, We only set InternalName, OriginalFileName when
   $shlib is set, but InternalName, OriginalFilename are required.

   https://docs.microsoft.com/en-us/windows/win32/menurc/versioninfo-resource


3) We don't add an icon to postgres ("This is a daemon process, which is why
   it is not labeled as an executable"), but we do add icons to several
   libraries, at least snowball, pgevent, libpq.

   We should probably just remove the icon from the libraries?


4) We include the date, excluding 0 for some mysterious reason, in the version
   number. This seems to unnecessarily contribute to making the build not
   reproducible. Hails from long ago:

   commit 9af932075098bd3c143993386288a634d518713c
   Author: Bruce Momjian 
   Date:   2004-12-19 02:16:31 +

   Add Win32 version stamps that increment each day for proper SYSTEM32
   DLL pginstaller installs.

5) We have a PGFILEDESC for (nearly?) every binary/library. They largely don't
   seem more useful descriptions than the binary's name. Why don't we just
   drop most of them and just set the description as something like
   "PostgreSQL $name (binary|library)"? I doubt anybody ever looks into these
   details except to perhaps check the version number or such.

Greetings,

Andres Freund

Re: replacing role-level NOINHERIT with a grant-level option

2022-08-29 Thread Nathan Bossart

On Mon, Aug 29, 2022 at 03:38:57PM -0400, Robert Haas wrote:
> Here is a patch to rearrange the logic slightly and also add a test
> case memorializing the intended behavior. Without this change, the
> regression test included in the patch fails like this:
> 
> ERROR:  no possible grantors
> 
> ...which is never supposed to happen.

The grantor column in the expected test output refers to "rhaas", so I
imagine the test only passes on your machines.  Should we create a new
grantor role for this test?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tom Lane

David Rowley  writes:
> I've read over the emails and glanced at Tomas' patch. I think that
> seems good. I think I'd rather see us do that than pad the struct out
> further as Tomas' method is more aligned to what we do in aset.c
> (ALLOC_BLOCKHDRSZ) and generation.c (Generation_BLOCKHDRSZ).

> I can adjust Tomas' patch to #define Slab_BLOCKHDRSZ

WFM

regards, tom lane

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread David Rowley

On Tue, 30 Aug 2022 at 03:43, Tom Lane  wrote:
> I think adding a padding field to SlabBlock would be a less messy
> solution than your patch.

Thank you both of you for looking at this while I was sleeping.

I've read over the emails and glanced at Tomas' patch. I think that
seems good. I think I'd rather see us do that than pad the struct out
further as Tomas' method is more aligned to what we do in aset.c
(ALLOC_BLOCKHDRSZ) and generation.c (Generation_BLOCKHDRSZ).

I can adjust Tomas' patch to #define Slab_BLOCKHDRSZ

David

Re: SQL/JSON features for v15

2022-08-29 Thread Andrew Dunstan



On 2022-08-29 Mo 09:35, Jonathan S. Katz wrote:
> On 8/29/22 8:56 AM, Amit Langote wrote:
>> On Sat, Aug 27, 2022 at 5:11 AM Nikita Glukhov
>>  wrote:
>
>> I am not sure if it's OK to eval_const_expressions() on a Query
>> sub-expression during parse-analysis.  IIUC, it is only correct to
>> apply it to after the rewriting phase.
>>
>>>     Maybe it would be better to simply remove DEFAULT ON EMPTY.
>>
>> So +1 to this for now.
>
> +1, if this simplifies the patch and makes it acceptable for v15
>
>>> It is possible to easily split this patch into several subpatches,
>>> I will do it if needed.
>>
>> That would be nice indeed.
>
> With RMT hat on, the RMT has its weekly meetings on Tuesdays. Based on
> the timing of the Beta 4 commit freeze[1] and how both
> including/reverting are nontrivial operations (e.g. we should ensure
> we're confident in both and that they pass through the buildfarm), we
> are going to have to make a decision on how to proceed at the next
> meeting.
>
> Can folks please chime in on what they think of the current patchset
> and if this is acceptable for v15?
>
>

I think at a pinch we could probably go with it, but it's a close call.
I think it deals with the most pressing issues that have been raised. If
people are still worried I think it would be trivial to add in calls
that error out of the DEFAULT clauses are used at all.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: logical decoding and replication of sequences

2022-08-29 Thread Thomas Munro

On Tue, Aug 30, 2022 at 6:04 AM Tomas Vondra
 wrote:
> On 8/29/22 12:21, Thomas Munro wrote:
> > 0001:  Instead of figuring out when to invalidate the cache, let's
> > just invalidate it before every read attempt.  It is only marked valid
> > after success (ie state->readLen > 0).  No need to worry about error
> > cases.
>
> Maybe I misunderstand how all this works, but won't this have a really
> bad performance impact. If not, why do we need the cache at all?

It's a bit confusing because there are several levels of "read".  The
cache remains valid as long as the caller of ReadPageInternal() keeps
asking for data that is in range (see early return after comment "/*
check whether we have all the requested data already */").  As soon as
the caller asks for something not in range, this patch marks the cache
invalid before calling the page_read() callback (= XLogPageRead()).
It is only marked valid again after that succeeds.  Here's a new
version with no code change, just a better commit message to try to
explain that more clearly.
From d02d851ab515d4a1f883f657bc43a9cae600a5d8 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 29 Aug 2022 19:56:27 +1200
Subject: [PATCH v2 1/3] Fix cache invalidation in rare recovery_prefetch
 cases.

XLogPageRead() can retry internally in rare cases after a read has
succeeded, overwriting the page buffer.  Namely, short reads, and page
validation failures while in standby mode (see commit 0668719801).  Due
to an oversight in commit 3f1ce973, these cases could leave stale data
in the internal cache of xlogreader.c without marking it invalid.  The
main defense against stale cached data was in the error handling path of
ReadPageInternal(), but that wasn't quite enough for errors handled
internally by XLogPageRead()'s retry loop.

Instead of trying to invalidate the cache in the relevant failure cases,
ReadPageInternal() now invalidates it before even calling the page_read
callback (ie XLogPageRead()).  It is only marked valid by setting
state->readLen etc after page_read() succeeds.  It remains valid until
the caller of ReadPageInternal() eventually asks for something out of
range and we have to go back to page_read() for more.

The removed line that was setting errormsg_deferred was redundant
because it's already set by report_invalid_record().

Back-patch to 15.

Reported-by: Noah Misch 
Reviewed-by:
Discussion: https://postgr.es/m/20220807003627.GA4168930%40rfd.leadboat.com
---
 src/backend/access/transam/xlogreader.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f17e80948d..e883ade607 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -987,6 +987,13 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		targetPageOff == state->segoff && reqLen <= state->readLen)
 		return state->readLen;

+	/*
+	 * Invalidate contents of internal buffer before read attempt.  Just set
+	 * the length to 0, rather than a full XLogReaderInvalReadState(), so we
+	 * don't forget the segment we last read.
+	 */
+	state->readLen = 0;
+
 	/*
 	 * Data is not in our buffer.
 	 *
@@ -1067,11 +1074,6 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;

 err:
-	if (state->errormsg_buf[0] != '\0')
-	{
-		state->errormsg_deferred = true;
-		XLogReaderInvalReadState(state);
-	}
 	return XLREAD_FAIL;
 }

-- 
2.30.2

From 6439f5ed5b9cb6524c307f06a43c9c54f51d3b86 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 29 Aug 2022 21:08:25 +1200
Subject: [PATCH v2 2/3] Improve xlogrecovery.c/xlogreader.c boundary.

Create a function XLogReaderResetError() to clobber the error message
inside an XLogReaderState, instead of doing it directly from
xlogrecovery.c.  This is older code from commit 0668719801, but commit
5dc0418f probably should have tidied this up because it leaves the
internal state a bit out of sync (not a live bug, but a little
confusing).
---
 src/backend/access/transam/xlogreader.c   | 10 ++
 src/backend/access/transam/xlogrecovery.c |  2 +-
 src/include/access/xlogreader.h   |  3 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e883ade607..c100e18333 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1325,6 +1325,16 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	return true;
 }

+/*
+ * Forget about an error produced by XLogReaderValidatePageHeader().
+ */
+void
+XLogReaderResetError(XLogReaderState *state)
+{
+	state->errormsg_buf[0] = '\0';
+	state->errormsg_deferred = false;
+}
+
 /*
  * Find the first record with an lsn >= RecPtr.
  *
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a59a0e

Re: New strategies for freezing, advancing relfrozenxid early

2022-08-29 Thread Peter Geoghegan

On Mon, Aug 29, 2022 at 11:47 AM Jeff Davis  wrote:
> Sounds like a good goal, and loosely follows the precedent of
> checkpoint targets and vacuum cost delays.

Right.

> Why is the threshold per-table? Imagine someone who has a bunch of 4GB
> partitions that add up to a huge amount of deferred freezing work.

I think it's possible that our cost model will eventually become very
sophisticated, and weigh all kinds of different factors, and work as
one component of a new framework that dynamically schedules autovacuum
workers. My main goal in posting this v1 was validating the *general
idea* of strategies with cost models, and the related question of how
we might use VM snapshots for that. After all, even the basic concept
is totally novel.

> The initial problem you described is a system-level problem, so it
> seems we should track the overall debt in the system in order to keep
> up.

I agree that the problem is fundamentally a system-level problem. One
reason why vacuum_freeze_strategy_threshold works at the table level
right now is to get the ball rolling. In any case the specifics of how
we trigger each strategy are from from settled. That's not the only
reason why we think about things at the table level in the patch set,
though.

There *are* some fundamental reasons why we need to care about
individual tables, rather than caring about unfrozen pages at the
system level *exclusively*. This is something that
vacuum_freeze_strategy_threshold kind of gets right already, despite
its limitations. There are 2 aspects of the design that seemingly have
to work at the whole table level:

1. Concentration matters when it comes to wraparound risk.

Fundamentally, each VACUUM still targets exactly one heap rel, and
advances relfrozenxid at most once per VACUUM operation. While the
total number of "unfrozen heap pages" across the whole database is the
single most important metric, it's not *everything*.

As a general rule, there is much less risk in having a certain fixed
number of unfrozen heap pages spread fairly evenly among several
larger tables, compared to the case where the same number of unfrozen
pages are all concentrated in one particular table -- right now it'll
often be one particular table that is far larger than any other table.
Right now the pain is generally felt with large tables only.

2. We need to think about things at the table level is to manage costs
*over time* holistically. (Closely related to #1.)

The ebb and flow of VACUUM for one particular table is a big part of
the picture here -- and will be significantly affected by table size.
We can probably always afford to risk falling behind on
freezing/relfrozenxid (i.e. we should prefer laziness) if we know that
we'll almost certainly be able to catch up later when things don't
quite work out. That makes small tables much less trouble, even when
there are many more of them (at least up to a point).

As you know, my high level goal is to avoid ever having to make huge
balloon payments to catch up on freezing, which is a much bigger risk
with a large table -- this problem is mostly a per-table problem (both
now and in the future).

A large table will naturally require fewer, larger VACUUM operations
than a small table, no matter what approach is taken with the strategy
stuff. We therefore have fewer VACUUM operations in a given
week/month/year/whatever to spread out the burden -- there will
naturally be fewer opportunities. We want to create the impression
that each autovacuum does approximately the same amount of work (or at
least the same per new heap page for large append-only tables).

It also becomes much more important to only dirty each heap page
during vacuuming ~once with larger tables. With a smaller table, there
is a much higher chance that the pages we modify will already be dirty
from user queries.

> > for this table, at this time: Is it more important to advance
> > relfrozenxid early (be eager), or to skip all-visible pages instead
> > (be lazy)? If it's the former, then we must scan every single page
> > that isn't all-frozen according to the VM snapshot (including every
> > all-visible page).
>
> This feels too absolute, to me. If the goal is to freeze more
> incrementally, well in advance of wraparound limits, then why can't we
> just freeze 1000 out of 1 freezable pages on this run, and then
> leave the rest for a later run?

My remarks here applied only to the question of relfrozenxid
advancement -- not to freezing. Skipping strategy (relfrozenxid
advancement) is a distinct though related concept to freezing
strategy. So I was making a very narrow statement about
invariants/basic correctness rules -- I wasn't arguing against
alternative approaches to freezing beyond the 2 freezing strategies
(not to be confused with skipping strategies) that appear in v1.
That's all I meant -- there is definitely no point in scanning only a
subset of the table's all-visible pages, as far as relfrozenxid
advancement is concerned (and sk

Postmaster self-deadlock due to PLT linkage resolution

2022-08-29 Thread Tom Lane

Buildfarm member mamba (NetBSD-current on prairiedog's former hardware)
has failed repeatedly since I set it up.  I have now run the cause of
that to ground [1], and here's what's happening: if the postmaster
receives a signal just before it first waits at the select() in
ServerLoop, it can self-deadlock.  During the postmaster's first use of
select(), the dynamic loader needs to resolve the PLT branch table entry
that the core executable uses to reach select() in libc.so, and it locks
the loader's internal data structures while doing that.  If we enter
a signal handler while the lock is held, and the handler needs to do
anything that also requires the lock, the postmaster is frozen.

The probability of this happening seems remarkably small, since there's
only one narrow window per postmaster lifetime, and there's just not
that many potential signal causes active at that time either.
Nonetheless I have traces showing it happening (1) because we receive
SIGCHLD for startup process termination and (2) because we receive
SIGUSR1 from the startup process telling us to start walreceivers.
I guess that mamba's slow single-CPU hardware interacts with the
NetBSD scheduler in just the right way to make it more probable than
you'd think.  On typical modern machines, the postmaster would almost
certainly manage to wait before the startup process is able to signal
it.  Still, "almost certainly" is not "certainly".

The attached patch seems to fix the problem, by forcing resolution of
the PLT link before we unblock signals.  It depends on the assumption
that another select() call appearing within postmaster.c will share
the same PLT link, which seems pretty safe.

I'd originally intended to make this code "#ifdef __NetBSD__",
but on looking into the FreeBSD sources I find much the same locking
logic in their dynamic loader, and now I'm wondering if such behavior
isn't pretty standard.  The added calls should have negligible cost,
so it doesn't seem unreasonable to do them everywhere.

(Of course, a much better answer is to get out of the business of
doing nontrivial stuff in signal handlers.  But even if we get that
done soon, we'd surely not back-patch it.)

Thoughts?

regards, tom lane

[1] https://gnats.netbsd.org/56979

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1664fcee2a..f9070da574 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1728,6 +1728,33 @@ ServerLoop(void)
 
 	nSockets = initMasks(&readmask);
 
+	/*
+	 * Issue a dummy select() call before we unblock signals.  The point of
+	 * this is that select() may be reached via a PLT indirection, and that
+	 * indirection will need to be resolved on first use, and on some
+	 * platforms the dynamic loader's resolution code takes a lock.  If that
+	 * lock is taken while signals are unblocked then we risk self-deadlock
+	 * should a signal handler do anything that also needs the loader lock.
+	 * This is a low-probability failure, since it can only happen if a signal
+	 * arrives with just the right timing during the first pass through the
+	 * server loop.  However, it has been seen repeatedly on slow single-CPU
+	 * machines.
+	 *
+	 * Likewise, issue a dummy pg_usleep() to ensure we've resolved any PLT
+	 * indirections needed in the PM_WAIT_DEAD_END path.  (The PG_SETMASK
+	 * calls could also be at hazard, had we not issued one of those when
+	 * blocking signals to begin with.)
+	 */
+	{
+		struct timeval timeout0;
+
+		timeout0.tv_sec = 0;
+		timeout0.tv_usec = 0;
+		(void) select(0, NULL, NULL, NULL, &timeout0);
+		pg_usleep(0);
+	}
+
+	/* Main server loop begins here */
 	for (;;)
 	{
 		fd_set		rmask;
@@ -1750,6 +1777,7 @@ ServerLoop(void)
 		{
 			PG_SETMASK(&UnBlockSig);
 
+			/* Add nothing else in this unblocked-signals section */
 			pg_usleep(10L); /* 100 msec seems reasonable */
 			selres = 0;
 
@@ -1765,6 +1793,7 @@ ServerLoop(void)
 
 			PG_SETMASK(&UnBlockSig);
 
+			/* Add nothing else in this unblocked-signals section */
 			selres = select(nSockets, &rmask, NULL, NULL, &timeout);
 
 			PG_SETMASK(&BlockSig);

Re: replacing role-level NOINHERIT with a grant-level option

2022-08-29 Thread Robert Haas

On Mon, Aug 29, 2022 at 10:17 AM Robert Haas  wrote:
> Good catch. Thanks for the review. Committed with that correction.

Argh, I found a bug, and one that I should have caught during testing,
too. I modelled the new function select_best_grantor() on
is_admin_of_role(), but it differs in that it calls
roles_is_member_of() with ROLERECURSE_PRIVS rather than
ROLECURSE_MEMBERS. Sadly, roles_is_member_of() handles
ROLERECURSE_PRIVS by completely ignoring non-inherited grants, which
is wrong, because then calls to select_best_grantor() treat a member
of a role with INHERIT FALSE, ADMIN TRUE is if they were not an admin
at all, which is incorrect.

Here is a patch to rearrange the logic slightly and also add a test
case memorializing the intended behavior. Without this change, the
regression test included in the patch fails like this:

ERROR:  no possible grantors

...which is never supposed to happen.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Fix-a-bug-in-roles_is_member_of.patch
Description: Binary data

Re: New strategies for freezing, advancing relfrozenxid early

2022-08-29 Thread Jeff Davis

On Thu, 2022-08-25 at 14:21 -0700, Peter Geoghegan wrote:
> The main high level goal of this work is to avoid painful, disruptive
> antiwraparound autovacuums (and other aggressive VACUUMs) that do way
> too much "catch up" freezing, all at once, causing significant
> disruption to production workloads.

Sounds like a good goal, and loosely follows the precedent of
checkpoint targets and vacuum cost delays.

> A new GUC/reloption called vacuum_freeze_strategy_threshold is added
> to control freezing strategy (also influences our choice of skipping
> strategy). It defaults to 4GB, so tables smaller than that cutoff
> (which are usually the majority of all tables) will continue to
> freeze
> in much the same way as today by default. Our current lazy approach
> to
> freezing makes sense there, and should be preserved for its own sake.

Why is the threshold per-table? Imagine someone who has a bunch of 4GB
partitions that add up to a huge amount of deferred freezing work.

The initial problem you described is a system-level problem, so it
seems we should track the overall debt in the system in order to keep
up.

> for this table, at this time: Is it more important to advance
> relfrozenxid early (be eager), or to skip all-visible pages instead
> (be lazy)? If it's the former, then we must scan every single page
> that isn't all-frozen according to the VM snapshot (including every
> all-visible page).

This feels too absolute, to me. If the goal is to freeze more
incrementally, well in advance of wraparound limits, then why can't we
just freeze 1000 out of 1 freezable pages on this run, and then
leave the rest for a later run?

> Thoughts?

What if we thought about this more like a "background freezer". It
would keep track of the total number of unfrozen pages in the system,
and freeze them at some kind of controlled/adaptive rate.

Regular autovacuum's job would be to keep advancing relfrozenxid for
all tables and to do other cleanup, and the background freezer's job
would be to keep the absolute number of unfrozen pages under some
limit. Conceptually those two jobs seem different to me.

Also, regarding patch v1-0001-Add-page-level-freezing, do you think
that narrows the conceptual gap between an all-visible page and an all-
frozen page?

Regards,
Jeff Davis

Re: logical decoding and replication of sequences

2022-08-29 Thread Tomas Vondra




On 8/29/22 12:21, Thomas Munro wrote:
> On Mon, Aug 8, 2022 at 8:56 PM Kyotaro Horiguchi
>  wrote:
>> At Mon, 08 Aug 2022 17:33:22 +0900 (JST), Kyotaro Horiguchi 
>>  wrote in
>>> If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral
>>> misses the chance to inavlidate reader-state.  That state is not an
>>> error while in StandbyMode.
>>
>> Mmm... Maybe I wanted to say:  (Still I'm not sure the rewrite works..)
>>
>> If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral
>> would miss the chance to invalidate reader-state.  When XLogPageRead
>> is called in blocking mode while in StandbyMode (that is, the
>> traditional condition) , the function continues retrying until it
>> succeeds, or returns XLRAD_FAIL if promote is triggered.  In other
>> words, it was not supposed to return non-failure while the header
>> validation is failing while in standby mode.  But while in nonblocking
>> mode, the function can return non-failure with lastSourceFailed =
>> true, which seems wrong.
> 
> New ideas:
> 
> 0001:  Instead of figuring out when to invalidate the cache, let's
> just invalidate it before every read attempt.  It is only marked valid
> after success (ie state->readLen > 0).  No need to worry about error
> cases.
> 

Maybe I misunderstand how all this works, but won't this have a really
bad performance impact. If not, why do we need the cache at all?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: effective_multixact_freeze_max_age issue

2022-08-29 Thread Peter Geoghegan

On Sun, Aug 28, 2022 at 4:14 PM Nathan Bossart  wrote:
> The idea seems sound to me, and IMO your patch simplifies things nicely,
> which might be reason enough to proceed with it.

It is primarily a case of making things simpler. Why would it ever
make sense to interpret age differently in the presence of a long
running transaction, though only for the FreezeLimit/MultiXactCutoff
cutoff calculation? And not for the closely related
freeze_table_age/multixact_freeze_table_age calculation? It's hard to
imagine that that was ever a deliberate choice.

vacuum_set_xid_limits() didn't contain the logic for determining if
its caller's VACUUM should be an aggressive VACUUM until quite
recently. Postgres 15 commit efa4a9462a put the logic for determining
aggressiveness right next to the logic for determining FreezeLimit,
which made the inconsistency much more noticeable. It is easy to
believe that this was really just an oversight, all along.

> However, I'm struggling
> to understand when this change would help much in practice.  IIUC it will
> cause vacuums to freeze a bit more, but outside of extreme cases (maybe
> when vacuum_freeze_min_age is set very high and there are long-running
> transactions), ISTM it might not have tremendously much impact.  Is the
> intent to create some sort of long-term behavior change for autovacuum, or
> is this mostly aimed towards consistency among the cutoff calculations?

I agree that this will have only a negligible impact on the majority
(perhaps even the vast majority) of applications. The primary
justification for this patch (simplification) seems sufficient, all
things considered. Even still, it's possible that it will help in
extreme cases. Cases with pathological performance issues,
particularly those involving MultiXacts.

We already set FreezeLimit to the most aggressive possible value of
OldestXmin when OldestXmin has itself already crossed a quasi
arbitrary XID-age threshold of autovacuum_freeze_max_age XIDs (i.e.
when OldestXmin < safeLimit), with analogous rules for
MultiXactCutoff/OldestMxact. Consequently, the way that we set the
cutoffs for freezing in pathological cases where (say) there is a
leaked replication slot will see a sharp discontinuity in how
FreezeLimit is set, within and across tables. And for what?

Initially, these pathological cases will end up using exactly the same
FreezeLimit for every VACUUM against every table (assuming that we're
using a system-wide min_freeze_age setting) -- every VACUUM operation
will use a FreezeLimit of `OldestXmin - vacuum_freeze_min_age`. At a
certain point that'll suddenly flip -- now every VACUUM operation will
use a FreezeLimit of `OldestXmin`. OTOH with the patch they'd all have
a FreezeLimit that is tied to the time that each VACUUM started --
which is exactly the FreezeLimit behavior that we'd get if there was
no leaked replication slot (at least until OldestXmin finally attains
an age of vacuum_freeze_min_age, when it finally becomes unavoidable,
even with the patch).

There is something to be said for preserving the "natural diversity"
of the relfrozenxid values among tables, too. The FreezeLimit we use
is (at least for now) almost always going to be very close to (if not
exactly) the same value as the NewFrozenXid value that we set
relfrozenxid to at the end of VACUUM (at least in larger tables).
Without the patch, a once-off problem with a leaked replication slot
can accidentally result in lasting problems where all of the largest
tables get their antiwraparound autovacuums at exactly the same time.
The current behavior increases the risk that we'll accidentally
"synchronize" the relfrozenxid values for large tables that had an
antiwraparound vacuum during the time when OldestXmin was held back.

Needlessly using the same FreezeLimit across each VACUUM operation
risks disrupting the natural ebb and flow of things. It's hard to say
how much of a problem that is in the real word. But why take any
chances?

--
Peter Geoghegan

Re: use ARM intrinsics in pg_lfind32() where available

2022-08-29 Thread Nathan Bossart

On Mon, Aug 29, 2022 at 05:49:46PM +0700, John Naylor wrote:
> Bowerbird just reported the same error, so I went ahead and pushed a
> fix with this.

Thanks!  I've attached a follow-up patch with a couple of small
suggestions.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 0ff1549083..745890f77f 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,9 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+#ifndef USE_NO_SIMD
+static inline bool vector32_is_highbit_set(const Vector32 v);
+#endif
 
 /* arithmetic operations */
 static inline Vector8 vector8_or(const Vector8 v1, const Vector8 v2);
@@ -88,7 +91,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
 /*
  * comparisons between vectors
  *
- * Note: These return a vector rather than booloan, which is why we don't
+ * Note: These return a vector rather than boolean, which is why we don't
  * have non-SIMD implementations.
  */
 #ifndef USE_NO_SIMD

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-08-29 Thread Robert Haas

On Mon, Aug 29, 2022 at 12:54 PM Robert Haas  wrote:
> On Wed, Aug 10, 2022 at 4:37 AM Kyotaro Horiguchi
>  wrote:
> > So, it seems that the *standby* received the inconsistent WAL stream
> > (aborted-contrecord not followed by a overwriting-missing-contrecord)
> > from the primary.  Thus the inconsistency happened on the primary, not
> > on the standby.
> >
> > So... I'm still failing to draw the big picutre of what is happening
> > here.
>
> For the benefit of anyone who may be looking at this thread in the
> archive later, I believe this commit will have fixed this issue:

Err, this commit here, I mean:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6672d7913929b82ec723a54381773d9cdc20fe9d

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-08-29 Thread Robert Haas

On Wed, Aug 10, 2022 at 4:37 AM Kyotaro Horiguchi
 wrote:
> So, it seems that the *standby* received the inconsistent WAL stream
> (aborted-contrecord not followed by a overwriting-missing-contrecord)
> from the primary.  Thus the inconsistency happened on the primary, not
> on the standby.
>
> So... I'm still failing to draw the big picutre of what is happening
> here.

For the benefit of anyone who may be looking at this thread in the
archive later, I believe this commit will have fixed this issue:


-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: standby promotion can create unreadable WAL

2022-08-29 Thread Robert Haas

On Mon, Aug 29, 2022 at 12:21 AM Kyotaro Horiguchi
 wrote:
> Mmm. That seems wrong. So forget about that.  The proposed patch looks
> fine to me.

Thanks for thinking it over. Committed and back-patched as far as v10,
since that's the oldest supported release.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tomas Vondra

On 8/29/22 17:43, Tom Lane wrote:
> Tomas Vondra  writes:
>> I suspect it's a pre-existing bug in Slab allocator, because it does this:
> 
>> #define SlabBlockGetChunk(slab, block, idx) \
>>  ((MemoryChunk *) ((char *) (block) + sizeof(SlabBlock)  \
>>  + (idx * slab->fullChunkSize)))
> 
>> and SlabBlock is only 20B, i.e. not a multiple of 8B. Which would mean
>> that even if we allocate block and size the chunks carefully (with all
>> the MAXALIGN things), we ultimately slice the block incorrectly.
> 
> Right, same conclusion I just came to.  But it's not a "pre-existing"
> bug, because sizeof(SlabBlock) *was* maxaligned until David added
> another field to it.
> 

Yeah, that's true. Still, there was an implicit expectation the size is
maxaligned, but it wasn't mentioned anywhere. I don't even recall if I
was aware of it when I wrote that code, or if I was just lucky.

> I think adding a padding field to SlabBlock would be a less messy
> solution than your patch.

Maybe, although I find it a bit annoying that we do MAXALIGN() for a
bunch of structs, and then in other places we add padding. Maybe not for
Slab, but e.g. for Generation. Maybe we should try doing the same thing
in all those places.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tom Lane

Andres Freund  writes:
> On 2022-08-29 11:43:14 -0400, Tom Lane wrote:
>> I think adding a padding field to SlabBlock would be a less messy
>> solution than your patch.

> That just seems to invite the same problem happening again later and it's
> harder to ensure that the padding is correct across platforms.

Yeah, I just tried and failed to write a general padding computation
--- you can't use sizeof() in the #if, which makes it a lot more
fragile than I was expecting.  Tomas' way is probably the best.

regards, tom lane

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Andres Freund

Hi,

On 2022-08-29 11:43:14 -0400, Tom Lane wrote:
> Tomas Vondra  writes:
> > I suspect it's a pre-existing bug in Slab allocator, because it does this:
> 
> > #define SlabBlockGetChunk(slab, block, idx) \
> > ((MemoryChunk *) ((char *) (block) + sizeof(SlabBlock)  \
> > + (idx * slab->fullChunkSize)))
> 
> > and SlabBlock is only 20B, i.e. not a multiple of 8B. Which would mean
> > that even if we allocate block and size the chunks carefully (with all
> > the MAXALIGN things), we ultimately slice the block incorrectly.
> 
> Right, same conclusion I just came to.  But it's not a "pre-existing"
> bug, because sizeof(SlabBlock) *was* maxaligned until David added
> another field to it.
> 
> I think adding a padding field to SlabBlock would be a less messy
> solution than your patch.

That just seems to invite the same problem happening again later and it's
harder to ensure that the padding is correct across platforms.

Greetings,

Andres Freund

Re: wal_sync_method=fsync_writethrough

2022-08-29 Thread Magnus Hagander

On Fri, Aug 26, 2022 at 11:29 PM Thomas Munro  wrote:
>
> On Sat, Aug 27, 2022 at 12:17 AM Magnus Hagander  wrote:
> > So, I don't know how it works now, but the history at least was this:
> > it was not about the disk caches, it was about raid controller caches.
> > Basically, we determined that windows didn't fsync it all the way. But
> > it would with  But if we changed wal_sync_method=fsync to actually
> > *do* that, then people who had paid big money for raid controllers
> > with flash or battery backed cache would lose a ton of performance. So
> > we needed one level that would sync out of the OS but not through the
> > RAID cache, and another one that would sync it out of the RAID cache
> > as well. Which would/could be different from the drive caches
> > themselves, and they often behaved differently. And I think it may
> > have even been dependent on the individual RAID drivers what the
> > default would  be.
>
> Thanks for the background.  Yeah, that makes sense to motivate
> open_datasync for Windows.  Not sure what you meant about fsync or
> meant to write after "would with".

That's a good question indeed :) I think I meant it would with
FILE_FLAG_WRITE_THROUGH.


> It seems like the 2005 discussions were primarily about open_datasync
> but also had the by-product of introducing the name
> fsync_writethrough.  If I'm reading between the lines[1] correctly,
> perhaps the logic went like this:
>
> 1.  We noticed that _commit() AKA FlushFileBuffers() issued
> SYNCHRONIZE CACHE (or equivalent) on Windows.
>
> 2.  At that time in history, Linux (and other Unixes) probably did not
> issue SYNCHRONIZE CACHE when you called fsync()/fdatasync().

I think it may have been driver dependent there (as well), at the time.


> 3.  We concluded therefore that Windows was strange and we needed to
> use a different level name for the setting to reflect this extra
> effect.

It was certainly strange to us :)


> Now it looks strange: we have both "fsync" and "fsync_writethrough"
> doing exactly the same thing while vaguely implying otherwise, and the
> contrast with other operating systems (if I divined that aspect
> correctly) mostly doesn't apply.  How flush commands affect various
> caches in modern storage stacks is also not really OS-specific AFAIK.
>
> (Obviously macOS is a different story...)

Given that it does vary (because macOS is actually an OS :D), we might
need to start from a matrix of exactly what happens in different
states, and then try to map that to a set? I fully agree that if
things actually behave the same, they should be called the same.

And it may also be that there is no longer a difference between
direct-drive and RAID-with-battery-or-flash, which used to be the huge
difference back then, where you had to tune for it. For many cases
that has been negated by just not using that (and using NVME and
possibly software raid instead), but there are certainly still people
using such systems...

//Magnus

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tom Lane

Tomas Vondra  writes:
> I suspect it's a pre-existing bug in Slab allocator, because it does this:

> #define SlabBlockGetChunk(slab, block, idx) \
>   ((MemoryChunk *) ((char *) (block) + sizeof(SlabBlock)  \
>   + (idx * slab->fullChunkSize)))

> and SlabBlock is only 20B, i.e. not a multiple of 8B. Which would mean
> that even if we allocate block and size the chunks carefully (with all
> the MAXALIGN things), we ultimately slice the block incorrectly.

Right, same conclusion I just came to.  But it's not a "pre-existing"
bug, because sizeof(SlabBlock) *was* maxaligned until David added
another field to it.

I think adding a padding field to SlabBlock would be a less messy
solution than your patch.

regards, tom lane

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tomas Vondra

On 8/29/22 17:27, Tomas Vondra wrote:
> ...
>
> I suspect it's a pre-existing bug in Slab allocator, because it does this:
> 
> #define SlabBlockGetChunk(slab, block, idx) \
>   ((MemoryChunk *) ((char *) (block) + sizeof(SlabBlock)  \
>   + (idx * slab->fullChunkSize)))
> 
> and SlabBlock is only 20B, i.e. not a multiple of 8B. Which would mean
> that even if we allocate block and size the chunks carefully (with all
> the MAXALIGN things), we ultimately slice the block incorrectly.
> 

The attached patch seems to fix the issue for me - at least it seems
like that. This probably will need to get backpatched, I guess. Maybe we
should add an assert to MemoryChunkGetPointer to check alignment?



> This would explain the 4B difference I reported before, I think. But I'm
> just as astonished we got this far in the tests - regular regression
> tests don't do much logical decoding, and we only use slab for changes,
> but I see the failure in 006 test in src/test/recovery, so the first
> five completed fine.
> 

I got confused - the first 5 tests in src/test/recovery don't do any
logical decoding, so it's not surprising it's the 006 that fails.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/src/backend/utils/mmgr/slab.c b/src/backend/utils/mmgr/slab.c
index ae1a735b8c..67fbede836 100644
--- a/src/backend/utils/mmgr/slab.c
+++ b/src/backend/utils/mmgr/slab.c
@@ -102,10 +102,10 @@ typedef struct SlabBlock
 #define SlabChunkGetPointer(chk)	\
 	((void *)(((char *)(chk)) + sizeof(MemoryChunk)))
 #define SlabBlockGetChunk(slab, block, idx) \
-	((MemoryChunk *) ((char *) (block) + sizeof(SlabBlock)	\
+	((MemoryChunk *) ((char *) (block) + MAXALIGN(sizeof(SlabBlock))	\
 	+ (idx * slab->fullChunkSize)))
 #define SlabBlockStart(block)	\
-	((char *) block + sizeof(SlabBlock))
+	((char *) block + MAXALIGN(sizeof(SlabBlock)))
 #define SlabChunkIndex(slab, block, chunk)	\
 	(((char *) chunk - SlabBlockStart(block)) / slab->fullChunkSize)
 
@@ -146,12 +146,12 @@ SlabContextCreate(MemoryContext parent,
 	fullChunkSize = Slab_CHUNKHDRSZ + MAXALIGN(chunkSize);
 
 	/* Make sure the block can store at least one chunk. */
-	if (blockSize < fullChunkSize + sizeof(SlabBlock))
+	if (blockSize < fullChunkSize + MAXALIGN(sizeof(SlabBlock)))
 		elog(ERROR, "block size %zu for slab is too small for %zu chunks",
 			 blockSize, chunkSize);
 
 	/* Compute maximum number of chunks per block */
-	chunksPerBlock = (blockSize - sizeof(SlabBlock)) / fullChunkSize;
+	chunksPerBlock = (blockSize - MAXALIGN(sizeof(SlabBlock))) / fullChunkSize;
 
 	/* The freelist starts with 0, ends with chunksPerBlock. */
 	freelistSize = sizeof(dlist_head) * (chunksPerBlock + 1);

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tom Lane

Got it: sizeof(SlabBlock) isn't a multiple of MAXALIGN,

(gdb) p sizeof(SlabBlock)
$4 = 20

but this coding requires it to be:

#define SlabBlockGetChunk(slab, block, idx) \
((MemoryChunk *) ((char *) (block) + sizeof(SlabBlock)  \
+ (idx * slab->fullChunkSize)))

So what you actually need to do is add some alignment padding to
SlabBlock.  I'd suggest reverting df0f4feef as it seems to be
a red herring.

regards, tom lane

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tom Lane

Robert Haas  writes:
> I've got another problem with this patch here on macOS:

> In file included from aset.c:52:
> ../../../../src/include/utils/memutils_memorychunk.h:170:18: error:
> comparison of constant 7 with expression of type
> 'MemoryContextMethodID' (aka 'enum MemoryContextMethodID') is always
> true [-Werror,-Wtautological-constant-out-of-range-compare]
> Assert(methodid <= MEMORY_CONTEXT_METHODID_MASK);
> ^~~~
> ../../../../src/include/c.h:827:9: note: expanded from macro 'Assert'
> if (!(condition)) \
>   ^

Huh.  My macOS buildfarm animals aren't showing that, nor can I
repro it on my laptop.  Which compiler version are you using exactly?

regards, tom lane

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tomas Vondra




On 8/29/22 17:20, Tom Lane wrote:
> Tomas Vondra  writes:
>> I can reproduce it on my system (rpi4 running 32-bit raspbian).
> 
> Yeah, more or less same as what I'm testing on.
> 
> Seeing that the complaint is about pfree'ing a non-maxaligned
> ReorderBufferChange pointer, I tried adding
> 
> diff --git a/src/backend/replication/logical/reorderbuffer.c 
> b/src/backend/replication/logical/reorderbuffer.c
> index 89cf9f9389..dfa9b6c9ee 100644
> --- a/src/backend/replication/logical/reorderbuffer.c
> +++ b/src/backend/replication/logical/reorderbuffer.c
> @@ -472,6 +472,8 @@ ReorderBufferGetChange(ReorderBuffer *rb)
> change = (ReorderBufferChange *)
> MemoryContextAlloc(rb->change_context, 
> sizeof(ReorderBufferChange));
>  
> +   Assert(change == (void *) MAXALIGN(change));
> +
> memset(change, 0, sizeof(ReorderBufferChange));
> return change;
>  }
> 
> and this failed!
> 
> (gdb) f 3
> #3  0x003cb888 in ReorderBufferGetChange (rb=0x24ed820) at reorderbuffer.c:475
> 475 Assert(change == (void *) MAXALIGN(change));
> (gdb) p change
> $1 = (ReorderBufferChange *) 0x24aaa14
> 
> So the bug is in fact in David's changes, and it consists in palloc
> sometimes handing back non-maxaligned pointers.  I find it mildly
> astonishing that we managed to get through core regression tests
> without such a fault surfacing, but there you have it.
> 
> This machine has MAXALIGN 8 but 4-byte pointers, so there's something
> wrong with that situation.
> 

I suspect it's a pre-existing bug in Slab allocator, because it does this:

#define SlabBlockGetChunk(slab, block, idx) \
((MemoryChunk *) ((char *) (block) + sizeof(SlabBlock)  \
+ (idx * slab->fullChunkSize)))

and SlabBlock is only 20B, i.e. not a multiple of 8B. Which would mean
that even if we allocate block and size the chunks carefully (with all
the MAXALIGN things), we ultimately slice the block incorrectly.

This would explain the 4B difference I reported before, I think. But I'm
just as astonished we got this far in the tests - regular regression
tests don't do much logical decoding, and we only use slab for changes,
but I see the failure in 006 test in src/test/recovery, so the first
five completed fine.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tom Lane

I wrote:
> So the bug is in fact in David's changes, and it consists in palloc
> sometimes handing back non-maxaligned pointers.  I find it mildly
> astonishing that we managed to get through core regression tests
> without such a fault surfacing, but there you have it.

Oh!  I just noticed that the troublesome context (rb->change_context)
is a SlabContext, so it may be that this only happens in non-aset
contexts.  It's a lot easier to believe that the core tests never
exercise the case of pfree'ing a slab chunk.

regards, tom lane

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tom Lane

Tomas Vondra  writes:
> I can reproduce it on my system (rpi4 running 32-bit raspbian).

Yeah, more or less same as what I'm testing on.

Seeing that the complaint is about pfree'ing a non-maxaligned
ReorderBufferChange pointer, I tried adding

diff --git a/src/backend/replication/logical/reorderbuffer.c 
b/src/backend/replication/logical/reorderbuffer.c
index 89cf9f9389..dfa9b6c9ee 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -472,6 +472,8 @@ ReorderBufferGetChange(ReorderBuffer *rb)
change = (ReorderBufferChange *)
MemoryContextAlloc(rb->change_context, 
sizeof(ReorderBufferChange));
 
+   Assert(change == (void *) MAXALIGN(change));
+
memset(change, 0, sizeof(ReorderBufferChange));
return change;
 }

and this failed!

(gdb) f 3
#3  0x003cb888 in ReorderBufferGetChange (rb=0x24ed820) at reorderbuffer.c:475
475 Assert(change == (void *) MAXALIGN(change));
(gdb) p change
$1 = (ReorderBufferChange *) 0x24aaa14

So the bug is in fact in David's changes, and it consists in palloc
sometimes handing back non-maxaligned pointers.  I find it mildly
astonishing that we managed to get through core regression tests
without such a fault surfacing, but there you have it.

This machine has MAXALIGN 8 but 4-byte pointers, so there's something
wrong with that situation.

regards, tom lane

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Robert Haas

Hi,

I've got another problem with this patch here on macOS:

ccache clang -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla
-Werror=unguarded-availability-new -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -Wno-unused-command-line-argument -g -O2 -Wall -Werror
-Wno-unknown-warning-option -fno-omit-frame-pointer
-I../../../../src/include  -isysroot
/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk
-I/opt/local/include/libxml2 -I/opt/local/include -I/opt/local/include
 -I/opt/local/include  -c -o aset.o aset.c -MMD -MP -MF .deps/aset.Po
In file included from aset.c:52:
../../../../src/include/utils/memutils_memorychunk.h:170:18: error:
comparison of constant 7 with expression of type
'MemoryContextMethodID' (aka 'enum MemoryContextMethodID') is always
true [-Werror,-Wtautological-constant-out-of-range-compare]
Assert(methodid <= MEMORY_CONTEXT_METHODID_MASK);
^~~~
../../../../src/include/c.h:827:9: note: expanded from macro 'Assert'
if (!(condition)) \
  ^
In file included from aset.c:52:
../../../../src/include/utils/memutils_memorychunk.h:186:18: error:
comparison of constant 7 with expression of type
'MemoryContextMethodID' (aka 'enum MemoryContextMethodID') is always
true [-Werror,-Wtautological-constant-out-of-range-compare]
Assert(methodid <= MEMORY_CONTEXT_METHODID_MASK);
^~~~
../../../../src/include/c.h:827:9: note: expanded from macro 'Assert'
if (!(condition)) \
  ^

I'm not sure what to do about that, but every file that includes
memutils_memorychunk.h produces those warnings (which become errors
due to -Werror).

...Robert

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tom Lane

Amit Kapila  writes:
> Yeah, I also thought that way but couldn't find a reason. I think if
> David is able to reproduce it on one of his systems then he can try
> locally reverting both the commits one by one.

It seems to repro easily on any 32-bit platform.  Aside from the
buildfarm results, I've now duplicated it on 32-bit ARM (which
eliminates the possibility that it's big-endian specific).

"bt full" from the first crash gives

#0  0xb6d7126c in raise () from /lib/libc.so.6
No symbol table info available.
#1  0xb6d5c360 in abort () from /lib/libc.so.6
No symbol table info available.
#2  0x00572430 in ExceptionalCondition (
conditionName=conditionName@entry=0x745110 "pointer == (void *) 
MAXALIGN(pointer)", errorType=errorType@entry=0x5d18d0 "FailedAssertion", 
fileName=fileName@entry=0x7450dc 
"../../../../src/include/utils/memutils_internal.h", 
lineNumber=lineNumber@entry=120) at assert.c:69
No locals.
#3  0x005a0d90 in GetMemoryChunkMethodID (pointer=)
at ../../../../src/include/utils/memutils_internal.h:120
header = 
#4  0x005a231c in GetMemoryChunkMethodID (pointer=)
at ../../../../src/include/utils/memutils_internal.h:119
header = 
header = 
#5  pfree (pointer=) at mcxt.c:1242
No locals.
#6  0x003c8fdc in ReorderBufferCleanupTXN (rb=0x7450dc, rb@entry=0x1f, 
txn=0x745110, txn@entry=0xe1f800) at reorderbuffer.c:1493
change = 
found = false
iter = {cur = , next = 0xddcb18, end = 0xde4e64}
#7  0x003ca968 in ReorderBufferProcessTXN (rb=0x1f, rb@entry=0xe1f800, 
txn=0xe1f800, commit_lsn=, snapshot_now=, 
command_id=, streaming=false) at reorderbuffer.c:2514
change = 0x0
_save_exception_stack = 0x0
_save_context_stack = 0x1
_local_sigjmp_buf = {{__jmpbuf = {-552117431, 1640676937, -1092521468, 
  14569068, 14569068, 14568932, 729, 0, 228514434, 166497, 0, 
  8195960, 8196088, 7, 0 , 13847224, 0, 196628, 
  0, 1844909312, 5178548, -1092520644, 13979392, 13979388, 
  14801472, 0, 8195960, 8196088, 7, 0, 8108342, 8108468, 1, 
  715827883, -1030792151, 0, 8195960, 729, 7, 0, 14809088, 
  14809088, 14542436, 0, 26303888, 0, 14568932, 3978796, 156, 0, 
  1, 0}, __mask_was_saved = 0, __saved_mask = {__val = {64, 
7242472, 14740696, 7244624, 14809088, 14741512, 5, 14740696, 
7244624, 7245864, 3982844, 14542436, 0, 14809092, 0, 26303888, 
729, 14569112, 14809092, 7242472, 729, 14741512, 32, 14809088, 
0, 228514434, 166497, 0, 0, 26303888, 3980968, 0
_do_rethrow = 
using_subtxn = 228
ccxt = 0x7aae68 
iterstate = 0x0
prev_lsn = 26303888
specinsert = 0x0
stream_started = false
curtxn = 0x0
__func__ = "ReorderBufferProcessTXN"
#8  0x003cb460 in ReorderBufferReplay (txn=, 
rb=rb@entry=0xe1f800, commit_lsn=, end_lsn=, 
commit_time=715099169882112, origin_id=0, origin_lsn=0, xid=729)
at reorderbuffer.c:2641
snapshot_now = 
#9  0x003cbedc in ReorderBufferCommit (rb=rb@entry=0xe1f800, 
xid=xid@entry=729, commit_lsn=, end_lsn=, 
commit_time=, commit_time@entry=715099398396546, 
origin_id=, origin_id@entry=0, origin_lsn=0, 
origin_lsn@entry=5906902891454464) at reorderbuffer.c:2665
txn = 
#10 0x003bb19c in DecodeCommit (two_phase=false, xid=729, parsed=0xbee17478, 
buf=, ctx=0xe1d7f8) at decode.c:682
origin_lsn = 
commit_time = 
origin_id = 0
i = 
origin_lsn = 
commit_time = 
origin_id = 
i = 
#11 xact_decode (ctx=0xe1d7f8, buf=) at decode.c:216
xlrec = 
parsed = {xact_time = 715099398396546, xinfo = 73, dbId = 16384, 
  tsId = 1663, nsubxacts = 0, subxacts = 0x0, nrels = 0, 
  xlocators = 0x0, nstats = 0, stats = 0x0, nmsgs = 70, 
  msgs = 0xe2b828, twophase_xid = 0, 
  twophase_gid = '\000' , nabortrels = 0, 
  abortlocators = 0x0, nabortstats = 0, abortstats = 0x0, 
  origin_lsn = 0, origin_timestamp = 0}
xid = 
two_phase = false
builder = 
reorder = 
r = 
info = 
__func__ = "xact_decode"
#12 0x003babf0 in LogicalDecodingProcessRecord (ctx=ctx@entry=0xe1d7f8, 
record=0xe1dad0) at decode.c:119
buf = {origptr = 26303888, endptr = 26305088, record = 0xe1dad0}
txid = 
rmgr = 
#13 0x003c0390 in pg_logical_slot_get_changes_guts (fcinfo=0x0, 
confirm=, binary=) at logicalfuncs.c:271
record = 
errm = 0x0
_save_exception_stack = 0xd60380
_save_context_stack = 0x1
_local_sigjmp_buf = {{__jmpbuf = {-552117223, 1640638557, 4, 14770368, 
  -1092519792, -1, 14770320, 8, 1, 14691520, 0, 8195960, 8196088, 
  7, 0 , 3398864, 0, 13849464, 13848680, 
  14650600, 3399100, 4, 31

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tomas Vondra




On 8/29/22 16:02, Amit Kapila wrote:
> On Mon, Aug 29, 2022 at 7:17 PM Tom Lane  wrote:
>>
>> David Rowley  writes:
>>> I suspect, going by all 3 failing animals being 32-bit which have a
>>> MAXIMUM_ALIGNOF 8 and SIZEOF_SIZE_T of 4 that this is due to the lack
>>> of padding in the MemoryChunk struct.
>>> AllocChunkData and GenerationChunk had padding to account for
>>> sizeof(Size) being 4 and sizeof(void *) being 8, I didn't add that to
>>> MemoryChunk, so I'll do that now.
>>
>> Doesn't seem to have fixed it.  IMO, the fact that we can get through
>> core regression tests and pg_upgrade is a strong indicator that
>> there's not anything fundamentally wrong with memory context
>> management.  I'm inclined to think the problem is in d2169c9985,
>> instead ... though I can't see anything wrong with it.
>>
> 
> Yeah, I also thought that way but couldn't find a reason. I think if
> David is able to reproduce it on one of his systems then he can try
> locally reverting both the commits one by one.
> 

I can reproduce it on my system (rpi4 running 32-bit raspbian). I can't
grant access very easily at the moment, so I'll continue investigating
do more debugging on perhaps I can grant access to the system.

So far all I know is that it doesn't happen on d2169c9985 (so ~5 commits
back), and then it starts failing on c6e0fe1f2a. The extra padding added
by df0f4feef8 makes no difference, because the struct looked like this:

struct MemoryChunk {
Size   requested_size;  /* 0 4 */

/* XXX 4 bytes hole, try to pack */

uint64 hdrmask; /* 8 8 */

/* size: 16, cachelines: 1, members: 2 */
/* sum members: 12, holes: 1, sum holes: 4 */
/* last cacheline: 16 bytes */
};

and the padding makes it look like this:

struct MemoryChunk {
Size   requested_size;  /* 0 4 */
char   padding[4];  /* 4 8 */
uint64 hdrmask; /* 8 8 */

/* size: 16, cachelines: 1, members: 2 */
/* sum members: 12, holes: 1, sum holes: 4 */
/* last cacheline: 16 bytes */
};

so it makes no difference.

I did look at the pointers in GetMemoryChunkMethodID, and it looks like
this (p1 is result of MAXALIGN(pointer):

(gdb) p pointer
$1 = (void *) 0x1ca1d2c
(gdb) p p1
$2 = 0x1ca1d30 ""
(gdb) p p1 - pointer
$3 = 4
(gdb) p (long int) pointer
$4 = 30022956
(gdb) p (long int) p1
$5 = 30022960
(gdb) p 30022956 % 8
$6 = 4

So the input pointer is not actually aligned to MAXIMUM_ALIGNOF (8B),
but only to 4B. That seems a bit strange.


>> Another possibility is that there's a pre-existing bug in the
>> logical decoding stuff that your changes accidentally exposed.
>>
> 
> Yeah, this is another possibility.

No idea.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Expand palloc/pg_malloc API

2022-08-29 Thread Robert Haas

On Fri, Aug 12, 2022 at 3:31 AM Peter Eisentraut
 wrote:
> (Personally, I have always been a bit suspicious about using the name
> palloc() without memory context semantics in frontend code, but I guess
> this is wide-spread now.)

I think it would be a good thing to add memory context support to the
frontend. We could just put everything in a single context for
starters, and then frontend utilities that wanted to create other
contexts could do so.

There are difficulties, though. For instance, memory contexts are
nodes, and have a NodeTag. And I'm pretty sure we don't want frontend
code to know about all the backend node types. My suspicion is that
memory context types really shouldn't be node types, but right now,
they are. Whether that's the correct view or not, this kind of problem
means it's not a simple lift-and-shift to move the memory context code
into src/common. Someone would need to spend some time thinking about
how to engineer it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: replacing role-level NOINHERIT with a grant-level option

2022-08-29 Thread Robert Haas

On Sun, Aug 28, 2022 at 5:14 PM Nathan Bossart  wrote:
> nitpick: I believe there should be a period before "After."
>
> Otherwise, LGTM.

Good catch. Thanks for the review. Committed with that correction.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add semi-join pushdown to postgres_fdw

2022-08-29 Thread Ashutosh Bapat

Hi Alexander,
Thanks for working on this. It's great to see FDW join pushdown scope
being expanded to more complex cases.

I am still figuring out the implementation. It's been a while I have
looked at join push down code.

But following change strikes me odd
 -- subquery using immutable function (can be sent to remote)
 PREPARE st3(int) AS SELECT * FROM ft1 t1 WHERE t1.c1 < $2 AND t1.c3
IN (SELECT c3 FROM ft2 t2 WHERE c1 > $1 AND date(c5) =
'1970-01-17'::date) ORDER BY c1;
 EXPLAIN (VERBOSE, COSTS OFF) EXECUTE st3(10, 20);
-  QUERY PLAN

- Sort
+

QUERY PLAN
+---
+ Foreign Scan
Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
-   Sort Key: t1.c1
-   ->  Nested Loop Semi Join
- Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
- Join Filter: (t1.c3 = t2.c3)
- ->  Foreign Scan on public.ft1 t1
-   Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
-   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8
FROM "S 1"."T 1" WHERE (("C 1" < 20))
- ->  Materialize
-   Output: t2.c3
-   ->  Foreign Scan on public.ft2 t2
- Output: t2.c3
- Remote SQL: SELECT c3 FROM "S 1"."T 1" WHERE
(("C 1" > 10)) AND ((date(c5) = '1970-01-17'::date))
-(14 rows)
+   Relations: (public.ft1 t1) SEMI JOIN (public.ft2 t2)
+   Remote SQL: SELECT r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6,
r1.c7, r1.c8 FROM "S 1"."T 1" r1 WHERE ((r1."C 1" < 20)) AND (EXISTS
(SELECT NULL FROM "S 1"."T 1" r3 WHERE ((r3."C 1" > 10)) AND
((date(r3.c5) = '1970-01-17'::date)) AND ((r1.c3 = r3.c3 ORDER BY
r1."C 1" ASC NULLS LAST
+(4 rows)

date_in  | s   |1 | [0:0]={cstring}
date_in which will be used to cast a test to date is not immutable. So
the query should't be pushed down. May not be a problem with your
patch. Can you please check?

On Wed, Aug 24, 2022 at 12:55 PM Alexander Pyhalov
 wrote:
>
> Hi.
>
> It's possible to extend deparsing in postgres_fdw, so that we can push
> down semi-joins, which doesn't refer to inner reltarget. This allows
> us to push down joins in queries like
>
> SELECT * FROM ft1 t1 WHERE t1.c1 < 10 AND t1.c3 IN (SELECT c3 FROM ft2
> t2 WHERE date(c5) = '1970-01-17'::date);
>
>
> EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE t1.c1 < 10 AND
> t1.c3 IN (SELECT c3 FROM ft2 t2 WHERE date(c5) = '1970-01-17'::date);
>
>  QUERY PLAN
> ---
> Foreign Scan
> Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
> Relations: (public.ft1 t1) SEMI JOIN (public.ft2 t2)
> Remote SQL: SELECT r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6,
> r1.c7, r1.c8 FROM "S 1"."T 1" r1 WHERE ((r1."C 1" < 10)) AND (EXISTS
> (SELECT NULL FROM "S 1"."T 1" r3 WHERE ((date(r3.c5) =
> '1970-01-17'::date)) AND ((r1.c3 = r3.c3
>

Thanks for working on this. It's great to see FDW join pushdown scope
being expanded to more complex cases.

I am still figuring out the implementation. It's been a while I have
looked at join push down code.

But following change strikes me odd
 -- subquery using immutable function (can be sent to remote)
 PREPARE st3(int) AS SELECT * FROM ft1 t1 WHERE t1.c1 < $2 AND t1.c3
IN (SELECT c3 FROM ft2 t2 WHERE c1 > $1 AND date(c5) =
'1970-01-17'::date) ORDER BY c1;
 EXPLAIN (VERBOSE, COSTS OFF) EXECUTE st3(10, 20);
-  QUERY PLAN

- Sort
+

QUERY PLAN
+---
+ Foreign Scan
Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
-   Sort Key: t1.c1
-   ->  Nested Loop Semi Join
- Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
- Join Filter: (t1.c3 = t2.c3)
- ->  Foreign Scan on public.ft1 t1
-   Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
-   Remote SQL: SELECT "C 1",

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Amit Kapila

On Mon, Aug 29, 2022 at 7:17 PM Tom Lane  wrote:
>
> David Rowley  writes:
> > I suspect, going by all 3 failing animals being 32-bit which have a
> > MAXIMUM_ALIGNOF 8 and SIZEOF_SIZE_T of 4 that this is due to the lack
> > of padding in the MemoryChunk struct.
> > AllocChunkData and GenerationChunk had padding to account for
> > sizeof(Size) being 4 and sizeof(void *) being 8, I didn't add that to
> > MemoryChunk, so I'll do that now.
>
> Doesn't seem to have fixed it.  IMO, the fact that we can get through
> core regression tests and pg_upgrade is a strong indicator that
> there's not anything fundamentally wrong with memory context
> management.  I'm inclined to think the problem is in d2169c9985,
> instead ... though I can't see anything wrong with it.
>

Yeah, I also thought that way but couldn't find a reason. I think if
David is able to reproduce it on one of his systems then he can try
locally reverting both the commits one by one.

> Another possibility is that there's a pre-existing bug in the
> logical decoding stuff that your changes accidentally exposed.
>

Yeah, this is another possibility.

-- 
With Regards,
Amit Kapila.

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread Tom Lane

David Rowley  writes:
> I suspect, going by all 3 failing animals being 32-bit which have a
> MAXIMUM_ALIGNOF 8 and SIZEOF_SIZE_T of 4 that this is due to the lack
> of padding in the MemoryChunk struct.
> AllocChunkData and GenerationChunk had padding to account for
> sizeof(Size) being 4 and sizeof(void *) being 8, I didn't add that to
> MemoryChunk, so I'll do that now.

Doesn't seem to have fixed it.  IMO, the fact that we can get through
core regression tests and pg_upgrade is a strong indicator that
there's not anything fundamentally wrong with memory context
management.  I'm inclined to think the problem is in d2169c9985,
instead ... though I can't see anything wrong with it.

Another possibility is that there's a pre-existing bug in the
logical decoding stuff that your changes accidentally exposed.
I wonder if valgrind would show anything interesting.

regards, tom lane

Re: SQL/JSON features for v15

2022-08-29 Thread Jonathan S. Katz


On 8/29/22 8:56 AM, Amit Langote wrote:

On Sat, Aug 27, 2022 at 5:11 AM Nikita Glukhov  wrote:



I am not sure if it's OK to eval_const_expressions() on a Query
sub-expression during parse-analysis.  IIUC, it is only correct to
apply it to after the rewriting phase.


Maybe it would be better to simply remove DEFAULT ON EMPTY.


So +1 to this for now.


+1, if this simplifies the patch and makes it acceptable for v15


It is possible to easily split this patch into several subpatches,
I will do it if needed.


That would be nice indeed.


With RMT hat on, the RMT has its weekly meetings on Tuesdays. Based on 
the timing of the Beta 4 commit freeze[1] and how both 
including/reverting are nontrivial operations (e.g. we should ensure 
we're confident in both and that they pass through the buildfarm), we 
are going to have to make a decision on how to proceed at the next meeting.


Can folks please chime in on what they think of the current patchset and 
if this is acceptable for v15?


Thanks,

Jonathan

[1] 
https://www.postgresql.org/message-id/9d251aec-cea2-bc1a-5ed8-46ef0bcf6...@postgresql.org


OpenPGP_signature
Description: OpenPGP digital signature

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2022-08-29 Thread torikoshia


On 2022-08-25 01:54, Damir Belyalov wrote:

+   /* Buffer was filled, commit subtransaction and

prepare to replay */

+   ReleaseCurrentSubTransaction();

What is actually being committed by this
ReleaseCurrentSubTransaction()?
It seems to me that just safecstate->replay_buffer is fulfilled
before
this commit.


All tuples are collected in replay_buffer, which is created in
''oldcontext'' of CopyFrom() (not context of a subtransaction). That's
why it didn't clean up when you used
RollbackAndReleaseCurrentSubTransaction().
Subtransactions are needed for better safety. There is no error when
copying from a file to the replay_buffer, but an error may occur at
the next stage - when adding a tuple to the table. Also there may be
other errors that are not obvious at first glance.


Thanks for the explanation and updating patch.
I now understand that the data being COPYed are not the target of 
COMMIT.


Although in past discussions[1] it seems the data to be COPYed were also 
subject to COMMIT, but I understand this patch has adopted another 
design.


350 +   /* Buffer was filled, commit subtransaction and 
prepare to replay */

351 +   ReleaseCurrentSubTransaction();
352 +   CurrentResourceOwner = safecstate->oldowner;
353 +
354 +   safecstate->replay_is_active = true;

BTW in v4 patch, data are loaded into the buffer one by one, and when 
the buffer fills up, the data in the buffer are 'replayed' also one by 
one, right?

Wouldn't this have more overhead than a normal COPY?

As a test, I COPYed slightly larger data with and without ignore_errors 
option.

There might be other reasons, but I found a performance difference.

```
=# copy test from '/tmp/1000.data' ;
COPY 1000
Time: 6001.325 ms (00:06.001)

=# copy test from '/tmp/1000.data' with (ignore_errors);
NOTICE:  FIND 0 ERRORS
COPY 1000
Time: 7711.555 ms (00:07.712)
```


I feel it might be better to have it as a parameter rather than
fixed at
1000.


Yes, I thought about it too. So I created another version with the GUC
parameter - replay_buffer_size. Attached it. But I think users won't
need to change replay_buffer_size.
Also replay_buffer does the same thing as MultiInsert buffer does and
MultiInsert buffer defined by const = 1000.


Yeah, when the data being COPYed are not the target of COMMIT,  I also 
think users won't neet to change it.



NextCopyFrom() is in copyfromparse.c while safeNextCopyFrom() is in
copyfrom.c.
Since safeNextCopyFrom() is analog of NextCopyFrom() as commented,
would
it be natural to put them in the same file?


Sure, corrected it.


Thanks.




188 +   safecstate->errors++;
189 +   if (safecstate->errors <= 100)
190 +   ereport(WARNING,
191 +   (errcode(errdata->sqlerrcode),
192 +   errmsg("%s", errdata->context)));


It would be better to state in the manual that errors exceeding 100
are
not displayed.
Or, it might be acceptable to output all errors that exceed 100.


It'll be too complicated to create a new parameter just for showing
the given number of errors. I think 100 is an optimal size.


Yeah, I may have misled you, but I also don't think this needs a new 
parameter.

I just thought 'either' of the following would be better:
- Put in the documentation that the warnings will not be output for more 
than 101 cases.

- Output all the warnings.


It would be helpful to add comments about skip_row, etc.


Corrected it.


Thanks.




WARNING:  FIND 0 ERRORS
When there were no errors, this WARNING seems not necessary.


Corrected it.


Thanks.


I applied v4 patch and when canceled the COPY, there was a case I found 
myself left in a transaction.

Should this situation be prevented from occurring?

```
=# copy test from '/tmp/1000.data' with (ignore_errors );

^CCancel request sent
ERROR:  canceling statement due to user request

=# truncate test;
ERROR:  current transaction is aborted, commands ignored until end of 
transaction block

```


[1] 
https://www.postgresql.org/message-id/1197677930.1536.18.camel%40dell.linuxdev.us.dell.com


--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: SQL/JSON features for v15

2022-08-29 Thread Amit Langote

On Sat, Aug 27, 2022 at 5:11 AM Nikita Glukhov  wrote:
> On 26.08.2022 22:25, Andrew Dunstan wrote:
>
> On 2022-08-24 We 20:05, Nikita Glukhov wrote:
>
> v8 - is a highly WIP patch, which I failed to finish today.
> Even some test cases fail now, and they simply show unfinished
> things like casts to bytea (they can be simply removed) and missing
> safe input functions.
>
> Thanks for your work, please keep going.
>
> I have completed in v9 all the things I previously planned:
>
>  - Added missing safe I/O and type conversion functions for
>datetime, float4, varchar, bpchar.  This introduces a lot
>of boilerplate code for returning errors and also maybe
>adds some overhead.

Didn't know that we have done similar things in the past for jsonpath, as in:

commit 16d489b0fe058e527619f5e9d92fd7ca3c6c2994
Author: Alexander Korotkov 
Date:   Sat Mar 16 12:21:19 2019 +0300

Numeric error suppression in jsonpath

BTW, maybe the following hunk in boolin_opt_error() is unnecessary?

-   len = strlen(str);
+   len -= str - in_str;

>  - Added JSON_QUERY coercion to UTF8 bytea using pg_convert_to().
>
>  - Added immutability checks that were missed with elimination
>of coercion expressions.
>Coercions text::datetime, datetime1::datetime2 and even
>datetime::text for some datetime types are mutable.
>datetime::text can be made immutable by passing ISO date
>style into output functions (like in jsonpath).
>
>  - Disabled non-Const expressions in DEFAULT ON EMPTY in non
>ERROR ON ERROR case.  Non-constant expressions are tried to
>evaluate into Const directly inside transformExpr().

I am not sure if it's OK to eval_const_expressions() on a Query
sub-expression during parse-analysis.  IIUC, it is only correct to
apply it to after the rewriting phase.

>Maybe it would be better to simply remove DEFAULT ON EMPTY.

So +1 to this for now.

> It is possible to easily split this patch into several subpatches,
> I will do it if needed.

That would be nice indeed.

I'm wondering if you're going to change the PASSING values
initialization to add the steps into the parent JsonExpr's ExprState,
like the previous patch was doing?

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: Fix possible bogus array out of bonds (src/backend/access/brin/brin_minmax_multi.c)

2022-08-29 Thread Ranier Vilela

Em dom., 28 de ago. de 2022 às 22:06, Kyotaro Horiguchi <
horikyota@gmail.com> escreveu:

> At Fri, 26 Aug 2022 10:28:50 -0300, Ranier Vilela 
> wrote in
> > At function has_matching_range, if variable ranges->nranges == 0,
> > we exit quickly with a result equal to false.
> >
> > This means that nranges can be zero.
> > It occurs then that it is possible then to occur an array out of bonds,
> in
> > the initialization of the variable maxvalue.
> > So if nranges is equal to zero, there is no need to initialize minvalue
> and
> > maxvalue.
> >
> > The patch tries to fix it, avoiding possible errors by using maxvalue.
>
> However it seems that nranges will never be zero, still the fix looks
> good to me since it is generally allowed to be zero. I don't find a
> similar mistake related to Range.nranges.
>
Thanks  Kyotaro for taking a look at this.

regards,
Ranier Vilela

RE: patch: Add missing descriptions for rmgr APIs

2022-08-29 Thread kuroda.hay...@fujitsu.com

> Your observation seems correct to me but you have not updated the
> comment for the mask. Is there a reason for the same?

Oh, it seems that I attached wrong one. There is no reason.
PSA the newer version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v2-0001-add-a-missing-comment.patch
Description: v2-0001-add-a-missing-comment.patch

Re: Make #else/#endif comments more consistent

2022-08-29 Thread Anton Voloshin


On 29/08/2022 14:50, Peter Eisentraut wrote:
I usually try to follow the guidelines in 
, which 
pretty much match what you are proposing.


Thank you for the link, it's a useful one and the wording is better than 
mine.


Note that for _MSC_VER in particular there is some trickiness: We 
generally use it to tell apart different MSVC compiler versions.


That's certainly true in branches <= 15, but in master, to my surprise, 
I don't see any numerical comparisons of _MSC_VER since the recent 
6203583b7.


I'm not sure explicit !defined(_MSC_VER) is all that more clear
than !_MSC_VER in the commentary. After all, we would probably
never (?) see an actual
#if (!_MSC_VER)
in a real code.

So I have mixed feelings on forcing define() on _MSC_VER, but if you 
insist, I don't mind much either way.


What about other changes? Are there any obviously wrong or missed ones?

--
Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ru

Re: patch: Add missing descriptions for rmgr APIs

2022-08-29 Thread Amit Kapila

On Mon, Aug 29, 2022 at 8:48 AM kuroda.hay...@fujitsu.com
 wrote:
>
> Hi hackers,
>
> While reading codes related with logical decoding,
> I thought that following comment in rmgrlist.h is not consistent.
>
> > /* symbol name, textual name, redo, desc, identify, startup, cleanup */
>
> This comment describes a set of APIs that the resource manager should have, 
> but functions for {mask, decode} are missed here.
>

Your observation seems correct to me but you have not updated the
comment for the mask. Is there a reason for the same?

-- 
With Regards,
Amit Kapila.

[PATCH] Enable using llvm jitlink as an alternative llvm jit linker of old Rtdyld.

2022-08-29 Thread Alex Fan

This brings the bonus of support jitting on riscv64 (included in this patch)
and other platforms Rtdyld doesn't support, e.g. windows COFF.

Currently, llvm doesn't expose jitlink (ObjectLinkingLayer) via C API, so
a wrapper is added. This also adds minor llvm 15 compat fix that is needed
---
 config/llvm.m4|  1 +
 src/backend/jit/llvm/llvmjit.c| 67 +--
 src/backend/jit/llvm/llvmjit_wrap.cpp | 35 ++
 src/include/jit/llvmjit.h |  9 
 4 files changed, 108 insertions(+), 4 deletions(-)

diff --git a/config/llvm.m4 b/config/llvm.m4
index 3a75cd8b4d..a31b8b304a 100644
--- a/config/llvm.m4
+++ b/config/llvm.m4
@@ -75,6 +75,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT],
   engine) pgac_components="$pgac_components $pgac_component";;
   debuginfodwarf) pgac_components="$pgac_components $pgac_component";;
   orcjit) pgac_components="$pgac_components $pgac_component";;
+  jitlink) pgac_components="$pgac_components $pgac_component";;
   passes) pgac_components="$pgac_components $pgac_component";;
   native) pgac_components="$pgac_components $pgac_component";;
   perfjitevents) pgac_components="$pgac_components $pgac_component";;
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 6c72d43beb..d8b840da8c 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -229,6 +229,11 @@ llvm_release_context(JitContext *context)
 LLVMModuleRef
 llvm_mutable_module(LLVMJitContext *context)
 {
+#ifdef __riscv
+   const char* abiname;
+   const char* target_abi = "target-abi";
+   LLVMMetadataRef abi_metadata;
+#endif
llvm_assert_in_fatal_section();
 
/*
@@ -241,6 +246,40 @@ llvm_mutable_module(LLVMJitContext *context)
context->module = LLVMModuleCreateWithName("pg");
LLVMSetTarget(context->module, llvm_triple);
LLVMSetDataLayout(context->module, llvm_layout);
+#ifdef __riscv
+#if __riscv_xlen == 64
+#ifdef __riscv_float_abi_double
+   abiname = "lp64d";
+#elif defined(__riscv_float_abi_single)
+   abiname = "lp64f";
+#else
+   abiname = "lp64";
+#endif
+#elif __riscv_xlen == 32
+#ifdef __riscv_float_abi_double
+   abiname = "ilp32d";
+#elif defined(__riscv_float_abi_single)
+   abiname = "ilp32f";
+#else
+   abiname = "ilp32";
+#endif
+#else
+   elog(ERROR, "unsupported riscv xlen %d", __riscv_xlen);
+#endif
+   /*
+* set this manually to avoid llvm defaulting to soft float and
+* resulting in linker error: `can't link double-float modules
+* with soft-float modules`
+* we could set this for TargetMachine via MCOptions, but there
+* is no C API for it
+* ref: 
https://github.com/llvm/llvm-project/blob/afa520ab34803c82587ea6759bfd352579f741b4/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp#L90
+*/
+   abi_metadata = LLVMMDStringInContext2(
+   LLVMGetModuleContext(context->module),
+   abiname, strlen(abiname));
+   LLVMAddModuleFlag(context->module, 
LLVMModuleFlagBehaviorOverride,
+   target_abi, strlen(target_abi), abi_metadata);
+#endif
}
 
return context->module;
@@ -786,6 +825,8 @@ llvm_session_initialize(void)
char   *error = NULL;
char   *cpu = NULL;
char   *features = NULL;
+   LLVMRelocMode reloc=LLVMRelocDefault;
+   LLVMCodeModel codemodel=LLVMCodeModelJITDefault;
LLVMTargetMachineRef opt0_tm;
LLVMTargetMachineRef opt3_tm;
 
@@ -820,16 +861,21 @@ llvm_session_initialize(void)
elog(DEBUG2, "LLVMJIT detected CPU \"%s\", with features \"%s\"",
 cpu, features);
 
+#ifdef __riscv
+   reloc=LLVMRelocPIC;
+   codemodel=LLVMCodeModelMedium;
+#endif
+
opt0_tm =
LLVMCreateTargetMachine(llvm_targetref, llvm_triple, cpu, 
features,

LLVMCodeGenLevelNone,
-   
LLVMRelocDefault,
-   
LLVMCodeModelJITDefault);
+   reloc,
+   codemodel);
opt3_tm =
LLVMCreateTargetMachine(llvm_targetref, llvm_triple, cpu, 
features,

LLVMCodeGenLevelAggressive,
-   
LLVMRelocDefault,
-   
LLVMCodeModelJITDefault);
+   reloc,
+

Re: Handle infinite recursion in logical replication setup

2022-08-29 Thread Amit Kapila

On Mon, Aug 29, 2022 at 11:59 AM Peter Smith  wrote:
>
> Here are some review comments for patch v42-0001:
>
> ~~
>
> 6.
>
> +   warning to notify user to check the publisher tables. The user can ensure
> +   that publisher tables do not have data which has an origin associated 
> before
> +   continuing with any other operations to prevent inconsistent data being
> +   replicated.
> +  
>
> 6a.
>
> I'm not so sure about this. IMO the warning is not about the
> replication – really it is about the COPY which has already happened
> anyway. So the user can't really prevent anything from going wrong;
> instead, they have to take some action to clean up if anything did go
> wrong.
>
> SUGGESTION
> Before continuing with other operations the user should check that
> publisher tables did not have data with different origins, otherwise
> inconsistent data may have been copied.
>
> ~

I would like to avoid the use of 'may' here. So, let's change it to
something like: "Before continuing with other operations the user
should check that publisher tables did not have data with different
origins to prevent data inconsistency issues on the subscriber."
>
> 6b.
>
> I am also wondering what can the user do now. Assuming there was bad
> COPY then the subscriber table has already got unwanted stuff copied
> into it. Is there any advice we can give to help users fix this mess?
>

I don't think the user can do much in that case. She will probably
need to re-create the replication.

>
> 11.
>
> + ereport(WARNING,
> + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("publisher has subscribed table \"%s.%s\" from some other publisher",
> +nspname, relname),
> + errdetail("Publisher might have subscribed one or more tables from
> some other publisher."),
> + errhint("Verify that these publisher tables do not have data that
> has an origin associated before proceeding to avoid inconsistency."));
> + break;
>
> 11a.
> I'm not sure having that "break" code is correct logic. Won't that
> mean the user will only get a warning for the first potential problem
> encountered but then other potential problems tables will have no
> warnings at all so the user may not be made aware of them? Perhaps you
> need to first gather the list of all the suspicious tables before
> logging a single warning which shows that list? Or perhaps you need to
> log multiple warnings – one warning per suspicious table?
>
> ~
>
> 11b.
> The WARNING message seems a bit inside-out because the errmsg is
> giving more details than the errdetail.
>
> SUGGESTIONS (or something similar)
> errmsg - "subscription XXX requested origin=NONE but may have copied
> data that had a different origin."
> errdetail – Publisher YYY has subscribed table \"%s.%s\" from some
> other publisher"
>

Your suggestion is better but we can't say 'copied' because the copy
may start afterwards by tablesync worker. Also, using 'may' is not
advisable in error messages. How about : "subscription XXX requested
origin=NONE but might copy data that had a different origin."?

-- 
With Regards,
Amit Kapila.

Re: Make #else/#endif comments more consistent

2022-08-29 Thread Peter Eisentraut


On 29.08.22 11:38, Anton Voloshin wrote:
I propose making them more consistent. Would the following guidelines be 
acceptable?


I usually try to follow the guidelines in 
, which 
pretty much match what you are proposing.



And this:
#else   /* !_MSC_VER */
over
#else   /* !defined(_MSC_VER) */


Note that for _MSC_VER in particular there is some trickiness: We 
generally use it to tell apart different MSVC compiler versions.  But it 
is not present with MinGW.  So !_MSC_VER and !defined(_MSC_VER) have 
different meanings.  So in this particular case, more precision in the 
comments might be better.

Re: pg_checksum: add test for coverage

2022-08-29 Thread Daniel Gustafsson

> On 29 Aug 2022, at 13:26, Dong Wook Lee  wrote:

> I add a tiny test to pg_checksum for coverage.
> I checked it improve test coverage 77.9% -> 87.7%.

+# Checksums are verified if --progress arguments are specified
+command_ok(
+   [ 'pg_checksums', '--progress', '-D', $pgdata ],
+   "verifies checksums as default action with --progress option");
+
+# Checksums are verified if --verbose arguments are specified
+command_ok(
+   [ 'pg_checksums', '--verbose', '-D', $pgdata ],
+   "verifies checksums as default action with --verbose option");

This isn't really true, --progress or --verbose doesn't enable checksum
verification, it just happens to be the default and thus is invoked when called
without a mode parameter.

As written these tests aren't providing more coverage, they run more code but
they don't ensure that the produced output is correct.  If you write these
tests with validation on the output they will be a lot more interesting.

--
Daniel Gustafsson   https://vmware.com/

RE: Perform streaming logical transactions by background workers and parallel apply

2022-08-29 Thread houzj.f...@fujitsu.com

On Tues, Aug 24, 2022 16:41 PM Kuroda, Hayato/黒田 隼人  
wrote:
> Dear Wang,
> 
> Followings are my comments about v23-0003. Currently I do not have any 
> comments about 0002 and 0004.

Thanks for your comments.

> 09. general
> 
> It seems that logicalrep_rel_mark_parallel_apply() is always called 
> when relations are opened on the subscriber-side, but is it really 
> needed? There checks are required only for streaming parallel apply, 
> so it may be not needed in case of streaming = 'on' or 'off'.

Improved.
This check is only performed when using apply background workers.

> 10. commit message
> 
> 2) There cannot be any non-immutable functions used by the subscriber-side
> replicated table. Look for functions in the following places:
> * a. Trigger functions
> * b. Column default value expressions and domain constraints
> * c. Constraint expressions
> * d. Foreign keys
> 
> "Foreign key" should not be listed here because it is not related with 
> the mutability. I think it should be listed as 3), not d..

Improved.

> 11. create_subscription.sgml
> 
> The constraint about foreign key should be described here.
> 
> 11. relation.c
> 
> 11.a
> 
> +   CacheRegisterSyscacheCallback(PROCOID,
> + 
> logicalrep_relmap_reset_parallel_cb,
> + 
> + (Datum) 0);
> 
> Isn't it needed another syscache callback for pg_type?
> Users can add any constraints via ALTER DOMAIN command, but the added 
> constraint may be not checked.
> I checked AlterDomainAddConstraint(), and it invalidates only the 
> relcache for pg_type.
> 
> 11.b
> 
> +   /*
> +* If the column is of a DOMAIN type, determine whether
> +* that domain has any CHECK expressions that are not
> +* immutable.
> +*/
> +   if (get_typtype(att->atttypid) == TYPTYPE_DOMAIN)
> +   {
> 
> I think the default value of *domain* must be also checked here.
> I tested like followings.
> 
> ===
> 1. created a domain that has a default value CREATE DOMAIN tmp INT 
> DEFAULT 1 CHECK (VALUE > 0);
> 
> 2. created a table
> CREATE TABLE foo (id tmp PRIMARY KEY);
> 
> 3. checked pg_attribute and pg_class
> select oid, relname, attname, atthasdef from pg_attribute, pg_class 
> where pg_attribute.attrelid = pg_class.oid and pg_class.relname = 
> 'foo' and attname = 'id';
>   oid  | relname | attname | atthasdef
> ---+-+-+---
>  16394 | foo | id  | f
> (1 row)
> 
> Tt meant that functions might be not checked because the if-statement 
> `if (att-
> >atthasdef)` became false.
> ===

Fixed.
In addition, to reduce duplicate validation, only the flag 
"parallel_apply_safe" is reset when pg_proc and pg_type changes.

> 12. 015_stream.pl, 016_stream_subxact.pl, 022_twophase_cascade.pl, 
> 023_twophase_stream.pl
> 
> -   my ($node_publisher, $node_subscriber, $appname, $is_parallel) = @_;
> +   my ($node_publisher, $node_subscriber, $appname) = @_;
> 
> Why the parameter is removed? I think the test that waits the output 
> from the apply background worker is meaningful.

Revert this change.
In addition, made some modifications to the logs confirmed in these test files 
to
ensure the streamed transactions complete as expected using apply background 
worker.

> 13. 032_streaming_apply.pl
> 
> The filename seems too general because apply background workers are 
> tested in above tests.
> How about "streaming_apply_constraint" or something?

Renamed to 032_streaming_parallel_safety.

Best regards,
Hou zj

pg_checksum: add test for coverage

2022-08-29 Thread Dong Wook Lee

Hi,
I add a tiny test to pg_checksum for coverage.
I checked it improve test coverage 77.9% -> 87.7%.

---
Regards,
DongWook Lee.


v1_add_test_to_pg_checksum.patch
Description: Binary data

Re: Reducing the chunk header sizes on all memory context types

2022-08-29 Thread David Rowley

On Mon, 29 Aug 2022 at 21:37, Amit Kapila  wrote:
> 2022-08-29 03:29:56.911 EDT [1056:67] pg_regress/ddl STATEMENT:
> SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> TRAP: FailedAssertion("pointer == (void *) MAXALIGN(pointer)", File:
> "../../../../src/include/utils/memutils_internal.h", Line: 120, PID:
> 1056)

I suspect, going by all 3 failing animals being 32-bit which have a
MAXIMUM_ALIGNOF 8 and SIZEOF_SIZE_T of 4 that this is due to the lack
of padding in the MemoryChunk struct.

AllocChunkData and GenerationChunk had padding to account for
sizeof(Size) being 4 and sizeof(void *) being 8, I didn't add that to
MemoryChunk, so I'll do that now.

David

Re: Handle infinite recursion in logical replication setup

2022-08-29 Thread Amit Kapila

On Wed, Aug 24, 2022 at 7:27 PM vignesh C  wrote:
>
> Since there was no objections to change it to throw a warning, I have
> made the changes for the same.
>

Review comments for v42-0001*
==
1. Can we improve the query in check_pub_table_subscribed() so that it
doesn't fetch any table that is already part of the subscription on
the subscriber? This may be better than what the patch is doing which
is to first fetch such information and then skip it. If forming this
query turns out to be too complex then we can retain your method as
well but I feel it is worth trying to optimize the query used in the
patch.

2. I thought that it may be better to fetch all the tables that have
the possibility to have data from more than one origin but maybe the
list will be too long especially for FOR ALL TABLE types of cases. Is
this the reason you have decided to give a WARNING as soon as you see
any such table? If so, probably adding a comment for it can be good.

3.
+ $node_publisher->wait_for_catchup($sub_name);
+
+ # also wait for initial table sync to finish
+ $node_subscriber->poll_query_until('postgres', $synced_query)
+   or die "Timed out while waiting for subscriber to synchronize data";

+$node_B->safe_psql(
+ 'postgres', "
+ALTER SUBSCRIPTION $subname_BA REFRESH PUBLICATION");
+
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";

You can replace this and any similar code in the patch with a method
used in commit 0c20dd33db.

4.
+###
+# Join 3rd node (node_C) to the existing 2 nodes(node_A & node_B) bidirectional
+# replication setup when the existing nodes (node_A & node_B) has pre-existing
+# data and the new node (node_C) does not have any data.
+###
+$result = $node_A->safe_psql('postgres', "SELECT * FROM tab ORDER BY 1;");
+is($result, qq(), 'Check existing data');
+
+$result = $node_B->safe_psql('postgres', "SELECT * FROM tab ORDER BY 1;");
+is($result, qq(), 'Check existing data');
+
+$result = $node_C->safe_psql('postgres', "SELECT * FROM tab ORDER BY 1;");
+is($result, qq(), 'Check existing data');

The comments say that for this test, two of the nodes have some
pre-existing data but I don't see the same in the test results. The
test following this one will also have a similar effect. BTW, I am not
sure all the new three node tests added by this patch are required
because I don't see if they have additional code coverage or test
anything which is not tested without those. This has doubled the
amount of test timing for this test file which is okay if these tests
make any useful contribution but otherwise, it may be better to remove
these. Am, I missing something?

-- 
With Regards,
Amit Kapila.

Re: use ARM intrinsics in pg_lfind32() where available

2022-08-29 Thread John Naylor

On Mon, Aug 29, 2022 at 4:28 PM John Naylor
 wrote:
>
> Here's the simplest fix I can think of:
>
> /*
>  * Exactly like vector8_is_highbit_set except for the input type, so
> it still looks
>  * at each _byte_ separately.
>  *
>  * XXX x86 uses the same underlying type for vectors with 8-bit,
> 16-bit, and 32-bit
>  * integer elements, but Arm does not, hence the need for a separate function.
>  * We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e. check each
>  * 32-bit element, but that would require an additional mask operation on x86.
>  */
> static inline bool
> vector32_is_highbit_set(const Vector32 v)
> {
> #if defined(USE_NEON)
> return vector8_is_highbit_set((Vector8) v);
> #else
> return vector8_is_highbit_set(v);
> #endif
> }

Bowerbird just reported the same error, so I went ahead and pushed a
fix with this.

--
John Naylor
EDB: http://www.enterprisedb.com

1 2 >

1 - 100 of 111 matches

Mail list logo