Re: WAL Insertion Lock Improvements

2023-07-20 Thread Michael Paquier
On Thu, Jul 20, 2023 at 02:38:29PM +0530, Bharath Rupireddy wrote:
> On Fri, Jul 14, 2023 at 4:17 AM Andres Freund  wrote:
>> I think this commit does too many things at once.
> 
> I've split the patch into three - 1) Make insertingAt 64-bit atomic.
> 2) Have better commenting on why there's no memory barrier or spinlock
> in and around LWLockWaitForVar call sites. 3) Have a quick exit for
> LWLockUpdateVar.

FWIW, I was kind of already OK with 0001, as it shows most of the
gains observed while 0003 had a limited impact:
https://www.postgresql.org/message-id/CALj2ACWgeOPEKVY-TEPvno%3DVatyzrb-vEEP8hN7QqrQ%3DyPRupA%40mail.gmail.com

It is kind of a no-brainer to replace the spinlocks with atomic reads
and writes there.

>> I don't think:
>>  * NB: LWLockConflictsWithVar (which is called from
>>  * LWLockWaitForVar) relies on the spinlock used 
>> above in this
>>  * function and doesn't use a memory barrier.
>>
>> helps to understand why any of this is safe to a meaningful degree.
>>
>> The existing comments aren't obviously aren't sufficient to explain this, but
>> the reason it's, I think, safe today, is that we are only waiting for
>> insertions that started before WaitXLogInsertionsToFinish() was called.  The
>> lack of memory barriers in the loop means that we might see locks as "unused"
>> that have *since* become used, which is fine, because they only can be for
>> later insertions that we wouldn't want to wait on anyway.
> 
> Right.

FWIW, I always have a hard time coming back to this code and see it
rely on undocumented assumptions with code in lwlock.c while we need
to keep an eye in xlog.c (we take a spinlock there while the variable
wait logic relies on it for ordering @-@).  So the proposal of getting
more documentation in place via 0002 goes in the right direction.

>> This comment seems off to me. Using atomics doesn't guarantee not needing
>> locking. It just guarantees that we are reading a non-torn value.
> 
> Modified the comment.

-   /*
-* Read value using the lwlock's wait list lock, as we can't generally
-* rely on atomic 64 bit reads/stores.  TODO: On platforms with a way to
-* do atomic 64 bit reads/writes the spinlock should be optimized away.
-*/
-   LWLockWaitListLock(lock);
-   value = *valptr;
-   LWLockWaitListUnlock(lock);
+   /* Reading atomically avoids getting a torn value */
+   value = pg_atomic_read_u64(valptr);

Should this specify that this is specifically important for platforms
where reading a uint64 could lead to a torn value read, if you apply
this term in this context?  Sounding picky, I would make that a bit
longer, say something like that:
"Reading this value atomically is safe even on platforms where uint64
cannot be read without observing a torn value."

Only xlogprefetcher.c uses the term "torn" for a value by the way, but
for a write.

>>> @@ -1605,9 +1602,14 @@ LWLockConflictsWithVar(LWLock *lock,
>>>   *
>>>   * Note: this function ignores shared lock holders; if the lock is held
>>>   * in shared mode, returns 'true'.
>>> + *
>>> + * Be careful that LWLockConflictsWithVar() does not include a memory 
>>> barrier,
>>> + * hence the caller of this function may want to rely on an explicit 
>>> barrier or
>>> + * a spinlock to avoid memory ordering issues.
>>>   */
>>
>> s/careful/aware/?
>>
>> s/spinlock/implied barrier via spinlock or lwlock/?
> 
> Done.

Okay to mention a LWLock here, even if the sole caller of this routine
relies on a spinlock.

0001 looks OK-ish seen from here.  Thoughts?
--
Michael


signature.asc
Description: PGP signature


Re: Support worker_spi to execute the function dynamically.

2023-07-20 Thread Bharath Rupireddy
On Fri, Jul 21, 2023 at 8:38 AM Michael Paquier  wrote:
>
> On Thu, Jul 20, 2023 at 03:44:12PM +0530, Bharath Rupireddy wrote:
> > I don't think that change is correct. The worker_spi essentially shows
> > how to start bg workers with RegisterBackgroundWorker and dynamic bg
> > workers with RegisterDynamicBackgroundWorker. If
> > shared_preload_libraries = worker_spi not specified in there, you will
> > miss to start RegisterBackgroundWorkers. Is giving an initidb time
> > database name to worker_spi.database work there? If the database for
> > bg workers doesn't exist, changing bgw_restart_time from
> > BGW_NEVER_RESTART to say 1 will help to see bg workers coming up
> > eventually.
>
> Yeah, it does not move the needle by much.  I think that we are
> looking at switching this module to use a TAP test in the long term,
> instead, where it would be possible to test the scenarios we want to
> look at *with* and *without* shared_preload_libraries especially with
> the custom wait events for extensions in mind if we add our tests in
> this module.

Okay. Here's a quick patch for adding TAP tests to the worker_spi
module. We can change it to taste.

Thoughts?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-Add-TAP-tests-to-worker_spi-module.patch
Description: Binary data


Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-20 Thread Amit Kapila
On Fri, Jul 21, 2023 at 7:30 AM Peter Smith  wrote:
>
> ~~~
>
> 2. StartLogRepWorker
>
> /* Common function to start the leader apply or tablesync worker. */
> void
> StartLogRepWorker(int worker_slot)
> {
> /* Attach to slot */
> logicalrep_worker_attach(worker_slot);
>
> /* Setup signal handling */
> pqsignal(SIGHUP, SignalHandlerForConfigReload);
> pqsignal(SIGTERM, die);
> BackgroundWorkerUnblockSignals();
>
> /*
> * We don't currently need any ResourceOwner in a walreceiver process, but
> * if we did, we could call CreateAuxProcessResourceOwner here.
> */
>
> /* Initialise stats to a sanish value */
> MyLogicalRepWorker->last_send_time = MyLogicalRepWorker->last_recv_time =
> MyLogicalRepWorker->reply_time = GetCurrentTimestamp();
>
> /* Load the libpq-specific functions */
> load_file("libpqwalreceiver", false);
>
> InitializeLogRepWorker();
>
> /* Connect to the origin and start the replication. */
> elog(DEBUG1, "connecting to publisher using connection string \"%s\"",
> MySubscription->conninfo);
>
> /*
> * Setup callback for syscache so that we know when something changes in
> * the subscription relation state.
> */
> CacheRegisterSyscacheCallback(SUBSCRIPTIONRELMAP,
>   invalidate_syncing_table_states,
>   (Datum) 0);
> }
>
> ~
>
> 2a.
> The function name seems a bit misleading because it is not really
> "starting" anything here - it is just more "initialization" code,
> right? Nor is it common to all kinds of LogRepWorker. Maybe the
> function could be named something else like 'InitApplyOrSyncWorker()'.
> -- see also #2c
>

How about SetupLogRepWorker? The other thing I noticed is that we
don't seem to be consistent in naming functions in these files. For
example, shall we make all exposed functions follow camel case (like
InitializeLogRepWorker) and static functions follow _ style (like
run_apply_worker) or the other possibility is to use _ style for all
functions except may be the entry functions like ApplyWorkerMain()? I
don't know if there is already a pattern but if not then let's form it
now, so that code looks consistent.

> ~
>
> 2b.
> Should this have Assert to ensure this is only called from leader
> apply or tablesync? -- see also #2c
>
> ~
>
> 2c.
> IMO maybe the best/tidiest way to do this is not to introduce a new
> function at all. Instead, just put all this "common init" code into
> the existing "common init" function ('InitializeLogRepWorker') and
> execute it only if (am_tablesync_worker() || am_leader_apply_worker())
> { }.
>

I don't like 2c much because it will make InitializeLogRepWorker()
have two kinds of initializations.

> ==
> src/include/replication/worker_internal.h
>
> 3.
>  extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
>  XLogRecPtr remote_lsn);
> +extern void set_stream_options(WalRcvStreamOptions *options,
> +char *slotname,
> +XLogRecPtr *origin_startpos);
> +
> +extern void start_apply(XLogRecPtr origin_startpos);
> +extern void DisableSubscriptionAndExit(void);
> +extern void StartLogRepWorker(int worker_slot);
>
> This placement (esp. with the missing whitespace) seems to be grouping
> the set_stream_options with the other 'pa' externs, which are all
> under the comment "/* Parallel apply worker setup and interactions
> */".
>
> Putting all these up near the other "extern void
> InitializeLogRepWorker(void)" might be less ambiguous.
>

+1. Also, note that they should be in the same order as they are in .c files.

-- 
With Regards,
Amit Kapila.




PATCH: Add REINDEX tag to event triggers

2023-07-20 Thread Garrett Thornburg
Added my v1 patch to add REINDEX to event triggers.

I originally built this against pg15 but rebased to master for the patch
to hopefully make it easier for maintainers to merge. The rebase was
automatic so it should be easy to include into any recent version. I'd love
for this to land in pg16 if possible.

I added regression tests and they are passing. I also formatted the code
using the project tools. Docs are included too.

A few notes:
1. I tried to not touch the function parameters too much and opted to
create a convenience function that makes it easy to attach index or table
OIDs to the current event trigger list.
2. I made sure both concurrent and regular reindex of index, table,
database work as expected.
3. The ddl end command will make available all indexes that were modified
by the reindex command. This is a large list when you run "reindex
database" but works really well. I debated returning records of the table
or DB that were reindexed but that doesn't really make sense. Returning
each index fits my use case of building an audit record around the index
lifecycle (hence the motivation for the patch).

Thanks for your time,

Garrett Thornburg


v1-0001-Add-REINDEX-tag-to-event-triggers.patch
Description: Binary data


Re: POC: GROUP BY optimization

2023-07-20 Thread Andrey Lepikhov

On 20/7/2023 18:46, Tomas Vondra wrote:

On 7/20/23 08:37, Andrey Lepikhov wrote:

On 3/10/2022 21:56, Tom Lane wrote:

Revert "Optimize order of GROUP BY keys".

This reverts commit db0d67db2401eb6238ccc04c6407a4fd4f985832 and
several follow-on fixes.
...
Since we're hard up against the release deadline for v15, let's
revert these changes for now.  We can always try again later.


It may be time to restart the project. As a first step, I rebased the
patch on the current master. It wasn't trivial because of some latest
optimizations (a29eab, 1349d27 and 8d83a5d).
Now, Let's repeat the review and rewrite the current path according to
the reasons uttered in the revert commit.


I think the fundamental task is to make the costing more reliable, and
the commit message 443df6e2db points out a couple challenges in this
area. Not sure how feasible it is to address enough of them ...

1) procost = 1.0 - I guess we could make this more realistic by doing
some microbenchmarks and tuning the costs for the most expensive cases.

2) estimating quicksort comparisons - This relies on ndistinct
estimates, and I'm not sure how much more reliable we can make those.
Probably not much :-( Not sure what to do about this, the only thing I
can think of is to track "reliability" of the estimates and only do the
reordering if we have high confidence in the estimates. That means we'll
miss some optimization opportunities, but it should limit the risk.

For me personally, the most challenging issue is:
3. Imbalance, induced by the cost_sort() changes. It may increase or 
decrease the contribution of sorting to the total cost compared to other 
factors and change choice of sorted paths.


--
regards,
Andrey Lepikhov
Postgres Professional





Re: Buildfarm failures on chipmunk

2023-07-20 Thread Kyotaro Horiguchi
At Fri, 21 Jul 2023 09:10:54 +0530, vignesh C  wrote in 
> postmaster.log shows the following error:
> 2023-07-20 08:21:00.343 GMT [27063] LOG:  unrecognized configuration
> parameter "force_parallel_mode" in file
> "/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/tmp_check/data/postgresql.conf"
> line 835
> 2023-07-20 08:21:00.344 GMT [27063] FATAL:  configuration file
> "/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/tmp_check/data/postgresql.conf"
> contains errors
> 
> Thoughts?

[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk=2023-07-20%2006%3A48%3A23

> $Script_Config = {
> ...
>'extra_config' => {
>'DEFAULT' => [
> ...
>'HEAD' => [
>'force_parallel_mode = 
> regress'
>  ]
>  },

The BF config needs a fix for HEAD.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Buildfarm failures on chipmunk

2023-07-20 Thread Mingli Zhang
HI,

> On Jul 21, 2023, at 11:40, vignesh C  wrote:
> 
> Hi,
> 
> Recently chipmunk has failed with the following errors at [1]:
> 
> make -C '../../..'
> DESTDIR='/home/pgbfarm/buildroot/HEAD/pgsql.build'/tmp_install install
>> '/home/pgbfarm/buildroot/HEAD/pgsql.build'/tmp_install/log/install.log
> 2>&1
> make -j1  checkprep
>>> '/home/pgbfarm/buildroot/HEAD/pgsql.build'/tmp_install/log/install.log
> 2>&1
> echo "# +++ regress check in src/test/regress +++" &&
> PATH="/home/pgbfarm/buildroot/HEAD/pgsql.build/tmp_install/home/pgbfarm/buildroot/HEAD/inst/bin:/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress:$PATH"
> LD_LIBRARY_PATH="/home/pgbfarm/buildroot/HEAD/pgsql.build/tmp_install/home/pgbfarm/buildroot/HEAD/inst/lib"
> ../../../src/test/regress/pg_regress --temp-instance=./tmp_check
> --inputdir=. --bindir=
> --temp-config=/tmp/buildfarm-BewC4d/bfextra.conf  --no-locale
> --port=5678 --dlpath=. --max-concurrent-tests=20 --port=5678
> --schedule=./parallel_schedule
> # +++ regress check in src/test/regress +++
> # postmaster failed, examine
> "/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/log/postmaster.log"
> for the reason
> Bail out!GNUmakefile:118: recipe for target 'check' failed
> make: *** [check] Error 2
> log files for step check:
> 
> postmaster.log shows the following error:
> 2023-07-20 08:21:00.343 GMT [27063] LOG:  unrecognized configuration
> parameter "force_parallel_mode" in file
> "/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/tmp_check/data/postgresql.conf"
> line 835
> 2023-07-20 08:21:00.344 GMT [27063] FATAL:  configuration file
> "/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/tmp_check/data/postgresql.conf"
> contains errors
> 
> [1] - 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk=2023-07-20%2006%3A48%3A23
> 
> Thoughts?
> 
> Regards,
> Vignesh
> 
> 


It seems `force_parallel_mode` has been removed in 5352ca22e

Config should use `debug_parallel_query` instead.




Zhang Mingli
https://www.hashdata.xyz



Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-20 Thread Peter Smith
Some review comments for v21-0002.

On Thu, Jul 20, 2023 at 11:41 PM Melih Mutlu  wrote:
>
> Hi,
>
> Attached the updated patches with recent reviews addressed.
>
> See below for my comments:
>
> Peter Smith , 19 Tem 2023 Çar, 06:08 tarihinde şunu 
> yazdı:
>>
>> 5.
>> + /* Found a table for next iteration */
>> + finish_sync_worker(true);
>> +
>> + StartTransactionCommand();
>> + ereport(LOG,
>> + (errmsg("logical replication worker for subscription \"%s\" will be
>> reused to sync table \"%s\" with relid %u.",
>> + MySubscription->name,
>> + get_rel_name(MyLogicalRepWorker->relid),
>> + MyLogicalRepWorker->relid)));
>> + CommitTransactionCommand();
>> +
>> + done = false;
>> + break;
>> + }
>> + LWLockRelease(LogicalRepWorkerLock);
>>
>>
>> 5b.
>> Isn't there a missing call to that LWLockRelease, if the 'break' happens?
>
>
> Lock is already released before break, if that's the lock you meant:
>
>> /* Update worker state for the next table */
>> MyLogicalRepWorker->relid = rstate->relid;
>> MyLogicalRepWorker->relstate = rstate->state;
>> MyLogicalRepWorker->relstate_lsn = rstate->lsn;
>> LWLockRelease(LogicalRepWorkerLock);
>>
>>
>> /* Found a table for next iteration */
>> finish_sync_worker(true);
>> done = false;
>> break;
>
>

Sorry, I misread the code. You are right.

==
src/backend/replication/logical/tablesync.c

1.
+ if (!reuse_worker)
+ {
+ ereport(LOG,
+ (errmsg("logical replication table synchronization worker for
subscription \"%s\" has finished",
+ MySubscription->name)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("logical replication worker for subscription \"%s\" will be
reused to sync table \"%s\" with relid %u.",
+ MySubscription->name,
+ get_rel_name(MyLogicalRepWorker->relid),
+ MyLogicalRepWorker->relid)));
+ }

1a.
We know this must be a tablesync worker, so I think that second errmsg
should also be saying "logical replication table synchronization
worker".

~

1b.
Since this is if/else anyway, is it simpler to be positive and say "if
(reuse_worker)" instead of the negative "if (!reuse_worker)"

~~~

2. run_tablesync_worker
 {
+ MyLogicalRepWorker->relsync_completed = false;
+
+ /* Start table synchronization. */
  start_table_sync(origin_startpos, );
This still contains the added comment that I'd previously posted I
thought was adding anything useful. Also, I didn't think this comment
exists in the HEAD code.
==
src/backend/replication/logical/worker.c

3. LogicalRepApplyLoop

+ /*
+ * apply_dispatch() may have gone into apply_handle_commit()
+ * which can call process_syncing_tables_for_sync.
+ *
+ * process_syncing_tables_for_sync decides whether the sync of
+ * the current table is completed. If it is completed,
+ * streaming must be already ended. So, we can break the loop.
+ */
+ if (am_tablesync_worker() &&
+ MyLogicalRepWorker->relsync_completed)
+ {
+ endofstream = true;
+ break;
+ }
+

Maybe just personal taste, but IMO it is better to rearrange like
below because then there is no reason to read the long comment except
for tablesync workers.

if (am_tablesync_worker())
{
/*
 * apply_dispatch() may have gone into apply_handle_commit()
 * which can call process_syncing_tables_for_sync.
 *
 * process_syncing_tables_for_sync decides whether the sync of
 * the current table is completed. If it is completed,
 * streaming must be already ended. So, we can break the loop.
*/
if (MyLogicalRepWorker->relsync_completed)
{
endofstream = true;
break;
}
}

~~~

4. LogicalRepApplyLoop

+
+ /*
+ * If relsync_completed is true, this means that the tablesync
+ * worker is done with synchronization. Streaming has already been
+ * ended by process_syncing_tables_for_sync. We should move to the
+ * next table if needed, or exit.
+ */
+ if (am_tablesync_worker() &&
+ MyLogicalRepWorker->relsync_completed)
+ endofstream = true;

Ditto the same comment about rearranging the condition, as #3 above.

==
src/include/replication/worker_internal.h

5.
+ /*
+ * Indicates whether tablesync worker has completed syncing its assigned
+ * table.
+ */
+ bool relsync_completed;
+

Isn't it better to arrange this to be adjacent to other relXXX fields,
so they all clearly belong to that "Used for initial table
synchronization." group?

For example, something like:

/* Used for initial table synchronization. */
Oid  relid;
char relstate;
XLogRecPtr relstate_lsn;
slock_t relmutex;
bool relsync_completed; /* has tablesync finished syncing
the assigned table? */

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Buildfarm failures on chipmunk

2023-07-20 Thread vignesh C
Hi,

Recently chipmunk has failed with the following errors at [1]:

make -C '../../..'
DESTDIR='/home/pgbfarm/buildroot/HEAD/pgsql.build'/tmp_install install
>'/home/pgbfarm/buildroot/HEAD/pgsql.build'/tmp_install/log/install.log
2>&1
make -j1  checkprep
>>'/home/pgbfarm/buildroot/HEAD/pgsql.build'/tmp_install/log/install.log
2>&1
echo "# +++ regress check in src/test/regress +++" &&
PATH="/home/pgbfarm/buildroot/HEAD/pgsql.build/tmp_install/home/pgbfarm/buildroot/HEAD/inst/bin:/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress:$PATH"
LD_LIBRARY_PATH="/home/pgbfarm/buildroot/HEAD/pgsql.build/tmp_install/home/pgbfarm/buildroot/HEAD/inst/lib"
 ../../../src/test/regress/pg_regress --temp-instance=./tmp_check
--inputdir=. --bindir=
--temp-config=/tmp/buildfarm-BewC4d/bfextra.conf  --no-locale
--port=5678 --dlpath=. --max-concurrent-tests=20 --port=5678
--schedule=./parallel_schedule
# +++ regress check in src/test/regress +++
# postmaster failed, examine
"/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/log/postmaster.log"
for the reason
Bail out!GNUmakefile:118: recipe for target 'check' failed
make: *** [check] Error 2
log files for step check:

postmaster.log shows the following error:
2023-07-20 08:21:00.343 GMT [27063] LOG:  unrecognized configuration
parameter "force_parallel_mode" in file
"/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/tmp_check/data/postgresql.conf"
line 835
2023-07-20 08:21:00.344 GMT [27063] FATAL:  configuration file
"/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/tmp_check/data/postgresql.conf"
contains errors

[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk=2023-07-20%2006%3A48%3A23

Thoughts?

Regards,
Vignesh




Re: Use of additional index columns in rows filtering

2023-07-20 Thread Peter Geoghegan
On Thu, Jul 20, 2023 at 4:35 AM Tomas Vondra
 wrote:
> I think the SAOP patch may need to be much more careful about this, but
> for this patch it's simpler because it doesn't really change any of the
> index internals, or the indexscan in general.

It's true that the SAOP patch needs relatively complicated
infrastructure to assess whether or not the technique is safe with a
given set of quals. You cannot safely get an ordered index scan for
something like "select * from table where a < 5 and b in (1,2,3) order
by a, b". With or without my patch. My patch isn't really changing all
that much about the behavior in nbtree, as these things go. It's
*surprising* how little has to change about the high level structure
of index scans, in fact.

(Actually, I'm glossing over a lot. The MDAM paper describes
techniques that'd make even the really challenging cases safe, through
a process of converting quals from conjunctive normal form into
disjunctive normal form. This is more or less the form that the state
machine implemented by _bt_advance_array_keys() produces already,
today. But even with all this practically all of the heavy lifting
takes place before the index scan even begins, during preprocessing --
so you still require surprisingly few changes to index scans
themselves.)

> If I simplify that a bit, we're just reordering the clauses in a way to
> maybe eliminate the heap fetch. The main risk seems to be that this will
> force an expensive qual to the front of the list, just because it can be
> evaluated on the index tuple.

My example query might have been poorly chosen, because it involved a
limit. What I'm thinking of is more general than that.

> But the difference would need to be worse
> than what we save by not doing the I/O - considering how expensive the
> I/O is, that seems unlikely. Could happen for expensive quals that don't
> really eliminate many rows, I guess.

That sounds like the same principle that I was talking about. I think
that it can be pushed quite far, though. I am mostly talking about the
worst case, and it seems like you might not be.

You can easily construct examples where some kind of skew causes big
problems with a multi-column index. I'm thinking of indexes whose
leading columns are low cardinality, and queries where including the
second column as an index qual looks kind of marginal to the
optimizer. Each grouping represented in the most significant index
column might easily have its own unique characteristics; the
distribution of values in subsequent columns might be quite
inconsistent across each grouping, in whatever way.

Since nothing stops a qual on a lower order column having a wildly
different selectivity for one particular grouping, it might not make
sense to say that a problem in this area is due to a bad selectivity
estimate. Even if we have perfect estimates, what good can they do if
the optimal strategy is to vary our strategy at runtime, *within* an
individual index scan, as different parts of the key space (different
groupings) are traversed through? To skip or not to skip, say. This
isn't about picking the cheapest plan, really.

That's another huge advantage of index quals -- they can (at least in
principle) allow you skip over big parts of the index when it just
ends up making sense, in whatever way, for whatever reason. In the
index, and in the heap. Often both. You'd likely always prefer to err
in the direction of having more index quals rather than fewer, when
doing so doesn't substantially change the plan itself. It could be
very cheap insurance, even without any limit. (It would probably also
be a lot faster, but it needn't be.)

> Anyway, I see this as extension of what order_qual_clauses() does. The
> main issue is that even order_qual_clauses() admits the estimates are
> somewhat unreliable, so we can't expect to make perfect decisions.

The attribute value independence assumption is wishful thinking, in no
small part -- it's quite surprising that it works as well as it does,
really.

-- 
Peter Geoghegan




Re: Support worker_spi to execute the function dynamically.

2023-07-20 Thread Michael Paquier
On Thu, Jul 20, 2023 at 03:44:12PM +0530, Bharath Rupireddy wrote:
> I don't think that change is correct. The worker_spi essentially shows
> how to start bg workers with RegisterBackgroundWorker and dynamic bg
> workers with RegisterDynamicBackgroundWorker. If
> shared_preload_libraries = worker_spi not specified in there, you will
> miss to start RegisterBackgroundWorkers. Is giving an initidb time
> database name to worker_spi.database work there? If the database for
> bg workers doesn't exist, changing bgw_restart_time from
> BGW_NEVER_RESTART to say 1 will help to see bg workers coming up
> eventually.

Yeah, it does not move the needle by much.  I think that we are
looking at switching this module to use a TAP test in the long term,
instead, where it would be possible to test the scenarios we want to
look at *with* and *without* shared_preload_libraries especially with
the custom wait events for extensions in mind if we add our tests in
this module.

It does not change the fact that Ikeda-san is right about the launch
of dynamic workers with this module being broken, so I have applied v1
with the comment I have suggested.  This will ease a bit the
implementation of any follow-up test scenarios, while avoiding an
incorrect pattern in this template module. 
--
Michael


signature.asc
Description: PGP signature


Re: Use COPY for populating all pgbench tables

2023-07-20 Thread Michael Paquier
On Thu, Jul 20, 2023 at 02:22:51PM -0500, Tristan Partin wrote:
> Thanks for your testing Michael. I went ahead and added a test to make sure
> that this behavior doesn't regress accidentally, but I am struggling to get
> the test to fail using the previous version of this patch. Do you have any
> advice? This is my first time writing a test for Postgres. I can recreate
> the issue outside of the test script, but not within it for whatever reason.

We're all here to learn, and writing TAP tests is important these days
for a lot of patches.

+# Check that the pgbench_branches and pgbench_tellers filler columns are filled
+# with NULLs
+foreach my $table ('pgbench_branches', 'pgbench_tellers') {
+   my ($ret, $out, $err) = $node->psql(
+   'postgres',
+   "SELECT COUNT(1) FROM $table;
+SELECT COUNT(1) FROM $table WHERE filler is NULL;",
+   extra_params => ['--no-align', '--tuples-only']);
+
+   is($ret, 0, "psql $table counts status is 0");
+   is($err, '', "psql $table counts stderr is empty");
+   if ($out =~ m/^(\d+)\n(\d+)$/g) {
+   is($1, $2, "psql $table filler column filled with NULLs");
+   } else {
+   fail("psql $table stdout m/^(\\d+)\n(\\d+)$/g");
+   }
+}

This is IMO hard to parse, and I'd rather add some checks for the
accounts and history tables as well.  Let's use four simple SQL
queries like what I posted upthread (no data for history inserted
after initialization), as of the attached.  I'd be tempted to apply
that first as a separate patch, because we've never add coverage for
that and we have specific expectations in the code from this filler
column for tpc-b.  And this needs to cover both client-side and
server-side data generation.

Note that the indentation was a bit incorrect, so fixed while on it.

Attached is a v7, with these tests (should be a patch on its own but
I'm lazy to split this morning) and some more adjustments that I have
done while going through the patch.  What do you think?
--
Michael
From 211d66fc39338d522a72722fc3674e306826fd37 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 21 Jul 2023 11:07:31 +0900
Subject: [PATCH v7] Use COPY instead of INSERT for populating all tables

COPY is a better interface for bulk loading and/or high latency
connections. Previously COPY was only used for the pgbench_accounts
table because the amount of data was so much larger than the other
tables. However, as already said there are also gains to be had in the
high latency case, such as loading data across continents.
---
 src/bin/pgbench/pgbench.c| 155 +++
 src/bin/pgbench/t/001_pgbench_with_server.pl |  35 +
 doc/src/sgml/ref/pgbench.sgml|   9 +-
 3 files changed, 133 insertions(+), 66 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 087baa7d57..539c2795e2 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -835,6 +835,8 @@ static void add_socket_to_set(socket_set *sa, int fd, int idx);
 static int	wait_on_socket_set(socket_set *sa, int64 usecs);
 static bool socket_has_input(socket_set *sa, int fd, int idx);
 
+/* callback used to build rows for COPY during data loading */
+typedef void (*initRowMethod) (PQExpBufferData *sql, int64 curr);
 
 /* callback functions for our flex lexer */
 static const PsqlScanCallbacks pgbench_callbacks = {
@@ -4859,17 +4861,45 @@ initTruncateTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
-/*
- * Fill the standard tables with some data generated and sent from the client
- */
 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PQExpBufferData *sql, int64 curr)
 {
-	PQExpBufferData sql;
+	/* "filler" column uses NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t0\t\\N\n",
+	  curr + 1);
+}
+
+static void
+initTeller(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column uses NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\\N\n",
+	  curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to blank padded empty string */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / naccounts + 1);
+}
+
+static void
+initPopulateTable(PGconn *con, const char *table, int64 base,
+  initRowMethod init_row)
+{
+	int			n;
+	int			k;
+	int			chars = 0;
 	PGresult   *res;
-	int			i;
-	int64		k;
-	char	   *copy_statement;
+	PQExpBufferData sql;
+	char		copy_statement[256];
+	const char *copy_statement_fmt = "copy %s from stdin";
+	int64		total = base * scale;
 
 	/* used to track elapsed time and estimate of the remaining time */
 	pg_time_usec_t start;
@@ -4878,50 +4908,24 @@ initGenerateDataClientSide(PGconn *con)
 	/* Stay on the same line if reporting to a terminal */
 	char		eol = isatty(fileno(stderr)) ? '\r' : '\n';
 
-	fprintf(stderr, "generating data (client-side)...\n");
-
-	/*
-	 * we do all of this in one transaction to 

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-20 Thread Peter Smith
Some review comments for v21-0001

==
src/backend/replication/logical/worker.c

1. InitializeLogRepWorker

  if (am_tablesync_worker())
  ereport(LOG,
- (errmsg("logical replication table synchronization worker for
subscription \"%s\", table \"%s\" has started",
+ (errmsg("logical replication worker for subscription \"%s\", table
\"%s\" has started",
  MySubscription->name,
  get_rel_name(MyLogicalRepWorker->relid;

I think this should not be changed. IIUC that decision for using the
generic worker name for translations was only when the errmsg was in
shared code where the worker type was not clear from existing
conditions. See also previous review comments [1].

~~~

2. StartLogRepWorker

/* Common function to start the leader apply or tablesync worker. */
void
StartLogRepWorker(int worker_slot)
{
/* Attach to slot */
logicalrep_worker_attach(worker_slot);

/* Setup signal handling */
pqsignal(SIGHUP, SignalHandlerForConfigReload);
pqsignal(SIGTERM, die);
BackgroundWorkerUnblockSignals();

/*
* We don't currently need any ResourceOwner in a walreceiver process, but
* if we did, we could call CreateAuxProcessResourceOwner here.
*/

/* Initialise stats to a sanish value */
MyLogicalRepWorker->last_send_time = MyLogicalRepWorker->last_recv_time =
MyLogicalRepWorker->reply_time = GetCurrentTimestamp();

/* Load the libpq-specific functions */
load_file("libpqwalreceiver", false);

InitializeLogRepWorker();

/* Connect to the origin and start the replication. */
elog(DEBUG1, "connecting to publisher using connection string \"%s\"",
MySubscription->conninfo);

/*
* Setup callback for syscache so that we know when something changes in
* the subscription relation state.
*/
CacheRegisterSyscacheCallback(SUBSCRIPTIONRELMAP,
  invalidate_syncing_table_states,
  (Datum) 0);
}

~

2a.
The function name seems a bit misleading because it is not really
"starting" anything here - it is just more "initialization" code,
right? Nor is it common to all kinds of LogRepWorker. Maybe the
function could be named something else like 'InitApplyOrSyncWorker()'.
-- see also #2c

~

2b.
Should this have Assert to ensure this is only called from leader
apply or tablesync? -- see also #2c

~

2c.
IMO maybe the best/tidiest way to do this is not to introduce a new
function at all. Instead, just put all this "common init" code into
the existing "common init" function ('InitializeLogRepWorker') and
execute it only if (am_tablesync_worker() || am_leader_apply_worker())
{ }.

==
src/include/replication/worker_internal.h

3.
 extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
 XLogRecPtr remote_lsn);
+extern void set_stream_options(WalRcvStreamOptions *options,
+char *slotname,
+XLogRecPtr *origin_startpos);
+
+extern void start_apply(XLogRecPtr origin_startpos);
+extern void DisableSubscriptionAndExit(void);
+extern void StartLogRepWorker(int worker_slot);

This placement (esp. with the missing whitespace) seems to be grouping
the set_stream_options with the other 'pa' externs, which are all
under the comment "/* Parallel apply worker setup and interactions
*/".

Putting all these up near the other "extern void
InitializeLogRepWorker(void)" might be less ambiguous.

--
[1] worker name in errmsg -
https://www.postgresql.org/message-id/CAA4eK1%2B%2BwkxxMjsPh-z2aKa9ZjNhKsjv0Tnw%2BTVX-hCBkDHusw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-20 Thread Peter Smith
On Thu, Jul 20, 2023 at 11:41 PM Melih Mutlu  wrote:
>
> Hi,
>
> Attached the updated patches with recent reviews addressed.
>
> See below for my comments:
>
> Peter Smith , 19 Tem 2023 Çar, 06:08 tarihinde şunu 
> yazdı:
>>
>> Some review comments for v19-0001
>>
>> 2. LogicalRepSyncTableStart
>>
>> /*
>> * Finally, wait until the leader apply worker tells us to catch up and
>> * then return to let LogicalRepApplyLoop do it.
>> */
>> wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
>>
>> ~
>>
>> Should LogicalRepApplyLoop still be mentioned here, since that is
>> static in worker.c? Maybe it is better to refer instead to the common
>> 'start_apply' wrapper? (see also #5a below)
>
>
> Isn't' LogicalRepApplyLoop static on HEAD and also mentioned in the exact 
> comment in tablesync.c while the common "start_apply" function also exists? 
> I'm not sure how such a change would be related to this patch.
>

Fair enough. I thought it was questionable for one module to refer to
another module's static functions, but you are correct - it is not
really related to your patch. Sorry for the noise.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: doc: improve the restriction description of using indexes on REPLICA IDENTITY FULL table.

2023-07-20 Thread Masahiko Sawada
On Wed, Jul 19, 2023 at 5:09 PM Önder Kalacı  wrote:
>
> Hi Masahiko, Amit, all
>
>> I've updated the patch.
>
>
> I think the flow is much nicer now compared to the HEAD. I really don't have 
> any
> comments regarding the accuracy of the code changes, all looks good to me.
> Overall, I cannot see any behavioral changes as you already alluded to.

Thank you for reviewing the patch.

>
> Maybe few minor notes regarding the comments:
>
>>  /*
>> + * And must reference the remote relation column. This is because if it
>> + * doesn't, the sequential scan is favorable over index scan in most
>> + * cases..
>> + */
>
>
> I think the reader might have lost the context (or say in the future due to
> another refactor etc). So maybe start with:
>
>> /* And the leftmost index field must refer to the  ...

Fixed.

>
>
> Also, now in IsIndexUsableForReplicaIdentityFull() some of the conditions 
> have comments
> some don't. Should we comment on the missing ones as well, maybe such as:
>
>> /* partial indexes are not support *
>> if (indexInfo->ii_Predicate != NIL)
>
> and,
>>
>> /* all indexes must have at least one attribute */
>> Assert(indexInfo->ii_NumIndexAttrs >= 1);

Agreed. But I don't think the latter comment is necessary as it's obvious.

>
>
>
>>
>>>
>>>
>>> BTW, IsIndexOnlyExpression() is not necessary but the current code
>>> still works fine. So do we need to backpatch it to PG16? I'm thinking
>>> we can apply it to only HEAD.
>>
>> Either way is fine but I think if we backpatch it then the code
>> remains consistent and the backpatching would be easier.
>
>
> Yeah, I also have a slight preference for backporting. It could make it 
> easier to maintain the code
> in the future in case of another backport(s). With the cost of making it 
> slightly harder for you now :)

Agreed.

I've attached the updated patch. I'll push it early next week, barring
any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v4-0001-Remove-unnecessary-checks-for-indexes-for-REPLICA.patch
Description: Binary data


Re: logicalrep_message_type throws an error

2023-07-20 Thread Masahiko Sawada
On Fri, Jul 21, 2023 at 1:39 AM Ashutosh Bapat
 wrote:
>
> On Thu, Jul 20, 2023 at 9:11 AM Amit Kapila  wrote:
> > >
> > > Okay, changed it accordingly.
> > >
> >
> > oops, forgot to attach the patch.
> >
>
> WFM
>
> @@ -3367,7 +3367,7 @@ apply_dispatch(StringInfo s)
>  default:
>  ereport(ERROR,
>  (errcode(ERRCODE_PROTOCOL_VIOLATION),
> - errmsg("invalid logical replication message type
> \"%c\"", action)));
> + errmsg("invalid logical replication message type
> \"??? (%d)\"", action)));
>
> I think we should report character here since that's what is visible
> in the code and also the message types are communicated as characters
> not integers. Makes debugging easier. Context may report integer as an
> additional data point.

I think it could confuse us if an invalid message is not a printable character.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Row pattern recognition

2023-07-20 Thread Vik Fearing

On 7/21/23 01:36, Jacob Champion wrote:

There's also the issue of backtracking in the face of reclassification,
as I think Vik was alluding to upthread.


We definitely need some kind of backtracking or other reclassification 
method.



(I've attached two failing tests against v2, to hopefully better
illustrate, along with what I_think_  should be the correct results.)


Almost.  You are matching 07-01-2023 but the condition is "price > 100".
--
Vik Fearing





Re: Row pattern recognition

2023-07-20 Thread Jacob Champion
Hi Ishii-san,

On 7/19/23 22:15, Tatsuo Ishii wrote:
> Currently my patch has a limitation for the sake of simple
> implementation: a pattern like "A+" is parsed and analyzed in the raw
> parser. This makes subsequent process much easier because the pattern
> element variable (in this case "A") and the quantifier (in this case
> "+") is already identified by the raw parser. However there are much
> more cases are allowed in the standard as you already pointed out. For
> those cases probably we should give up to parse PATTERN items in the
> raw parser, and instead the raw parser just accepts the elements as
> Sconst?

Is there a concern that the PATTERN grammar can't be represented in
Bison? I thought it was all context-free... Or is the concern that the
parse tree of the pattern is hard to feed into a regex engine?

> Any comments, especially on the PREV/NEXT implementation part is
> welcome. Currently the DEFINE expression like "price > PREV(price)" is
> prepared in ExecInitWindowAgg using ExecInitExpr,tweaking var->varno
> in Var node so that PREV uses OUTER_VAR, NEXT uses INNER_VAR.  Then
> evaluate the expression in ExecWindowAgg using ExecEvalExpr, setting
> previous row TupleSlot to ExprContext->ecxt_outertuple, and next row
> TupleSlot to ExprContext->ecxt_innertuple. I think this is temporary
> hack and should be gotten ride of before v1 is committed. Better idea?

I'm not familiar enough with this code yet to offer very concrete
suggestions, sorry... But at some point in the future, we need to be
able to skip forward and backward from arbitrary points, like

DEFINE B AS B.price > PREV(FIRST(A.price), 3)

so there won't be just one pair of "previous and next" tuples. Maybe
that can help clarify the design? It feels like it'll need to eventually
be a "real" function that operates on the window state, even if it
doesn't support all of the possible complexities in v1.

--

Taking a closer look at the regex engine:

It looks like the + qualifier has trouble when it meets the end of the
frame. For instance, try removing the last row of the 'stock' table in
rpr.sql; some of the final matches will disappear unexpectedly. Or try a
pattern like

PATTERN ( a+ )
 DEFINE a AS TRUE

which doesn't seem to match anything in my testing.

There's also the issue of backtracking in the face of reclassification,
as I think Vik was alluding to upthread. The pattern

PATTERN ( a+ b+ )
 DEFINE a AS col = 2,
b AS col = 2

doesn't match a sequence of values (2 2 ...) with the current
implementation, even with a dummy row at the end to avoid the
end-of-frame bug.

(I've attached two failing tests against v2, to hopefully better
illustrate, along with what I _think_ should be the correct results.)

I'm not quite understanding the match loop in evaluate_pattern(). It
looks like we're building up a string to pass to the regex engine, but
by the we call regexp_instr, don't we already know whether or not the
pattern will match based on the expression evaluation we've done?

Thanks,
--Jacobdiff --git a/src/test/regress/expected/rpr.out 
b/src/test/regress/expected/rpr.out
index 6bf8818911..68e9a98684 100644
--- a/src/test/regress/expected/rpr.out
+++ b/src/test/regress/expected/rpr.out
@@ -230,6 +230,79 @@ SELECT company, tdate, price, rpr(price) OVER w FROM stock
  company2 | 07-10-2023 |  1300 | 
 (20 rows)
 
+-- match everything
+SELECT company, tdate, price, rpr(price) OVER w FROM stock
+ WINDOW w AS (
+ PARTITION BY company
+ ORDER BY tdate
+ ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
+ AFTER MATCH SKIP TO NEXT ROW
+ INITIAL
+ PATTERN (A+)
+ DEFINE
+  A AS TRUE
+);
+ company  |   tdate| price | rpr 
+--++---+-
+ company1 | 07-01-2023 |   100 | 100
+ company1 | 07-02-2023 |   200 |
+ company1 | 07-03-2023 |   150 |
+ company1 | 07-04-2023 |   140 |
+ company1 | 07-05-2023 |   150 |
+ company1 | 07-06-2023 |90 |
+ company1 | 07-07-2023 |   110 |
+ company1 | 07-08-2023 |   130 |
+ company1 | 07-09-2023 |   120 |
+ company1 | 07-10-2023 |   130 |
+ company2 | 07-01-2023 |50 |  50
+ company2 | 07-02-2023 |  2000 |
+ company2 | 07-03-2023 |  1500 |
+ company2 | 07-04-2023 |  1400 |
+ company2 | 07-05-2023 |  1500 |
+ company2 | 07-06-2023 |60 |
+ company2 | 07-07-2023 |  1100 |
+ company2 | 07-08-2023 |  1300 |
+ company2 | 07-09-2023 |  1200 |
+ company2 | 07-10-2023 |  1300 |
+(20 rows)
+
+-- backtracking with reclassification of rows
+SELECT company, tdate, price, rpr(price) OVER w FROM stock
+ WINDOW w AS (
+ PARTITION BY company
+ ORDER BY tdate
+ ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
+ AFTER MATCH SKIP TO NEXT ROW
+ INITIAL
+ PATTERN (A+ B+)
+ DEFINE
+  A AS price > 100,
+  B AS price > 100
+);
+ company  |   tdate| price | rpr  
+--++---+--
+ company1 | 07-01-2023 |   100 |  100
+ company1 | 07-02-2023 |   200 | 
+ 

Re: Atomic ops for unlogged LSN

2023-07-20 Thread John Morris
Based on your feedback, I’ve updated the patch with additional comments.

  1.  Explain the two cases when writing to the control file, and
  2.  a bit more emphasis on unloggedLSNs not being valid after a crash.

Thanks to y’all.

  *   John


v2-0001-Use-atomic-ops-for-unloggedLSNs.patch
Description: v2-0001-Use-atomic-ops-for-unloggedLSNs.patch


Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-20 Thread David Rowley
On Thu, 20 Jul 2023 at 20:37, Dean Rasheed  wrote:
>
> On Thu, 20 Jul 2023 at 00:56, David Rowley  wrote:
> I agree with the principal though. In the attached updated patch, I
> replaced that test with a simpler one:
>
> +/*
> + * Process the number's digits. We optimize for decimal input (expected 
> to
> + * be the most common case) first. Anything that doesn't start with a 
> base
> + * prefix indicator must be decimal.
> + */
> +
> +   /* process decimal digits */
> +   if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1])))
>
> So hopefully any compiler should only need to do the one comparison
> against '0' for any non-zero decimal input.
>
> Doing that didn't give any measurable performance improvement for me,
> but it did at least not make it noticeably worse, and seems more
> likely to generate better code with not-so-smart compilers. I'd be
> interested to know how that performs for you (and if your compiler
> really does generate 3 "first digit is '0'" tests for the unpatched
> code).

That seems better.  I benchmarked it on two machines:

gcc12.2/linux/amd3990x
create table a (a int);
insert into a select x from generate_series(1,1000)x;
copy a to '/tmp/a.dump';

master @ ab29a7a9c
postgres=# truncate a; copy a from '/tmp/a.dump';
Time: 2226.912 ms (00:02.227)
Time: 2214.168 ms (00:02.214)
Time: 2206.481 ms (00:02.206)
Time: 2219.640 ms (00:02.220)
Time: 2205.004 ms (00:02.205)

master + pg_strtoint32_base_10_first.v2.patch
postgres=# truncate a; copy a from '/tmp/a.dump';
Time: 2067.108 ms (00:02.067)
Time: 2070.401 ms (00:02.070)
Time: 2073.423 ms (00:02.073)
Time: 2065.407 ms (00:02.065)
Time: 2066.536 ms (00:02.067) (~7% faster)

apple m2 pro/clang

master @ 9089287a
postgres=# truncate a; copy a from '/tmp/a.dump';
Time: 1286.369 ms (00:01.286)
Time: 1300.534 ms (00:01.301)
Time: 1295.661 ms (00:01.296)
Time: 1296.404 ms (00:01.296)
Time: 1268.361 ms (00:01.268)
Time: 1259.321 ms (00:01.259)

master + pg_strtoint32_base_10_first.v2.patch
postgres=# truncate a; copy a from '/tmp/a.dump';
Time: 1253.519 ms (00:01.254)
Time: 1235.256 ms (00:01.235)
Time: 1269.501 ms (00:01.270)
Time: 1267.801 ms (00:01.268)
Time: 1275.758 ms (00:01.276)
Time: 1261.478 ms (00:01.261) (a bit noisy but avg of ~1.8% faster)

David




Re: Bytea PL/Perl transform

2023-07-20 Thread Ivan Panchenko
>Friday, 14 July 2023, 23:27 +03:00 от Tom Lane :
> 
>=?UTF-8?B?SXZhbiBQYW5jaGVua28=?= < w...@mail.ru > writes:
>> Четверг, 6 июля 2023, 14:48 +03:00 от Peter Eisentraut <  
>> pe...@eisentraut.org >:
>>> If the transform deals with a built-in type, then they should just be
>>> added to the respective pl extension directly.
>
>> The new extension bytea_plperl can be easily moved into plperl now, but what 
>> should be do with the existing ones, namely jsonb_plperl and bool_plperl ?
>> If we leave them where they are, it would be hard to explain why some 
>> transforms are inside plperl while other ones live separately. If we move 
>> them into plperl also, wouldn’t it break some compatibility?
>
>It's kind of a mess, indeed. But I think we could make plperl 1.1
>contain the additional transforms and just tell people they have
>to drop the obsolete extensions before they upgrade to 1.1.
>Fortunately, it doesn't look like functions using a transform
>have any hard dependency on the transform, so "drop extension
>jsonb_plperl" followed by "alter extension plperl update" should
>work without cascading to all your plperl functions.
>
>Having said that, we'd still be in the position of having to
>explain why some transforms are packaged with plperl and others
>not. The distinction between built-in and contrib types might
>be obvious to us hackers, but I bet a lot of users see it as
>pretty artificial. So maybe the existing packaging design is
>fine and we should just look for a way to reduce the code
>duplication.
The code duplication between different transforms is not in C code, but mostly 
in copy-pasted Makefile, *.control, *.sql and meson-build. These files could be 
generated from some universal templates. But, keeping in mind Windows 
compatibility and presence of three build system, this hardly looks like a 
simplification.
Probably at present time it would be better to keep the existing code 
duplication until plperl 1.1.
Nevertheless, dealing with code duplication is a wider task than the bytea 
transform, so let me suggest to keep this extension in the present form. If we 
decide what to do with the duplication, it would be another patch.

The mesonified and rebased version of the transform patch is attached.
>
>regards, tom lane
> 
Regards, Ivandiff --git a/contrib/Makefile b/contrib/Makefile
index bbf220407b..bb997dda69 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -78,9 +78,9 @@ ALWAYS_SUBDIRS += sepgsql
 endif
 
 ifeq ($(with_perl),yes)
-SUBDIRS += bool_plperl hstore_plperl jsonb_plperl
+SUBDIRS += bool_plperl bytea_plperl hstore_plperl jsonb_plperl
 else
-ALWAYS_SUBDIRS += bool_plperl hstore_plperl jsonb_plperl
+ALWAYS_SUBDIRS += bool_plperl bytea_plperl hstore_plperl jsonb_plperl
 endif
 
 ifeq ($(with_python),yes)
diff --git a/contrib/bytea_plperl/bytea_plperl--1.0.sql b/contrib/bytea_plperl/bytea_plperl--1.0.sql
new file mode 100644
index 00..6544b2ac85
--- /dev/null
+++ b/contrib/bytea_plperl/bytea_plperl--1.0.sql
@@ -0,0 +1,19 @@
+/* contrib/bytea_plperl/bytea_plperl--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bytea_plperl" to load this file. \quit
+
+CREATE FUNCTION bytea_to_plperl(val internal) RETURNS internal
+LANGUAGE C STRICT IMMUTABLE
+AS 'MODULE_PATHNAME';
+
+CREATE FUNCTION plperl_to_bytea(val internal) RETURNS bytea
+LANGUAGE C STRICT IMMUTABLE
+AS 'MODULE_PATHNAME';
+
+CREATE TRANSFORM FOR bytea LANGUAGE plperl (
+FROM SQL WITH FUNCTION bytea_to_plperl(internal),
+TO SQL WITH FUNCTION plperl_to_bytea(internal)
+);
+
+COMMENT ON TRANSFORM FOR bytea LANGUAGE plperl IS 'transform between bytea and Perl';
diff --git a/contrib/bytea_plperl/bytea_plperl.c b/contrib/bytea_plperl/bytea_plperl.c
new file mode 100644
index 00..7ff16040c9
--- /dev/null
+++ b/contrib/bytea_plperl/bytea_plperl.c
@@ -0,0 +1,36 @@
+/*
+ * contrib/bytea_plperl/bytea_plperl.c
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "plperl.h"
+#include "varatt.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(bytea_to_plperl);
+PG_FUNCTION_INFO_V1(plperl_to_bytea);
+
+Datum
+bytea_to_plperl(PG_FUNCTION_ARGS)
+{
+	dTHX;
+	bytea *in = PG_GETARG_BYTEA_PP(0);
+	return PointerGetDatum(newSVpvn_flags( (char *) VARDATA_ANY(in), VARSIZE_ANY_EXHDR(in), 0 ));
+}
+
+Datum
+plperl_to_bytea(PG_FUNCTION_ARGS)
+{
+	dTHX;
+	bytea *result;
+	STRLEN len;
+	SV *in = (SV *) PG_GETARG_POINTER(0);
+	char *ptr = SvPVbyte(in, len);
+	result = palloc(VARHDRSZ + len );
+	SET_VARSIZE(result, VARHDRSZ + len );
+	memcpy(VARDATA_ANY(result), ptr,len );
+	PG_RETURN_BYTEA_P(result);
+}
diff --git a/contrib/bytea_plperl/bytea_plperl.control b/contrib/bytea_plperl/bytea_plperl.control
new file mode 100644
index 00..9ff0f2a8dd
--- /dev/null
+++ b/contrib/bytea_plperl/bytea_plperl.control
@@ -0,0 +1,7 @@
+# bytea_plperl extension
+comment = 'transform between bytea and plperl'
+default_version = '1.0'
+module_pathname = 

Re: In Postgres 16 BETA, should the ParseNamespaceItem have the same index as it's RangeTableEntry?

2023-07-20 Thread Farias de Oliveira
Hello,

Thank you for the help guys and I'm so sorry for my late response. Indeed,
the error relies on the ResultRelInfo. In GetResultRTEPermissionInfo()
function, it does a checking on the relinfo->ri_RootResultRelInfo variable.
I believe that it should go inside this scope:


if (relinfo->ri_RootResultRelInfo)
{
/*
 * For inheritance child result relations (a partition routing 
target
 * of an INSERT or a child UPDATE target), this returns the root
 * parent's RTE to fetch the RTEPermissionInfo because that's 
the only
 * one that has one assigned.
 */
rti = relinfo->ri_RootResultRelInfo->ri_RangeTableIndex;
}

The relinfo contains:

{type = T_ResultRelInfo, ri_RangeTableIndex = 5, ri_RelationDesc =
0x7f44e3308cc8, ri_NumIndices = 0, ri_IndexRelationDescs = 0x0,
ri_IndexRelationInfo = 0x0, ri_RowIdAttNo = 0,
  ri_extraUpdatedCols = 0x0, ri_projectNew = 0x0, ri_newTupleSlot =
0x0, ri_oldTupleSlot = 0x0, ri_projectNewInfoValid = false,
ri_TrigDesc = 0x0, ri_TrigFunctions = 0x0,
  ri_TrigWhenExprs = 0x0, ri_TrigInstrument = 0x0, ri_ReturningSlot =
0x0, ri_TrigOldSlot = 0x0, ri_TrigNewSlot = 0x0, ri_FdwRoutine = 0x0,
ri_FdwState = 0x0,
  ri_usesFdwDirectModify = false, ri_NumSlots = 0,
ri_NumSlotsInitialized = 0, ri_BatchSize = 0, ri_Slots = 0x0,
ri_PlanSlots = 0x0, ri_WithCheckOptions = 0x0,
  ri_WithCheckOptionExprs = 0x0, ri_ConstraintExprs = 0x0,
ri_GeneratedExprsI = 0x0, ri_GeneratedExprsU = 0x0,
ri_NumGeneratedNeededI = 0, ri_NumGeneratedNeededU = 0,
  ri_returningList = 0x0, ri_projectReturning = 0x0,
ri_onConflictArbiterIndexes = 0x0, ri_onConflict = 0x0,
ri_matchedMergeAction = 0x0, ri_notMatchedMergeAction = 0x0,
  ri_PartitionCheckExpr = 0x0, ri_ChildToRootMap = 0x0,
ri_ChildToRootMapValid = false, ri_RootToChildMap = 0x0,
ri_RootToChildMapValid = false, ri_RootResultRelInfo = 0x0,
  ri_PartitionTupleSlot = 0x0, ri_CopyMultiInsertBuffer = 0x0,
ri_ancestorResultRels = 0x0}

Since relinfo->ri_RootResultRelInfo = 0x0, the rti will have no value
and Postgres will interpret that the ResultRelInfo must've been
created only for filtering triggers and the relation is not being
inserted into.
The relinfo variable is created with the
create_entity_result_rel_info() function:

ResultRelInfo *create_entity_result_rel_info(EState *estate, char *graph_name,
 char *label_name)
{
RangeVar *rv;
Relation label_relation;
ResultRelInfo *resultRelInfo;

ParseState *pstate = make_parsestate(NULL);

resultRelInfo = palloc(sizeof(ResultRelInfo));

if (strlen(label_name) == 0)
{
rv = makeRangeVar(graph_name, AG_DEFAULT_LABEL_VERTEX, -1);
}
else
{
rv = makeRangeVar(graph_name, label_name, -1);
}

label_relation = parserOpenTable(pstate, rv, RowExclusiveLock);

// initialize the resultRelInfo
InitResultRelInfo(resultRelInfo, label_relation,
  list_length(estate->es_range_table), NULL,
  estate->es_instrument);

// open the parse state
ExecOpenIndices(resultRelInfo, false);

free_parsestate(pstate);

return resultRelInfo;
}

In this case, how can we get the relinfo->ri_RootResultRelInfo to
store the appropriate data?

Thank you,

Matheus Farias


Em ter., 18 de jul. de 2023 às 06:58, Amit Langote 
escreveu:

> Hello,
>
> On Sat, Jul 15, 2023 at 4:43 AM Farias de Oliveira
>  wrote:
> > I believe I have found something interesting that might be the root of
> the problem with RTEPermissionInfo. But I do not know how to fix it
> exactly. In AGE's code, the execution of it goes through a function called
> analyze_cypher_clause() which does the following:
> >
> > It ends up going inside other functions and changing it more a bit, but
> at the end of one of these functions it assigns values to some members of
> the query:
> >
> > query->targetList = lappend(query->targetList, tle);
> > query->rtable = pstate->p_rtable;
> > query->jointree = makeFromExpr(pstate->p_joinlist, NULL);
> >
> > I assume that here is missing the assignment of query->rteperminfos to
> be the same as pstate->p_rteperminfos, but the pstate has the following
> values:
> >
> > {pstate = {parentParseState = 0x0, p_sourcetext = 0x2b06ef0 "MATCH (n)
> SET n.i = 3", p_rtable = 0x2c6e590,
> > p_rteperminfos = 0x0, p_joinexprs = 0x0, p_nullingrels = 0x0,
> p_joinlist = 0x2c6e678, p_namespace = 0x2c6e6c8,
> > p_lateral_active = false, p_ctenamespace = 0x0, p_future_ctes = 0x0,
> p_parent_cte = 0x0, p_target_relation = 0x0,
> > p_target_nsitem = 0x0, p_is_insert = false, p_windowdefs = 0x0,
> p_expr_kind = EXPR_KIND_NONE, p_next_resno = 3,
> > p_multiassign_exprs = 0x0, p_locking_clause = 0x0,
> p_locked_from_parent = false, p_resolve_unknowns = true,
> > p_queryEnv = 0x0, p_hasAggs = false, p_hasWindowFuncs = false,
> p_hasTargetSRFs = false, 

Re: Use COPY for populating all pgbench tables

2023-07-20 Thread Tristan Partin

On Wed Jul 19, 2023 at 10:07 PM CDT, Michael Paquier wrote:

So this patch causes pgbench to not stick with its historical
behavior, and the change is incompatible with the comments because the
tellers and branches tables don't use NULL for their filler attribute
anymore.


Great find. This was a problem of me just not understanding the COPY 
command properly. Relevant documentation snippet:


Specifies the string that represents a null value. The default is \N 
(backslash-N) in text format, and an unquoted empty string in CSV 
format. You might prefer an empty string even in text format for cases 
where you don't want to distinguish nulls from empty strings. This 
option is not allowed when using binary format.


This new revision populates the column with the NULL value.


psql (17devel)
Type "help" for help.

tristan957=# select count(1) from pgbench_branches;
 count 
---

 1
(1 row)

tristan957=# select count(1) from pgbench_branches where filler is null;
 count 
---

 1
(1 row)


Thanks for your testing Michael. I went ahead and added a test to make 
sure that this behavior doesn't regress accidentally, but I am 
struggling to get the test to fail using the previous version of this 
patch. Do you have any advice? This is my first time writing a test for 
Postgres. I can recreate the issue outside of the test script, but not 
within it for whatever reason.


--
Tristan Partin
Neon (https://neon.tech)
From cf84b3ea2d7b583aaee3f80807b26ef4521db0f6 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 23 May 2023 11:48:16 -0500
Subject: [PATCH v6] Use COPY instead of INSERT for populating all tables

COPY is a better interface for bulk loading and/or high latency
connections. Previously COPY was only used for the pgbench_accounts
table because the amount of data was so much larger than the other
tables. However, as already said there are also gains to be had in the
high latency case, such as loading data across continents.
---
 doc/src/sgml/ref/pgbench.sgml|   8 +-
 src/bin/pgbench/pgbench.c| 151 +++
 src/bin/pgbench/t/001_pgbench_with_server.pl |  18 +++
 3 files changed, 110 insertions(+), 67 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 850028557d..4424595cc6 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -231,10 +231,10 @@ pgbench  options  d
 extensively through a COPY.
 pgbench uses the FREEZE option with version 14 or later
 of PostgreSQL to speed up
-subsequent VACUUM, unless partitions are enabled.
-Using g causes logging to print one message
-every 100,000 rows while generating data for the
-pgbench_accounts table.
+subsequent VACUUM. If partitions are enabled for
+the pgbench_accounts table, the FREEZE option is
+not enabled. Using g causes logging to print one
+message every 100,000 rows while generating data for all tables.


 With G (server-side data generation),
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 087baa7d57..ef01d049a1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4859,17 +4859,44 @@ initTruncateTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
-/*
- * Fill the standard tables with some data generated and sent from the client
- */
+typedef void initRow(PQExpBufferData *sql, int64 curr);
+
 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PQExpBufferData *sql, int64 curr)
+{
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t0\t\\N\n",
+	  curr + 1);
+}
+
+static void
+initTeller(PQExpBufferData *sql, int64 curr)
+{
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\\N\n",
+	  curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PQExpBufferData *sql, int64 curr)
 {
+	/* "filler" column defaults to blank padded empty string */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / naccounts + 1);
+}
+
+static void
+initPopulateTable(PGconn *con, const char *table, int64 base, initRow *init_row)
+{
+	int			n;
+	int			k;
+	int			chars = 0;
+	PGresult	*res;
 	PQExpBufferData sql;
-	PGresult   *res;
-	int			i;
-	int64		k;
-	char	   *copy_statement;
+	char 		copy_statement[256];
+	const char *copy_statement_fmt = "copy %s from stdin";
+	const int64 total = base * scale;
 
 	/* used to track elapsed time and estimate of the remaining time */
 	pg_time_usec_t start;
@@ -4878,50 +4905,24 @@ initGenerateDataClientSide(PGconn *con)
 	/* Stay on the same line if reporting to a terminal */
 	char		eol = isatty(fileno(stderr)) ? '\r' : '\n';
 
-	fprintf(stderr, "generating data (client-side)...\n");
-
-	/*
-	 * we do all of this in one transaction to enable the backend's
-	 * data-loading optimizations
-	 */
-	

Re: Inefficiency in parallel pg_restore with many tables

2023-07-20 Thread Nathan Bossart
Here is a work-in-progress patch set for converting ready_list to a
priority queue.  On my machine, Tom's 100k-table example [0] takes 11.5
minutes without these patches and 1.5 minutes with them.

One item that requires more thought is binaryheap's use of Datum.  AFAICT
the Datum definitions live in postgres.h and aren't available to frontend
code.  I think we'll either need to move the Datum definitions to c.h or to
adjust binaryheap to use "void *".

[0] https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From bbf7b32f0abd0e9e2f7a0759cc79fcccbb5f9281 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 19 Jul 2023 15:54:00 -0700
Subject: [PATCH v2 1/4] misc binaryheap fixes

---
 src/backend/postmaster/pgarch.c | 12 ++--
 src/backend/replication/logical/reorderbuffer.c |  2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 46af349564..c28e6f2070 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -108,7 +108,7 @@ static ArchiveModuleState *archive_module_state;
  * archive_status.  Minimizing the number of directory scans when there are
  * many files to archive can significantly improve archival rate.
  *
- * arch_heap is a max-heap that is used during the directory scan to track
+ * arch_heap is a min-heap that is used during the directory scan to track
  * the highest-priority files to archive.  After the directory scan
  * completes, the file names are stored in ascending order of priority in
  * arch_files.  pgarch_readyXlog() returns files from arch_files until it
@@ -248,7 +248,7 @@ PgArchiverMain(void)
 	arch_files = palloc(sizeof(struct arch_files_state));
 	arch_files->arch_files_size = 0;
 
-	/* Initialize our max-heap for prioritizing files to archive. */
+	/* Initialize our min-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
 ready_file_comparator, NULL);
 
@@ -624,7 +624,7 @@ pgarch_readyXlog(char *xlog)
 		basename[basenamelen] = '\0';
 
 		/*
-		 * Store the file in our max-heap if it has a high enough priority.
+		 * Store the file in our min-heap if it has a high enough priority.
 		 */
 		if (arch_files->arch_heap->bh_size < NUM_FILES_PER_DIRECTORY_SCAN)
 		{
@@ -644,15 +644,15 @@ pgarch_readyXlog(char *xlog)
 			 * Remove the lowest priority file and add the current one to the
 			 * heap.
 			 */
-			arch_file = DatumGetCString(binaryheap_remove_first(arch_files->arch_heap));
+			arch_file = DatumGetCString(binaryheap_first(arch_files->arch_heap));
 			strcpy(arch_file, basename);
-			binaryheap_add(arch_files->arch_heap, CStringGetDatum(arch_file));
+			binaryheap_replace_first(arch_files->arch_heap, CStringGetDatum(arch_file));
 		}
 	}
 	FreeDir(rldir);
 
 	/* If no files were found, simply return. */
-	if (arch_files->arch_heap->bh_size == 0)
+	if (binaryheap_empty(arch_files->arch_heap))
 		return false;
 
 	/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 26d252bd87..05bbcecd29 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1381,7 +1381,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	int32		off;
 
 	/* nothing there anymore */
-	if (state->heap->bh_size == 0)
+	if (binaryheap_empty(state->heap))
 		return NULL;
 
 	off = DatumGetInt32(binaryheap_first(state->heap));
-- 
2.25.1

>From 8f0339263dd21bb04c1b96a0e850a9b57a2e3f8b Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 20 Jul 2023 09:45:03 -0700
Subject: [PATCH v2 2/4] make binaryheap available to frontend

---
 src/backend/lib/Makefile |  1 -
 src/backend/lib/meson.build  |  1 -
 src/common/Makefile  |  1 +
 src/{backend/lib => common}/binaryheap.c | 15 
 src/common/meson.build   |  1 +
 src/include/lib/binaryheap.h | 29 
 6 files changed, 46 insertions(+), 2 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (96%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git 

Re: There should be a way to use the force flag when restoring databases

2023-07-20 Thread Gurjeet Singh
On Thu, Jul 20, 2023 at 2:10 AM Daniel Gustafsson  wrote:
>
> > On 19 Jul 2023, at 19:28, Gurjeet Singh  wrote:
> >
> > The docs for 'pg_restore --create` say "Create the database before
> > restoring into it. If --clean is also specified, drop and recreate the
> > target database before connecting to it."
> >
> > If we provided a force option, it may then additionally say: "If the
> > --clean and --force options are specified, DROP DATABASE ... WITH
> > FORCE command will be used to drop the database."
>
> pg_restore --clean refers to dropping any pre-existing database objects and 
> not
> just databases, but --force would only apply to databases.
>
> I wonder if it's worth complicating pg_restore with that when running dropdb
> --force before pg_restore is an option for those wanting to use WITH FORCE.

Fair point. But the same argument could've been applied to --clean
option, as well; why overload the meaning of --clean and make it drop
database, when a dropdb before pg_restore was an option.

IMHO, if pg_restore offers to drop database, providing an option to
the user to do it forcibly is not that much of a stretch, and within
reason for the user to expect it to be there, like Joan did.

Best regards,
Gurjeet
http://Gurje.et




Re: logical decoding and replication of sequences, take 2

2023-07-20 Thread Tomas Vondra
FWIW there's two questions related to the switch to XLOG_SMGR_CREATE.

1) Does smgr_decode() need to do the same block as sequence_decode()?

/* Skip the change if already processed (per the snapshot). */
if (transactional &&
!SnapBuildProcessChange(builder, xid, buf->origptr))
return;
else if (!transactional &&
 (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
  SnapBuildXactNeedsSkip(builder, buf->origptr)))
return;

I don't think it does. Also, we don't have any transactional flag here.
Or rather, everything is transactional ...


2) Currently, the sequences hash table is in reorderbuffer, i.e. global.
I was thinking maybe we should have it in the transaction (because we
need to do cleanup at the end). It seem a bit inconvenient, because then
we'd need to either search htabs in all subxacts, or transfer the
entries to the top-level xact (otoh, we already do that with snapshots),
and cleanup on abort.

What do you think?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logicalrep_message_type throws an error

2023-07-20 Thread Ashutosh Bapat
On Thu, Jul 20, 2023 at 9:11 AM Amit Kapila  wrote:
> >
> > Okay, changed it accordingly.
> >
>
> oops, forgot to attach the patch.
>

WFM

@@ -3367,7 +3367,7 @@ apply_dispatch(StringInfo s)
 default:
 ereport(ERROR,
 (errcode(ERRCODE_PROTOCOL_VIOLATION),
- errmsg("invalid logical replication message type
\"%c\"", action)));
+ errmsg("invalid logical replication message type
\"??? (%d)\"", action)));

I think we should report character here since that's what is visible
in the code and also the message types are communicated as characters
not integers. Makes debugging easier. Context may report integer as an
additional data point.

-- 
Best Wishes,
Ashutosh Bapat




Re: Atomic ops for unlogged LSN

2023-07-20 Thread John Morris

> what happens if …  reader here stores the old unloggedLSN value
> to control file and the server restarts (clean exit). So, the old
>value is loaded back to unloggedLSN upon restart and the callers of
> GetFakeLSNForUnloggedRel() will see an old/repeated value too. Is it a
> problem?

First, a clarification. The value being saved is the “next” unlogged LSN,
not one which has already been used.
(we are doing “fetch and add”,  not “add and fetch”)

You have a good point about shutdown and startup.  It is vital we
don’t repeat an unlogged LSN. This situation could easily happen
If other readers were active while we were shutting down.

>With an atomic variable, it is guaranteed that the readers
>don't see a torn-value, but no synchronization is provided.

The atomic increment also ensures the sequence
of values is correct, specifically we don’t see
repeated values like we might with a conventional increment.
As a side effect, the instruction enforces a memory barrier, but we are not
relying on a barrier in this case.



Re: remaining sql/json patches

2023-07-20 Thread Alvaro Herrera
On 2023-Jul-21, Amit Langote wrote:

> I’m thinking of pushing 0001 and 0002 tomorrow barring objections.

0001 looks reasonable to me.  I think you asked whether to squash that
one with the other bugfix commit for the same code that you already
pushed to master; I think there's no point in committing as separate
patches, because the first one won't show up in the git_changelog output
as a single entity with the one in 16, so it'll just be additional
noise.

I've looked at 0002 at various points in time and I think it looks
generally reasonable.  I think your removal of a couple of newlines
(where originally two appear in sequence) is unwarranted; that the name
to_json[b]_worker is ugly for exported functions (maybe "datum_to_json"
would be better, or you may have better ideas); and that the omission of
the stock comment in the new stanzas in FigureColnameInternal() is
strange.  But I don't have anything serious.  Do add some ecpg tests ...

Also, remember to pgindent and bump catversion, if you haven't already.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"




Re: pg15.3: dereference null *plan in CachedPlanIsSimplyValid/plpgsql

2023-07-20 Thread Tom Lane
Justin Pryzby  writes:
> A production instance crashed like so.

Oh, interesting.  Apparently we got past the "Has cache invalidation fired
on this plan?" check:

if (!plansource->is_valid || plan != plansource->gplan || !plan->is_valid)
return false;

because "plan" and "plansource->gplan" were *both* null, which is
a case the code wasn't expecting.  plansource.c seems to be paranoid
about gplan possibly being null everywhere else but here :-(

> Note that the prior query seems to have timed out in the same / similar 
> plpgsql
> statement:

Hm.  Perhaps that helps explain how we got into this state.  It must
be some weird race condition, given the lack of prior reports.
Anyway, adding a check for plan not being null seems like the
obvious fix.

regards, tom lane




Re: remaining sql/json patches

2023-07-20 Thread Amit Langote
On Thu, Jul 20, 2023 at 17:19 Amit Langote  wrote:

> On Wed, Jul 19, 2023 at 5:17 PM Amit Langote 
> wrote:
> > On Wed, Jul 19, 2023 at 12:53 AM Alvaro Herrera 
> wrote:
> > > On 2023-Jul-18, Amit Langote wrote:
> > >
> > > > Attached updated patches.  In 0002, I removed the mention of the
> > > > RETURNING clause in the JSON(), JSON_SCALAR() documentation, which I
> > > > had forgotten to do in the last version which removed its support in
> > > > code.
> > >
> > > > I think 0001 looks ready to go.  Alvaro?
> > >
> > > It looks reasonable to me.
> >
> > Thanks for taking another look.
> >
> > I will push this tomorrow.
>
> Pushed.
>
> > > > Also, I've been wondering if it isn't too late to apply the following
> > > > to v16 too, so as to make the code look similar in both branches:
> > >
> > > Hmm.
> > >
> > > > 785480c953 Pass constructName to transformJsonValueExpr()
> > >
> > > I think 785480c953 can easily be considered a bugfix on 7081ac46ace8,
> so
> > > I agree it's better to apply it to 16.
> >
> > OK.
>
> Pushed to 16.
>
> > > > b6e1157e7d Don't include CaseTestExpr in JsonValueExpr.formatted_expr
> > >
> > > I feel a bit uneasy about this one.  It seems to assume that
> > > formatted_expr is always set, but at the same time it's not obvious
> that
> > > it is.  (Maybe this aspect just needs some more commentary).
> >
> > Hmm, I agree that the comments about formatted_expr could be improved
> > further, for which I propose the attached.  Actually, staring some
> > more at this, I'm inclined to change makeJsonValueExpr() to allow
> > callers to pass it the finished 'formatted_expr' rather than set it by
> > themselves.
> >
> > >  I agree
> > > that it would be better to make both branches identical, because if
> > > there's a problem, we are better equipped to get a fix done to both.
> > >
> > > As for the removal of makeCaseTestExpr(), I agree -- of the six callers
> > > of makeNode(CastTestExpr), only two of them would be able to use the
> new
> > > function, so it doesn't look of general enough usefulness.
> >
> > OK, so you agree with back-patching this one too, though perhaps only
> > after applying something like the aforementioned patch.
>
> I looked at this some more and concluded that it's fine to think that
> all JsonValueExpr nodes leaving the parser have their formatted_expr
> set.  I've updated the commentary some more in the patch attached as
> 0001.
>
> Rebased SQL/JSON patches also attached.  I've fixed the JSON_TABLE
> syntax synopsis in the documentation as mentioned in my other email.


I’m thinking of pushing 0001 and 0002 tomorrow barring objections.

> --
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


pg15.3: dereference null *plan in CachedPlanIsSimplyValid/plpgsql

2023-07-20 Thread Justin Pryzby
A production instance crashed like so.

< 2023-07-20 05:41:36.557 MDT  >LOG:  server process (PID 106001) was 
terminated by signal 11: Segmentation fault

[168417.649549] postmaster[106001]: segfault at 12 ip 008f1d40 sp 
7ffe885502e0 error 4 in postgres[40+7dc000]

Core was generated by `postgres: main main 127.0.0.1(51936) SELECT'.

(gdb) bt
#0  CachedPlanIsSimplyValid (plansource=0x2491398, plan=0x0, owner=0x130a198) 
at plancache.c:1441
#1  0x7fcab2e93d15 in exec_eval_simple_expr (rettypmod=0x7ffe88550374, 
rettype=0x7ffe88550370, isNull=0x7ffe8855036f, result=, 
expr=0x2489dc8, estate=0x7ffe885507f0) at pl_exec.c:6067
#2  exec_eval_expr (estate=0x7ffe885507f0, expr=0x2489dc8, 
isNull=0x7ffe8855036f, rettype=0x7ffe88550370, rettypmod=0x7ffe88550374) at 
pl_exec.c:5696
#3  0x7fcab2e97d42 in exec_assign_expr (estate=estate@entry=0x7ffe885507f0, 
target=0x16250f8, expr=0x2489dc8) at pl_exec.c:5028
#4  0x7fcab2e98675 in exec_stmt_assign (stmt=0x2489d80, stmt=0x2489d80, 
estate=0x7ffe885507f0) at pl_exec.c:2155
#5  exec_stmts (estate=estate@entry=0x7ffe885507f0, stmts=0x2489ad8) at 
pl_exec.c:2019
#6  0x7fcab2e9b672 in exec_stmt_block (estate=estate@entry=0x7ffe885507f0, 
block=block@entry=0x248af48) at pl_exec.c:1942
#7  0x7fcab2e9b6cb in exec_toplevel_block 
(estate=estate@entry=0x7ffe885507f0, block=0x248af48) at pl_exec.c:1633
#8  0x7fcab2e9bf84 in plpgsql_exec_function (func=func@entry=0x1035388, 
fcinfo=fcinfo@entry=0x40a76b8, simple_eval_estate=simple_eval_estate@entry=0x0, 
simple_eval_resowner=simple_eval_resowner@entry=0x0,
procedure_resowner=procedure_resowner@entry=0x0, atomic=atomic@entry=true) 
at pl_exec.c:622
#9  0x7fcab2ea7454 in plpgsql_call_handler (fcinfo=0x40a76b8) at 
pl_handler.c:277
#10 0x00658307 in ExecAggPlainTransByRef (setno=, 
aggcontext=, pergroup=0x41a3358, pertrans=, 
aggstate=) at execExprInterp.c:4379
#11 ExecInterpExpr (state=0x40a7870, econtext=0x1843a58, isnull=) at execExprInterp.c:1770
#12 0x00670282 in ExecEvalExprSwitchContext (isNull=0x7ffe88550b47, 
econtext=, state=) at 
../../../src/include/executor/executor.h:341
#13 advance_aggregates (aggstate=, aggstate=) at 
nodeAgg.c:824
#14 agg_fill_hash_table (aggstate=0x1843630) at nodeAgg.c:2544
#15 ExecAgg (pstate=0x1843630) at nodeAgg.c:2154
#16 0x00662c58 in ExecProcNodeInstr (node=0x1843630) at 
execProcnode.c:480
#17 0x0067c7f3 in ExecProcNode (node=0x1843630) at 
../../../src/include/executor/executor.h:259
#18 ExecHashJoinImpl (parallel=false, pstate=0x1843390) at nodeHashjoin.c:268
#19 ExecHashJoin (pstate=0x1843390) at nodeHashjoin.c:621
#20 0x00662c58 in ExecProcNodeInstr (node=0x1843390) at 
execProcnode.c:480
#21 0x0067c7f3 in ExecProcNode (node=0x1843390) at 
../../../src/include/executor/executor.h:259
#22 ExecHashJoinImpl (parallel=false, pstate=0x18430f0) at nodeHashjoin.c:268
#23 ExecHashJoin (pstate=0x18430f0) at nodeHashjoin.c:621
#24 0x00662c58 in ExecProcNodeInstr (node=0x18430f0) at 
execProcnode.c:480
#25 0x0068de76 in ExecProcNode (node=0x18430f0) at 
../../../src/include/executor/executor.h:259
#26 ExecSort (pstate=0x1842ee0) at nodeSort.c:149
#27 0x00662c58 in ExecProcNodeInstr (node=0x1842ee0) at 
execProcnode.c:480
#28 0x0066ced9 in ExecProcNode (node=0x1842ee0) at 
../../../src/include/executor/executor.h:259
#29 fetch_input_tuple (aggstate=aggstate@entry=0x1842908) at nodeAgg.c:563
#30 0x006701b2 in agg_retrieve_direct (aggstate=0x1842908) at 
nodeAgg.c:2346
#31 ExecAgg (pstate=0x1842908) at nodeAgg.c:2161
#32 0x00662c58 in ExecProcNodeInstr (node=0x1842908) at 
execProcnode.c:480
#33 0x0065be12 in ExecProcNode (node=0x1842908) at 
../../../src/include/executor/executor.h:259
#34 ExecutePlan (execute_once=, dest=0x5fb39f8, 
direction=, numberTuples=0, sendTuples=true, 
operation=CMD_SELECT, use_parallel_mode=, planstate=0x1842908, 
estate=0x1842298)
at execMain.c:1636
#35 standard_ExecutorRun (queryDesc=0x2356ab8, direction=, 
count=0, execute_once=) at execMain.c:363
#36 0x7fcab32bd67d in pgss_ExecutorRun (queryDesc=0x2356ab8, 
direction=ForwardScanDirection, count=0, execute_once=) at 
pg_stat_statements.c:1010
#37 0x7fcab30b7781 in explain_ExecutorRun (queryDesc=0x2356ab8, 
direction=ForwardScanDirection, count=0, execute_once=) at 
auto_explain.c:320
#38 0x007d010e in PortalRunSelect (portal=portal@entry=0xf79598, 
forward=forward@entry=true, count=0, count@entry=9223372036854775807, 
dest=dest@entry=0x5fb39f8) at pquery.c:924
#39 0x007d15bf in PortalRun (portal=, 
count=9223372036854775807, isTopLevel=, run_once=, dest=0x5fb39f8, altdest=0x5fb39f8, qc=0x7ffe885512d0) at pquery.c:768
#40 0x007cd417 in exec_simple_query (
query_string=0x15ee788 "-- report: thresrept: None: E-UTRAN/eNB KPIs: 
['2023-07-19 2023-07-20', '0 23', '4', '{\"report_type\": \"hourly\", 

Re: sslinfo extension - add notbefore and notafter timestamps

2023-07-20 Thread Daniel Gustafsson
> On 17 Jul 2023, at 20:26, Cary Huang  wrote:

>>> Perhaps calling "tm2timestamp(_time, 0, NULL, )" without checking 
>>> the return code would be just fine. I see some other usages of 
>>> tm2timstamp() in other code areas also skip checking the return code.
>> 
>> I think we want to know about any failures, btu we can probably make it into 
>> an
>> elog() instead, as it should never fail.
> 
> Yes, sure. I have corrected the error message to elog(ERROR, "timestamp out 
> of range") on a rare tm2timestamp() failure.

I went over this again and ended up pushing it along with a catversion bump.
Due to a mistake in my testing I didn't however catch that it was using an API
only present in OpenSSL 1.1.1 and higher, which caused buildfailures when using
older OpenSSL versions, so I ended up reverting it again (leaving certificate
changes in place) to keep the buildfarm green.

Will look closer at an implementation which works across all supported versions
of OpenSSL when I have more time.

--
Daniel Gustafsson





Re: Printing backtrace of postgres processes

2023-07-20 Thread Daniel Gustafsson
> On 11 Jan 2023, at 15:44, Bharath Rupireddy 
>  wrote:
> 
> On Wed, Nov 30, 2022 at 11:43 AM Bharath Rupireddy
>  wrote:
>> 
>> I'm attaching the v22 patch set for further review.
> 
> Needed a rebase due to 216a784829c2c5f03ab0c43e009126cbb819e9b2.
> Attaching v23 patch set for further review.

This thread has stalled for well over 6 months with the patch going from CF to
CF.  From skimming the thread it seems that a lot of the details have been
ironed out with most (all?) objections addressed.  Is any committer interested
in picking this up?  If not we should probably mark it returned with feedback.

--
Daniel Gustafsson





Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-07-20 Thread Daniel Gustafsson
This patch no longer applies and needs a rebase.

Given where we are in the commitfest, do you think this patch has the potential
to go in or should it be moved?

--
Daniel Gustafsson





Re: psql: Add role's membership options to the \du+ command

2023-07-20 Thread Jonathan S. Katz

On 7/19/23 1:44 PM, Pavel Luzanov wrote:

On 19.07.2023 19:47, Tom Lane wrote:

And done, with some minor editorialization.


Thanks to everyone who participated in the work.
Special thanks to David for moving forward this patch for a long time, 
and to Tom for taking commit responsibilities.


[RMT]

+1; thanks to everyone for seeing this through!

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-20 Thread Melih Mutlu
Hi,

Attached the updated patches with recent reviews addressed.

See below for my comments:

Peter Smith , 19 Tem 2023 Çar, 06:08 tarihinde şunu
yazdı:

> Some review comments for v19-0001
>
> 2. LogicalRepSyncTableStart
>
> /*
> * Finally, wait until the leader apply worker tells us to catch up and
> * then return to let LogicalRepApplyLoop do it.
> */
> wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
>
> ~
>
> Should LogicalRepApplyLoop still be mentioned here, since that is
> static in worker.c? Maybe it is better to refer instead to the common
> 'start_apply' wrapper? (see also #5a below)


Isn't' LogicalRepApplyLoop static on HEAD and also mentioned in the exact
comment in tablesync.c while the common "start_apply" function also exists?
I'm not sure how such a change would be related to this patch.

---

5.
> + /* Found a table for next iteration */
> + finish_sync_worker(true);
> +
> + StartTransactionCommand();
> + ereport(LOG,
> + (errmsg("logical replication worker for subscription \"%s\" will be
> reused to sync table \"%s\" with relid %u.",
> + MySubscription->name,
> + get_rel_name(MyLogicalRepWorker->relid),
> + MyLogicalRepWorker->relid)));
> + CommitTransactionCommand();
> +
> + done = false;
> + break;
> + }
> + LWLockRelease(LogicalRepWorkerLock);


> 5b.
> Isn't there a missing call to that LWLockRelease, if the 'break' happens?


Lock is already released before break, if that's the lock you meant:

/* Update worker state for the next table */
> MyLogicalRepWorker->relid = rstate->relid;
> MyLogicalRepWorker->relstate = rstate->state;
> MyLogicalRepWorker->relstate_lsn = rstate->lsn;
> LWLockRelease(LogicalRepWorkerLock);


> /* Found a table for next iteration */
> finish_sync_worker(true);
> done = false;
> break;


---

2.
> As for the publisher node, this patch allows to reuse logical
> walsender processes
> after the streaming is done once.


> ~


> Is this paragraph even needed? Since the connection is reused then it
> already implies the other end (the Wlasender) is being reused, right?


I actually see no harm in explaining this explicitly.


Thanks,
-- 
Melih Mutlu
Microsoft


v21-0001-Refactor-to-split-Apply-and-Tablesync-Workers.patch
Description: Binary data


v21-0002-Reuse-Tablesync-Workers.patch
Description: Binary data


v21-0003-Reuse-connection-when-tablesync-workers-change-t.patch
Description: Binary data


Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-20 Thread Amit Kapila
On Thu, Jul 20, 2023 at 5:12 PM Melih Mutlu  wrote:
>
> Peter Smith , 20 Tem 2023 Per, 05:41 tarihinde şunu 
> yazdı:
>>
>> 7. InitializeLogRepWorker
>>
>>   if (am_tablesync_worker())
>>   ereport(LOG,
>> - (errmsg("logical replication worker for subscription \"%s\", table
>> \"%s\" has started",
>> + (errmsg("logical replication worker for subscription \"%s\", table
>> \"%s\" with relid %u has started",
>>   MySubscription->name,
>> - get_rel_name(MyLogicalRepWorker->relid;
>> + get_rel_name(MyLogicalRepWorker->relid),
>> + MyLogicalRepWorker->relid)));
>>
>> But this is certainly a tablesync worker so the message here should
>> say "logical replication table synchronization worker" like the HEAD
>> code used to do.
>>
>> It seems this mistake was introduced in patch v20-0001.
>
>
> I'm a bit confused here. Isn't it decided to use "logical replication worker" 
> regardless of the worker's type [1]. That's why I made this change. If that's 
> not the case here, I'll put it back.
>

I feel where the worker type is clear, it is better to use it unless
the same can lead to translation issues.

-- 
With Regards,
Amit Kapila.




Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-07-20 Thread Andrew Dunstan


On 2023-07-20 Th 05:52, Alvaro Herrera wrote:

On 2023-Jul-19, Andrew Dunstan wrote:


The result you report suggest to me that somehow the old version is no
longer a PostgreSQL::Version object.  Here's the patch I suggest:

Ahh, okay, that makes more sense; and yes, it does work.



Your patch LGTM


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2023-07-20 Thread stephane tachoires
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

It is just a rebase
I check with make and meson
run manual split and merge on list and range partition
Doc fits

The new status of this patch is: Ready for Committer


Re: POC: GROUP BY optimization

2023-07-20 Thread Tomas Vondra
On 7/20/23 08:37, Andrey Lepikhov wrote:
> On 3/10/2022 21:56, Tom Lane wrote:
>> Revert "Optimize order of GROUP BY keys".
>>
>> This reverts commit db0d67db2401eb6238ccc04c6407a4fd4f985832 and
>> several follow-on fixes.
>> ...
>> Since we're hard up against the release deadline for v15, let's
>> revert these changes for now.  We can always try again later.
> 
> It may be time to restart the project. As a first step, I rebased the
> patch on the current master. It wasn't trivial because of some latest
> optimizations (a29eab, 1349d27 and 8d83a5d).
> Now, Let's repeat the review and rewrite the current path according to
> the reasons uttered in the revert commit.

I think the fundamental task is to make the costing more reliable, and
the commit message 443df6e2db points out a couple challenges in this
area. Not sure how feasible it is to address enough of them ...

1) procost = 1.0 - I guess we could make this more realistic by doing
some microbenchmarks and tuning the costs for the most expensive cases.

2) estimating quicksort comparisons - This relies on ndistinct
estimates, and I'm not sure how much more reliable we can make those.
Probably not much :-( Not sure what to do about this, the only thing I
can think of is to track "reliability" of the estimates and only do the
reordering if we have high confidence in the estimates. That means we'll
miss some optimization opportunities, but it should limit the risk.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-20 Thread Melih Mutlu
Hi,

Peter Smith , 20 Tem 2023 Per, 05:41 tarihinde şunu
yazdı:

> 7. InitializeLogRepWorker
>
>   if (am_tablesync_worker())
>   ereport(LOG,
> - (errmsg("logical replication worker for subscription \"%s\", table
> \"%s\" has started",
> + (errmsg("logical replication worker for subscription \"%s\", table
> \"%s\" with relid %u has started",
>   MySubscription->name,
> - get_rel_name(MyLogicalRepWorker->relid;
> + get_rel_name(MyLogicalRepWorker->relid),
> + MyLogicalRepWorker->relid)));
>
> But this is certainly a tablesync worker so the message here should
> say "logical replication table synchronization worker" like the HEAD
> code used to do.
>
> It seems this mistake was introduced in patch v20-0001.
>

I'm a bit confused here. Isn't it decided to use "logical replication
worker" regardless of the worker's type [1]. That's why I made this change.
If that's not the case here, I'll put it back.

[1]
https://www.postgresql.org/message-id/flat/CAHut%2BPt1xwATviPGjjtJy5L631SGf3qjV9XUCmxLu16cHamfgg%40mail.gmail.com

Thanks,
-- 
Melih Mutlu
Microsoft


Re: Use of additional index columns in rows filtering

2023-07-20 Thread Tomas Vondra



On 7/19/23 23:38, Peter Geoghegan wrote:
> On Wed, Jul 19, 2023 at 1:46 PM Jeff Davis  wrote:
>> On Wed, 2023-07-19 at 20:03 +0200, Tomas Vondra wrote:
>>> Makes sense, I also need to think about maybe not having duplicate
>>> clauses in the two lists. What annoys me on that it partially
>>> prevents
>>> the cost-based reordering done by order_qual_clauses(). So maybe we
>>> should have three lists ... Also, some of the expressions count be
>>> fairly expensive.
>>
>> Can we just calculate the costs of the pushdown and do it when it's a
>> win? If the random_page_cost savings exceed the costs from evaluating
>> the clause earlier, then push down.
> 
> My patch that teaches nbtree to execute ScalarArrayOps intelligently
> (by dynamically choosing to not re-descend the btree to perform
> another "primitive index scan" when the data we need is located on the
> same leaf page as the current ScalarArrayOps arrays) took an
> interesting turn recently -- one that seems related.
> 
> I found that certain kinds of queries are dramatically faster once you
> teach the optimizer to accept that multi-column ScalarArrayOps can be
> trusted to return tuples in logical/index order, at least under some
> circumstances. For example:
> 
> pg@regression: [583930]=# create index order_by_saop on
> tenk1(two,four,twenty);
> CREATE INDEX
> 
> pg@regression: [583930]=# EXPLAIN (ANALYZE, BUFFERS)
> select ctid, thousand from tenk1
> where two in (0,1) and four in (1,2) and twenty in (1,2)
> order by two, four, twenty limit 20;
> 
> This shows "Buffers: shared hit=1377" on HEAD, versus "Buffers: shared
> hit=13" with my patch. All because we can safely terminate the scan
> early now. The vast majority of the buffer hits the patch will avoid
> are against heap pages, even though I started out with the intention
> of eliminating unnecessary repeat index page accesses.
> 
> Note that build_index_paths() currently refuses to allow SAOP clauses
> to be recognized as ordered with a multi-column index and a query with
> a clause for more than the leading column -- that is something that
> the patch needs to address (to get this particular improvement, at
> least). Allowing such an index path to have useful pathkeys is
> typically safe (in the sense that it won't lead to wrong answers to
> queries), but we still make a conservative assumption that they can
> lead to wrong answers. There are comments about "equality constraints"
> that describe the restrictions right now.
> 
> But it's not just the question of basic correctness -- the optimizer
> is very hesitant to use multi-column SAOPs, even in cases that don't
> care about ordering. So it's also (I think, implicitly) a question of
> *risk*. The risk of getting very inefficient SAOP execution in nbtree,
> when it turns out that a huge number of "primitive index scans" are
> needed. But, if nbtree is taught to "coalesce together" primitive
> index scans at runtime, that risk goes way down.
> 
> Anyway, this seems related to what you're talking about because the
> relationship between selectivity and ordering seems particularly
> important in this context. And because it suggests that there is at
> least some scope for adding "run time insurance" to the executor,
> which is valuable in the optimizer if it bounds the potential
> downside. If you can be practically certain that there is essentially
> zero risk of serious problems when the costing miscalculates (for a
> limited subset of cases), then life might be a lot easier -- clearly
> we should be biased in one particular direction with a case that has
> that kind of general profile.
> 
> My current understanding of the optimizer side of things --
> particularly things like costing for "filter quals/pqquals" versus
> costing for "true index quals" -- is rather basic. That will have to
> change. Curious to hear your thoughts (if any) on how what you're
> discussing may relate to what I need to do with my patch. Right now my
> patch assumes that making SAOP clauses into proper index quals (that
> usually preserve index ordering) is an unalloyed good (when safe!).
> This assumption is approximately true on average, as far as I can
> tell. But it's probably quite untrue in various specific cases, that
> somebody is bound to care about.
> 

I think the SAOP patch may need to be much more careful about this, but
for this patch it's simpler because it doesn't really change any of the
index internals, or the indexscan in general.

If I simplify that a bit, we're just reordering the clauses in a way to
maybe eliminate the heap fetch. The main risk seems to be that this will
force an expensive qual to the front of the list, just because it can be
evaluated on the index tuple. But the difference would need to be worse
than what we save by not doing the I/O - considering how expensive the
I/O is, that seems unlikely. Could happen for expensive quals that don't
really eliminate many rows, I guess.

Anyway, I see this as extension of what 

Re: Synchronizing slots from primary to standby

2023-07-20 Thread shveta malik
On Fri, Jun 16, 2023 at 3:26 PM Amit Kapila  wrote:
>
> On Mon, Apr 17, 2023 at 7:37 PM Drouvot, Bertrand
>  wrote:
> >

> 3. As mentioned in the initial email, I think it would be better to
> replace LIST_SLOTS command with a SELECT query.
>

I had a look at this thread. I am interested to work on this and can
spend some time addressing the comments given here.

I tried to replace LIST_SLOTS command with a SELECT query. Attached
rebased patch and PoC patch for LIST_SLOTS removal. For LIST_SLOTs cmd
removal, below are the points where more analysis is needed.

1) I could not use the exposed libpqwalreceiver's functions
walrcv_exec/libpqrcv_exec in LogicalRepLauncher to run select query
instead of LIST_SLOTS cmd. This is because libpqrcv_exec() needs
database connection but since in LogicalReplauncher, we do not have
any (MyDatabseId is not set), so the API gives an error. Thus to make
it work for the time-being, I used 'libpqrcv_PQexec' which is not
dependent upon database connection. But since it is not exposed "yet"
to other layers, I temporarily added the new code to
libpqwalreceiver.c itself. In fact I reused the existing function
wrapper libpqrcv_list_slots and changed the functionality to get info
using select query rather than list_slots.

2) While using connect API walrcv_connect/libpqrcv_connect(), we need
to tell it whether it is for logical or physical replication. In the
existing patch, where we were using LIST_SLOTS cmd, we have this
connection made with logical=false. But now since we need to run
select query to get the same info, using connection with logical=false
gives error on primary while executing select query. "ERROR:  cannot
execute SQL commands in WAL sender for physical replication".
And thus in ApplyLauncherStartSlotSync(), I have changed connect API
to use logical=true for the time being.  I noticed that in the
existing patch, it was using logical=false in
ApplyLauncherStartSlotSync() while logical=true in
synchronize_slots(). Possibly due to the same fact that logical=false
connection will not allow synchronize_slots() to run select query on
primary while it worked for ApplyLauncherStartSlotSync() as it was
running list_slots cmd instead of select query.

I am exploring further on these points to figure out which one is the
correct way to deal with these. Meanwhile posting this WIP patch for
early feedback. I will try addressing other comments as well in next
versions.

thanks
Shveta


v1-0001-Remove-list_slots-command.patch
Description: Binary data


v7-0001-Synchronize-logical-replication-slots-from-primar.patch
Description: Binary data


Re: Partial aggregates pushdown

2023-07-20 Thread Alexander Pyhalov

fujii.y...@df.mitsubishielectric.co.jp писал 2023-07-19 03:43:

Hi Mr.Pyhalov, hackers.



3)
I modified the patch to safely do a partial aggregate pushdown for
queries which contain having clauses.



Hi.
Sorry, but I don't see how it could work.
For example, the attached test returns wrong result:

CREATE FUNCTION f() RETURNS INT AS $$
begin
  return 10;
end
$$ LANGUAGE PLPGSQL;

SELECT b, sum(a) FROM pagg_tab GROUP BY b HAVING sum(a) < f() ORDER BY 
1;

 b  | sum
+-
  0 |   0
 10 |   0
 20 |   0
 30 |   0
 40 |   0
+(5 rows)

In fact the above query should have returned 0 rows, as

SELECT b, sum(a) FROM pagg_tab GROUP BY b ORDER BY 1;
 b  | sum
+--
  0 |  600
  1 |  660
  2 |  720
  3 |  780
  4 |  840
  5 |  900
  6 |  960
  7 | 1020
  8 | 1080
  9 | 1140
 10 |  600
 11 |  660
 12 |  720

shows no such rows.

Or, on the same data

SELECT b, sum(a) FROM pagg_tab GROUP BY b HAVING sum(a) > 660 ORDER BY 
1;


You'll get 0 rows.

But
SELECT b, sum(a) FROM pagg_tab GROUP BY b;
 b  | sum
+--
 42 |  720
 29 | 1140
  4 |  840
 34 |  840
 41 |  660
  0 |  600
 40 |  600
gives.

The issue is that you can't calculate "partial" having. You should 
compare full aggregate in filter, but it's not possible on the level of 
one partition.

And you have this in plans

 Finalize GroupAggregate
   Output: pagg_tab.b, avg(pagg_tab.a), max(pagg_tab.a), count(*)
   Group Key: pagg_tab.b
   Filter: (sum(pagg_tab.a) < 700)
   ->  Sort
 Output: pagg_tab.b, (PARTIAL avg(pagg_tab.a)), (PARTIAL 
max(pagg_tab.a)), (PARTIAL count(*)), (PARTIAL sum(pagg_tab.a))

 Sort Key: pagg_tab.b
 ->  Append
   ->  Foreign Scan
 Output: pagg_tab.b, (PARTIAL avg(pagg_tab.a)), 
(PARTIAL max(pagg_tab.a)), (PARTIAL count(*)), (PARTIAL sum(pagg_tab.a))
 Filter: ((PARTIAL sum(pagg_tab.a)) < 700)    
<--- here we can't compare anything yet, sum is incomplete.
 Relations: Aggregate on (public.fpagg_tab_p1 
pagg_tab)
 Remote SQL: SELECT b, avg_p_int4(a), max(a), 
count(*), sum(a) FROM public.pagg_tab_p1 GROUP BY 1

   ->  Foreign Scan
 Output: pagg_tab_1.b, (PARTIAL avg(pagg_tab_1.a)), 
(PARTIAL max(pagg_tab_1.a)), (PARTIAL count(*)), (PARTIAL 
sum(pagg_tab_1.a))

 Filter: ((PARTIAL sum(pagg_tab_1.a)) < 700)
 Relations: Aggregate on (public.fpagg_tab_p2 
pagg_tab_1)
 Remote SQL: SELECT b, avg_p_int4(a), max(a), 
count(*), sum(a) FROM public.pagg_tab_p2 GROUP BY 1

   ->  Foreign Scan
 Output: pagg_tab_2.b, (PARTIAL avg(pagg_tab_2.a)), 
(PARTIAL max(pagg_tab_2.a)), (PARTIAL count(*)), (PARTIAL 
sum(pagg_tab_2.a))

 Filter: ((PARTIAL sum(pagg_tab_2.a)) < 700)
 Relations: Aggregate on (public.fpagg_tab_p3 
pagg_tab_2)
 Remote SQL: SELECT b, avg_p_int4(a), max(a), 
count(*), sum(a) FROM public.pagg_tab_p3 GROUP BY 1


In foreign_grouping_ok()
6586 if (IsA(expr, Aggref))
6587 {
6588 if (partial)
6589 {
6590 mark_partial_aggref((Aggref 
*) expr, AGGSPLIT_INITIAL_SERIAL);

6591 continue;
6592 }
6593 else if (!is_foreign_expr(root, 
grouped_rel, expr))

6594 return false;
6595
6596 tlist = add_to_flat_tlist(tlist, 
list_make1(expr));

6597 }

at least you shouldn't do anything with expr, if is_foreign_expr() 
returned false. If we restrict pushing down queries with havingQuals, 
I'm not quite sure how Aggref can appear in local_conds.


As for changes in planner.c (setGroupClausePartial()) I have several 
questions.


1) Why don't we add non_group_exprs to pathtarget->exprs when 
partial_target->exprs is not set?


2) We replace extra->partial_target->exprs with partial_target->exprs 
after processing. Why are we sure that after this tleSortGroupRef is 
correct?


--
Best regards,
Alexander Pyhalov,
Postgres Professionaldiff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 1824cb67fe9..c6f613019c3 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3025,6 +3025,22 @@ EXPLAIN (VERBOSE, COSTS OFF)
 SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700 ORDER BY 1;
 SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700 ORDER BY 1;
 
+CREATE FUNCTION f() RETURNS INT AS $$
+begin
+  return 10;
+end 
+$$ LANGUAGE PLPGSQL;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT b, sum(a) FROM pagg_tab GROUP BY b HAVING sum(a) < f() ORDER BY 1;
+SELECT b, 

Re: inconsistency between the VM page visibility status and the visibility status of the page

2023-07-20 Thread Tomas Vondra
On 7/20/23 08:44, yanhui.xiong wrote:
> Hi,
> 
> i had a issue here, When executing a SELECT statement using an
> index-only scan, i got a wrong row number because there may be an
> inconsistency between the VM page visibility status and the visibility
> status of the page,the VM bit is set and page-level is clear
> 
> 
> i read the code and note that there has a chance to happen like this,but
> how it happens?
> 
> the code do clear the page-level visibility and vm bit at the same time,
> i don not understand how it happens
> 

Well, by only looking at the code you're assuming two things:

1) the code is correct

2) the environment is perfect

Either (or both) of these assumptions may be wrong. There certainly
could be some subtle bug in the visibility map code, who knows. Or maybe
this is due to some sort of data corruption outside postgres.

It's impossible to answer without you digging much deeper and providing
much more information. What's the environment? What hardware? What PG
version? How long is it running? Any crashes? Any other cases of similar
issues? What does the page look like in pageinspect?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Support worker_spi to execute the function dynamically.

2023-07-20 Thread Bharath Rupireddy
On Thu, Jul 20, 2023 at 2:38 PM Masahiro Ikeda  wrote:
>
> Thanks for discussing about the patch. I updated the patch from your
> comments
> * v2-0001-Support-worker_spi-to-execute-the-function-dynamical.patch
>
> I found another thing to be changed better. Though the tests was assumed
> "shared_preload_libraries = worker_spi", the background workers failed
> to
> be launched in initialized phase because the database is not created
> yet.
>
> ```
> # make check# in src/test/modules/worker_spi
> # cat log/postmaster.log # in src/test/modules/worker_spi/
> 2023-07-20 17:58:47.958 JST worker_spi[853620] FATAL:  database
> "contrib_regression" does not exist
> 2023-07-20 17:58:47.958 JST worker_spi[853621] FATAL:  database
> "contrib_regression" does not exist
> 2023-07-20 17:58:47.959 JST postmaster[853612] LOG:  background worker
> "worker_spi" (PID 853620) exited with exit code 1
> 2023-07-20 17:58:47.959 JST postmaster[853612] LOG:  background worker
> "worker_spi" (PID 853621) exited with exit code 1
> ```
>
> It's better to remove "shared_preload_libraries = worker_spi" from the
> test configuration. I misunderstood that two background workers would
> be launched and waiting at the start of the test.

I don't think that change is correct. The worker_spi essentially shows
how to start bg workers with RegisterBackgroundWorker and dynamic bg
workers with RegisterDynamicBackgroundWorker. If
shared_preload_libraries = worker_spi not specified in there, you will
miss to start RegisterBackgroundWorkers. Is giving an initidb time
database name to worker_spi.database work there? If the database for
bg workers doesn't exist, changing bgw_restart_time from
BGW_NEVER_RESTART to say 1 will help to see bg workers coming up
eventually.

I think it's worth adding test cases for the expected number of bg
workers (after creating worker_spi extension) and dynamic bg workers
(after calling worker_spi_launch()). Also, to distinguish bg workers
and dynamic bg workers, you can change
bgw_type in worker_spi_launch to "worker_spi dynamic worker".

-/* get the configuration */
+/* Get the configuration */

-/* set up common data for all our workers */
+/* Set up common data for all our workers */

These unrelated changes better be there as-is. Because, the postgres
code has both commenting styles /* Get  */ or /* get */, IOW,
single line comments starting with both uppercase and lowercase.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-07-20 Thread Alvaro Herrera
On 2023-Jul-19, Andrew Dunstan wrote:

> The result you report suggest to me that somehow the old version is no
> longer a PostgreSQL::Version object.  Here's the patch I suggest:

Ahh, okay, that makes more sense; and yes, it does work.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From 055b0d89f7c6667b79ba0b97f54ea5aef64dfbb1 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 19 Jul 2023 17:54:09 +0200
Subject: [PATCH v2] Compare only major versions in AdjustUpgrade.pm

Because PostgreSQL::Version is very nuanced about development version
numbers, the comparison to 16beta2 makes it think that that release is
older than 16, therefore applying a database tweak that doesn't work
there.  Fix by having AdjustUpgrade create a separate
PostgreSQL::Version object that only contains the major version number.

While at it, have it ensure that the objects given are of the expected
type.

Co-authored-by: Andrew Dunstan 
Discussion: https://postgr.es/m/20230719110504.zbu74o54bqqlsufb@alvherre.pgsql
---
 src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm b/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm
index a241d2ceff..64305fdcce 100644
--- a/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm
+++ b/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm
@@ -76,6 +76,10 @@ sub adjust_database_contents
 	my ($old_version, %dbnames) = @_;
 	my $result = {};
 
+	die "wrong type for \$old_version\n"
+		unless $old_version->isa("PostgreSQL::Version");
+	$old_version = PostgreSQL::Version->new($old_version->major);
+
 	# remove dbs of modules known to cause pg_upgrade to fail
 	# anything not builtin and incompatible should clean up its own db
 	foreach my $bad_module ('test_ddl_deparse', 'tsearch2')
@@ -262,6 +266,10 @@ sub adjust_old_dumpfile
 {
 	my ($old_version, $dump) = @_;
 
+	die "wrong type for \$old_version\n"
+		unless $old_version->isa("PostgreSQL::Version");
+	$old_version = PostgreSQL::Version->new($old_version->major);
+
 	# use Unix newlines
 	$dump =~ s/\r\n/\n/g;
 
@@ -579,6 +587,10 @@ sub adjust_new_dumpfile
 {
 	my ($old_version, $dump) = @_;
 
+	die "wrong type for \$old_version\n"
+		unless $old_version->isa("PostgreSQL::Version");
+	$old_version = PostgreSQL::Version->new($old_version->major);
+
 	# use Unix newlines
 	$dump =~ s/\r\n/\n/g;
 
-- 
2.39.2



Re: Support worker_spi to execute the function dynamically.

2023-07-20 Thread Bharath Rupireddy
On Thu, Jul 20, 2023 at 2:59 PM Michael Paquier  wrote:
>
> On Thu, Jul 20, 2023 at 05:54:55PM +0900, Masahiro Ikeda wrote:
> > Yes, you're right. When I tried using worker_spi to test wait event,
> > I found the behavior. And thanks a lot for your patch. I wasn't aware
> > of the way.  I'll merge your patch to the tests for wait events.
>
> Be careful when using that.  I have not spent more than a few minutes
> to show my point, but what I sent lacks a shmem_request_hook in
> _PG_init(), for example, to request an amount of shared memory equal
> to the size of the state structure.

I think the preferred way to grab a chunk of shared memory for an
external module is by using shmem_request_hook and shmem_startup_hook.
Wait events shared memory too can use them.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Support worker_spi to execute the function dynamically.

2023-07-20 Thread Michael Paquier
On Thu, Jul 20, 2023 at 05:54:55PM +0900, Masahiro Ikeda wrote:
> Yes, you're right. When I tried using worker_spi to test wait event,
> I found the behavior. And thanks a lot for your patch. I wasn't aware
> of the way.  I'll merge your patch to the tests for wait events.

Be careful when using that.  I have not spent more than a few minutes
to show my point, but what I sent lacks a shmem_request_hook in
_PG_init(), for example, to request an amount of shared memory equal
to the size of the state structure.
--
Michael


signature.asc
Description: PGP signature


Re: There should be a way to use the force flag when restoring databases

2023-07-20 Thread Daniel Gustafsson
> On 19 Jul 2023, at 19:28, Gurjeet Singh  wrote:
> 
> On Tue, Jul 18, 2023 at 12:53 AM Joan  wrote:
>> 
>> Since posgres 13 there's the option to do a FORCE when dropping a database 
>> (so it disconnects current users) Documentation here: 
>> https://www.postgresql.org/docs/current/sql-dropdatabase.html
>> 
>> I am currently using dir format for the output
>>   pg_dump -d "bdname" -F d -j 4 -v -f /tmp/dir
>> 
>> And restoring the database with
>>  pg_restore -d postgres -C -c --exit-on-error -F d -j 3 -v /tmp/dir
>> 
>> Having an option to add the FORCE option to either the generated dump by 
>> pg_dump, or in the pg_restore would be very useful when restoring the 
>> databases to another servers so it would avoid having to do scripting.
>> 
>> In my specific case I am using this to refresh periodically a development 
>> environment with data from production servers for a small database (~200M).
> 
> Making force-drop a part of pg_dump output may be dangerous, and not
> provide much flexibility at restore time.
> 
> Adding a force option to pg_restore feels like providing the right tradeoff.
> 
> The docs for 'pg_restore --create` say "Create the database before
> restoring into it. If --clean is also specified, drop and recreate the
> target database before connecting to it."
> 
> If we provided a force option, it may then additionally say: "If the
> --clean and --force options are specified, DROP DATABASE ... WITH
> FORCE command will be used to drop the database."

pg_restore --clean refers to dropping any pre-existing database objects and not
just databases, but --force would only apply to databases.

I wonder if it's worth complicating pg_restore with that when running dropdb
--force before pg_restore is an option for those wanting to use WITH FORCE.

--
Daniel Gustafsson





Re: WAL Insertion Lock Improvements

2023-07-20 Thread Bharath Rupireddy
On Fri, Jul 14, 2023 at 4:17 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-07-13 14:04:31 -0700, Andres Freund wrote:
> > From b74b6e953cb5a7e7ea1a89719893f6ce9e231bba Mon Sep 17 00:00:00 2001
> > From: Bharath Rupireddy 
> > Date: Fri, 19 May 2023 15:00:21 +
> > Subject: [PATCH v8] Optimize WAL insertion lock acquisition and release
> >
> > This commit optimizes WAL insertion lock acquisition and release
> > in the following way:
>
> I think this commit does too many things at once.

I've split the patch into three - 1) Make insertingAt 64-bit atomic.
2) Have better commenting on why there's no memory barrier or spinlock
in and around LWLockWaitForVar call sites. 3) Have a quick exit for
LWLockUpdateVar.

> > 1. WAL insertion lock's variable insertingAt is currently read and
> > written with the help of lwlock's wait list lock to avoid
> > torn-free reads/writes. This wait list lock can become a point of
> > contention on a highly concurrent write workloads. Therefore, make
> > insertingAt a 64-bit atomic which inherently provides torn-free
> > reads/writes.
>
> "inherently" is a bit strong, given that we emulate 64bit atomics where not
> available...

Modified.

> > 2. LWLockUpdateVar currently acquires lwlock's wait list lock even when
> > there are no waiters at all. Add a fastpath exit to LWLockUpdateVar when
> > there are no waiters to avoid unnecessary locking.
>
> I don't think there's enough of an explanation for why this isn't racy.
>
> The reason it's, I think, safe, is that anyone using LWLockConflictsWithVar()
> will do so twice in a row, with a barrier inbetween. But that really relies on
> what I explain in the paragraph below:

The twice-in-a-row lock acquisition protocol used by LWLockWaitForVar
is helping us out have quick exit in LWLockUpdateVar. Because,
LWLockWaitForVar ensures that they are added to the wait queue even if
LWLockUpdateVar thinks that there aren't waiters. Is my understanding
correct here?

> > It also adds notes on why LWLockConflictsWithVar doesn't need a
> > memory barrier as far as its current usage is concerned.
>
> I don't think:
>  * NB: LWLockConflictsWithVar (which is called from
>  * LWLockWaitForVar) relies on the spinlock used 
> above in this
>  * function and doesn't use a memory barrier.
>
> helps to understand why any of this is safe to a meaningful degree.
>
> The existing comments aren't obviously aren't sufficient to explain this, but
> the reason it's, I think, safe today, is that we are only waiting for
> insertions that started before WaitXLogInsertionsToFinish() was called.  The
> lack of memory barriers in the loop means that we might see locks as "unused"
> that have *since* become used, which is fine, because they only can be for
> later insertions that we wouldn't want to wait on anyway.

Right.

> Not taking a lock to acquire the current insertingAt value means that we might
> see older insertingAt value. Which should also be fine, because if we read a
> too old value, we'll add ourselves to the queue, which contains atomic
> operations.

Right. An older value adds ourselves to the queue in LWLockWaitForVar,
and we should be woken up eventually by LWLockUpdateVar.

This matches with my understanding. I used more or less your above
wording in 0002 patch.

> >   /*
> > -  * Read value using the lwlock's wait list lock, as we can't generally
> > -  * rely on atomic 64 bit reads/stores.  TODO: On platforms with a way 
> > to
> > -  * do atomic 64 bit reads/writes the spinlock should be optimized 
> > away.
> > +  * Reading the value atomically ensures that we don't need any 
> > explicit
> > +  * locking. Note that in general, 64 bit atomic APIs in postgres 
> > inherently
> > +  * provide explicit locking for the platforms without atomics support.
> >*/
>
> This comment seems off to me. Using atomics doesn't guarantee not needing
> locking. It just guarantees that we are reading a non-torn value.

Modified the comment.

> > @@ -1605,9 +1602,14 @@ LWLockConflictsWithVar(LWLock *lock,
> >   *
> >   * Note: this function ignores shared lock holders; if the lock is held
> >   * in shared mode, returns 'true'.
> > + *
> > + * Be careful that LWLockConflictsWithVar() does not include a memory 
> > barrier,
> > + * hence the caller of this function may want to rely on an explicit 
> > barrier or
> > + * a spinlock to avoid memory ordering issues.
> >   */
>
> s/careful/aware/?
>
> s/spinlock/implied barrier via spinlock or lwlock/?

Done.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v9-0001-Optimize-WAL-insertion-lock-acquisition-and-relea.patch
Description: Binary data


v9-0002-Improve-commets-in-and-around-LWLockWaitForVar-ca.patch
Description: Binary data


v9-0003-Have-a-quick-exit-for-LWLockUpdateVar-when-there-.patch
Description: Binary data


Re: Support worker_spi to execute the function dynamically.

2023-07-20 Thread Masahiro Ikeda

Hi,

On 2023-07-20 13:50, Bharath Rupireddy wrote:
On Thu, Jul 20, 2023 at 10:09 AM Michael Paquier  
wrote:


On Thu, Jul 20, 2023 at 09:43:37AM +0530, Bharath Rupireddy wrote:
> +1. However, a comment above helps one to understand why some GUCs are
> defined before if (!process_shared_preload_libraries_in_progress). As
> this is an example extension, it will help understand the reasoning
> better. I know we will it in the commit message, but a direct comment
> helps:
>
> /*
>  * Note that this GUC is defined irrespective of worker_spi shared library
>  * presence in shared_preload_libraries. It's possible to create the
>  * worker_spi extension and use functions without it being specified in
>  * shared_preload_libraries. If we return from here without defining this
>  * GUC, the dynamic workers launched by worker_spi_launch() will keep
>  * crashing and restarting.
>  */

WFM to be more talkative here and document things, but I don't think
that's it.  How about a simple "These GUCs are defined even if this
library is not loaded with shared_preload_libraries, for
worker_spi_launch()."


LGTM.


Thanks for discussing about the patch. I updated the patch from your 
comments

* v2-0001-Support-worker_spi-to-execute-the-function-dynamical.patch

I found another thing to be changed better. Though the tests was assumed
"shared_preload_libraries = worker_spi", the background workers failed 
to
be launched in initialized phase because the database is not created 
yet.


```
# make check# in src/test/modules/worker_spi
# cat log/postmaster.log # in src/test/modules/worker_spi/
2023-07-20 17:58:47.958 JST worker_spi[853620] FATAL:  database 
"contrib_regression" does not exist
2023-07-20 17:58:47.958 JST worker_spi[853621] FATAL:  database 
"contrib_regression" does not exist
2023-07-20 17:58:47.959 JST postmaster[853612] LOG:  background worker 
"worker_spi" (PID 853620) exited with exit code 1
2023-07-20 17:58:47.959 JST postmaster[853612] LOG:  background worker 
"worker_spi" (PID 853621) exited with exit code 1

```

It's better to remove "shared_preload_libraries = worker_spi" from the
test configuration. I misunderstood that two background workers would
be launched and waiting at the start of the test.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONFrom 02ef0d8daddd43305842987a6aeac6732b2415a9 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Thu, 20 Jul 2023 10:34:50 +0900
Subject: [PATCH] Support worker_spi to execute the function dynamically.

Currently, the database name to connect is initialized only
when process_shared_preload_libraries_in_progress is true.
It means that worker_spi_launch() fails if shared_preload_libraries
is empty because the database to connect is NULL.

The patch changes that the database name is always initilized
when called _PG_init(). We can call worker_spi_launch() and
launch SPI workers dynamically.

It also changes the configuration for the test because we don't
need "shared_preload_libraries = worker_spi" anymore, and
background workers launched in initialized phase always have
failed because the "contrib_regression" database is not created yet.
---
 src/test/modules/worker_spi/Makefile |  4 ++--
 src/test/modules/worker_spi/dynamic.conf |  3 +--
 src/test/modules/worker_spi/worker_spi.c | 27 ++--
 3 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/src/test/modules/worker_spi/Makefile b/src/test/modules/worker_spi/Makefile
index cbf9b2e37f..c2d32d9b72 100644
--- a/src/test/modules/worker_spi/Makefile
+++ b/src/test/modules/worker_spi/Makefile
@@ -8,10 +8,10 @@ PGFILEDESC = "worker_spi - background worker example"
 
 REGRESS = worker_spi
 
-# enable our module in shared_preload_libraries for dynamic bgworkers
 REGRESS_OPTS = --temp-config $(top_srcdir)/src/test/modules/worker_spi/dynamic.conf
 
-# Disable installcheck to ensure we cover dynamic bgworkers.
+# Disable because these tests require "worker_spi.database = contrib_regression",
+# which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
 
 ifdef USE_PGXS
diff --git a/src/test/modules/worker_spi/dynamic.conf b/src/test/modules/worker_spi/dynamic.conf
index bfe015f664..d9fd463a53 100644
--- a/src/test/modules/worker_spi/dynamic.conf
+++ b/src/test/modules/worker_spi/dynamic.conf
@@ -1,2 +1 @@
-shared_preload_libraries = worker_spi
-worker_spi.database = contrib_regression
+worker_spi.database = contrib_regression
\ No newline at end of file
diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index 7227cfaa45..f691fbec9f 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -282,7 +282,12 @@ _PG_init(void)
 {
 	BackgroundWorker worker;
 
-	/* get the configuration */
+	/* Get the configuration */
+
+	/*
+	 * These GUCs are defined even if this library is not loaded with
+	 * shared_preload_libraries, for 

Re: Support worker_spi to execute the function dynamically.

2023-07-20 Thread Masahiro Ikeda

On 2023-07-20 12:55, Michael Paquier wrote:

On Thu, Jul 20, 2023 at 11:15:51AM +0900, Masahiro Ikeda wrote:

While I'm working on the thread[1], I found that the function of
worker_spi module fails if 'shared_preload_libraries' doesn't have
worker_spi.


I guess that you were patching worker_spi to register dynamically a
wait event and embed that in a TAP test or similar without loading it
in shared_preload_libraries?  FWIW, you could use a trick like what I
am attaching here to load a wait event dynamically with the custom
wait event API.  You would need to make worker_spi_init_shmem() a bit
more aggressive with an extra hook to reserve a shmem area size, but
that's enough to show the custom wait event in the same backend as the
one that launches a worker_spi dynamically, while demonstrating how
the API can be used in this case.


Yes, you're right. When I tried using worker_spi to test wait event,
I found the behavior. And thanks a lot for your patch. I wasn't aware
of the way.  I'll merge your patch to the tests for wait events.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-20 Thread Melih Mutlu
Hi Peter,

Peter Smith , 20 Tem 2023 Per, 07:10 tarihinde şunu
yazdı:

> Hi, I had a look at the latest 3 patch (v20-0003).
>
> Although this patch was recently modified, the updates are mostly only
> to make it compatible with the updated v20-0002 patch. Specifically,
> the v20-0003 updates did not yet address my review comments from
> v17-0003 [1].
>

Yes, I only addressed your reviews for 0001 and 0002, and rebased 0003 in
latest patches as stated here [1].

I'll update the patch soon according to recent reviews, including yours for
0003.


[1]
https://www.postgresql.org/message-id/CAGPVpCTvALKEXe0%3DN-%2BiMmVxVQ-%2BP8KZ_1qQ1KsSSZ-V9wJ5hw%40mail.gmail.com

Thanks for the reminder.
-- 
Melih Mutlu
Microsoft


Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-20 Thread Dean Rasheed
On Thu, 20 Jul 2023 at 00:56, David Rowley  wrote:
>
> I noticed that 6fcda9aba added quite a lot of conditions that need to
> be checked before we process a normal decimal integer string. I think
> we could likely do better and code it to assume that most strings will
> be decimal and put that case first rather than last.

That sounds sensible, but ...

> In the attached, I've changed that for the 32-bit version only.  A
> more complete patch would need to do the 16 and 64-bit versions too.
>
> Without that, we need to check if the first digit is '0' a total of 3
> times and also check if the 2nd digit is any of 'x', 'X', 'o', 'O',
> 'b' or 'B'.

That's not what I see. For me, the compiler (gcc 11, using either -O2
or -O3) is smart enough to spot that the first part of each of the 3
checks is the same, and it only tests whether the first digit is '0'
once, before immediately jumping to the decimal parsing code. I didn't
test other compilers though, so I can believe that different compilers
might not be so smart.

OTOH, this test in your patch:

+/* process decimal digits */
+if (isdigit((unsigned char) ptr[0]) &&
+(isdigit((unsigned char) ptr[1]) || ptr[1] == '_' || ptr[1]
== '\0' || isspace(ptr[1])))

is doing more work than it needs to, and actually makes things
noticeably worse for me. It needs to do a minimum of 2 isdigit()
checks before it will parse the input as a decimal, whereas before
(for me at least) it just did one simple comparison of ptr[0] against
'0'.

I agree with the principal though. In the attached updated patch, I
replaced that test with a simpler one:

+/*
+ * Process the number's digits. We optimize for decimal input (expected to
+ * be the most common case) first. Anything that doesn't start with a base
+ * prefix indicator must be decimal.
+ */
+
+   /* process decimal digits */
+   if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1])))

So hopefully any compiler should only need to do the one comparison
against '0' for any non-zero decimal input.

Doing that didn't give any measurable performance improvement for me,
but it did at least not make it noticeably worse, and seems more
likely to generate better code with not-so-smart compilers. I'd be
interested to know how that performs for you (and if your compiler
really does generate 3 "first digit is '0'" tests for the unpatched
code).

Regards,
Dean

---

Here were my test results (where P1 is the "fix_COPY_DEFAULT.patch"),
and I tested COPY loading 50M rows:

HEAD + P1
=

7137.966 ms
7193.190 ms
7094.491 ms
7123.520 ms

HEAD + P1 + pg_strtoint32_base_10_first.patch
=

7561.658 ms
7548.282 ms
7551.360 ms
7560.671 ms

HEAD + P1 + pg_strtoint32_base_10_first.v2.patch


7238.775 ms
7184.937 ms
7123.257 ms
7159.618 ms
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
new file mode 100644
index 471fbb7..e2320e6
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -314,113 +314,124 @@ pg_strtoint32_safe(const char *s, Node *
 	else if (*ptr == '+')
 		ptr++;
 
-	/* process digits */
-	if (ptr[0] == '0' && (ptr[1] == 'x' || ptr[1] == 'X'))
+	/*
+	 * Process the number's digits. We optimize for decimal input (expected to
+	 * be the most common case) first. Anything that doesn't start with a base
+	 * prefix indicator must be decimal.
+	 */
+
+	/* process decimal digits */
+	if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1])))
 	{
-		firstdigit = ptr += 2;
+		firstdigit = ptr;
 
 		while (*ptr)
 		{
-			if (isxdigit((unsigned char) *ptr))
+			if (isdigit((unsigned char) *ptr))
 			{
-if (unlikely(tmp > -(PG_INT32_MIN / 16)))
+if (unlikely(tmp > -(PG_INT32_MIN / 10)))
 	goto out_of_range;
 
-tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
+tmp = tmp * 10 + (*ptr++ - '0');
 			}
 			else if (*ptr == '_')
 			{
-/* underscore must be followed by more digits */
+/* underscore may not be first */
+if (unlikely(ptr == firstdigit))
+	goto invalid_syntax;
+/* and it must be followed by more digits */
 ptr++;
-if (*ptr == '\0' || !isxdigit((unsigned char) *ptr))
+if (*ptr == '\0' || !isdigit((unsigned char) *ptr))
 	goto invalid_syntax;
 			}
 			else
 break;
 		}
 	}
-	else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O'))
+	/* process hex digits */
+	else if (ptr[1] == 'x' || ptr[1] == 'X')
 	{
 		firstdigit = ptr += 2;
 
 		while (*ptr)
 		{
-			if (*ptr >= '0' && *ptr <= '7')
+			if (isxdigit((unsigned char) *ptr))
 			{
-if (unlikely(tmp > -(PG_INT32_MIN / 8)))
+if (unlikely(tmp > -(PG_INT32_MIN / 16)))
 	goto out_of_range;
 
-tmp = tmp * 8 + (*ptr++ - '0');
+tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
 			}
 			else if (*ptr == '_')
 			{
 /* underscore must be followed by more digits */
 ptr++;
-if (*ptr == '\0' || *ptr < 

Re: Assertion failure in SnapBuildInitialSnapshot()

2023-07-20 Thread Daniel Gustafsson
> On 9 Feb 2023, at 07:32, Masahiko Sawada  wrote:

> I've attached the patch of this idea for discussion.

Amit, Andres: have you had a chance to look at the updated version of this
patch?

--
Daniel Gustafsson





Re: Atomic ops for unlogged LSN

2023-07-20 Thread Bharath Rupireddy
On Thu, Jul 20, 2023 at 12:25 AM John Morris
 wrote:
>
> >> Why don't we just use a barrier when around reading the value? It's not 
> >> like
> >> CreateCheckPoint() is frequent?
>
> One reason is that a barrier isn’t needed, and adding unnecessary barriers 
> can also be confusing.
>
> With respect to the “debug only” comment in the original code, whichever 
> value is written to the structure during a checkpoint will be reset when 
> restarting. Nobody will ever see that value. We could just as easily write a 
> zero.

>
> Shutting down is a different story. It isn’t stated, but we require the 
> unlogged LSN be quiescent before shutting down. This patch doesn’t change 
> that requirement.
>
> We could also argue that memory ordering doesn’t matter when filling in a 
> conventional, unlocked structure.  But the fact we have only two cases 1) 
> checkpoint where the value is ignored, and 2) shutdown where it is quiescent, 
> makes the additional argument unnecessary.
>
> Would you be more comfortable if I added comments describing the two cases? 
> My intent was to be minimalistic, but the original code could use better 
> explanation.

Here, the case for unloggedLSN is that there can exist multiple
backends incrementing/writing it (via GetFakeLSNForUnloggedRel), and
there can exist one reader at a time in CreateCheckPoint with
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);. Is incrementing
unloggedLSN atomically (with an implied memory barrier from
pg_atomic_fetch_add_u64) helping out synchronize multiple writers and
readers? With a spinlock, writers-readers synchronization is
guaranteed. With an atomic variable, it is guaranteed that the readers
don't see a torn-value, but no synchronization is provided.

If the above understanding is right, what happens if readers see an
old unloggedLSN value - reader here stores the old unloggedLSN value
to control file and the server restarts (clean exit). So, the old
value is loaded back to unloggedLSN upon restart and the callers of
GetFakeLSNForUnloggedRel() will see an old/repeated value too. Is it a
problem?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: logical decoding and replication of sequences, take 2

2023-07-20 Thread Ashutosh Bapat
Thanks Tomas for the updated patches.

Here are my comments on 0006 patch as well as 0002 patch.

On Wed, Jul 19, 2023 at 4:23 PM Tomas Vondra
 wrote:
>
> I think this is an accurate description of what the current patch does.
> And I think it's a reasonable behavior.
>
> My point is that if this gets released in PG17, it'll be difficult to
> change, even if it does not change on-disk format.
>

Yes. I agree. And I don't see any problem even if we are not able to change it.

>
> I need to think behavior about this a bit more, and maybe check how
> difficult would be implementing it.

Ok.

In most of the comments and in documentation, there are some phrases
which do not look accurate.

Change to a sequence is being refered to as "sequence increment". While
ascending sequences are common, PostgreSQL supports descending sequences as
well. The changes there will be decrements. But that's not the only case. A
sequence may be restarted with an older value, in which case the change could
increment or a decrement. I think correct usage is 'changes to sequence' or
'sequence changes'.

Sequence being assigned a new relfilenode is referred to as sequence
being created. This is confusing. When an existing sequence is ALTERed, we
will not "create" a new sequence but we will "create" a new relfilenode and
"assign" it to that sequence.

PFA such edits in 0002 and 0006 patches. Let me know if those look
correct. I think we
need similar changes to the documentation and comments in other places.

>
> I did however look at the proposed alternative to the "created" flag.
> The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding.
> The smgr_decode code needs a review (I'm not sure the
> skipping/fast-forwarding part is correct), but it seems to be working
> fine overall, although we need to ensure the WAL record has the correct XID.
>

Briefly describing the patch. When decoding a XLOG_SMGR_CREATE WAL
record, it adds the relfilenode mentioned in it to the sequences hash.
When decoding a sequence change record, it checks whether the
relfilenode in the WAL record is in hash table. If it is the sequence
changes is deemed transactional otherwise non-transactional. The
change looks good to me. It simplifies the logic to decide whether a
sequence change is transactional or not.

In sequence_decode() we skip sequence changes when fast forwarding.
Given that smgr_decode() is only to supplement sequence_decode(), I
think it's correct to do the same in smgr_decode() as well. Simillarly
skipping when we don't have full snapshot.

Some minor comments on 0006 patch

+/* make sure the relfilenode creation is associated with the XID */
+if (XLogLogicalInfoActive())
+GetCurrentTransactionId();

I think this change is correct and is inline with similar changes in 0002. But
I looked at other places from where DefineRelation() is called. For regular
tables it is called from ProcessUtilitySlow() which in turn does not call
GetCurrentTransactionId(). I am wondering whether we are just discovering a
class of bugs caused by not associating an xid with a newly created
relfilenode.

+/*
+ * If we don't have snapshot or we are just fast-forwarding, there is no
+ * point in decoding changes.
+ */
+if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT ||
+ctx->fast_forward)
+return;

This code block is repeated.

+void
+ReorderBufferAddRelFileLocator(ReorderBuffer *rb, TransactionId xid,
+   RelFileLocator rlocator)
+{
... snip ...
+
+/* sequence changes require a transaction */
+if (xid == InvalidTransactionId)
+return;

IIUC, with your changes in DefineSequence() in this patch, this should not
happen. So this condition will never be true. But in case it happens, this code
will not add the relfilelocation to the hash table and we will deem the
sequence change as non-transactional. Isn't it better to just throw an error
and stop replication if that (ever) happens?

Also some comments on 0002 patch

@@ -405,8 +405,19 @@ fill_seq_fork_with_data(Relation rel, HeapTuple
tuple, ForkNumber forkNum)

 /* check the comment above nextval_internal()'s equivalent call. */
 if (RelationNeedsWAL(rel))
+{
 GetTopTransactionId();

+/*
+ * Make sure the subtransaction has a XID assigned, so that
the sequence
+ * increment WAL record is properly associated with it. This
matters for
+ * increments of sequences created/altered in the
transaction, which are
+ * handled as transactional.
+ */
+if (XLogLogicalInfoActive())
+GetCurrentTransactionId();
+}
+

I think we should separately commit the changes which add a call to
GetCurrentTransactionId(). That looks like an existing bug/anomaly
which can stay irrespective of this patch.

+/*
+ * To support logical decoding of sequences, we require the sequence
+ * callback. We decide it here, but only check 

Re: [feature]COPY FROM enable FORCE_NULL/FORCE_NOT_NULL on all columns

2023-07-20 Thread Zhang Mingli
Hi,

On Jul 7, 2023 at 18:00 +0800, Damir Belyalov , wrote:
>
> V2 patch still have some errors when apply file doc/src/sgml/ref/copy.sgml, 
> rebased and fixed it in V3 path.
> Thanks a lot for review.
I have updated https://commitfest.postgresql.org/43/3896/ to staus Ready for 
Committer, thanks again.


Zhang Mingli
www.hashdata.xyz


Re: remaining sql/json patches

2023-07-20 Thread Amit Langote
Hello,

On Thu, Jul 20, 2023 at 10:35 AM jian he  wrote:
> On Tue, Jul 18, 2023 at 5:11 PM Amit Langote  wrote:
> > > Op 7/17/23 om 07:00 schreef jian he:
> > > > hi.
> > > > seems there is no explanation about, json_api_common_syntax in
> > > > functions-json.html
> > > >
> > > > I can get json_query full synopsis from functions-json.html as follows:
> > > > json_query ( context_item, path_expression [ PASSING { value AS
> > > > varname } [, ...]] [ RETURNING data_type [ FORMAT JSON [ ENCODING UTF8
> > > > ] ] ] [ { WITHOUT | WITH { CONDITIONAL | [UNCONDITIONAL] } } [ ARRAY ]
> > > > WRAPPER ] [ { KEEP | OMIT } QUOTES [ ON SCALAR STRING ] ] [ { ERROR |
> > > > NULL | EMPTY { [ ARRAY ] | OBJECT } | DEFAULT expression } ON EMPTY ]
> > > > [ { ERROR | NULL | EMPTY { [ ARRAY ] | OBJECT } | DEFAULT expression }
> > > > ON ERROR ])
> > > >
> > > > seems doesn't  have a full synopsis for json_table? only partial one
> > > > by  one explanation.
> >
> > I looked through the history of the docs portion of the patch and it
> > looks like the synopsis for JSON_TABLE(...) used to be there but was
> > taken out during one of the doc reworks [1].
> >
> > I've added it back in the patch as I agree that it would help to have
> > it.   Though, I am not totally sure where I've put it is the right
> > place for it.  JSON_TABLE() is a beast that won't fit into the table
> > that JSON_QUERY() et al are in, so maybe that's how it will have to
> > be?  I have no better idea.
>
> attached screenshot render json_table syntax almost plain html. It looks fine.

Thanks for checking.

> based on syntax, then I am kind of confused with following 2 cases:
> --1
> SELECT * FROM JSON_TABLE(jsonb '1', '$'
> COLUMNS (a int PATH 'strict $.a' default 1  ON EMPTY default 2 on error)
> ERROR ON ERROR) jt;
>
> --2
> SELECT * FROM JSON_TABLE(jsonb '1', '$'
> COLUMNS (a int PATH 'strict $.a' default 1  ON EMPTY default 2 on error)) 
> jt;
>
> the first one should yield syntax error?

No.  Actually, the synopsis missed the optional ON ERROR clause that
can appear after COLUMNS(...).  Will fix it.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: There should be a way to use the force flag when restoring databases

2023-07-20 Thread Joan
HI Gurjeet, that woulld be great, all the cases where a FORCE won't apply
make totally sense (either complex scenarios or permission issues)

> It doesn't terminate if prepared transactions, active logical replication
> slots or subscriptions are present in the target database.
>
This will fail if the current user has no permissions to terminate other
> connections
>

Regards

Missatge de Gurjeet Singh  del dia dc., 19 de jul. 2023 a
les 19:28:

> On Tue, Jul 18, 2023 at 12:53 AM Joan  wrote:
> >
> > Since posgres 13 there's the option to do a FORCE when dropping a
> database (so it disconnects current users) Documentation here:
> https://www.postgresql.org/docs/current/sql-dropdatabase.html
> >
> > I am currently using dir format for the output
> >pg_dump -d "bdname" -F d -j 4 -v -f /tmp/dir
> >
> > And restoring the database with
> >   pg_restore -d postgres -C -c --exit-on-error -F d -j 3 -v /tmp/dir
> >
> > Having an option to add the FORCE option to either the generated dump by
> pg_dump, or in the pg_restore would be very useful when restoring the
> databases to another servers so it would avoid having to do scripting.
> >
> > In my specific case I am using this to refresh periodically a
> development environment with data from production servers for a small
> database (~200M).
>
> Making force-drop a part of pg_dump output may be dangerous, and not
> provide much flexibility at restore time.
>
> Adding a force option to pg_restore feels like providing the right
> tradeoff.
>
> The docs for 'pg_restore --create` say "Create the database before
> restoring into it. If --clean is also specified, drop and recreate the
> target database before connecting to it."
>
> If we provided a force option, it may then additionally say: "If the
> --clean and --force options are specified, DROP DATABASE ... WITH
> FORCE command will be used to drop the database."
>
> Using WITH FORCE is not a guarantee, as the DROP DATABASE docs clarify
> the conditions under which it may fail.
>
> Best regards,
> Gurjeet
> http://Gurje.et
>


Re: POC: GROUP BY optimization

2023-07-20 Thread Andrey Lepikhov

On 3/10/2022 21:56, Tom Lane wrote:
> Revert "Optimize order of GROUP BY keys".
>
> This reverts commit db0d67db2401eb6238ccc04c6407a4fd4f985832 and
> several follow-on fixes.
> ...
> Since we're hard up against the release deadline for v15, let's
> revert these changes for now.  We can always try again later.

It may be time to restart the project. As a first step, I rebased the 
patch on the current master. It wasn't trivial because of some latest 
optimizations (a29eab, 1349d27 and 8d83a5d).
Now, Let's repeat the review and rewrite the current path according to 
the reasons uttered in the revert commit.


--
regards,
Andrey Lepikhov
Postgres Professional
From 913d55ee887dccfeba360f5f44ed347dd1ba9044 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Fri, 14 Jul 2023 10:29:36 +0700
Subject: [PATCH] When evaluating a query with a multi-column GROUP BY clause
 using sort, the cost may be heavily dependent on the order in which the keys
 are compared when building the groups. Grouping does not imply any ordering,
 so we're allowed to compare the keys in arbitrary order, and a Hash Agg
 leverages this. But for Group Agg, we simply compared keys in the order as
 specified in the query. This commit explores alternative ordering of the
 keys, trying to find a cheaper one.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

In principle, we might generate grouping paths for all permutations of
the keys, and leave the rest to the optimizer. But that might get very
expensive, so we try to pick only a couple interesting orderings based
on both local and global information.

When planning the grouping path, we explore statistics (number of
distinct values, cost of the comparison function) for the keys and
reorder them to minimize comparison costs. Intuitively, it may be better
to perform more expensive comparisons (for complex data types etc.)
last, because maybe the cheaper comparisons will be enough. Similarly,
the higher the cardinality of a key, the lower the probability we’ll
need to compare more keys. The patch generates and costs various
orderings, picking the cheapest ones.

The ordering of group keys may interact with other parts of the query,
some of which may not be known while planning the grouping. E.g. there
may be an explicit ORDER BY clause, or some other ordering-dependent
operation, higher up in the query, and using the same ordering may allow
using either incremental sort or even eliminate the sort entirely.

The patch generates orderings and picks those minimizing the comparison
cost (for various pathkeys), and then adds orderings that might be
useful for operations higher up in the plan (ORDER BY, etc.). Finally,
it always keeps the ordering specified in the query, on the assumption
the user might have additional insights.

This introduces a new GUC enable_group_by_reordering, so that the
optimization may be disabled if needed.
---
 .../postgres_fdw/expected/postgres_fdw.out|  15 +-
 src/backend/optimizer/path/costsize.c | 363 ++-
 src/backend/optimizer/path/equivclass.c   |  13 +-
 src/backend/optimizer/path/pathkeys.c | 602 ++
 src/backend/optimizer/plan/planner.c  | 467 --
 src/backend/optimizer/util/pathnode.c |   4 +-
 src/backend/utils/adt/selfuncs.c  |  38 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/pathnodes.h |  10 +
 src/include/optimizer/cost.h  |   4 +-
 src/include/optimizer/paths.h |   7 +
 src/include/utils/selfuncs.h  |   5 +
 src/test/regress/expected/aggregates.out  | 244 ++-
 .../regress/expected/incremental_sort.out |   2 +-
 src/test/regress/expected/join.out|  51 +-
 src/test/regress/expected/merge.out   |  15 +-
 .../regress/expected/partition_aggregate.out  |  44 +-
 src/test/regress/expected/partition_join.out  |  75 +--
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/union.out   |  60 +-
 src/test/regress/sql/aggregates.sql   |  99 +++
 src/test/regress/sql/incremental_sort.sql |   2 +-
 23 files changed, 1760 insertions(+), 374 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index 852b5b4707..3f3c181ac8 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -9593,13 +9593,16 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 
ON (t1.a = t2.b) INNER J
 -- left outer join + nullable clause
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 
10) t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a < 10 ORDER BY 1,2,3;
-   
  

Re: ObjectIdGetDatum() missing from SearchSysCache*() callers

2023-07-20 Thread Michael Paquier
On Tue, Jul 18, 2023 at 07:27:02AM +0900, Michael Paquier wrote:
> On Mon, Jul 17, 2023 at 05:33:42PM +0300, Aleksander Alekseev wrote:
> > I can't be 100% sure but it looks like that's all of them. PFA the
> > updated patch v2.
> 
> Thanks.  Yes, this stuff is easy to miss.  I was just grepping for a
> few patterns and missed these two.

Spotted a few more of these things after a second lookup.

One for subscriptions:
src/backend/commands/alter.c:
if (SearchSysCacheExists2(SUBSCRIPTIONNAME, MyDatabaseId,

And two for transforms:
src/backend/utils/cache/lsyscache.c:
tup = SearchSysCache2(TRFTYPELANG, typid, langid);
src/backend/utils/cache/lsyscache.c:
tup = SearchSysCache2(TRFTYPELANG, typid, langid);

And applied the whole.  Thanks for looking and spot more of these
inconsistencies!
--
Michael


signature.asc
Description: PGP signature