from:"Masahiko Sawada"

Re: Force streaming every change in logical decoding

2022-12-06 Thread Masahiko Sawada

On Wed, Dec 7, 2022 at 12:55 PM Amit Kapila  wrote:
>
> On Wed, Dec 7, 2022 at 7:31 AM Masahiko Sawada  wrote:
> >
> > On Wed, Dec 7, 2022 at 8:46 AM Peter Smith  wrote:
> > >
> > > >
> > > > Yeah, I think this can also help in reducing the time for various
> > > > tests in test_decoding/stream and
> > > > src/test/subscription/t/*_stream_*.pl file by reducing the number of
> > > > changes required to invoke streaming mode. Can we think of making this
> > > > GUC extendible to even test more options on server-side (publisher)
> > > > and client-side (subscriber) cases? For example, we can have something
> > > > like logical_replication_mode with the following valid values: (a)
> > > > server_serialize: this will serialize each change to file on
> > > > publishers and then on commit restore and send all changes; (b)
> > > > server_stream: this will stream each change as currently proposed in
> > > > your patch. Then if we want to extend it for subscriber-side testing
> > > > then we can introduce new options like client_serialize for the case
> > > > being discussed in the email [1].
> > > >
> > > > Thoughts?
> > >
> > > There is potential for lots of developer GUCs for testing/debugging in
> > > the area of logical replication but IMO it might be better to keep
> > > them all separated. Putting everything into a single
> > > 'logical_replication_mode' might cause difficulties later when/if you
> > > want combinations of the different modes.
> >
> > I think we want the developer option that forces streaming changes
> > during logical decoding to be PGC_USERSET but probably the developer
> > option for testing the parallel apply feature would be PGC_SIGHUP.
> >
>
> Ideally, that is true but if we want to combine the multiple modes in
> one parameter, is there a harm in keeping it as PGC_SIGHUP?

It's not a big harm but we will end up doing ALTER SYSTEM and
pg_reload_conf() even in regression tests (e.g. in
test_decoding/stream.sql).

>
> > Also, since streaming changes is not specific to logical replication
> > but to logical decoding, I'm not sure logical_replication_XXX is a
> > good name. IMO having force_stream_mode and a different GUC for
> > testing the parallel apply feature makes sense to me.
> >
>
> But if we want to have a separate variable for testing/debugging
> streaming like force_stream_mode, why not for serializing as well? And
> if we want for both then we can even think of combining them in one
> variable as logical_decoding_mode with values as 'stream' and
> 'serialize'.

Making it enum makes sense to me.

> The first one specified would be given preference. Also,
> the name force_stream_mode doesn't seem to convey that it is for
> logical decoding.

Agreed.

> On one side having separate GUCs for publisher and subscriber seems to
> give better flexibility but having more GUCs also sometimes makes them
> less usable. Here, my thought was to have a single or as few GUCs as
> possible which can be extendible by providing multiple values instead
> of having different GUCs. I was trying to map this with the existing
> string parameters in developer options.

I see your point. On the other hand, I'm not sure it's a good idea to
control different features by one GUC in general. The developer option
could be an exception?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-06 Thread Masahiko Sawada

On Wed, Dec 7, 2022 at 1:29 PM Amit Kapila  wrote:
>
> On Wed, Dec 7, 2022 at 9:00 AM Masahiko Sawada  wrote:
> >
> > On Thu, Dec 1, 2022 at 7:17 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > > ---
> > > > if (am_parallel_apply_worker() && on_subinfo_change) {
> > > > /*
> > > >  * If a parallel apply worker exits due to the subscription
> > > >  * information change, we notify the leader apply worker so that the
> > > >  * leader can report more meaningful message in time and restart the
> > > >  * logical replication.
> > > >  */
> > > > pq_putmessage('X', NULL, 0);
> > > > }
> > > >
> > > > and
> > > >
> > > > ereport(ERROR,
> > > > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > > >  errmsg("logical replication parallel apply worker 
> > > > exited
> > > > because of subscription information change")));
> > > >
> > > > Do we really need an additional message in case of 'X'? When we call
> > > > apply_worker_clean_exit with on_subinfo_change = true, we have reported 
> > > > the
> > > > error message such as:
> > > >
> > > > ereport(LOG,
> > > > (errmsg("logical replication parallel apply worker for 
> > > > subscription
> > > > \"%s\" will stop because of a parameter change",
> > > > MySubscription->name)));
> > > >
> > > > I think that reporting a similar message from the leader might not be
> > > > meaningful for users.
> > >
> > > The intention is to let leader report more meaningful message if a worker
> > > exited due to subinfo change. Otherwise, the leader is likely to report an
> > > error like " lost connection ... to parallel apply worker" when trying to 
> > > send
> > > data via shared memory if the worker exited. What do you think ?
> >
> > Agreed. But do we need to have the leader exit with an error in spite
> > of the fact that the worker cleanly exits? If the leader exits with an
> > error, the subscription will be disabled if disable_on_error is true,
> > right?
> >
>
> Right, but the leader will anyway exit at some point either due to an
> ERROR like "lost connection ... to parallel worker" or with a LOG
> like: "... will restart because of a parameter change" but I see your
> point. So, will it be better if we have a LOG message here and then
> proc_exit()? Do you have something else in mind for this?

No, I was thinking that too. It's better to write a LOG message and do
proc_exit().

Regarding the error "lost connection ... to parallel worker", it could
still happen depending on the timing even if the parallel worker
cleanly exits due to parameter changes, right? If so, I'm concerned
that it could lead to disable the subscription unexpectedly if
disable_on_error is enabled.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-06 Thread Masahiko Sawada

On Thu, Dec 1, 2022 at 7:17 PM houzj.f...@fujitsu.com
 wrote:
>
> On Thursday, December 1, 2022 3:58 PM Masahiko Sawada  
> wrote:
> >
> > On Wed, Nov 30, 2022 at 10:51 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Wednesday, November 30, 2022 9:41 PM houzj.f...@fujitsu.com
> >  wrote:
> > > >
> > > > On Tuesday, November 29, 2022 8:34 PM Amit Kapila
> > > > > Review comments on v53-0001*
> > > >
> > > > Attach the new version patch set.
> > >
> > > Sorry, there were some mistakes in the previous patch set.
> > > Here is the correct V54 patch set. I also ran pgindent for the patch set.
> > >
> >
> > Thank you for updating the patches. Here are random review comments for
> > 0001 and 0002 patches.
>
> Thanks for the comments!
>
> >
> > ereport(ERROR,
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >  errmsg("logical replication parallel apply worker exited
> > abnormally"),
> >  errcontext("%s", edata.context))); and
> >
> > ereport(ERROR,
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >  errmsg("logical replication parallel apply worker exited
> > because of subscription information change")));
> >
> > I'm not sure ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE is appropriate
> > here. Given that parallel apply worker has already reported the error 
> > message
> > with the error code, I think we don't need to set the errorcode for the logs
> > from the leader process.
> >
> > Also, I'm not sure the term "exited abnormally" is appropriate since we use 
> > it
> > when the server crashes for example. I think ERRORs reported here don't mean
> > that in general.
>
> How about reporting "xxx worker exited due to error" ?

Sounds better to me.

>
> > ---
> > if (am_parallel_apply_worker() && on_subinfo_change) {
> > /*
> >  * If a parallel apply worker exits due to the subscription
> >  * information change, we notify the leader apply worker so that the
> >  * leader can report more meaningful message in time and restart the
> >  * logical replication.
> >  */
> > pq_putmessage('X', NULL, 0);
> > }
> >
> > and
> >
> > ereport(ERROR,
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >  errmsg("logical replication parallel apply worker exited
> > because of subscription information change")));
> >
> > Do we really need an additional message in case of 'X'? When we call
> > apply_worker_clean_exit with on_subinfo_change = true, we have reported the
> > error message such as:
> >
> > ereport(LOG,
> > (errmsg("logical replication parallel apply worker for subscription
> > \"%s\" will stop because of a parameter change",
> > MySubscription->name)));
> >
> > I think that reporting a similar message from the leader might not be
> > meaningful for users.
>
> The intention is to let leader report more meaningful message if a worker
> exited due to subinfo change. Otherwise, the leader is likely to report an
> error like " lost connection ... to parallel apply worker" when trying to send
> data via shared memory if the worker exited. What do you think ?

Agreed. But do we need to have the leader exit with an error in spite
of the fact that the worker cleanly exits? If the leader exits with an
error, the subscription will be disabled if disable_on_error is true,
right?

And what do you think about the error code?

>
> > ---
> > -if (options->proto.logical.streaming &&
> > -PQserverVersion(conn->streamConn) >= 14)
> > -appendStringInfoString(, ", streaming 'on'");
> > +if (options->proto.logical.streaming_str)
> > +appendStringInfo(, ", streaming '%s'",
> > +
> > options->proto.logical.streaming_str);
> >
> > and
> >
> > +/*
> > + * Assign the appropriate option value for streaming option
> > according to
> > + * the 'streaming' mode and the publisher's ability to
> > support that mode.
> > + */
> > +if (server_version >= 16 &&
> > +MySubscription->stream == SUBSTREAM_PARALLEL)
> >

Re: Force streaming every change in logical decoding

2022-12-06 Thread Masahiko Sawada

On Wed, Dec 7, 2022 at 8:46 AM Peter Smith  wrote:
>
> On Tue, Dec 6, 2022 at 9:29 PM Amit Kapila  wrote:
> >
> > On Tue, Dec 6, 2022 at 11:53 AM shiy.f...@fujitsu.com
> >  wrote:
> > >
> > > Hi hackers,
> > >
> > > In logical decoding, when logical_decoding_work_mem is exceeded, the 
> > > changes are
> > > sent to output plugin in streaming mode. But there is a restriction that 
> > > the
> > > minimum value of logical_decoding_work_mem is 64kB. I tried to add a GUC 
> > > to
> > > allow sending every change to output plugin without waiting until
> > > logical_decoding_work_mem is exceeded.
> > >
> > > This helps to test streaming mode. For example, to test "Avoid streaming 
> > > the
> > > transaction which are skipped" [1], it needs many XLOG_XACT_INVALIDATIONS
> > > messages. With the new option, it can be tested with fewer changes and in 
> > > less
> > > time. Also, this new option helps to test more scenarios for "Perform 
> > > streaming
> > > logical transactions by background workers" [2].
> > >
>
> +1
>
> >
> > Yeah, I think this can also help in reducing the time for various
> > tests in test_decoding/stream and
> > src/test/subscription/t/*_stream_*.pl file by reducing the number of
> > changes required to invoke streaming mode. Can we think of making this
> > GUC extendible to even test more options on server-side (publisher)
> > and client-side (subscriber) cases? For example, we can have something
> > like logical_replication_mode with the following valid values: (a)
> > server_serialize: this will serialize each change to file on
> > publishers and then on commit restore and send all changes; (b)
> > server_stream: this will stream each change as currently proposed in
> > your patch. Then if we want to extend it for subscriber-side testing
> > then we can introduce new options like client_serialize for the case
> > being discussed in the email [1].
> >
> > Thoughts?
>
> There is potential for lots of developer GUCs for testing/debugging in
> the area of logical replication but IMO it might be better to keep
> them all separated. Putting everything into a single
> 'logical_replication_mode' might cause difficulties later when/if you
> want combinations of the different modes.

I think we want the developer option that forces streaming changes
during logical decoding to be PGC_USERSET but probably the developer
option for testing the parallel apply feature would be PGC_SIGHUP.
Also, since streaming changes is not specific to logical replication
but to logical decoding, I'm not sure logical_replication_XXX is a
good name. IMO having force_stream_mode and a different GUC for
testing the parallel apply feature makes sense to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Force streaming every change in logical decoding

2022-12-06 Thread Masahiko Sawada

On Tue, Dec 6, 2022 at 7:29 PM Amit Kapila  wrote:
>
> On Tue, Dec 6, 2022 at 11:53 AM shiy.f...@fujitsu.com
>  wrote:
> >
> > Hi hackers,
> >
> > In logical decoding, when logical_decoding_work_mem is exceeded, the 
> > changes are
> > sent to output plugin in streaming mode. But there is a restriction that the
> > minimum value of logical_decoding_work_mem is 64kB. I tried to add a GUC to
> > allow sending every change to output plugin without waiting until
> > logical_decoding_work_mem is exceeded.
> >
> > This helps to test streaming mode. For example, to test "Avoid streaming the
> > transaction which are skipped" [1], it needs many XLOG_XACT_INVALIDATIONS
> > messages. With the new option, it can be tested with fewer changes and in 
> > less
> > time. Also, this new option helps to test more scenarios for "Perform 
> > streaming
> > logical transactions by background workers" [2].
> >
>
> Yeah, I think this can also help in reducing the time for various
> tests in test_decoding/stream and
> src/test/subscription/t/*_stream_*.pl file by reducing the number of
> changes required to invoke streaming mode.

+1

> Can we think of making this
> GUC extendible to even test more options on server-side (publisher)
> and client-side (subscriber) cases? For example, we can have something
> like logical_replication_mode with the following valid values: (a)
> server_serialize: this will serialize each change to file on
> publishers and then on commit restore and send all changes; (b)
> server_stream: this will stream each change as currently proposed in
> your patch. Then if we want to extend it for subscriber-side testing
> then we can introduce new options like client_serialize for the case
> being discussed in the email [1].

Setting logical_replication_mode = 'client_serialize' implies that the
publisher behaves as server_stream? or do you mean we can set like
logical_replication_mode = 'server_stream, client_serialize'?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Add index scan progress to pg_stat_progress_vacuum

2022-12-05 Thread Masahiko Sawada

s info after updating
pvs->shared->idx_completed_progress.

---
+/*
+ * Check if we are done vacuuming indexes and report
+ * progress.

How about "Waiting for all indexes to be vacuumed while updating the
parallel index vacuum progress"?

+ *
+ * We nap using with a WaitLatch to avoid a busy loop.
+ *
+ * Note: This function should be used by the leader process only,
+ * and it's up to the caller to ensure this.
+ */

I think these comments are not necessary.

+void
+parallel_wait_for_workers_to_finish(ParallelVacuumState *pvs)

How about "parallel_vacuum_wait_to_finish"?

---
+/*
+ * Read the shared ParallelVacuumProgress and update progress.h
+ * with indexes vacuumed so far. This function is called periodically
+ * by index AMs as well as parallel_vacuum_process_one_index.
+ *
+ * To avoid unnecessarily updating progress, we check the progress
+ * values from the backend entry and only update if the value
+ * of completed indexes increases.
+ *
+ * Note: This function should be used by the leader process only,
+ * and it's up to the caller to ensure this.
+ */
+void
+parallel_vacuum_update_progress(void)
+{
+volatile PgBackendStatus *beentry = MyBEEntry;
+
+Assert(!IsParallelWorker);
+
+if (beentry && ParallelVacuumProgress)
+{
+int parallel_vacuum_current_value =
beentry->st_progress_param[PROGRESS_VACUUM_INDEX_COMPLETED];
+int parallel_vacuum_new_value =
pg_atomic_read_u32(ParallelVacuumProgress);
+
+if (parallel_vacuum_new_value > parallel_vacuum_current_value)
+
pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED,
parallel_vacuum_new_value);
+}
+}

parallel_vacuum_update_progress() is typically called every 1GB so I
think we don't need to worry about unnecessary update. Also, I think
this code doesn't work when pgstat_track_activities is false. Instead,
I think that in parallel_wait_for_workers_to_finish(), we can check
the value of pvs->nindexes_completed and update the progress if there
is an update or it's first time.

---
+(void) WaitLatch(MyLatch, WL_TIMEOUT | WL_LATCH_SET |
WL_EXIT_ON_PM_DEATH, PARALLEL_VACUUM_PROGRESS_TIMEOUT,
+
WAIT_EVENT_PARALLEL_VACUUM_FINISH);
+ResetLatch(MyLatch);

I think we don't necessarily need to use
PARALLEL_VACUUM_PROGRESS_TIMEOUT here. Probably we can use 1000L
instead. If we want to use PARALLEL_VACUUM_PROGRESS_TIMEOUT, we need
comments for that:

+#define PARALLEL_VACUUM_PROGRESS_TIMEOUT   1000

---
-WAIT_EVENT_XACT_GROUP_UPDATE
+WAIT_EVENT_XACT_GROUP_UPDATE,
+WAIT_EVENT_PARALLEL_VACUUM_FINISH
 } WaitEventIPC;

 Enums of WaitEventIPC should be defined in alphabetical order.

---
cfbot fails.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-01 Thread Masahiko Sawada

On Thu, Dec 1, 2022 at 4:00 PM John Naylor  wrote:
>
>
> On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada  
> wrote:
> >
> > I've investigated this issue and have a question about using atomic
> > variables on palloc'ed memory. In non-parallel vacuum cases,
> > radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
> > the memory allocated by aset.c is 4-bytes aligned so these atomic
> > variables are not always 8-bytes aligned. Is there any way to enforce
> > 8-bytes aligned memory allocations in 32-bit machines?
>
> The bigger question in my mind is: Why is there an atomic variable in 
> backend-local memory?

Because I use the same radix_tree and radix_tree_control structs for
non-parallel and parallel vacuum. Therefore, radix_tree_control is
allocated in DSM for parallel-vacuum cases or in backend-local memory
for non-parallel vacuum cases.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-30 Thread Masahiko Sawada

ialized on first use). */
+static HTAB *ParallelApplyWorkersHash = NULL;
+
+/*
+ * A list to maintain the active parallel apply workers. The information for
+ * the new worker is added to the list after successfully launching it. The
+ * list entry is removed if there are already enough workers in the worker
+ * pool either at the end of the transaction or while trying to find a free
+ * worker for applying the transaction. For more information about the worker
+ * pool, see comments atop this file.
+ */
+static List *ParallelApplyWorkersList = NIL;

The names ParallelApplyWorkersHash and ParallelWorkersList are very
similar but the usages are completely different. Probably we can find
better names such as ParallelApplyTxnHash and ParallelApplyWorkerPool.
And probably we can add more comments for ParallelApplyWorkersHash.

---
if (winfo->serialize_changes ||
napplyworkers > (max_parallel_apply_workers_per_subscription / 2))
{
int slot_no;
uint16  generation;

SpinLockAcquire(>shared->mutex);
generation = winfo->shared->logicalrep_worker_generation;
slot_no = winfo->shared->logicalrep_worker_slot_no;
SpinLockRelease(>shared->mutex);

logicalrep_pa_worker_stop(slot_no, generation);

pa_free_worker_info(winfo);

return true;
}

/* Unlink any files that were needed to serialize partial changes. */
if (winfo->serialize_changes)
stream_cleanup_files(MyLogicalRepWorker->subid, winfo->shared->xid);

If winfo->serialize_changes is true, we return true in the first if
statement. So stream_cleanup_files in the second if statement is never
executed.

---
+/*
+ * First, try to get a parallel apply worker from the pool,
if available.
+ * Otherwise, try to start a new parallel apply worker.
+ */
+winfo = pa_get_available_worker();
+if (!winfo)
+{
+winfo = pa_init_and_launch_worker();
+if (!winfo)
+return;
+}

I think we don't necessarily need to separate two functions for
getting a worker from the pool and launching a new worker. It seems to
reduce the readability. Instead, I think that we can have one function
that returns winfo if there is a free worker in the worker pool or it
launches a worker. That way, we can simply do like:

winfo = pg_launch_parallel_worker()
if (!winfo)
return;

---
+/* Setup replication origin tracking. */
+StartTransactionCommand();
+ReplicationOriginNameForLogicalRep(MySubscription->oid, InvalidOid,
+
 originname, sizeof(originname));
+originid = replorigin_by_name(originname, true);
+if (!OidIsValid(originid))
+originid = replorigin_create(originname);

This code looks to allow parallel workers to use different origins in
cases where the origin doesn't exist, but is that okay? Shouldn't we
pass miassing_ok = false in this case?

---
cfbot seems to fails:

https://cirrus-ci.com/task/6264595342426112

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-30 Thread Masahiko Sawada

On Wed, Nov 30, 2022 at 7:54 PM Amit Kapila  wrote:
>
> On Tue, Nov 29, 2022 at 10:18 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > Attach the new version patch which addressed all comments.
> >
>
> Some comments on v53-0002*
> 
> 1. I think testing the scenario where the shm_mq buffer is full
> between the leader and parallel apply worker would require a large
> amount of data and then also there is no guarantee. How about having a
> developer GUC [1] force_apply_serialize which allows us to serialize
> the changes and only after commit the parallel apply worker would be
> allowed to apply it?

+1

The code coverage report shows that we don't cover the partial
serialization codes. This GUC would improve the code coverage.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-30 Thread Masahiko Sawada

On Wed, Nov 23, 2022 at 2:10 AM Andres Freund  wrote:
>
> On 2022-11-21 17:06:56 +0900, Masahiko Sawada wrote:
> > Sure. I've attached the v10 patches. 0004 is the pure refactoring
> > patch and 0005 patch introduces the pointer tagging.
>
> This failed on cfbot, with som many crashes that the VM ran out of disk for
> core dumps. During testing with 32bit, so there's probably something broken
> around that.
>
> https://cirrus-ci.com/task/4635135954386944
>
> A failure is e.g. at: 
> https://api.cirrus-ci.com/v1/artifact/task/4635135954386944/testrun/build-32/testrun/adminpack/regress/log/initdb.log
>
> performing post-bootstrap initialization ... 
> ../src/backend/lib/radixtree.c:1696:21: runtime error: member access within 
> misaligned address 0x590faf74 for type 'struct radix_tree_control', which 
> requires 8 byte alignment
> 0x590faf74: note: pointer points here
>   90 11 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> 00 00 00 00 00 00 00 00
>   ^

radix_tree_control struct has two pg_atomic_uint64 variables, and the
assertion check in pg_atomic_init_u64() failed:

static inline void
pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
{
/*
 * Can't necessarily enforce alignment - and don't need it - when using
 * the spinlock based fallback implementation. Therefore only assert when
 * not using it.
 */
#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
AssertPointerAlignment(ptr, 8);
#endif
pg_atomic_init_u64_impl(ptr, val);
}

I've investigated this issue and have a question about using atomic
variables on palloc'ed memory. In non-parallel vacuum cases,
radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
the memory allocated by aset.c is 4-bytes aligned so these atomic
variables are not always 8-bytes aligned. Is there any way to enforce
8-bytes aligned memory allocations in 32-bit machines?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-30 Thread Masahiko Sawada

On Tue, Nov 29, 2022 at 1:36 PM John Naylor
 wrote:
>
> While creating a benchmark for inserting into node128-inner, I found a bug. 
> If a caller deletes from a node128, the slot index is set to invalid, but the 
> child pointer is still valid. Do that a few times, and every child pointer is 
> valid, even if no slot index points to it. When the next inserter comes 
> along, something surprising happens. This function:
>
> /* Return an unused slot in node-128 */
> static int
> node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
> {
>   int slotpos = 0;
>
>   Assert(!NODE_IS_LEAF(node));
>   while (node_inner_128_is_slot_used(node, slotpos))
>   slotpos++;
>
>   return slotpos;
> }
>
> ...passes an integer to this function, whose parameter is a uint8:
>
> /* Is the slot in the node used? */
> static inline bool
> node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
> {
>   Assert(!NODE_IS_LEAF(node));
>   return (node->children[slot] != NULL);
> }
>
> ...so instead of growing the node unnecessarily or segfaulting, it enters an 
> infinite loop doing this:
>
> add eax, 1
> movzx   ecx, al
> cmp QWORD PTR [rbx+264+rcx*8], 0
> jne .L147
>
> The fix is easy enough -- set the child pointer to null upon deletion,

Good catch!

> but I'm somewhat astonished that the regression tests didn't hit this. I do 
> still intend to replace this code with something faster, but before I do so 
> the tests should probably exercise the deletion paths more. Since VACUUM

Indeed, there are some tests for deletion but all of them delete all
keys in the node so we end up deleting the node. I've added tests of
repeating deletion and insertion as well as additional assertions.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-29 Thread Masahiko Sawada

On Fri, Nov 25, 2022 at 6:47 PM John Naylor
 wrote:
>
>
>
> On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada  wrote:
> >
> > [v11]
>
> There is one more thing that just now occurred to me: In expanding the use of 
> size classes, that makes rebasing and reworking the shared memory piece more 
> work than it should be. That's important because there are still some open 
> questions about the design around shared memory. To keep unnecessary churn to 
> a minimum, perhaps we should limit size class expansion to just one (or 5 
> total size classes) for the near future?

Make sense. We can add size classes once we have a good design and
implementation around shared memory.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-29 Thread Masahiko Sawada

On Fri, Nov 25, 2022 at 5:00 PM John Naylor
 wrote:
>
> On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada  wrote:
> >
> > So it seems that there are two candidates of rt_node structure: (1)
> > all nodes except for node256 are variable-size nodes and use pointer
> > tagging, and (2) node32 and node128 are variable-sized nodes and do
> > not use pointer tagging (fanout is in part of only these two nodes).
> > rt_node can be 5 bytes in both cases. But before going to this step, I
> > started to verify the idea of variable-size nodes by using 6-bytes
> > rt_node. We can adjust the node kinds and node classes later.
>
> First, I'm glad you picked up the size class concept and expanded it. (I have 
> some comments about some internal APIs below.)
>
> Let's leave the pointer tagging piece out until the main functionality is 
> committed. We have all the prerequisites in place, except for a benchmark 
> random enough to demonstrate benefit. I'm still not quite satisfied with how 
> the shared memory coding looked, and that is the only sticky problem we still 
> have, IMO. The rest is "just work".
>
> That said, (1) and (2) above are still relevant -- variable sizing any given 
> node is optional, and we can refine as needed.
>
> > Overall, the idea of variable-sized nodes is good, smaller size
> > without losing search performance.
>
> Good.
>
> > I'm going to check the load
> > performance as well.
>
> Part of that is this, which gets called a lot more now, when node1 expands:
>
> + if (inner)
> + newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
> + rt_node_kind_info[kind].inner_size);
> + else
> + newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
> + rt_node_kind_info[kind].leaf_size);
>
> Since memset for expanding size class is now handled separately, these can 
> use the non-zeroing versions. When compiling MemoryContextAllocZero, the 
> compiler has no idea how big the size is, so it assumes the worst and 
> optimizes for large sizes. On x86-64, that means using "rep stos", which 
> calls microcode found in the CPU's ROM. This is slow for small sizes. The 
> "init" function should be always inline with const parameters where possible. 
> That way, memset can compile to a single instruction for the smallest node 
> kind. (More on alloc/init below)

Right. I forgot to update it.

>
> Note, there is a wrinkle: As currently written inner_node128 searches the 
> child pointers for NULL when inserting, so when expanding from partial to 
> full size class, the new node must be zeroed (Worth fixing in the short term. 
> I thought of this while writing the proof-of-concept for size classes, but 
> didn't mention it.) Medium term, rather than special-casing this, I actually 
> want to rewrite the inner-node128 to be more similar to the leaf, with an 
> "isset" array, but accessed and tested differently. I guarantee it's *really* 
> slow now to load (maybe somewhat true even for leaves), but I'll leave the 
> details for later.

Agreed, I start with zeroing out the node when expanding from partial
to full size.

> Regarding node128 leaf, note that it's slightly larger than a DSA size class, 
> and we can trim it to fit:
>
> node61:  6 + 256+(2) +16 +  61*8 =  768
> node125: 6 + 256+(2) +16 + 125*8 = 1280

Agreed, changed.

>
> > I've attached the patches I used for the verification. I don't include
> > patches for pointer tagging, DSA support, and vacuum integration since
> > I'm investigating the issue on cfbot that Andres reported. Also, I've
> > modified tests to improve the test coverage.
>
> Sounds good. For v12, I think size classes have proven themselves, so v11's 
> 0002/4/5 can be squashed. Plus, some additional comments:
>
> +/* Return a new and initialized node */
> +static rt_node *
> +rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, 
> bool inner)
> +{
> + rt_node *newnode;
> +
> + newnode = rt_alloc_node(tree, kind, inner);
> + rt_init_node(newnode, kind, shift, chunk, inner);
> +
> + return newnode;
> +}
>
> I don't see the point of a function that just calls two functions.

Removed.

>
> +/*
> + * Create a new node with 'new_kind' and the same shift, chunk, and
> + * count of 'node'.
> + */
> +static rt_node *
> +rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
> +{
> + rt_node*newnode;
> +
> + newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
> + node->shift > 0);
> + newnode->count = node->count;
> +
> + return newnode;
> +}
>
> This, in turn, just calls a function that does _a

Re: Fix comment in SnapBuildFindSnapshot

2022-11-28 Thread Masahiko Sawada

On Tue, Nov 29, 2022 at 8:54 AM Michael Paquier  wrote:
>
> On Mon, Nov 28, 2022 at 04:46:44PM +0900, Michael Paquier wrote:
> > Hm, yes, that seems right.  There are three "c) states" in these
> > paragraphs, they are incremental steps.  Will apply if there are no
> > objections.
>
> And done.

Thank you!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Fix comment in SnapBuildFindSnapshot

2022-11-27 Thread Masahiko Sawada

Hi,

We have the following comment in SnapBuildFindSnapshot():

* c) transition from FULL_SNAPSHOT to CONSISTENT.
*
* In FULL_SNAPSHOT state (see d) ), and this xl_running_xacts'

It mentions "(state d) )", which seems like a typo of "(state d)", but
there is no "state d" in the first place. Reading the discussion of
the commit 955a684e040 that introduced this comment, this was a
comment for an old version patch[1]. So I think we can remove this
part.

I've attached the patch.

Regards,

[1] 
https://www.postgresql.org/message-id/20170505004237.edtahvrwb3uwd5rs%40alap3.anarazel.de

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_comment_in_SnapBuildFindSnapshot.patch
Description: Binary data

Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-27 Thread Masahiko Sawada

On Fri, Nov 25, 2022 at 5:58 PM Maxim Orlov  wrote:
>
>
>
> On Fri, 25 Nov 2022 at 09:40, Amit Kapila  wrote:
>>
>> On Thu, Nov 24, 2022 at 4:43 PM Amit Kapila  wrote:
>> >
>> > On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada  
>> > wrote:
>> > >
>> > > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila  
>> > > wrote:
>> > >
>> > > Agreed not to have a test case for this.
>> > >
>> > > I've attached an updated patch. I've confirmed this patch works for
>> > > all supported branches.
>> > >
>> >
>> > I have slightly changed the checks used in the patch, otherwise looks
>> > good to me. I am planning to push (v11-v15) the attached tomorrow
>> > unless there are more comments.
>> >
>>
>> Pushed.
>
> A big thanks to you! Could you also, close the commitfest entry 
> https://commitfest.postgresql.org/41/4024/, please?

Closed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-24 Thread Masahiko Sawada

On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila  wrote:
>
> On Tue, Nov 22, 2022 at 10:33 PM Maxim Orlov  wrote:
> >>
> >>
> >> Regarding the tests, the patch includes a new scenario to
> >> reproduce this issue. However, since the issue can be reproduced also
> >> by the existing scenario (with low probability, though), I'm not sure
> >> it's worth adding the new scenario.
> >
> > AFAICS, the test added doesn't 100% reproduce this issue, so, maybe, it 
> > does not worth it. But, I do not have a strong opinion here.
> > Let's add tests in a separate commit and let the actual committer to decide 
> > what to do, should we?
> >
>
> +1 to not have a test for this as the scenario can already be tested
> by the existing set of tests.

Agreed not to have a test case for this.

I've attached an updated patch. I've confirmed this patch works for
all supported branches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


0001-Reset-InitialRunningXacts-array-when-freeing-SnapBui.patch
Description: Binary data

Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-22 Thread Masahiko Sawada

Hi,

On Tue, Nov 22, 2022 at 6:37 PM Amit Kapila  wrote:
>
> On Mon, Nov 21, 2022 at 6:17 PM Maxim Orlov  wrote:
> >
> > PROBLEM
> >
> > After some investigation, I think, the problem is in the snapbuild.c 
> > (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
> > array in the context of builder->context, but for the time when we call 
> > SnapBuildPurgeOlderTxn this context may be already free'd.
> >
>
> I think you are seeing it freed in SnapBuildPurgeOlderTxn when we
> finish and restart decoding in the same session. After finishing the
> first decoding, it frees the decoding context but we forgot to reset
> NInitialRunningXacts and InitialRunningXacts array. So, next time when
> we start decoding in the same session where we don't restore any
> serialized snapshot, it can lead to the problem you are seeing because
> NInitialRunningXacts (and InitialRunningXacts array) won't have sane
> values.
>
> This can happen in the catalog_change_snapshot test as we have
> multiple permutations and those use the same session across a restart
> of decoding.

I have the same analysis. In order to restart the decoding from the
LSN where we don't restore any serialized snapshot, we somehow need to
advance the slot's restart_lsn. In this case, I think it happened
since the same session drops at the end of the first scenario and
creates the replication slot with the same name at the beginning of
the second scenario in catalog_change_snapshot.spec.

>
> >
> > Simple fix like:
> > @@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr 
> > lsn, xl_running_xacts *runn
> >  * changes. See SnapBuildXidSetCatalogChanges.
> >  */
> > NInitialRunningXacts = nxacts;
> > -   InitialRunningXacts = MemoryContextAlloc(builder->context, 
> > sz);
> > +   InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, 
> > sz);
> > memcpy(InitialRunningXacts, running->xids, sz);
> > qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), 
> > xidComparator);
> >
> > seems to solve the described problem, but I'm not in the context of [0] and 
> > why array is allocated in builder->context.
> >
>
> It will leak the memory for InitialRunningXacts. We need to reset
> NInitialRunningXacts (and InitialRunningXacts) as mentioned above.
>
> Thank you for the report and initial analysis. I have added Sawada-San
> to know his views as he was the primary author of this work.

Thanks!

I've attached a draft patch. To fix it, I think we can reset
InitialRunningXacts and NInitialRunningXacts at FreeSnapshotBuilder()
and add an assertion in AllocateSnapshotBuilder() to make sure both
are reset. Regarding the tests, the patch includes a new scenario to
reproduce this issue. However, since the issue can be reproduced also
by the existing scenario (with low probability, though), I'm not sure
it's worth adding the new scenario.

I've not checked if the patch works for version 14 or older yet.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


reset_initial_running_xacts.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-21 Thread Masahiko Sawada

On Mon, Nov 21, 2022 at 4:20 PM John Naylor
 wrote:
>
>
> On Fri, Nov 18, 2022 at 2:48 PM I wrote:
> > One issue with this patch: The "fanout" member is a uint8, so it can't hold 
> > 256 for the largest node kind. That's not an issue in practice, since we 
> > never need to grow it, and we only compare that value with the count in an 
> > Assert(), so I just set it to zero. That does break an invariant, so it's 
> > not great. We could use 2 bytes to be strictly correct in all cases, but 
> > that limits what we can do with the smallest node kind.
>
> Thinking about this part, there's an easy resolution -- use a different macro 
> for fixed- and variable-sized node kinds to determine if there is a free slot.
>
> Also, I wanted to share some results of adjusting the boundary between the 
> two smallest node kinds. In the hackish attached patch, I modified the fixed 
> height search benchmark to search a small (within L1 cache) tree thousands of 
> times. For the first set I modified node4's maximum fanout and filled it up. 
> For the second, I set node4's fanout to 1, which causes 2+ to spill to node32 
> (actually the partially-filled node15 size class as demoed earlier).
>
> node4:
>
> NOTICE:  num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   2 |16 |16520 |  0 |3
>
> NOTICE:  num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   3 |81 |16456 |  0 |   17
>
> NOTICE:  num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   4 |   256 |16456 |  0 |   89
>
> NOTICE:  num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   5 |   625 |16488 |  0 |  327
>
>
> node32:
>
> NOTICE:  num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   2 |16 |16488 |  0 |5
> (1 row)
>
> NOTICE:  num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   3 |81 |16520 |  0 |   28
>
> NOTICE:  num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   4 |   256 |16408 |  0 |   79
>
> NOTICE:  num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   5 |   625 |24616 |  0 |  199
>
> In this test, node32 seems slightly faster than node4 with 4 elements, at the 
> cost of more memory.
>
> Assuming the smallest node is fixed size (i.e. fanout/capacity member not 
> part of the common set, so only part of variable-sized nodes), 3 has a nice 
> property: no wasted padding space:
>
> node4: 5 + 4+(7) + 4*8 = 48 bytes
> node3: 5 + 3 + 3*8 = 32

IIUC if we store the fanout member only in variable-sized nodes,
rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
size doesn't change but there is 1 byte padding space.

Also, even if we have the node3 a variable-sized node, size class 1
for node3 could be a good choice since it also doesn't need padding
space and could be a good alternative to path compression.

node3 :  5 + 3 + 3*8 = 32 bytes
size class 1 : 5 + 3 + 1*8 = 16 bytes

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-18 Thread Masahiko Sawada

On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada  wrote:
>
> On Wed, Nov 16, 2022 at 4:39 PM John Naylor
>  wrote:
> >
> >
> > On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Nov 16, 2022 at 1:46 PM John Naylor
> > >  wrote:
> > > >
> > > >
> > > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
> > > >  wrote:
> > > > > Thanks! Please let me know if there is something I can help with.
> > > >
> > > > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> > > >
> > > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> > > > "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
> > >
> > > Which tests do you use to get this assertion failure? I've confirmed
> > > there is a bug in 0005 patch but without it, "make check-world"
> > > passed.
> >
> > Hmm, I started over and rebuilt and it didn't reproduce. Not sure what 
> > happened, sorry for the noise.
>
> Good to know. No problem.
>
> > I'm attaching a test I wrote to stress test branch prediction in search, 
> > and while trying it out I found two possible issues.
>
> Thank you for testing!
>
> >
> > It's based on the random int load test, but tests search speed. Run like 
> > this:
> >
> > select * from bench_search_random_nodes(10 * 1000 * 1000)
> >
> > It also takes some care to include all the different node kinds, 
> > restricting the possible keys by AND-ing with a filter. Here's a simple 
> > demo:
> >
> > filter = ((uint64)1<<40)-1;
> > LOG:  num_keys = 967, height = 4, n4 = 17513814, n32 = 6320, n128 = 
> > 62663, n256 = 3130
> >
> > Just using random integers leads to >99% using the smallest node. I wanted 
> > to get close to having the same number of each, but that's difficult while 
> > still using random inputs. I ended up using
> >
> > filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)
> >
> > which gives
> >
> > LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 
> > 182670, n256 = 1024
> >
> > Which seems okay for the task. One puzzling thing I found while trying 
> > various filters is that sometimes the reported tree height would change. 
> > For example:
> >
> > filter = (((uint64) 1<<32) | (0xFF<<24));
> > LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 = 
> > 62632, n256 = 3161
> >
> > 1) Any idea why the tree height would be reported as 7 here? I didn't 
> > expect that.
>
> In my environment, (0xFF<<24) is 0xFF00, not 0xFF00.
> It seems the filter should be (((uint64) 1<<32) | ((uint64)
> 0xFF<<24)).
>
> >
> > 2) It seems that 0004 actually causes a significant slowdown in this test 
> > (as in the attached, using the second filter above and with turboboost 
> > disabled):
> >
> > v9 0003: 2062 2051 2050
> > v9 0004: 2346 2316 2321
> >
> > That means my idea for the pointer struct might have some problems, at 
> > least as currently implemented. Maybe in the course of separating out and 
> > polishing that piece, an inefficiency will fall out. Or, it might be 
> > another reason to template local and shared separately. Not sure yet. I 
> > also haven't tried to adjust this test for the shared memory case.
>
> I'll also run the test on my environment and do the investigation tomorrow.
>

FYI I've not tested the patch you shared today but here are the
benchmark results I did with the v9 patch in my environment (I used
the second filter). I splitted 0004 patch into two patches: a patch
for pure refactoring patch to introduce rt_node_ptr and a patch to do
pointer tagging.

v9 0003 patch: 1113 1114 1114
introduce rt_node_ptr: 1127 1128 1128
pointer tagging  : 1085 1087 1086 (equivalent to 0004 patch)

In my environment, rt_node_ptr seemed to lead some overhead but
pointer tagging had performance benefits. I'm not sure the reason why
the results are different from yours. The radix tree stats shows the
same as your tests.

=# select * from bench_search_random_nodes(10 * 1000 * 1000);
2022-11-18 22:18:21.608 JST [3913544] LOG:  num_keys = 9291812, height
= 4, n4 = 262144, n32 =79603, n128 = 182670, n256 = 1024

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Add index scan progress to pg_stat_progress_vacuum

2022-11-18 Thread Masahiko Sawada

On Sat, Nov 12, 2022 at 4:10 AM Imseih (AWS), Sami  wrote:
>
> >I don't think any of these progress callbacks should be done while 
> > pinning a
> >buffer and ...
>
> Good point.
>
> >I also don't understand why info->parallel_progress_callback exists? 
> > It's only
> >set to parallel_vacuum_progress_report(). Why make this stuff more 
> > expensive
> >than it has to already be?
>
> Agree. Modified the patch to avoid the callback .
>
> >So each of the places that call this need to make an additional external
> >function call for each page, just to find that there's nothing to do or 
> > to
> >make yet another indirect function call. This should probably a static 
> > inline
> >function.
>
> Even better to just remove a function call altogether.
>
> >This is called, for every single page, just to read the number of indexes
> >completed by workers? A number that barely ever changes?
>
> I will take the initial suggestion by Sawada-san to update the progress
> every 1GB of blocks scanned.
>
> Also, It sems to me that we don't need to track progress in brin indexes,
> Since it is rare, if ever, this type of index will go through very heavy
> block scans. In fact, I noticed the vacuum AMs for brin don't invoke the
> vacuum_delay_point at all.
>
> The attached patch addresses the feedback.
>

Thank you for updating the patch! Here are review comments on v15 patch:

+  
+   Number of indexes that wil be vacuumed. This value will be
+   0 if there are no indexes to vacuum or
+   vacuum failsafe is triggered.

I think that indexes_total should be 0 also when INDEX_CLEANUP is off.

---
+/*
+ * Reset the indexes completed at this point.
+ * If we end up in another index vacuum cycle, we will
+ * start counting from the start.
+ */
+pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED, 0);

I think we don't need to reset it at the end of index vacuuming. There
is a small window before switching to the next phase. If we reset this
value while showing "index vacuuming" phase, the user might get
confused. Instead, we can reset it at the beginning of the index
vacuuming.

---
+void
+parallel_wait_for_workers_to_finish(ParallelVacuumState *pvs)
+{
+while (pg_atomic_read_u32(&(pvs->shared->idx_completed_progress))
< pvs->nindexes)
+{
+pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED,
+
   pg_atomic_read_u32(&(pvs->shared->idx_completed_progress)));
+
+(void) WaitLatch(MyLatch, WL_LATCH_SET |
WL_EXIT_ON_PM_DEATH, -1,
+ WAIT_EVENT_PARALLEL_FINISH);
+ResetLatch(MyLatch);
+}
+}

We should add CHECK_FOR_INTERRUPTS() at the beginning of the loop to
make the wait interruptible.

I think it would be better to update the counter only when the value
has been increased.

I think we should set a timeout, say 1 sec, to WaitLatch so that it
can periodically check the progress.

Probably it's better to have a new wait event for this wait in order
to distinguish wait for workers to complete index vacuum from the wait
for workers to exit.

---
@@ -838,7 +867,12 @@
parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation
indrel,
 ivinfo.estimated_count = pvs->shared->estimated_count;
 ivinfo.num_heap_tuples = pvs->shared->reltuples;
 ivinfo.strategy = pvs->bstrategy;
-
+ivinfo.idx_completed_progress = pvs->shared->idx_completed_progress;

and

@@ -998,6 +998,9 @@ btvacuumscan(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats,
 if (info->report_progress)

pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,

   scanblkno);
+if (info->report_parallel_progress &&
(scanblkno % REPORT_PARALLEL_VACUUM_EVERY_PAGES) == 0)
+
pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED,
+

pg_atomic_read_u32(&(info->idx_completed_progress)));
 }

I think this doesn't work, since ivinfo.idx_completed is in the
backend-local memory. Instead, I think we can have a function in
vacuumparallel.c that updates the progress. Then we can have index AM
call this function.

---
+if (!IsParallelWorker())
+ivinfo.report_parallel_progress = true;
+else
+ivinfo.report_parallel_progress = false;

We can do like:

ivinfo.report_parallel_progress = !IsParallelWorker();

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-17 Thread Masahiko Sawada

On Fri, Nov 18, 2022 at 1:47 PM Amit Kapila  wrote:
>
> On Fri, Nov 18, 2022 at 8:01 AM Peter Smith  wrote:
> >
> > On Fri, Nov 18, 2022 at 11:36 AM Masahiko Sawada  
> > wrote:
> > >
> > ...
> > > ---
> > > The streaming parameter has the new value "parallel" for "streaming"
> > > option to enable the parallel apply. It fits so far but I think the
> > > parallel apply feature doesn't necessarily need to be tied up with
> > > streaming replication. For example, we might want to support parallel
> > > apply also for non-streaming transactions in the future. It might be
> > > better to have another option, say "parallel", to control parallel
> > > apply behavior. The "parallel" option can be a boolean option and
> > > setting parallel = on requires streaming = on.
> > >
>
> If we do that then how will the user be able to use streaming
> serialize mode (write to file for streaming transactions) as we have
> now? Because after we introduce parallelism for non-streaming
> transactions, the user would want parallel = on irrespective of the
> streaming mode. Also, users may wish to only parallelize large
> transactions because of additional overhead for non-streaming
> transactions for transaction dependency tracking, etc. So, the user
> may wish to have a separate knob for large transactions as the patch
> has now.

One idea for that would be to make it enum. For example, setting
parallel = "streaming" works for that.

>
> >
> > FWIW, I tend to agree with this idea but for a different reason. In
> > this patch, the 'streaming' parameter had become a kind of hybrid
> > boolean/enum. AFAIK there are no other parameters anywhere that use a
> > hybrid pattern like this so I was thinking it may be better not to be
> > different.
> >
>
> I think we have a similar pattern for GUC parameters like
> constraint_exclusion (see constraint_exclusion_options),
> backslash_quote (see backslash_quote_options), etc.

Right. vacuum_index_cleanup and buffering storage parameters that
accept 'on', 'off', or 'auto') are other examples.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-17 Thread Masahiko Sawada

*
+ * Release all session level locks that could be held in parallel apply
+ * mode.
+ */
+LockReleaseAll(DEFAULT_LOCKMETHOD, true);
+

I think we call LockReleaseAll() at the process exit (in ProcKill()),
but do we really need to do LockReleaseAll() here too?

---

+elog(ERROR, "could not find replication state slot
for replication"
+ "origin with OID %u which was acquired by
%d", node, acquired_by);

Let's not break the error log message in the middle so that the user
can search the message by grep easily.

---
+{
+{"max_parallel_apply_workers_per_subscription",
+PGC_SIGHUP,
+REPLICATION_SUBSCRIBERS,
+gettext_noop("Maximum number of parallel
apply workers per subscription."),
+NULL,
+},
+_parallel_apply_workers_per_subscription,
+2, 0, MAX_BACKENDS,
+NULL, NULL, NULL
+},
+

I think we should use MAX_PARALLEL_WORKER_LIMIT as the max value
instead. MAX_BACKENDS is too high.

---
+/*
+ * Indicates whether there are pending messages in the queue.
The parallel
+ * apply worker will check it before starting to wait.
+ */
+pg_atomic_uint32   pending_message_count;

The "pending messages" sounds like individual logical replication
messages such as LOGICAL_REP_MSG_INSERT. But IIUC what this value
actually shows is how many streamed chunks are pending to process,
right?

---
The streaming parameter has the new value "parallel" for "streaming"
option to enable the parallel apply. It fits so far but I think the
parallel apply feature doesn't necessarily need to be tied up with
streaming replication. For example, we might want to support parallel
apply also for non-streaming transactions in the future. It might be
better to have another option, say "parallel", to control parallel
apply behavior. The "parallel" option can be a boolean option and
setting parallel = on requires streaming = on.

Another variant is to have a new subscription parameter for example
"parallel_workers" parameter that specifies the number of parallel
workers. That way, users can specify the number of parallel workers
per subscription.

---
When the parallel apply worker raises an error, I got the same error
twice from the leader worker and parallel worker as follows. Can we
suppress either one?

2022-11-17 17:30:23.490 JST [3814552] LOG:  logical replication
parallel apply worker for subscription "test_sub1" has started
2022-11-17 17:30:23.490 JST [3814552] ERROR:  duplicate key value
violates unique constraint "test1_c_idx"
2022-11-17 17:30:23.490 JST [3814552] DETAIL:  Key (c)=(1) already exists.
2022-11-17 17:30:23.490 JST [3814552] CONTEXT:  processing remote data
for replication origin "pg_16390" during message type "INSERT" for
replication target relatio
n "public.test1" in transaction 731
2022-11-17 17:30:23.490 JST [3814550] ERROR:  duplicate key value
violates unique constraint "test1_c_idx"
2022-11-17 17:30:23.490 JST [3814550] DETAIL:  Key (c)=(1) already exists.
2022-11-17 17:30:23.490 JST [3814550] CONTEXT:  processing remote data
for replication origin "pg_16390" during message type "INSERT" for
replication target relatio
n "public.test1" in transaction 731
parallel apply worker

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-16 Thread Masahiko Sawada

On Wed, Nov 16, 2022 at 4:39 PM John Naylor
 wrote:
>
>
> On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada  
> wrote:
> >
> > On Wed, Nov 16, 2022 at 1:46 PM John Naylor
> >  wrote:
> > >
> > >
> > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada  
> > > wrote:
> > > > Thanks! Please let me know if there is something I can help with.
> > >
> > > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> > >
> > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> > > "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
> >
> > Which tests do you use to get this assertion failure? I've confirmed
> > there is a bug in 0005 patch but without it, "make check-world"
> > passed.
>
> Hmm, I started over and rebuilt and it didn't reproduce. Not sure what 
> happened, sorry for the noise.

Good to know. No problem.

> I'm attaching a test I wrote to stress test branch prediction in search, and 
> while trying it out I found two possible issues.

Thank you for testing!

>
> It's based on the random int load test, but tests search speed. Run like this:
>
> select * from bench_search_random_nodes(10 * 1000 * 1000)
>
> It also takes some care to include all the different node kinds, restricting 
> the possible keys by AND-ing with a filter. Here's a simple demo:
>
> filter = ((uint64)1<<40)-1;
> LOG:  num_keys = 967, height = 4, n4 = 17513814, n32 = 6320, n128 = 
> 62663, n256 = 3130
>
> Just using random integers leads to >99% using the smallest node. I wanted to 
> get close to having the same number of each, but that's difficult while still 
> using random inputs. I ended up using
>
> filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)
>
> which gives
>
> LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 
> 182670, n256 = 1024
>
> Which seems okay for the task. One puzzling thing I found while trying 
> various filters is that sometimes the reported tree height would change. For 
> example:
>
> filter = (((uint64) 1<<32) | (0xFF<<24));
> LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 = 
> 62632, n256 = 3161
>
> 1) Any idea why the tree height would be reported as 7 here? I didn't expect 
> that.

In my environment, (0xFF<<24) is 0xFF00, not 0xFF00.
It seems the filter should be (((uint64) 1<<32) | ((uint64)
0xFF<<24)).

>
> 2) It seems that 0004 actually causes a significant slowdown in this test (as 
> in the attached, using the second filter above and with turboboost disabled):
>
> v9 0003: 2062 2051 2050
> v9 0004: 2346 2316 2321
>
> That means my idea for the pointer struct might have some problems, at least 
> as currently implemented. Maybe in the course of separating out and polishing 
> that piece, an inefficiency will fall out. Or, it might be another reason to 
> template local and shared separately. Not sure yet. I also haven't tried to 
> adjust this test for the shared memory case.

I'll also run the test on my environment and do the investigation tomorrow.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread Masahiko Sawada

On Wed, Nov 16, 2022 at 2:17 PM John Naylor
 wrote:
>
>
>
> On Wed, Nov 16, 2022 at 11:46 AM John Naylor  
> wrote:
> >
> >
> > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada  
> > wrote:
> > > Thanks! Please let me know if there is something I can help with.
> >
> > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> >
> > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> > "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
>
> Actually I do want to offer some general advice. Upthread I recommended a 
> purely refactoring patch that added the node-pointer struct but did nothing 
> else, so that the DSA changes would be smaller. 0004 attempted pointer 
> tagging in the same commit, which makes it no longer a purely refactoring 
> patch, so that 1) makes it harder to tell what part caused the bug and 2) 
> obscures what is necessary for DSA pointers and what was additionally 
> necessary for pointer tagging. Shared memory support is a prerequisite for a 
> shippable feature, but pointer tagging is (hopefully) a performance 
> optimization. Let's keep them separate.

Totally agreed. I'll separate them in the next version patch. Thank
you for your advice.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread Masahiko Sawada

On Wed, Nov 16, 2022 at 1:46 PM John Naylor
 wrote:
>
>
> On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada  
> wrote:
> > Thanks! Please let me know if there is something I can help with.
>
> I didn't get very far because the tests fail on 0004 in rt_verify_node:
>
> TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Which tests do you use to get this assertion failure? I've confirmed
there is a bug in 0005 patch but without it, "make check-world"
passed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-14 Thread Masahiko Sawada

On Mon, Nov 14, 2022 at 10:00 PM John Naylor
 wrote:
>
> On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada  wrote:
> >
> > 0004 patch is a new patch supporting a pointer tagging of the node
> > kind. Also, it introduces rt_node_ptr we discussed so that internal
> > functions use it rather than having two arguments for encoded and
> > decoded pointers. With this intermediate patch, the DSA support patch
> > became more readable and understandable. Probably we can make it
> > smaller further if we move the change of separating the control object
> > from radix_tree to the main patch (0002). The patch still needs to be
> > polished but I'd like to check if this idea is worthwhile. If we agree
> > on this direction, this patch will be merged into the main radix tree
> > implementation patch.
>
> Thanks for the new patch set. I've taken a very brief look at 0004 and I 
> think the broad outlines are okay. As you say it needs polish, but before 
> going further, I'd like to do some experiments of my own as I mentioned 
> earlier:
>
> - See how much performance we actually gain from tagging the node kind.
> - Try additional size classes while keeping the node kinds to only four.
> - Optimize node128 insert.
> - Try templating out the differences between local and shared memory. With 
> local memory, the node-pointer struct would be a union, for example. 
> Templating would also reduce branches and re-simplify some internal APIs, but 
> it's likely that would also make the TID store and/or vacuum more complex, 
> because at least some external functions would be duplicated.

Thanks! Please let me know if there is something I can help with.

In the meanwhile, I'd like to make some progress on the vacuum
integration and improving the test coverages.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-08 Thread Masahiko Sawada

On Sat, Nov 5, 2022 at 6:23 PM John Naylor  wrote:
>
> On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada  wrote:
> >
> > For parallel heap pruning, multiple workers will insert key-value
> > pairs to the radix tree concurrently. The simplest solution would be a
> > single lock to protect writes but the performance will not be good.
> > Another solution would be that we can divide the tables into multiple
> > ranges so that keys derived from TIDs are not conflicted with each
> > other and have parallel workers process one or more ranges. That way,
> > parallel vacuum workers can build *sub-trees* and the leader process
> > can merge them. In use cases of lazy vacuum, since the write phase and
> > read phase are separated the readers don't need to worry about
> > concurrent updates.
>
> It's a good idea to use ranges for a different reason -- readahead. See 
> commit 56788d2156fc3, which aimed to improve readahead for sequential scans. 
> It might work to use that as a model: Each worker prunes a range of 64 pages, 
> keeping the dead tids in a local array. At the end of the range: lock the tid 
> store, enter the tids into the store, unlock, free the local array, and get 
> the next range from the leader. It's possible contention won't be too bad, 
> and I suspect using small local arrays as-we-go would be faster and use less 
> memory than merging multiple sub-trees at the end.

Seems a promising idea. I think it might work well even in the current
parallel vacuum (ie., single writer). I mean, I think we can have a
single lwlock for shared cases in the first version. If the overhead
of acquiring the lwlock per insertion of key-value is not negligible,
we might want to try this idea.

Apart from that, I'm going to incorporate the comments on 0004 patch
and try a pointer tagging.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Add index scan progress to pg_stat_progress_vacuum

2022-11-07 Thread Masahiko Sawada

On Thu, Nov 3, 2022 at 1:52 AM Imseih (AWS), Sami  wrote:
>
> Attached is v13-0001--Show-progress-for-index-vacuums.patch which addresses
> the latest comments.

Thank you for updating the patch!

> 4/ Went back to calling parallel_vacuum_progress_report inside
> WaitForParallelWorkersToFinish to cover the case when a
> leader is waiting for parallel workers to finish.

I don't think we need to modify WaitForParallelWorkersToFinish to
cover that case. Instead, I think the leader process can execute a new
function. The function will be like WaitForParallelWorkersToFinish()
but simpler; it just updates the progress information if necessary and
then checks if idx_completed_progress is equal to the number of
indexes to vacuum. If yes, return from the function and call
WaitForParallelWorkersToFinish() to wait for all workers to finish.
Otherwise, it naps by using WaitLatch() and does this loop again.

---
@@ -46,6 +46,8 @@ typedef struct ParallelContext
 ParallelWorkerInfo *worker;
 intnknown_attached_workers;
 bool  *known_attached_workers;
+void   (*parallel_progress_callback)(void *arg);
+void   *parallel_progress_arg;
 } ParallelContext;

With the above change I suggested, I think we won't need to have a
callback function in ParallelContext. Instead, I think we can have
index-AMs call parallel_vacuum_report() if report_parallel_progress is
true.

Regards,

--
Masahiko Sawada

Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-07 Thread Masahiko Sawada

On Thu, Nov 3, 2022 at 10:06 PM houzj.f...@fujitsu.com
 wrote:
>
> On Wednesday, November 2, 2022 10:50 AM Masahiko Sawada 
>  wrote:
> >
> > On Mon, Oct 24, 2022 at 8:42 PM Masahiko Sawada
> >  wrote:
> > >
> > > On Wed, Oct 12, 2022 at 3:04 PM Amit Kapila 
> > wrote:
> > > >
> > > > On Tue, Oct 11, 2022 at 5:52 AM Masahiko Sawada
> >  wrote:
> > > > >
> > > > > On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila 
> > wrote:
> > > > > >
> > > > > > About your point that having different partition structures for
> > > > > > publisher and subscriber, I don't know how common it will be once we
> > > > > > have DDL replication. Also, the default value of
> > > > > > publish_via_partition_root is false which doesn't seem to indicate
> > > > > > that this is a quite common case.
> > > > >
> > > > > So how can we consider these concurrent issues that could happen only
> > > > > when streaming = 'parallel'? Can we restrict some use cases to avoid
> > > > > the problem or can we have a safeguard against these conflicts?
> > > > >
> > > >
> > > > Yeah, right now the strategy is to disallow parallel apply for such
> > > > cases as you can see in *0003* patch.
> > >
> > > Tightening the restrictions could work in some cases but there might
> > > still be coner cases and it could reduce the usability. I'm not really
> > > sure that we can ensure such a deadlock won't happen with the current
> > > restrictions. I think we need something safeguard just in case. For
> > > example, if the leader apply worker is waiting for a lock acquired by
> > > its parallel worker, it cancels the parallel worker's transaction,
> > > commits its transaction, and restarts logical replication. Or the
> > > leader can log the deadlock to let the user know.
> > >
> >
> > As another direction, we could make the parallel apply feature robust
> > if we can detect deadlocks that happen among the leader worker and
> > parallel workers. I'd like to summarize the idea discussed off-list
> > (with Amit, Hou-San, and Kuroda-San) for discussion. The basic idea is
> > that when the leader worker or parallel worker needs to wait for
> > something (eg. transaction completion, messages) we use lmgr
> > functionality so that we can create wait-for edges and detect
> > deadlocks in lmgr.
> >
> > For example, a scenario where a deadlock occurs is the following:
> >
> > [Publisher]
> > create table tab1(a int);
> > create publication pub for table tab1;
> >
> > [Subcriber]
> > creat table tab1(a int primary key);
> > create subscription sub connection 'port=1 dbname=postgres'
> > publication pub with (streaming = parallel);
> >
> > TX1:
> > BEGIN;
> > INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
> > Tx2:
> > BEGIN;
> > INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- 
> > streamed
> > COMMIT;
> > COMMIT;
> >
> > Suppose a parallel apply worker (PA-1) is executing TX-1 and the
> > leader apply worker (LA) is executing TX-2 concurrently on the
> > subscriber. Now, LA is waiting for PA-1 because of the unique key of
> > tab1 while PA-1 is waiting for LA to send further messages. There is a
> > deadlock between PA-1 and LA but lmgr cannot detect it.
> >
> > One idea to resolve this issue is that we have LA acquire a session
> > lock on a shared object (by LockSharedObjectForSession()) and have
> > PA-1 wait on the lock before trying to receive messages. IOW,  LA
> > acquires the lock before sending STREAM_STOP and releases it if
> > already acquired before sending STREAM_START, STREAM_PREPARE and
> > STREAM_COMMIT. For PA-1, it always needs to acquire the lock after
> > processing STREAM_STOP and then release immediately after acquiring
> > it. That way, when PA-1 is waiting for LA, we can have a wait-edge
> > from PA-1 to LA in lmgr, which will make a deadlock in lmgr like:
> >
> > LA (waiting to acquire lock) -> PA-1 (waiting to acquire the shared
> > object) -> LA
> >
> > We would need the shared objects per parallel apply worker.
> >
> > After detecting a deadlock, we can restart logical replication with
> > temporarily disabling the parallel apply, which is done by 0005 patch.
> >
> > Another scenario is similar to the previous case but T

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-06 Thread Masahiko Sawada

On Mon, Nov 7, 2022 at 12:58 PM Amit Kapila  wrote:
>
> On Mon, Nov 7, 2022 at 8:26 AM Masahiko Sawada  wrote:
> >
> > On Sun, Nov 6, 2022 at 3:40 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Saturday, November 5, 2022 1:43 PM Amit Kapila 
> > > 
> > > >
> > > > On Fri, Nov 4, 2022 at 7:35 PM houzj.f...@fujitsu.com
> > > >  wrote:
> > > > >
> > > > > On Friday, November 4, 2022 4:07 PM Amit Kapila
> > > >  wrote:
> > > > > >
> > > > > > On Thu, Nov 3, 2022 at 6:36 PM houzj.f...@fujitsu.com
> > > > > >  wrote:
> > > > > > >
> > > > > > > Thanks for the analysis and summary !
> > > > > > >
> > > > > > > I tried to implement the above idea and here is the patch set.
> > > > > > >
> > > > > >
> > > > > > Few comments on v42-0001
> > > > > > ===
> > > > >
> > > > > Thanks for the comments.
> > > > >
> > > > > >
> > > > > > 10.
> > > > > > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > > > > > + winfo->shared->transaction_lock_id =
> > > > > > + winfo->shared->parallel_apply_get_unique_id();
> > > > > >
> > > > > > Why can't we use xid (remote_xid) for one of these and local_xid
> > > > > > (one generated by parallel apply) for the other?
> > > > ...
> > > > ...
> > > > >
> > > > > I also considered using xid for these locks, but it seems the objsubid
> > > > > for the shared object lock is 16bit while xid is 32 bit. So, I tried
> > > > > to generate a unique 16bit id here.
> > > > >
> > > >
> > > > Okay, I see your point. Can we think of having a new lock tag for this 
> > > > with classid,
> > > > objid, objsubid for the first three fields of locktag field? We can use 
> > > > a new
> > > > macro SET_LOCKTAG_APPLY_TRANSACTION and a common function to set the
> > > > tag and acquire the lock. One more point related to this is that I am 
> > > > suggesting
> > > > classid by referring to SET_LOCKTAG_OBJECT as that is used in the 
> > > > current
> > > > patch but do you think we need it for our purpose, won't subscription 
> > > > id and
> > > > xid can uniquely identify the tag?
> > >
> > > I agree that it could be better to have a new lock tag. Another point is 
> > > that
> > > the remote xid and Local xid could be the same in some rare cases, so I 
> > > think
> > > we might need to add another identifier to make it unique.
> > >
> > > Maybe :
> > > locktag_field1 : subscription oid
> > > locktag_field2 : xid(remote or local)
> > > locktag_field3 : 0(lock for stream block)/1(lock for transaction)
> >
> > Or I think we can use locktag_field2 for remote xid and locktag_field3
> > for local xid.
> >
>
> We can do that way as well but OTOH, I think for the local
> transactions we don't need subscription oid, so field1 could be
> InvalidOid and field2 will be xid of local xact. Won't that be better?

This would work. But I'm a bit concerned that we cannot identify which
subscriptions the lock belongs to when checking pg_locks view.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-06 Thread Masahiko Sawada

On Sun, Nov 6, 2022 at 3:40 PM houzj.f...@fujitsu.com
 wrote:
>
> On Saturday, November 5, 2022 1:43 PM Amit Kapila 
> >
> > On Fri, Nov 4, 2022 at 7:35 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Friday, November 4, 2022 4:07 PM Amit Kapila
> >  wrote:
> > > >
> > > > On Thu, Nov 3, 2022 at 6:36 PM houzj.f...@fujitsu.com
> > > >  wrote:
> > > > >
> > > > > Thanks for the analysis and summary !
> > > > >
> > > > > I tried to implement the above idea and here is the patch set.
> > > > >
> > > >
> > > > Few comments on v42-0001
> > > > ===
> > >
> > > Thanks for the comments.
> > >
> > > >
> > > > 10.
> > > > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > > > + winfo->shared->transaction_lock_id =
> > > > + winfo->shared->parallel_apply_get_unique_id();
> > > >
> > > > Why can't we use xid (remote_xid) for one of these and local_xid
> > > > (one generated by parallel apply) for the other?
> > ...
> > ...
> > >
> > > I also considered using xid for these locks, but it seems the objsubid
> > > for the shared object lock is 16bit while xid is 32 bit. So, I tried
> > > to generate a unique 16bit id here.
> > >
> >
> > Okay, I see your point. Can we think of having a new lock tag for this with 
> > classid,
> > objid, objsubid for the first three fields of locktag field? We can use a 
> > new
> > macro SET_LOCKTAG_APPLY_TRANSACTION and a common function to set the
> > tag and acquire the lock. One more point related to this is that I am 
> > suggesting
> > classid by referring to SET_LOCKTAG_OBJECT as that is used in the current
> > patch but do you think we need it for our purpose, won't subscription id and
> > xid can uniquely identify the tag?
>
> I agree that it could be better to have a new lock tag. Another point is that
> the remote xid and Local xid could be the same in some rare cases, so I think
> we might need to add another identifier to make it unique.
>
> Maybe :
> locktag_field1 : subscription oid
> locktag_field2 : xid(remote or local)
> locktag_field3 : 0(lock for stream block)/1(lock for transaction)

Or I think we can use locktag_field2 for remote xid and locktag_field3
for local xid.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-04 Thread Masahiko Sawada

On Thu, Nov 3, 2022 at 1:59 PM John Naylor  wrote:
>
> On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada  
> wrote:
> >
> > I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
> > the comments I got so far. 0004 patch is a DSA support patch for PoC.
>
> Thanks for the new patchset. This is not a full review, but I have some 
> comments:
>
> 0001 and 0002 look okay on a quick scan -- I will use this as a base for 
> further work that we discussed. However, before I do so I'd like to request 
> another revision regarding the following:
>
> > In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
> > to point its children, and we use rt_node_ptr as either rt_node* or
> > dsa_pointer depending on whether the radix tree is shared or not (ie,
> > by checking radix_tree->dsa == NULL).
>

Thank you for the comments!

> 0004: Looks like a good start, but this patch has a large number of changes 
> like these, making it hard to read:
>
> - if (found && child_p)
> - *child_p = child;
> + if (found && childp_p)
> + *childp_p = childp;
> ...
>   rt_node_inner_32 *new32;
> + rt_node_ptr new32p;
>
>   /* grow node from 4 to 32 */
> - new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
> -  RT_NODE_KIND_32);
> + new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
> + new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);
>
> It's difficult to keep in my head what all the variables refer to. I thought 
> a bit about how to split this patch up to make this easier to read. Here's 
> what I came up with:
>
> typedef struct rt_node_ptr
> {
>   uintptr_t encoded;
>   rt_node * decoded;
> }
>
> Note that there is nothing about "dsa or local". That's deliberate. That way, 
> we can use the "encoded" field for a tagged pointer as well, as I hope we can 
> do (at least for local pointers) in the future. So an intermediate patch 
> would have "static inline void" functions  node_ptr_encode() and  
> node_ptr_decode(), which would only copy from one member to another. I 
> suspect that: 1. The actual DSA changes will be *much* smaller and easier to 
> reason about. 2. Experimenting with tagged pointers will be easier.

Good idea. Will try in the next version patch.

>
> Also, quick question: 0004 has a new function rt_node_update_inner() -- is 
> that necessary because of DSA?, or does this ideally belong in 0002? What's 
> the reason for it?

Oh, this was needed once when initially I'm writing DSA support but
thinking about it again now I think we can remove it and use
rt_node_insert_inner() with parent = NULL instead.

>
> Regarding the performance, I've
> > added another boolean argument to bench_seq/shuffle_search(),
> > specifying whether to use the shared radix tree or not. Here are
> > benchmark results in my environment,
>
> > [...]
>
> > In non-shared radix tree cases (the forth argument is false), I don't
> > see a visible performance degradation. On the other hand, in shared
> > radix tree cases (the forth argument is true), I see visible overheads
> > because of dsa_get_address().
>
> Thanks, this is useful.
>
> > Please note that the current shared radix tree implementation doesn't
> > support any locking, so it cannot be read while written by someone.
>
> I think at the very least we need a global lock to enforce this.
>
> > Also, only one process can iterate over the shared radix tree. When it
> > comes to parallel vacuum, these don't become restriction as the leader
> > process writes the radix tree while scanning heap and the radix tree
> > is read by multiple processes while vacuuming indexes. And only the
> > leader process can do heap vacuum by iterating the key-value pairs in
> > the radix tree. If we want to use it for other cases too, we would
> > need to support locking, RCU or something.
>
> A useful exercise here is to think about what we'd need to do parallel heap 
> pruning. We don't need to go that far for v16 of course, but what's the 
> simplest thing we can do to make that possible? Other use cases can change to 
> more sophisticated schemes if need be.

For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of l

Re: Fix comments atop ReorderBufferAddInvalidations

2022-11-03 Thread Masahiko Sawada

Hi

On Thu, Nov 3, 2022 at 7:53 PM Amit Kapila  wrote:
>
> The comments atop seem to indicate that we always accumulate
> invalidation messages in top-level transactions which is neither
> required nor match with the code. This is introduced in the commit
> c55040ccd0 and I have observed it while working on a fix for commit
> 16b1fe0037.

Thank you for the patch. It looks good to me.

I think we can backpatch it to avoid confusion in future.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-01 Thread Masahiko Sawada

On Mon, Oct 24, 2022 at 8:42 PM Masahiko Sawada  wrote:
>
> On Wed, Oct 12, 2022 at 3:04 PM Amit Kapila  wrote:
> >
> > On Tue, Oct 11, 2022 at 5:52 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila  
> > > wrote:
> > > >
> > > > About your point that having different partition structures for
> > > > publisher and subscriber, I don't know how common it will be once we
> > > > have DDL replication. Also, the default value of
> > > > publish_via_partition_root is false which doesn't seem to indicate
> > > > that this is a quite common case.
> > >
> > > So how can we consider these concurrent issues that could happen only
> > > when streaming = 'parallel'? Can we restrict some use cases to avoid
> > > the problem or can we have a safeguard against these conflicts?
> > >
> >
> > Yeah, right now the strategy is to disallow parallel apply for such
> > cases as you can see in *0003* patch.
>
> Tightening the restrictions could work in some cases but there might
> still be coner cases and it could reduce the usability. I'm not really
> sure that we can ensure such a deadlock won't happen with the current
> restrictions. I think we need something safeguard just in case. For
> example, if the leader apply worker is waiting for a lock acquired by
> its parallel worker, it cancels the parallel worker's transaction,
> commits its transaction, and restarts logical replication. Or the
> leader can log the deadlock to let the user know.
>

As another direction, we could make the parallel apply feature robust
if we can detect deadlocks that happen among the leader worker and
parallel workers. I'd like to summarize the idea discussed off-list
(with Amit, Hou-San, and Kuroda-San) for discussion. The basic idea is
that when the leader worker or parallel worker needs to wait for
something (eg. transaction completion, messages) we use lmgr
functionality so that we can create wait-for edges and detect
deadlocks in lmgr.

For example, a scenario where a deadlock occurs is the following:

[Publisher]
create table tab1(a int);
create publication pub for table tab1;

[Subcriber]
creat table tab1(a int primary key);
create subscription sub connection 'port=1 dbname=postgres'
publication pub with (streaming = parallel);

TX1:
BEGIN;
INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
Tx2:
BEGIN;
INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
COMMIT;
COMMIT;

Suppose a parallel apply worker (PA-1) is executing TX-1 and the
leader apply worker (LA) is executing TX-2 concurrently on the
subscriber. Now, LA is waiting for PA-1 because of the unique key of
tab1 while PA-1 is waiting for LA to send further messages. There is a
deadlock between PA-1 and LA but lmgr cannot detect it.

One idea to resolve this issue is that we have LA acquire a session
lock on a shared object (by LockSharedObjectForSession()) and have
PA-1 wait on the lock before trying to receive messages. IOW,  LA
acquires the lock before sending STREAM_STOP and releases it if
already acquired before sending STREAM_START, STREAM_PREPARE and
STREAM_COMMIT. For PA-1, it always needs to acquire the lock after
processing STREAM_STOP and then release immediately after acquiring
it. That way, when PA-1 is waiting for LA, we can have a wait-edge
from PA-1 to LA in lmgr, which will make a deadlock in lmgr like:

LA (waiting to acquire lock) -> PA-1 (waiting to acquire the shared
object) -> LA

We would need the shared objects per parallel apply worker.

After detecting a deadlock, we can restart logical replication with
temporarily disabling the parallel apply, which is done by 0005 patch.

Another scenario is similar to the previous case but TX-1 and TX-2 are
executed by two parallel apply workers (PA-1 and PA-2 respectively).
In this scenario, PA-2 is waiting for PA-1 to complete its transaction
while PA-1 is waiting for subsequent input from LA. Also, LA is
waiting for PA-2 to complete its transaction in order to preserve the
commit order. There is a deadlock among three processes but it cannot
be detected in lmgr because the fact that LA is waiting for PA-2 to
complete its transaction doesn't appear in lmgr (see
parallel_apply_wait_for_xact_finish()). To fix it, we can use
XactLockTableWait() instead.

However, since XactLockTableWait() considers PREPARED TRANSACTION as
still in progress, probably we need a similar trick as above in case
where a transaction is prepared. For example, suppose that TX-2 was
prepared instead of committed in the above scenario, PA-2 acquires
another shared lock at START_STREAM and releases it at
STREAM_COMMIT/PREPARE. LA can wait on the lock.

Yet another scenario where LA has to wait is the case where the shm_mq
buffe

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-27 Thread Masahiko Sawada

On Thu, Oct 27, 2022 at 11:34 AM shiy.f...@fujitsu.com
 wrote:
>
> On Wed, Oct 26, 2022 7:19 PM Amit Kapila  wrote:
> >
> > On Tue, Oct 25, 2022 at 8:38 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Fri, Oct 21, 2022 at 6:32 PM houzj.f...@fujitsu.com
> > >  wrote:
> > >
> > > I've started to review this patch. I tested v40-0001 patch and have
> > > one question:
> > >
> > > IIUC even when most of the changes in the transaction are filtered out
> > > in pgoutput (eg., by relation filter or row filter), the walsender
> > > sends STREAM_START. This means that the subscriber could end up
> > > launching parallel apply workers also for almost empty (and streamed)
> > > transactions. For example, I created three subscriptions each of which
> > > subscribes to a different table. When I loaded a large amount of data
> > > into one table, all three (leader) apply workers received START_STREAM
> > > and launched their parallel apply workers.
> > >
> >
> > The apply workers will be launched just the first time then we
> > maintain a pool so that we don't need to restart them.
> >
> > > However, two of them
> > > finished without applying any data. I think this behaviour looks
> > > problematic since it wastes workers and rather decreases the apply
> > > performance if the changes are not large. Is it worth considering a
> > > way to delay launching a parallel apply worker until we find out the
> > > amount of changes is actually large?
> > >
> >
> > I think even if changes are less there may not be much difference
> > because we have observed that the performance improvement comes from
> > not writing to file.
> >
> > > For example, the leader worker
> > > writes the streamed changes to files as usual and launches a parallel
> > > worker if the amount of changes exceeds a threshold or the leader
> > > receives the second segment. After that, the leader worker switches to
> > > send the streamed changes to parallel workers via shm_mq instead of
> > > files.
> > >
> >
> > I think writing to file won't be a good idea as that can hamper the
> > performance benefit in some cases and not sure if it is worth.
> >
>
> I tried to test some cases that only a small part of the transaction or an 
> empty
> transaction is sent to subscriber, to see if using streaming parallel will 
> bring
> performance degradation.
>
> The test was performed ten times, and the average was taken.
> The results are as follows. The details and the script of the test is 
> attached.
>
> 10% of rows are sent
> --
> HEAD24.4595
> patched 18.4545
>
> 5% of rows are sent
> --
> HEAD21.244
> patched 17.9655
>
> 0% of rows are sent
> --
> HEAD18.0605
> patched 17.893
>
>
> It shows that when only 5% or 10% of rows are sent to subscriber, using 
> parallel
> apply takes less time than HEAD, and even if all rows are filtered there's no
> performance degradation.

Thank you for the testing!

I think this performance improvement comes from both applying changes
in parallel to receiving changes and avoiding writing a file. I'm
happy to know there is also a benefit also for small streaming
transactions. I've also measured the overhead when processing
streaming empty transactions and confirmed the overhead is negligible.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-26 Thread Masahiko Sawada

On Wed, Oct 26, 2022 at 8:06 PM John Naylor
 wrote:
>
> On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada  
> wrote:
>
> > I've attached updated PoC patches for discussion and cfbot. From the
> > previous version, I mainly changed the following things:
> >

Thank you for the comments!

> > * Separate treatment of inner and leaf nodes
>
> Overall, this looks much better!
>
> > * Pack both the node kind and node count to an uint16 value.
>
> For this, I did mention a bitfield earlier as something we "could" do, but it 
> wasn't clear we should. After looking again at the node types, I must not 
> have thought through this at all. Storing one byte instead of four for the 
> full enum is a good step, but saving one more byte usually doesn't buy 
> anything because of padding, with a few exceptions like this example:
>
> node4:   4 +  4   +  4*8 =   40
> node4:   5 +  4+(7)   +  4*8 =   48 bytes
>
> Even there, I'd rather not spend the extra cycles to access the members. And 
> with my idea of decoupling size classes from kind, the variable-sized kinds 
> will require another byte to store "capacity". Then, even if the kind gets 
> encoded in a pointer tag, we'll still have 5 bytes in the base type. So I 
> think we should assume 5 bytes from the start. (Might be 6 temporarily if I 
> work on size decoupling first).

True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes. Encoding the kind in a pointer tag could be tricky given DSA
support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.

>
> (Side note, if you have occasion to use bitfields again in the future, C99 
> has syntactic support for them, so no need to write your own shifting/masking 
> code).

Thanks!

>
> > I've not done SIMD part seriously yet. But overall the performance
> > seems good so far. If we agree with the current approach, I think we
> > can proceed with the verification of decoupling node sizes from node
> > kind. And I'll investigate DSA support.
>
> Sounds good. I have some additional comments about v7, and after these are 
> addressed, we can proceed independently with the above two items. Seeing the 
> DSA work will also inform me how invasive pointer tagging will be. There will 
> still be some performance tuning and cosmetic work, but it's getting closer.
>

I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.

While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:

node = dsa_get_address(tree->dsa, node_dp);

As shown above, the dsa area used by the shared radix tree is stored
in radix_tree struct, so we can know whether the radix tree is shared
or not by checking (tree->dsa == NULL). That is, if it's shared we use
a pointer to radix tree node as dsa_pointer, and if not we use a
pointer as a backend-local pointer. We don't need to encode something
in a pointer.

> -
> 0001:
>
> +#ifndef USE_NO_SIMD
> +#include "port/pg_bitutils.h"
> +#endif
>
> Leftover from an earlier version?
>
> +static inline int vector8_find(const Vector8 v, const uint8 c);
> +static inline int vector8_find_ge(const Vector8 v, const uint8 c);
>
> Leftovers, causing compiler warnings. (Also see new variable shadow warning)

Will fix.

>
> +#else /* USE_NO_SIMD */
> + Vector8 r = 0;
> + uint8 *rp = (uint8 *) 
> +
> + for (Size i = 0; i < sizeof(Vector8); i++)
> + rp[i] = Min(((const uint8 *) )[i], ((const uint8 *) )[i]);
> +
> + return r;
> +#endif
>
> As I mentioned a couple versions ago, this style is really awkward, and 
> potential non-SIMD callers will be better off writing their own byte-wise 
> loop rather than using this API. Especially since the "min" function exists 
> only as a workaround for lack of unsigned comparison in (at least) SSE2. 
> There is one existing function in this file with that idiom for non-assert 
> code (for completeness), but even there, inputs of current interest to us use 
> the uint64 algorithm.

Agreed. Will remove non-SIMD code.

>
> 0002:
>
> + /* XXX: should not to use vector8_highbit_mask */
> + bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << 
> sizeof(Vector8));
>
> Hmm?

It's my outdated memo, will remove.

>
> +/*
> + * Return index of the first element in chunks in the given node that is 
> greater
> + * than or equal to 'key'.  Return -1 if

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-24 Thread Masahiko Sawada

On Fri, Oct 21, 2022 at 6:32 PM houzj.f...@fujitsu.com
 wrote:
>
> On Thursday, October 20, 2022 5:49 PM Amit Kapila  
> wrote:
> > On Thu, Oct 20, 2022 at 2:08 PM Peter Smith 
> > wrote:
> > >
> > > 7. get_transaction_apply_action
> > >
> > > > 12. get_transaction_apply_action
> > > >
> > > > I still felt like there should be some tablesync checks/comments in
> > > > this function, just for sanity, even if it works as-is now.
> > > >
> > > > For example, are you saying ([3] #22b) that there might be rare
> > > > cases where a Tablesync would call to parallel_apply_find_worker?
> > > > That seems strange, given that "for streaming transactions that are
> > > > being applied in the parallel ... we disallow applying changes on a
> > > > table that is not in the READY state".
> > > >
> > > > --
> > >
> > > Houz wrote [2] -
> > >
> > > I think because we won't try to start parallel apply worker in table
> > > sync worker(see the check in parallel_apply_can_start()), so we won't
> > > find any worker in parallel_apply_find_worker() which means
> > > get_transaction_apply_action will return TRANS_LEADER_SERIALIZE. And
> > > get_transaction_apply_action is a function which can be invoked for
> > > all kinds of workers(same is true for all apply_handle_xxx functions),
> > > so not sure if table sync check/comment is necessary.
> > >
> > > ~
> > >
> > > Sure, and I believe you when you say it all works OK - but IMO there
> > > is something still not quite right with this current code. For
> > > example,
> > >
> > > e.g.1 the functional will return TRANS_LEADER_SERIALIZE for Tablesync
> > > worker, and yet the comment for TRANS_LEADER_SERIALIZE says "means
> > > that we are in the leader apply worker" (except we are not)
> > >
> > > e.g.2 we know for a fact that Tablesync workers cannot start their own
> > > parallel apply workers, so then why do we even let the Tablesync
> > > worker make a call to parallel_apply_find_worker() looking for
> > > something we know will not be found?
> > >
> >
> > I don't see much benefit in adding an additional check for tablesync workers
> > here. It will unnecessarily make this part of the code look bit ugly.
>
> Thanks for the review, here is the new version patch set which addressed 
> Peter[1]
> and Kuroda-san[2]'s comments.

I've started to review this patch. I tested v40-0001 patch and have
one question:

IIUC even when most of the changes in the transaction are filtered out
in pgoutput (eg., by relation filter or row filter), the walsender
sends STREAM_START. This means that the subscriber could end up
launching parallel apply workers also for almost empty (and streamed)
transactions. For example, I created three subscriptions each of which
subscribes to a different table. When I loaded a large amount of data
into one table, all three (leader) apply workers received START_STREAM
and launched their parallel apply workers. However, two of them
finished without applying any data. I think this behaviour looks
problematic since it wastes workers and rather decreases the apply
performance if the changes are not large. Is it worth considering a
way to delay launching a parallel apply worker until we find out the
amount of changes is actually large? For example, the leader worker
writes the streamed changes to files as usual and launches a parallel
worker if the amount of changes exceeds a threshold or the leader
receives the second segment. After that, the leader worker switches to
send the streamed changes to parallel workers via shm_mq instead of
files.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-24 Thread Masahiko Sawada

On Wed, Oct 12, 2022 at 3:04 PM Amit Kapila  wrote:
>
> On Tue, Oct 11, 2022 at 5:52 AM Masahiko Sawada  wrote:
> >
> > On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila  wrote:
> > >
> > > About your point that having different partition structures for
> > > publisher and subscriber, I don't know how common it will be once we
> > > have DDL replication. Also, the default value of
> > > publish_via_partition_root is false which doesn't seem to indicate
> > > that this is a quite common case.
> >
> > So how can we consider these concurrent issues that could happen only
> > when streaming = 'parallel'? Can we restrict some use cases to avoid
> > the problem or can we have a safeguard against these conflicts?
> >
>
> Yeah, right now the strategy is to disallow parallel apply for such
> cases as you can see in *0003* patch.

Tightening the restrictions could work in some cases but there might
still be coner cases and it could reduce the usability. I'm not really
sure that we can ensure such a deadlock won't happen with the current
restrictions. I think we need something safeguard just in case. For
example, if the leader apply worker is waiting for a lock acquired by
its parallel worker, it cancels the parallel worker's transaction,
commits its transaction, and restarts logical replication. Or the
leader can log the deadlock to let the user know.

>
> > We
> > could find a new problematic scenario in the future and if it happens,
> > logical replication gets stuck, it cannot be resolved only by apply
> > workers themselves.
> >
>
> I think users can change streaming option to on/off and internally the
> parallel apply worker can detect and restart to allow replication to
> proceed. Having said that, I think that would be a bug in the code and
> we should try to fix it. We may need to disable parallel apply in the
> problematic case.
>
> The other ideas that occurred to me in this regard are (a) provide a
> reloption (say parallel_apply) at table level and we can use that to
> bypass various checks like different Unique Key between
> publisher/subscriber, constraints/expressions having mutable
> functions, Foreign Key (when enabled on subscriber), operations on
> Partitioned Table. We can't detect whether those are safe or not
> (primarily because of a different structure in publisher and
> subscriber) so we prohibit parallel apply but if users use this
> option, we can allow it even in those cases.

The parallel apply worker is assigned per transaction, right? If so,
how can we know which tables are modified in the transaction in
advance? and what if two tables whose reloptions are true and false
are modified in the same transaction?

> (b) While enabling the
> parallel option in the subscription, we can try to match all the
> table(s) information of the publisher/subscriber. It will be tricky to
> make this work because say even if match some trigger function name,
> we won't be able to match the function body. The other thing is when
> at a later point the table definition is changed on the subscriber, we
> need to again validate the information between publisher and
> subscriber which I think would be difficult as we would be already in
> between processing some message and getting information from the
> publisher at that stage won't be possible.

Indeed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: START_REPLICATION SLOT causing a crash in an assert build

2022-10-21 Thread Masahiko Sawada

On Thu, Oct 20, 2022 at 6:54 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-10-13 15:57:28 +0900, Masahiko Sawada wrote:
> > I've attached an updated patch. I've added the common function to
> > start pg_recvlogical and wait for it to become active. Please review
> > it.
>
> > +# Start pg_recvlogical process and wait for it to become active.
> > +sub start_pg_recvlogical
> > +{
> > + my ($node, $slot_name, $create_slot) = @_;
> > +
> > + my @cmd = (
> > + 'pg_recvlogical', '-S', "$slot_name", '-d',
> > + $node->connstr('postgres'),
> > + '--start', '--no-loop', '-f', '-');
> > + push @cmd, '--create-slot' if $create_slot;
> > +
> > + # start pg_recvlogical process.
> > + my $pg_recvlogical = IPC::Run::start(@cmd);
> > +
> > + # Wait for the replication slot to become active.
> > + $node->poll_query_until('postgres',
> > + "SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE 
> > slot_name = '$slot_name' AND active_pid IS NOT NULL)"
> > + ) or die "slot never became active";
> > +
> > + return $pg_recvlogical;
> > +}
>
> Because you never process the output from pg_recvlogical I think this test
> will fail if the pipe buffer is small (or more changes are made). I think
> either it needs to output to a file, or we need to process the output.

Okay, but how can we test this situation? As far as I tested, if we
don't specify the redirection of pg_recvlogical's output as above,
pg_recvlogical's stdout and stderr are output to the log file. So I
could not reproduce the issue you're concerned about. Which pipe do
you refer to?

>
> A file seems the easier solution in this case, we don't really care about what
> changes are streamed to the client, right?
>
>
> > +$node = PostgreSQL::Test::Cluster->new('test2');
> > +$node->init(allows_streaming => 'logical');
> > +$node->start;
> > +$node->safe_psql('postgres', "CREATE TABLE test(i int)");
>
> Why are we creating a new cluster? Initdbs takes a fair bit of time on some
> platforms, so this seems unnecessary?

Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-20 Thread Masahiko Sawada

On Thu, Oct 20, 2022 at 6:57 PM Amit Kapila  wrote:
>
> On Wed, Oct 19, 2022 at 9:40 AM Masahiko Sawada  wrote:
> >
> > I've attached patches for Change-3 with a test case. Please review them as 
> > well.
> >
>
> The patch looks mostly good to me apart from few minor comments which
> are as follows:
> 1.
> +# The last decoding restarts from the first checkpoint, and add
> invalidation messages
> +# generated by "s0_truncate" to the subtransaction. When decoding the
> commit record of
> +# the top-level transaction, we mark both top-level transaction and
> its subtransactions
> +# as containing catalog changes. However, we check if we don't create
> the association
> +# between top-level and subtransactions at this time. Otherwise, we
> miss executing
> +# invalidation messages when forgetting the transaction.
> +permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert"
> "s1_checkpoint" "s1_get_changes" "s0_truncate" "s0_commit" "s0_begin"
> "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit"
> "s1_get_changes"
>
> The second part of this comment seems to say things more than required
> which makes it less clear. How about something like: "The last
> decoding restarts from the first checkpoint and adds invalidation
> messages generated by "s0_truncate" to the subtransaction. While
> processing the commit record for the top-level transaction, we decide
> to skip this xact but ensure that corresponding invalidation messages
> get processed."?
>
> 2.
> + /*
> + * We will assign subtransactions to the top transaction before
> + * replaying the contents of the transaction.
> + */
>
> I don't think we need this comment.
>

Thank you for the comment! I agreed with all comments and I've updated
patches accordingly.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v12-v15_v2-0001-Fix-executing-invalidation-messages-generated-by-.patch
Description: Binary data


v11_v2-0001-Fix-executing-invalidation-messages-generated-by-.patch
Description: Binary data

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-20 Thread Masahiko Sawada

On Thu, Oct 20, 2022 at 8:09 PM Amit Kapila  wrote:
>
> On Wed, Oct 19, 2022 at 4:47 PM Amit Kapila  wrote:
> >
> > On Wed, Oct 19, 2022 at 1:08 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > >
> > > > I've attached two patches that need to be back-patched to all branches
> > > > and includes Change-1, Change-2, and a test case for them. FYI this
> > > > patch resolves the assertion failure reported in this thread as well
> > > > as one reported in another thread[2]. So I borrowed some of the
> > > > changes from the patch[2] Osumi-san recently proposed.
> > > >
> > >
> > > Amit pointed out offlist that the changes in reorderbuffer.c is not
> > > pgindent'ed. I've run pgindent and attached updated patches.
> > >
> >
> > Thanks, I have tested these across all branches till v10 and it works
> > as expected. I am planning to push this tomorrow unless I see any
> > further comments.
> >
>
> Pushed.

Thank you!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-19 Thread Masahiko Sawada

On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada  wrote:
>
> On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada  wrote:
> >
> > On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila  wrote:
> > >
> > > On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > >
> > > > > I think because the test case proposed needs all three changes, we can
> > > > > push the change-1 without a test case and then as a second patch have
> > > > > change-2 for HEAD and change-3 for back branches with the test case.
> > > > > Do you have any other ideas to proceed here?
> > > >
> > > > I found another test case that causes the assertion failure at
> > > > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > > > attached the patch for the test case. In this test case, I modified a
> > > > user-catalog table instead of system-catalog table. That way, we don't
> > > > generate invalidation messages while generating NEW_CID records. As a
> > > > result, we mark only the subtransactions as containing catalog change
> > > > and don't make association between top-level and sub transactions. The
> > > > assertion failure happens on all supported branches. If we need to fix
> > > > this (I believe so), Change-2 needs to be backpatched to all supported
> > > > branches.
> > > >
> > > > There are three changes as Amit mentioned, and regarding the test
> > > > case, we have three test cases I've attached: truncate_testcase.patch,
> > > > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > > > between assertion failures and test cases are very complex. I could
> > > > not find any test case to cause only one assertion failure on all
> > > > branches. One idea to proceed is:
> > > >
> > > > Patch-1 includes Change-1 and is applied to all branches.
> > > >
> > > > Patch-2 includes Change-2 and the user_catalog test case, and is
> > > > applied to all branches.
> > > >
> > > > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > > > test case), and is applied to v14 and v15 (also till v11 if we
> > > > prefer).
> > > >
> > > > The patch-1 doesn't include any test case but the user_catalog test
> > > > case can test both Change-1 and Change-2 on all branches.
> > > >
> > >
> > > I was wondering if it makes sense to commit both Change-1 and Change-2
> > > together as one patch? Both assertions are caused by a single test
> > > case and are related to the general problem that the association of
> > > top and sub transaction is only guaranteed to be formed before we
> > > decode transaction changes. Also, it would be good to fix the problem
> > > with a test case that can cause it. What do you think?
> >
> > Yeah, it makes sense to me.
> >
>
> I've attached two patches that need to be back-patched to all branches
> and includes Change-1, Change-2, and a test case for them. FYI this
> patch resolves the assertion failure reported in this thread as well
> as one reported in another thread[2]. So I borrowed some of the
> changes from the patch[2] Osumi-san recently proposed.
>

Amit pointed out offlist that the changes in reorderbuffer.c is not
pgindent'ed. I've run pgindent and attached updated patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


HEAD_v4-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data


v10-v15_v4-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada

On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada  wrote:
>
> On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada  wrote:
> >
> > On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila  wrote:
> > >
> > > On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > >
> > > > > I think because the test case proposed needs all three changes, we can
> > > > > push the change-1 without a test case and then as a second patch have
> > > > > change-2 for HEAD and change-3 for back branches with the test case.
> > > > > Do you have any other ideas to proceed here?
> > > >
> > > > I found another test case that causes the assertion failure at
> > > > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > > > attached the patch for the test case. In this test case, I modified a
> > > > user-catalog table instead of system-catalog table. That way, we don't
> > > > generate invalidation messages while generating NEW_CID records. As a
> > > > result, we mark only the subtransactions as containing catalog change
> > > > and don't make association between top-level and sub transactions. The
> > > > assertion failure happens on all supported branches. If we need to fix
> > > > this (I believe so), Change-2 needs to be backpatched to all supported
> > > > branches.
> > > >
> > > > There are three changes as Amit mentioned, and regarding the test
> > > > case, we have three test cases I've attached: truncate_testcase.patch,
> > > > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > > > between assertion failures and test cases are very complex. I could
> > > > not find any test case to cause only one assertion failure on all
> > > > branches. One idea to proceed is:
> > > >
> > > > Patch-1 includes Change-1 and is applied to all branches.
> > > >
> > > > Patch-2 includes Change-2 and the user_catalog test case, and is
> > > > applied to all branches.
> > > >
> > > > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > > > test case), and is applied to v14 and v15 (also till v11 if we
> > > > prefer).
> > > >
> > > > The patch-1 doesn't include any test case but the user_catalog test
> > > > case can test both Change-1 and Change-2 on all branches.
> > > >
> > >
> > > I was wondering if it makes sense to commit both Change-1 and Change-2
> > > together as one patch? Both assertions are caused by a single test
> > > case and are related to the general problem that the association of
> > > top and sub transaction is only guaranteed to be formed before we
> > > decode transaction changes. Also, it would be good to fix the problem
> > > with a test case that can cause it. What do you think?
> >
> > Yeah, it makes sense to me.
> >
>
> I've attached two patches that need to be back-patched to all branches
> and includes Change-1, Change-2, and a test case for them. FYI this
> patch resolves the assertion failure reported in this thread as well
> as one reported in another thread[2]. So I borrowed some of the
> changes from the patch[2] Osumi-san recently proposed.
>

I've attached patches for Change-3 with a test case. Please review them as well.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From 763cb604fae97131b5be2da36deb10054b6dbf3a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Wed, 19 Oct 2022 12:16:43 +0900
Subject: [PATCH v1] Don't assign subtransactions to top transaction in
 SnapBuildXidSetCatalogChanges().

Previously, when decoding the commit record of the transaction, we
mark both top-level transaction and its subtransactions as containing
catalog changes and assign the subtransactions to the top-level
transaction. However, if the subtransacitons have invalidation
messages, we missed executing them when forgetting the
transactions. In commit 272248a0c1 where we introduced
SnapBuildXidSetCatalogChanges(), the reason why we assigned them is
just to avoid the assertion failure in AssertTXNLsnOrder() as they
have the same LSN. Now that with commit XXX we skip this assertion
check until we reach the LSN at we start decoding the contents of the
transaciton, there is no reason for that anymore.

SnapBuildXidSetCatalogChanges() was introduced in 15 or older but this
bug doesn't exist in branches prior to 14 since we don't add
invalidation messages to subtransactions. We decided to backpatch
through 11 for consistency bu

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada

On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada  wrote:
>
> On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila  wrote:
> >
> > On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada  
> > wrote:
> > >
> > > >
> > > > I think because the test case proposed needs all three changes, we can
> > > > push the change-1 without a test case and then as a second patch have
> > > > change-2 for HEAD and change-3 for back branches with the test case.
> > > > Do you have any other ideas to proceed here?
> > >
> > > I found another test case that causes the assertion failure at
> > > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > > attached the patch for the test case. In this test case, I modified a
> > > user-catalog table instead of system-catalog table. That way, we don't
> > > generate invalidation messages while generating NEW_CID records. As a
> > > result, we mark only the subtransactions as containing catalog change
> > > and don't make association between top-level and sub transactions. The
> > > assertion failure happens on all supported branches. If we need to fix
> > > this (I believe so), Change-2 needs to be backpatched to all supported
> > > branches.
> > >
> > > There are three changes as Amit mentioned, and regarding the test
> > > case, we have three test cases I've attached: truncate_testcase.patch,
> > > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > > between assertion failures and test cases are very complex. I could
> > > not find any test case to cause only one assertion failure on all
> > > branches. One idea to proceed is:
> > >
> > > Patch-1 includes Change-1 and is applied to all branches.
> > >
> > > Patch-2 includes Change-2 and the user_catalog test case, and is
> > > applied to all branches.
> > >
> > > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > > test case), and is applied to v14 and v15 (also till v11 if we
> > > prefer).
> > >
> > > The patch-1 doesn't include any test case but the user_catalog test
> > > case can test both Change-1 and Change-2 on all branches.
> > >
> >
> > I was wondering if it makes sense to commit both Change-1 and Change-2
> > together as one patch? Both assertions are caused by a single test
> > case and are related to the general problem that the association of
> > top and sub transaction is only guaranteed to be formed before we
> > decode transaction changes. Also, it would be good to fix the problem
> > with a test case that can cause it. What do you think?
>
> Yeah, it makes sense to me.
>

I've attached two patches that need to be back-patched to all branches
and includes Change-1, Change-2, and a test case for them. FYI this
patch resolves the assertion failure reported in this thread as well
as one reported in another thread[2]. So I borrowed some of the
changes from the patch[2] Osumi-san recently proposed.

Regards,

[1] 
https://www.postgresql.org/message-id/TYCPR01MB83733C6CEAE47D0280814D5AED7A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com
[2] 
https://www.postgresql.org/message-id/TYAPR01MB5866B30A1439043B1FC3F21EF5229%40TYAPR01MB5866.jpnprd01.prod.outlook.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v15_v3-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data


HEAD_v3-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada

On Tue, Oct 18, 2022 at 7:56 PM Amit Kapila  wrote:
>
> On Mon, Oct 17, 2022 at 7:05 AM Masahiko Sawada  wrote:
> >
> > On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila  wrote:
> > >
> > > --- a/src/backend/replication/logical/decode.c
> > > +++ b/src/backend/replication/logical/decode.c
> > > @@ -113,6 +113,15 @@
> > > LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> > > XLogReaderState *recor
> > >   buf.origptr);
> > >   }
> > >
> > > +#ifdef USE_ASSERT_CHECKING
> > > + /*
> > > + * Check the order of transaction LSNs when we reached the start decoding
> > > + * LSN. See the comments in AssertTXNLsnOrder() for details.
> > > + */
> > > + if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
> > > + AssertTXNLsnOrder(ctx->reorder);
> > > +#endif
> > > +
> > >   rmgr = GetRmgr(XLogRecGetRmid(record));
> > > >
> > >
> > > I am not able to think how/when this check will be useful. Because we
> > > skipped assert checking only for records that are prior to
> > > start_decoding_at point, I think for those records ordering should
> > > have been checked before the restart. start_decoding_at point will be
> > > either (a) confirmed_flush location, or (b) lsn sent by client, and
> > > any record prior to that must have been processed before restart.
> >
> > Good point. I was considering the case where the client sets far ahead
> > LSN but it's not worth considering this case in this context. I've
> > updated the patch accoringly.
> >
>
> One minor comment:
> Can we slightly change the comment: ". The ordering of the records
> prior to the LSN, we should have been checked before the restart." to
> ". The ordering of the records prior to the start_decoding_at LSN
> should have been checked before the restart."?

Agreed. I'll update the patch accordingly.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada

On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila  wrote:
>
> On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada  wrote:
> >
> > >
> > > I think because the test case proposed needs all three changes, we can
> > > push the change-1 without a test case and then as a second patch have
> > > change-2 for HEAD and change-3 for back branches with the test case.
> > > Do you have any other ideas to proceed here?
> >
> > I found another test case that causes the assertion failure at
> > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > attached the patch for the test case. In this test case, I modified a
> > user-catalog table instead of system-catalog table. That way, we don't
> > generate invalidation messages while generating NEW_CID records. As a
> > result, we mark only the subtransactions as containing catalog change
> > and don't make association between top-level and sub transactions. The
> > assertion failure happens on all supported branches. If we need to fix
> > this (I believe so), Change-2 needs to be backpatched to all supported
> > branches.
> >
> > There are three changes as Amit mentioned, and regarding the test
> > case, we have three test cases I've attached: truncate_testcase.patch,
> > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > between assertion failures and test cases are very complex. I could
> > not find any test case to cause only one assertion failure on all
> > branches. One idea to proceed is:
> >
> > Patch-1 includes Change-1 and is applied to all branches.
> >
> > Patch-2 includes Change-2 and the user_catalog test case, and is
> > applied to all branches.
> >
> > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > test case), and is applied to v14 and v15 (also till v11 if we
> > prefer).
> >
> > The patch-1 doesn't include any test case but the user_catalog test
> > case can test both Change-1 and Change-2 on all branches.
> >
>
> I was wondering if it makes sense to commit both Change-1 and Change-2
> together as one patch? Both assertions are caused by a single test
> case and are related to the general problem that the association of
> top and sub transaction is only guaranteed to be formed before we
> decode transaction changes. Also, it would be good to fix the problem
> with a test case that can cause it. What do you think?

Yeah, it makes sense to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada

On Tue, Oct 18, 2022 at 1:07 PM Amit Kapila  wrote:
>
> On Tue, Oct 18, 2022 at 6:29 AM Masahiko Sawada  wrote:
> >
> > On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila  wrote:
> > >
> > >
> > > IIUC, here you are speaking of three different changes. Change-1: Add
> > > a check in AssertTXNLsnOrder() to skip assert checking till we reach
> > > start_decoding_at. Change-2: Set needs_timetravel to true in one of
> > > the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
> > > call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
> > > in PG-14/15 as that won't be required after Change-1.
> >
> > Yes.
> >
> > >
> > > AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
> > > required in HEAD/v15/v14 to fix the problem.
> >
> > IIUC Change-2 is required in v16 and HEAD
> >
>
> Why are you referring v16 and HEAD separately?

Sorry, my wrong, I was confused.

>
> > but not mandatory in v15 and
> > v14. The reason why we need Change-2 is that there is a case where we
> > mark only subtransactions as containing catalog change while not doing
> > that for its top-level transaction. In v15 and v14, since we mark both
> > subtransactions and top-level transaction in
> > SnapBuildXidSetCatalogChanges() as containing catalog changes, we
> > don't get the assertion failure at "Assert(!needs_snapshot ||
> > needs_timetravel)".
> >
> > Regarding Change-3, it's required in v15 and v14 but not in HEAD and
> > v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
> > HEAD, Change-3 cannot be applied to the two branches.
> >
> > > Now, the second and third
> > > changes are not required in branches prior to v14 because we don't
> > > record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
> > > we want, we can even back-patch Change-2 and Change-3 to keep the code
> > > consistent or maybe just Change-3.
> >
> > Right. I don't think it's a good idea to back-patch Change-2 in
> > branches prior to v14 as it's not a relevant issue.
> >
>
> Fair enough but then why to even backpatch it to v15 and v14?

Oops, it's a typo. I wanted to say Change-2 should be back-patched only to HEAD.

>
> > Regarding
> > back-patching Change-3 to branches prior 14, I think it may be okay
> > til v11, but I'd be hesitant for v10 as the final release comes in a
> > month.
> >
>
> So to fix the issue in all branches, what we need to do is to
> backpatch change-1: in all branches till v10, change-2: in HEAD, and
> change-3: in V15 and V14. Additionally, we think, it is okay to
> backpatch change-3 till v11 as it is mainly done to avoid the problem
> fixed by change-1 and it makes code consistent in back branches.

Right.

>
> I think because the test case proposed needs all three changes, we can
> push the change-1 without a test case and then as a second patch have
> change-2 for HEAD and change-3 for back branches with the test case.
> Do you have any other ideas to proceed here?

I found another test case that causes the assertion failure at
"Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
attached the patch for the test case. In this test case, I modified a
user-catalog table instead of system-catalog table. That way, we don't
generate invalidation messages while generating NEW_CID records. As a
result, we mark only the subtransactions as containing catalog change
and don't make association between top-level and sub transactions. The
assertion failure happens on all supported branches. If we need to fix
this (I believe so), Change-2 needs to be backpatched to all supported
branches.

There are three changes as Amit mentioned, and regarding the test
case, we have three test cases I've attached: truncate_testcase.patch,
analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
between assertion failures and test cases are very complex. I could
not find any test case to cause only one assertion failure on all
branches. One idea to proceed is:

Patch-1 includes Change-1 and is applied to all branches.

Patch-2 includes Change-2 and the user_catalog test case, and is
applied to all branches.

Patch-3 includes Change-3 and the truncate test case (or the analyze
test case), and is applied to v14 and v15 (also till v11 if we
prefer).

The patch-1 doesn't include any test case but the user_catalog test
case can test both Change-1 and Change-2 on all branches. In v15 and
v14, the analyze test case causes both the assertions at
"Assert(txn->ninvalidations == 0);" and "Assert(prev_first_lsn <
cur_txn->first_lsn);" whereas the truncate test case causes the
assertion only at "Assert(txn->ninvalidations == 0);". Since the
patch-2 is applied on top of the patch-1, there is no difference in
terms of testing Change-2.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

user_catalog_testcase.patch
Description: Binary data

analyze_testcase.patch
Description: Binary data

truncate_testcase.patch
Description: Binary data

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-17 Thread Masahiko Sawada

On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila  wrote:
>
> On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada  
> wrote:
> >
> > Please note that to pass the new regression tests, the fix proposed in
> > a related thread[1] is required. Particularly, we need:
> >
> > @@ -1099,6 +1099,9 @@ SnapBuildCommitTxn(SnapBuild *builder,
> > XLogRecPtr lsn, TransactionId xid,
> > else if (sub_needs_timetravel)
> > {
> > /* track toplevel txn as well, subxact alone isn't 
> > meaningful */
> > +   elog(DEBUG2, "forced transaction %u to do timetravel
> > due to one of its subtransaction",
> > +xid);
> > +   needs_timetravel = true;
> > SnapBuildAddCommittedTxn(builder, xid);
> > }
> > else if (needs_timetravel)
> >
> > A side benefit of this approach is that we can fix another assertion
> > failure too that happens on REL14 and REL15 and reported here[2]. In
> > the commits 68dcce247f1a(REL14) and 272248a0c1(REL15), the reason why
> > we make the association between sub-txns to top-txn in
> > SnapBuildXidSetCatalogChanges() is just to avoid the assertion failure
> > in AssertTXNLsnOrder(). However, since the invalidation messages are
> > not transported from sub-txn to top-txn during the assignment, another
> > assertion check in ReorderBufferForget() fails when forgetting the
> > subtransaction. If we apply this idea of skipping the assertion
> > checks, we no longer need to make the such association in
> > SnapBuildXidSetCatalogChanges() and resolve this issue as well.
> >
>
> IIUC, here you are speaking of three different changes. Change-1: Add
> a check in AssertTXNLsnOrder() to skip assert checking till we reach
> start_decoding_at. Change-2: Set needs_timetravel to true in one of
> the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
> call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
> in PG-14/15 as that won't be required after Change-1.

Yes.

>
> AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
> required in HEAD/v15/v14 to fix the problem.

IIUC Change-2 is required in v16 and HEAD but not mandatory in v15 and
v14. The reason why we need Change-2 is that there is a case where we
mark only subtransactions as containing catalog change while not doing
that for its top-level transaction. In v15 and v14, since we mark both
subtransactions and top-level transaction in
SnapBuildXidSetCatalogChanges() as containing catalog changes, we
don't get the assertion failure at "Assert(!needs_snapshot ||
needs_timetravel)".

Regarding Change-3, it's required in v15 and v14 but not in HEAD and
v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
HEAD, Change-3 cannot be applied to the two branches.

> Now, the second and third
> changes are not required in branches prior to v14 because we don't
> record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
> we want, we can even back-patch Change-2 and Change-3 to keep the code
> consistent or maybe just Change-3.

Right. I don't think it's a good idea to back-patch Change-2 in
branches prior to v14 as it's not a relevant issue. Regarding
back-patching Change-3 to branches prior 14, I think it may be okay
til v11, but I'd be hesitant for v10 as the final release comes in a
month.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-16 Thread Masahiko Sawada

On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila  wrote:
>
> On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada  
> wrote:
> >
> > Summarizing this issue, the assertion check in AssertTXNLsnOrder()
> > fails as reported because the current logical decoding cannot properly
> > handle the case where the decoding restarts from NEW_CID. Since we
> > don't make the association between top-level transaction and its
> > subtransaction while decoding NEW_CID (ie, in
> > SnapBuildProcessNewCid()), two transactions are created in
> > ReorderBuffer as top-txn and have the same LSN. This failure happens
> > on all supported versions.
> >
> > To fix the problem, one idea is that we make the association between
> > top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
> > as Tomas proposed. On the other hand, since we don't guarantee to make
> > the association between the top-level transaction and its
> > sub-transactions until we try to decode the actual contents of the
> > transaction, it makes sense to me that instead of trying to solve by
> > making association, we need to change the code which are assuming that
> > it is associated.
> >
> > I've attached the patch for this idea. With the patch, we skip the
> > assertion checks in AssertTXNLsnOrder() until we reach the LSN at
> > which we start decoding the contents of transaction, ie.
> > start_decoding_at in SnapBuild. The minor concern is other way that
> > the assertion check could miss some faulty cases where two unrelated
> > top-transactions could have same LSN. With this patch, it will pass
> > for such a case. Therefore, for transactions that we skipped checking,
> > we do the check when we reach the LSN.
> >
>
> >
> --- a/src/backend/replication/logical/decode.c
> +++ b/src/backend/replication/logical/decode.c
> @@ -113,6 +113,15 @@
> LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> XLogReaderState *recor
>   buf.origptr);
>   }
>
> +#ifdef USE_ASSERT_CHECKING
> + /*
> + * Check the order of transaction LSNs when we reached the start decoding
> + * LSN. See the comments in AssertTXNLsnOrder() for details.
> + */
> + if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
> + AssertTXNLsnOrder(ctx->reorder);
> +#endif
> +
>   rmgr = GetRmgr(XLogRecGetRmid(record));
> >
>
> I am not able to think how/when this check will be useful. Because we
> skipped assert checking only for records that are prior to
> start_decoding_at point, I think for those records ordering should
> have been checked before the restart. start_decoding_at point will be
> either (a) confirmed_flush location, or (b) lsn sent by client, and
> any record prior to that must have been processed before restart.

Good point. I was considering the case where the client sets far ahead
LSN but it's not worth considering this case in this context. I've
updated the patch accoringly.

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v2-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-14 Thread Masahiko Sawada

Hi,

On Mon, Oct 10, 2022 at 2:16 PM John Naylor
 wrote:
>
> The following is not quite a full review, but has plenty to think about. 
> There is too much to cover at once, and I have to start somewhere...
>
> My main concerns are that internal APIs:
>
> 1. are difficult to follow
> 2. lead to poor branch prediction and too many function calls
>
> Some of the measurements are picking on the SIMD search code, but I go into 
> details in order to demonstrate how a regression there can go completely 
> unnoticed. Hopefully the broader themes are informative.
>
> On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada  wrote:
> > [fixed benchmarks]
>
> Thanks for that! Now I can show clear results on some aspects in a simple 
> way. The attached patches (apply on top of v6) are not intended to be 
> incorporated as-is quite yet, but do point the way to some reorganization 
> that I think is necessary. I've done some testing on loading, but will leave 
> it out for now in the interest of length.
>
>
> 0001-0003 are your performance test fix and and some small conveniences for 
> testing. Binary search is turned off, for example, because we know it 
> already. And the sleep call is so I can run perf in a different shell 
> session, on only the search portion.
>
> Note the v6 test loads all block numbers in the range. Since the test item 
> ids are all below 64 (reasonable), there are always 32 leaf chunks, so all 
> the leaves are node32 and completely full. This had the effect of never 
> taking the byte-wise loop in the proposed pg_lsearch function. These two 
> aspects make this an easy case for the branch predictor:
>
> john=# select * from bench_seq_search(0, 1*1000*1000);
> NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 
> 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  100 | 10199040 |   18000 |167 | 
> 0 |  822 |   0
>
>  1,470,141,841  branches:u
> 63,693  branch-misses:u   #0.00% of all branches
>
> john=# select * from bench_shuffle_search(0, 1*1000*1000);
> NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 
> 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  100 | 10199040 |   18000 |168 | 
> 0 | 2174 |   0
>
>  1,470,142,569  branches:u
> 15,023,983  branch-misses:u   #1.02% of all branches
>
>
> 0004 randomizes block selection in the load part of the search test so that 
> each block has a 50% chance of being loaded.  Note that now we have many 
> node16s where we had none before. Although node 16 and node32 appear to share 
> the same path in the switch statement of rt_node_search(), the chunk 
> comparison and node_get_values() calls each must go through different 
> branches. The shuffle case is most affected, but even the sequential case 
> slows down. (The leaves are less full -> there are more of them, so memory 
> use is larger, but it shouldn't matter much, in the sequential case at least)
>
> john=# select * from bench_seq_search(0, 2*1000*1000);
> NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, 
> n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms 
> | rt_search_ms | array_serach_ms
> +--+-++---+--+-
>  999654 | 14893056 |   179937720 |173 | 0 
> |  907 |   0
>
>  1,684,114,926  branches:u
>  1,989,901  branch-misses:u   #0.12% of all branches
>
> john=# select * from bench_shuffle_search(0, 2*1000*1000);
> NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, 
> n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms 
> | rt_search_ms | array_serach_ms
> +--+-++---+--+-
>  999654 | 14893056 |   17993

Re: START_REPLICATION SLOT causing a crash in an assert build

2022-10-13 Thread Masahiko Sawada

On Thu, Oct 13, 2022 at 1:21 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-10-11 17:10:52 +0900, Masahiko Sawada wrote:
> > +# Reset the replication slot statistics.
> > +$node->safe_psql('postgres',
> > + "SELECT pg_stat_reset_replication_slot('regression_slot');");
> > +my $result = $node->safe_psql('postgres',
> > + "SELECT * FROM pg_stat_replication_slots WHERE slot_name = 
> > 'regrssion_slot'"
> > +);
>
> Typo in the slot name "regrssion_slot" instead of "regression_slot". We can't
> use * here, because that'll include the reset timestamp.

Fixed.

>
>
> > +# Teardown the node so the statistics is removed.
> > +$pg_recvlogical->kill_kill;
> > +$node->teardown_node;
> > +$node->start;
>
> ISTM that removing the file instead of shutting down the cluster with force
> would make it a more targeted test.

Agreed.

>
>
> > +# Check if the replication slot statistics have been removed.
> > +$result = $node->safe_psql('postgres',
> > + "SELECT * FROM pg_stat_replication_slots WHERE slot_name = 
> > 'regrssion_slot'"
> > +);
> > +is($result, "", "replication slot statistics are removed");
>
> Same typo as above. We can't assert a specific result here either, because
> recvlogical will have processed a bunch of changes. Perhaps we could check at
> least that the reset time is NULL?

Agreed.

>
>
> > +# Test if the replication slot staistics continue to be accumulated even 
> > after
>
> s/staistics/statistics/

Fixed.

I've attached an updated patch. I've added the common function to
start pg_recvlogical and wait for it to become active. Please review
it.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


regression_test_for_replslot_stats_v2.patch
Description: Binary data

Re: test_decoding assertion failure for the loss of top-sub transaction relationship

2022-10-12 Thread Masahiko Sawada

On Wed, Oct 12, 2022 at 3:35 PM kuroda.hay...@fujitsu.com
 wrote:
>
> Dear Sawada-san,
>
> > FYI, as I just replied to the related thread[1], the assertion failure
> > in REL14 and REL15 can be fixed by the patch proposed there. So I'd
> > like to see how the discussion goes. Regardless of this proposed fix,
> > the patch proposed by Kuroda-san is required for HEAD, REL14, and
> > REL15, in order to fix the assertion failure in SnapBuildCommitTxn().
>
> I understood that my patches for REL14 and REL15 might be not needed.

No, sorry for confusing you. I meant that even if we agreed with the
patch I proposed there, your patch is still required to fix the issue.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add index scan progress to pg_stat_progress_vacuum

2022-10-12 Thread Masahiko Sawada

On Tue, Oct 11, 2022 at 10:50 PM Imseih (AWS), Sami  wrote:
>
> >One idea would be to add a flag, say report_parallel_vacuum_progress,
> >to IndexVacuumInfo struct and expect index AM to check and update the
> >parallel index vacuum progress, say every 1GB blocks processed. The
> >flag is true only when the leader process is vacuuming an index.
>
> >Regards,
>
> Sorry for the long delay on this. I have taken the approach as suggested
> by Sawada-san and Robert and attached is v12.

Thank you for updating the patch!

>
> 1. The patch introduces a new counter in the same shared memory already
> used by the parallel leader and workers to keep track of the number
> of indexes completed. This way there is no reason to loop through
> the index status every time we want to get the status of indexes completed.

While it seems to be a good idea to have the atomic counter for the
number of indexes completed, I think we should not use the global
variable referencing the counter as follow:

+static pg_atomic_uint32 *index_vacuum_completed = NULL;
:
+void
+parallel_vacuum_progress_report(void)
+{
+   if (IsParallelWorker() || !report_parallel_vacuum_progress)
+   return;
+
+   pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED,
+pg_atomic_read_u32(index_vacuum_completed));
+}

I think we can pass ParallelVacuumState (or PVIndStats) to the
reporting function so that it can check the counter and report the
progress.

> 2. A new function in vacuumparallel.c will be used to update
> the progress of indexes completed by reading from the
> counter created in point #1.
>
> 3. The function is called during the vacuum_delay_point as a
> matter of convenience, since this is called in all major vacuum
> loops. The function will only do something if the caller
> sets a boolean to report progress. Doing so will also ensure
> progress is being reported in case the parallel workers completed
> before the leader.

Robert pointed out:

---
I am not too sure that the idea of using the vacuum delay points is the best
plan. I think we should try to avoid piggybacking on such general
infrastructure if we can, and instead look for a way to tie this to
something that is specific to parallel vacuum.
---

I agree with this part.

Instead, I think we can add a boolean and the pointer for
ParallelVacuumState to IndexVacuumInfo. If the boolean is true, an
index AM can call the reporting function with the pointer to
ParallelVacuumState while scanning index blocks, for example, for
every 1GB. The boolean can be true only for the leader process.

>
> 4. Rather than adding any complexity to WaitForParallelWorkersToFinish
> and introducing a new callback, vacuumparallel.c will wait until
> the number of vacuum workers is 0 and then call
> WaitForParallelWorkersToFinish as it does currently.

Agreed, but with the following change, the leader process waits in a
busy loop, which should not be acceptable:

+   if (VacuumActiveNWorkers)
+   {
+   while (pg_atomic_read_u32(VacuumActiveNWorkers) > 0)
+   {
+   parallel_vacuum_progress_report();
+   }
+   }
+

Also, I think it's better to check whether idx_completed_progress
equals to the number indexes instead.

> 5. Went back to the idea of adding a new view called 
> pg_stat_progress_vacuum_index
> which is accomplished by adding a new type called VACUUM_PARALLEL in 
> progress.h

Probably, we can devide the patch into two patches. One for adding the
new statistics of the number of vacuumed indexes to
pg_stat_progress_vacuum and another one for adding new statistics view
pg_stat_progress_vacuum_index.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: test_decoding assertion failure for the loss of top-sub transaction relationship

2022-10-12 Thread Masahiko Sawada

On Wed, Sep 7, 2022 at 11:06 AM kuroda.hay...@fujitsu.com
 wrote:
>
> Dear Amit,
>
> Thanks for giving comments!
>
> > Did you get this new assertion failure after you applied the patch for
> > the first failure? Because otherwise, how can you reach it with the
> > same test case?
>
> The first failure is occurred only in the HEAD, so I did not applied the 
> first patch
> to REL14 and REL15.
> This difference is caused because the commit [Fix catalog lookup...] in 
> REL15(272248a) and older is different
> from the HEAD one.
> In order versions SnapBuildXidSetCatalogChanges() has been added. In the 
> function
> a transaction will be marked as containing catalog changes if the transaction 
> is in InitialRunningXacts,
> and after that the relation between sub-top transactions is assigned based on 
> the parsed->subxact.
> The marking avoids the first failure, but the assignment triggers new failure.
>

FYI, as I just replied to the related thread[1], the assertion failure
in REL14 and REL15 can be fixed by the patch proposed there. So I'd
like to see how the discussion goes. Regardless of this proposed fix,
the patch proposed by Kuroda-san is required for HEAD, REL14, and
REL15, in order to fix the assertion failure in SnapBuildCommitTxn().

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoA1gV9pfu8hoXpTQBWH8uEMRg_F_MKM%2BU3Sr0HnyH4AUQ%40mail.gmail.com

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-11 Thread Masahiko Sawada

napshots and how we share them between the transactions :-( If we share
> > the snapshots between transactions, you're probably right we can't just
> > skip these changes.
> >
> > However, doesn't that pretty much mean we *have* to do something about
> > the assignment? I mean, suppose we miss the assignment (like now), so
> > that we end up with two TXNs that we think are top-level. And then we
> > get the commit for the actual top-level transaction. AFAICS that won't
> > clean-up the subxact, and we end up with a lingering TXN.
> >
>
> I think we will clean up such a subxact. Such a xact should be skipped
> via DecodeTXNNeedSkip() and then it will call ReorderBufferForget()
> for each of the subxacts and that will make sure that we clean up each
> of subtxn's.
>

Right.

Regards,

[1] 
https://www.postgresql.org/message-id/TYAPR01MB58666BD6BE24853269624282F5419%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[2] 
https://www.postgresql.org/message-id/TYAPR01MB58660803BCAA7849C8584AA4F57E9%40TYAPR01MB5866.jpnprd01.prod.outlook.com


--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data

Re: test_decoding assertion failure for the loss of top-sub transaction relationship

2022-10-11 Thread Masahiko Sawada

On Wed, Sep 7, 2022 at 11:06 AM kuroda.hay...@fujitsu.com
 wrote:
>
> Dear Amit,
>
> Thanks for giving comments!
>
> > Did you get this new assertion failure after you applied the patch for
> > the first failure? Because otherwise, how can you reach it with the
> > same test case?
>
> The first failure is occurred only in the HEAD, so I did not applied the 
> first patch
> to REL14 and REL15.
> This difference is caused because the commit [Fix catalog lookup...] in 
> REL15(272248a) and older is different
> from the HEAD one.
> In order versions SnapBuildXidSetCatalogChanges() has been added. In the 
> function
> a transaction will be marked as containing catalog changes if the transaction 
> is in InitialRunningXacts,
> and after that the relation between sub-top transactions is assigned based on 
> the parsed->subxact.
> The marking avoids the first failure, but the assignment triggers new failure.
>
>
> > About patch:
> > else if (sub_needs_timetravel)
> >   {
> > - /* track toplevel txn as well, subxact alone isn't meaningful */
> > + elog(DEBUG2, "forced transaction %u to do timetravel due to one of
> > its subtransaction",
> > + xid);
> > + needs_timetravel = true;
> >   SnapBuildAddCommittedTxn(builder, xid);
> >
> > Why did you remove the above comment? I think it still makes sense to 
> > retain it.
>
> Fixed.

Here are some review comments for v2 patch:

+# Test that we can force the top transaction to do timetravel when one of sub
+# transactions needs that. This is necessary when we restart decoding
from RUNNING_XACT
+# without the wal to associate subtransaction to its top transaction.

I don't think the second sentence is necessary.

---
The last decoding
+# starts from the first checkpoint and NEW_CID of "s0_truncate"
doesn't mark the top
+# transaction as catalog modifying transaction. In this scenario, the
enforcement sets
+# needs_timetravel to true even if the top transaction is regarded as
that it does not
+# have catalog changes and thus the decoding works without a
contradition that one
+# subtransaction needed timetravel while its top transaction didn't.

I don't understand the last sentence, probably it's a long sentence.

How about the following description?

# Test that we can handle the case where only subtransaction is marked
as containing
# catalog changes. The last decoding starts from NEW_CID generated by
"s0_truncate" and
# marks only the subtransaction as containing catalog changes but we
don't create the
# association between top-level transaction and subtransaction yet.
When decoding the
# commit record of the top-level transaction, we must force the
top-level transaction
# to do timetravel since one of its subtransactions is marked as
containing catalog changes.

---
+ elog(DEBUG2, "forced transaction %u to do timetravel due to one of
its subtransaction",
+ xid);
+ needs_timetravel = true;

I think "one of its subtransaction" should be "one of its subtransactions".

Regards,

--
Masahiko Sawada

Re: START_REPLICATION SLOT causing a crash in an assert build

2022-10-11 Thread Masahiko Sawada

On Sun, Oct 9, 2022 at 2:42 AM Andres Freund  wrote:
>
> On 2022-10-08 09:53:50 -0700, Andres Freund wrote:
> > On 2022-10-07 19:56:33 -0700, Andres Freund wrote:
> > > I'm planning to push this either later tonight (if I feel up to it after
> > > cooking dinner) or tomorrow morning PST, due to the release wrap deadline.
> >
> > I looked this over again, tested a bit more, and pushed the adjusted 15 and
> > master versions to github to get a CI run. Once that passes, as I expect, 
> > I'll
> > push them for real.
>
> Those passed and thus pushed.
>
> Thanks for the report, debugging and review everyone!

Thanks!

>
>
> I think we need at least the following tests for replslots:
> - a reset while decoding is ongoing works correctly
> - replslot stats continue to be accumulated after stats have been removed
>
>
> I wonder how much it'd take to teach isolationtester to handle the replication
> protocol...

I think we can do these tests by using pg_recvlogical in TAP tests.
I've attached a patch to do that.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


regression_test_for_replslot_stats.patch
Description: Binary data

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-10 Thread Masahiko Sawada

On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila  wrote:
>
> On Fri, Oct 7, 2022 at 8:47 AM Masahiko Sawada  wrote:
> >
> > On Thu, Oct 6, 2022 at 9:04 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > I think the root reason for this kind of deadlock problems is the table
> > > structure difference between publisher and subscriber(similar to the 
> > > unique
> > > difference reported earlier[1]). So, I think we'd better disallow this 
> > > case. For
> > > example to avoid the reported problem, we could only support parallel 
> > > apply if
> > > pubviaroot is false on publisher and replicated tables' types(relkind) 
> > > are the
> > > same between publisher and subscriber.
> > >
> > > Although it might restrict some use cases, but I think it only restrict 
> > > the
> > > cases when the partitioned table's structure is different between 
> > > publisher and
> > > subscriber. User can still use parallel apply for cases when the table
> > > structure is the same between publisher and subscriber which seems 
> > > acceptable
> > > to me. And we can also document that the feature is expected to be used 
> > > for the
> > > case when tables' structure are the same. Thoughts ?
> >
> > I'm concerned that it could be a big restriction for users. Having
> > different partitioned table's structures on the publisher and the
> > subscriber is quite common use cases.
> >
> > From the feature perspective, the root cause seems to be the fact that
> > the apply worker does both receiving and applying changes. Since it
> > cannot receive the subsequent messages while waiting for a lock on a
> > table, the parallel apply worker also cannot move forward. If we have
> > a dedicated receiver process, it can off-load the messages to the
> > worker while another process waiting for a lock. So I think that
> > separating receiver and apply worker could be a building block for
> > parallel-apply.
> >
>
> I think the disadvantage that comes to mind is the overhead of passing
> messages between receiver and applier processes even for non-parallel
> cases. Now, I don't think it is advisable to have separate handling
> for non-parallel cases. The other thing is that we need to someway
> deal with feedback messages which helps to move synchronous replicas
> and update subscriber's progress which in turn helps to keep the
> restart point updated. These messages also act as heartbeat messages
> between walsender and walapply process.
>
> To deal with this, one idea is that we can have two connections to
> walsender process, one with walreceiver and the other with walapply
> process which according to me could lead to a big increase in resource
> consumption and it will bring another set of complexities in the
> system. Now, in this, I think we have two possibilities, (a) The first
> one is that we pass all messages to the leader apply worker and then
> it decides whether to execute serially or pass it to the parallel
> apply worker. However, that can again deadlock in the truncate
> scenario we discussed because the main apply worker won't be able to
> receive new messages once it is blocked at the truncate command. (b)
> The second one is walreceiver process itself takes care of passing
> streaming transactions to parallel apply workers but if we do that
> then walreceiver needs to wait at the transaction end to maintain
> commit order which means it can also lead to deadlock in case the
> truncate happens in a streaming xact.

I imagined (b) but I had missed the point of preserving the commit
order. Separating the receiver and apply worker cannot resolve this
problem.

>
> The other alternative is that we allow walreceiver process to wait for
> apply process to finish transaction and send the feedback but that
> seems to be again an overhead if we have to do it even for small
> transactions, especially it can delay sync replication cases. Even, if
> we don't consider overhead, it can still lead to a deadlock because
> walreceiver won't be able to move in the scenario we are discussing.
>
> About your point that having different partition structures for
> publisher and subscriber, I don't know how common it will be once we
> have DDL replication. Also, the default value of
> publish_via_partition_root is false which doesn't seem to indicate
> that this is a quite common case.

So how can we consider these concurrent issues that could happen only
when streaming = 'parallel'? Can we restrict some use cases to avoid
the problem or can we have a safeguard against these conflicts? We
could find a new problematic scenario in the future and if it happens,
logical replication gets stuck, it cannot be resolved only by apply
workers themselves.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-07 Thread Masahiko Sawada

On Fri, Oct 7, 2022 at 2:29 PM John Naylor  wrote:
>
> On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada  wrote:
> > In addition to two patches, I've attached the third patch. It's not
> > part of radix tree implementation but introduces a contrib module
> > bench_radix_tree, a tool for radix tree performance benchmarking. It
> > measures loading and lookup performance of both the radix tree and a
> > flat array.
>
> Hi Masahiko, I've been using these benchmarks, along with my own variations, 
> to try various things that I've mentioned. I'm long overdue for an update, 
> but the picture is not yet complete.

Thanks!

> For now, I have two questions that I can't figure out on my own:
>
> 1. There seems to be some non-obvious limit on the number of keys that are 
> loaded (or at least what the numbers report). This is independent of the 
> number of tids per block. Example below:
>
> john=# select * from bench_shuffle_search(0, 8*1000*1000);
> NOTICE:  num_keys = 800, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 
> 25, n256 = 981
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  800 |268435456 |4800 |661 |
> 29 |  276 | 389
>
> john=# select * from bench_shuffle_search(0, 9*1000*1000);
> NOTICE:  num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 
> 262144, n256 = 1028
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  8388608 |276824064 |5400 |718 |
> 33 |  311 | 446
>
> The array is the right size, but nkeys hasn't kept pace. Can you reproduce 
> this? Attached is the patch I'm using to show the stats when running the 
> test. (Side note: The numbers look unfavorable for radix tree because I'm 
> using 1 tid per block here.)

Yes, I can reproduce this. In tid_to_key_off() we need to cast to
uint64 when packing offset number and block number:

   tid_i = ItemPointerGetOffsetNumber(tid);
   tid_i |= ItemPointerGetBlockNumber(tid) << shift;

>
> 2. I found that bench_shuffle_search() is much *faster* for traditional 
> binary search on an array than bench_seq_search(). I've found this to be true 
> in every case. This seems counterintuitive to me -- any idea why this is? 
> Example:
>
> john=# select * from bench_seq_search(0, 100);
> NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 
> 1, n256 = 122
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  100 | 10199040 |   18000 |168 |   
> 106 |  827 |3348
>
> john=# select * from bench_shuffle_search(0, 100);
> NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 
> 1, n256 = 122
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  100 | 10199040 |   18000 |171 |   
> 107 |  827 |1400
>

Ugh, in shuffle_itemptrs(), we shuffled itemptrs instead of itemptr:

for (int i = 0; i < nitems - 1; i++)
{
int j = shuffle_randrange(, i, nitems - 1);
   ItemPointerData t = itemptrs[j];

   itemptrs[j] = itemptrs[i];
   itemptrs[i] = t;

With the fix, the results on my environment were:

postgres(1:4093192)=# select * from bench_seq_search(0, 1000);
2022-10-07 16:57:03.124 JST [4093192] LOG:  num_keys = 1000,
height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226
  nkeys   | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--+--+-++---+--+-
 1000 |101826560 |  18 |846 |
 486 | 6096 |   21128
(1 row)

Time: 28975.566 ms (00:28.976)
postgres(1:4093192)=# select * from bench_shuffle_search(0, 1000);
2022-10-07 16:57:37.476 JST [4093192] LOG:  num_keys = 1000,
height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226
  nkeys   | rt_mem_allocated | array_mem_a

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread Masahiko Sawada

On Thu, Oct 6, 2022 at 9:04 PM houzj.f...@fujitsu.com
 wrote:
>
>
>
> > -Original Message-
> > From: Masahiko Sawada 
> > Sent: Thursday, October 6, 2022 4:07 PM
> > To: Hou, Zhijie/侯 志杰 
> > Cc: Amit Kapila ; Wang, Wei/王 威
> > ; Peter Smith ; Dilip
> > Kumar ; Shi, Yu/侍 雨 ;
> > PostgreSQL Hackers 
> > Subject: Re: Perform streaming logical transactions by background workers 
> > and
> > parallel apply
> >
> > On Tue, Sep 27, 2022 at 9:26 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Saturday, September 24, 2022 7:40 PM Amit Kapila
> >  wrote:
> > > >
> > > > On Thu, Sep 22, 2022 at 3:41 PM Amit Kapila 
> > > > wrote:
> > > > >
> > > > > On Thu, Sep 22, 2022 at 8:59 AM wangw.f...@fujitsu.com
> > > > >  wrote:
> > > > > >
> > > > >
> > > > > Few comments on v33-0001
> > > > > ===
> > > > >
> > > >
> > > > Some more comments on v33-0001
> > > > =
> > > > 1.
> > > > + /* Information from the corresponding LogicalRepWorker slot. */
> > > > + uint16 logicalrep_worker_generation;
> > > > +
> > > > + int logicalrep_worker_slot_no;
> > > > +} ParallelApplyWorkerShared;
> > > >
> > > > Both these variables are read/changed by leader/parallel workers without
> > > > using any lock (mutex). It seems currently there is no problem because 
> > > > of
> > the
> > > > way the patch is using in_parallel_apply_xact but I think it won't be a 
> > > > good
> > idea
> > > > to rely on it. I suggest using mutex to operate on these variables and 
> > > > also
> > check
> > > > if the slot_no is in a valid range after reading it in 
> > > > parallel_apply_free_worker,
> > > > otherwise error out using elog.
> > >
> > > Changed.
> > >
> > > > 2.
> > > >  static void
> > > >  apply_handle_stream_stop(StringInfo s)
> > > >  {
> > > > - if (!in_streamed_transaction)
> > > > + ParallelApplyWorkerInfo *winfo = NULL; TransApplyAction apply_action;
> > > > +
> > > > + if (!am_parallel_apply_worker() &&
> > > > + (!in_streamed_transaction && !stream_apply_worker))
> > > >   ereport(ERROR,
> > > >   (errcode(ERRCODE_PROTOCOL_VIOLATION),
> > > >   errmsg_internal("STREAM STOP message without STREAM START")));
> > > >
> > > > This check won't be able to detect missing stream start messages for 
> > > > parallel
> > > > apply workers apart from the first pair of start/stop. I thought of 
> > > > adding
> > > > in_remote_transaction check along with
> > > > am_parallel_apply_worker() to detect the same but that also won't work
> > > > because the parallel worker doesn't reset it at the stop message.
> > > > Another possibility is to introduce yet another variable for this but 
> > > > that
> > doesn't
> > > > seem worth it. I would like to keep this check simple.
> > > > Can you think of any better way?
> > >
> > > I feel we can reuse the in_streamed_transaction in parallel apply worker 
> > > to
> > > simplify the check there. I tried to set this flag in parallel apply 
> > > worker
> > > when stream starts and reset it when stream stop so that we can directly 
> > > check
> > > this flag for duplicate stream start message and other related things.
> > >
> > > > 3. I think we can skip sending start/stop messages from the leader to 
> > > > the
> > > > parallel worker because unlike apply worker it will process only one
> > > > transaction-at-a-time. However, it is not clear whether that is worth 
> > > > the
> > effort
> > > > because it is sent after logical_decoding_work_mem changes. For now, I 
> > > > have
> > > > added a comment for this in the attached patch but let me if I am 
> > > > missing
> > > > something or if I am wrong.
> > >
> > > I the suggested comments look good.
> > >
> > > > 4.
> > > > postgres=# select pid, leader_pid, application_name, backend_type from
> > > > pg_stat_activity;
> > > >   pid  | leader_pi

Re: START_REPLICATION SLOT causing a crash in an assert build

2022-10-06 Thread Masahiko Sawada

On Fri, Oct 7, 2022 at 8:00 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-10-06 14:10:46 +0900, Kyotaro Horiguchi wrote:
> > +1. FWIW, the atttached is an example of what it looks like if we
> > avoid file format change.
>
> What about if we go the other direction - simply remove the name from the
> stats entry at all. I don't actually think we need it anymore. Unless I am
> missing something right now - entirely possible! - the danger that
> pgstat_acquire_replslot() mentions doesn't actually exist [anymore]. After a
> crash we throw away the old stats data and if a slot is dropped while shut
> down, we'll not load the slot data at startup.

+1. I think it works. Since the replication slot index doesn't change
during server running we can fetch the name from
ReplicationSlotCtl->replication_slots.

If we don't need the name in stats entry, pgstat_acquire_replslot() is
no longer necessary?

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Issue with pg_stat_subscription_stats

2022-10-06 Thread Masahiko Sawada

On Fri, Oct 7, 2022 at 9:27 AM Andres Freund  wrote:
>
> On 2022-10-06 16:43:43 -0700, Andres Freund wrote:
> > On 2022-10-06 14:10:56 +0900, Michael Paquier wrote:
> > > On Tue, Jul 12, 2022 at 09:31:16AM +0530, Amit Kapila wrote:
> > > > I am not against backpatching this but OTOH it doesn't appear critical
> > > > enough to block one's work, so not backpatching should be fine.
> > >
> > > We are just talking about the reset timestamp not being set at
> > > when the object is created, right?  This does not strike me as
> > > critical, so applying it only on HEAD is fine IMO.  A few months ago,
> > > while in beta, I would have been fine with something applied to
> > > REL_15_STABLE.  Now that we are in RC, that's not worth taking a risk
> > > in my opinion.
> >
> > Agreed.
> >
> > > Amit or Andres, are you planning to double-check and perhaps merge
> > > this patch to take care of the inconsistency?
> >
> > I'll run it through CI and then to master unless somebody pipes up in the
> > meantime.
>
> And pushed. Thanks all!

Thanks!

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread Masahiko Sawada

rue');

* On subscriber
create table p (c int) partition by list (c);
create table c1 partition of p for values In (2);
create table c2 partition of p for values In (1);
create subscription test_sub connection 'port=5551 dbname=postgres'
publication test_pub with (streaming = 'parallel', copy_data =
'false');

Note that while both the publisher and the subscriber have the same
name tables the partition structure is different and rows go to a
different table on the subscriber (eg, row c=1 will go to c2 table on
the subscriber). If two current transactions are executed as follows,
the apply worker (ig, the leader apply worker) waits for a lock on c2
held by its parallel apply worker:

* TX-1
BEGIN;
INSERT INTO p SELECT 1 FROM generate_series(1, 1); --- changes are streamed

* TX-2
BEGIN;
TRUNCATE c2; --- wait for a lock on c2

* TX-1
INSERT INTO p SELECT 1 FROM generate_series(1, 1);
COMMIT;

This might not be a common case in practice but it could mean that
there is a restriction on how partitioned tables should be structured
on the publisher and the subscriber when using streaming = 'parallel'.
When this happens, since the logical replication cannot move forward
the users need to disable parallel-apply mode or increase
logical_decoding_work_mem. We could describe this limitation in the
doc but it would be hard for users to detect problematic table
structure.

BTW, when the leader apply worker waits for a lock on c2 in the above
example, the parallel apply worker is in a busy-loop, which should be
fixed.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-06 Thread Masahiko Sawada

On Wed, Oct 5, 2022 at 6:40 PM John Naylor  wrote:
>
>
> On Wed, Oct 5, 2022 at 1:46 PM Masahiko Sawada  wrote:
> >
> > On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Sep 23, 2022 at 12:11 AM John Naylor
> > >  wrote:
> > > Yeah, node31 and node256 are bloated.  We probably could use slab for
> > > node256 independently. It's worth trying a benchmark to see how it
> > > affects the performance and the tree size.
>
> This wasn't the focus of your current email, but while experimenting with v6 
> I had another thought about local allocation: If we use the default slab 
> block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right? If 
> so, since aset and DSA also waste at least a few hundred bytes, we could 
> store a useless 256-byte slot array within node256. That way, node128 and 
> node256 share the same start of pointers/values array, so there would be one 
> less branch for getting that address. In v6, rt_node_get_values and 
> rt_node_get_children are not inlined (asde: gcc uses a jump table for 5 kinds 
> but not for 4), but possibly should be, and the smaller the better.

It would be good for performance but I'm a bit concerned that it's
highly optimized to the design of aset and DSA. Since size 2088 will
be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if
we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a
useless 256-byte and the assumption will be broken.

>
> > Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
> > to point to its child nodes, instead of C pointers (ig, backend-local
> > address). I'm thinking of a straightforward approach as the first
> > step; inner nodes have a union of rt_node* and dsa_pointer and we
> > choose either one based on whether the radix tree is shared or not. We
> > allocate and free the shared memory for individual nodes by
> > dsa_allocate() and dsa_free(), respectively. Therefore we need to get
> > a C pointer from dsa_pointer by using dsa_get_address() while
> > descending the tree. I'm a bit concerned that calling
> > dsa_get_address() for every descent could be performance overhead but
> > I'm going to measure it anyway.
>
> Are dsa pointers aligned the same as pointers to locally allocated memory? 
> Meaning, is the offset portion always a multiple of 4 (or 8)?

I think so.

> It seems that way from a glance, but I can't say for sure. If the lower 2 
> bits of a DSA pointer are never set, we can tag them the same way as a 
> regular pointer. That same technique could help hide the latency of 
> converting the pointer, by the same way it would hide the latency of loading 
> parts of a node into CPU registers.
>
> One concern is, handling both local and dsa cases in the same code requires 
> more (predictable) branches and reduces code density. That might be a reason 
> in favor of templating to handle each case in its own translation unit.

Right. We also need to support locking for shared radix tree, which
would require more branches.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-05 Thread Masahiko Sawada

On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada  wrote:
>
> On Fri, Sep 23, 2022 at 12:11 AM John Naylor
>  wrote:
> >
> >
> > On Thu, Sep 22, 2022 at 11:46 AM John Naylor  
> > wrote:
> > > One thing I want to try soon is storing fewer than 16/32 etc entries, so 
> > > that the whole node fits comfortably inside a power-of-two allocation. 
> > > That would allow us to use aset without wasting space for the smaller 
> > > nodes, which would be faster and possibly would solve the fragmentation 
> > > problem Andres referred to in
> >
> > > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
> >
> > While calculating node sizes that fit within a power-of-two size, I noticed 
> > the current base node is a bit wasteful, taking up 8 bytes. The node kind 
> > only has a small number of values, so it doesn't really make sense to use 
> > an enum here in the struct (in fact, Andres' prototype used a uint8 for 
> > node_kind). We could use a bitfield for the count and kind:
> >
> > uint16 -- kind and count bitfield
> > uint8 shift;
> > uint8 chunk;
> >
> > That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, 
> > the bitfield can just go back to being count only.
>
> Good point, agreed.
>
> >
> > Here are the v6 node kinds:
> >
> > node4:   8 +   4 +(4)+   4*8 =   48 bytes
> > node16:  8 +  16 +  16*8 =  152
> > node32:  8 +  32 +  32*8 =  296
> > node128: 8 + 256 + 128/8 + 128*8 = 1304
> > node256: 8   + 256/8 + 256*8 = 2088
> >
> > And here are the possible ways we could optimize nodes for space using aset 
> > allocation. Parentheses are padding bytes. Even if my math has mistakes, 
> > the numbers shouldn't be too far off:
> >
> > node3:   4 +   3 +(1)+   3*8 =   32 bytes
> > node6:   4 +   6 +(6)+   6*8 =   64
> > node13:  4 +  13 +(7)+  13*8 =  128
> > node28:  4 +  28 +  28*8 =  256
> > node31:  4 + 256 +  32/8 +  31*8 =  512 (XXX not good)
> > node94:  4 + 256 +  96/8 +  94*8 = 1024
> > node220: 4 + 256 + 224/8 + 220*8 = 2048
> > node256: = 4096
> >
> > The main disadvantage is that node256 would balloon in size.
>
> Yeah, node31 and node256 are bloated.  We probably could use slab for
> node256 independently. It's worth trying a benchmark to see how it
> affects the performance and the tree size.
>
> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
> sizeof(dsa_area_span), 0,   /* special size classes */
> 8, 16, 24, 32, 40, 48, 56, 64,  /* 8 classes separated by 8 bytes */
> 80, 96, 112, 128,   /* 4 classes separated by 16 bytes */
> 160, 192, 224, 256, /* 4 classes separated by 32 bytes */
> 320, 384, 448, 512, /* 4 classes separated by 64 bytes */
> 640, 768, 896, 1024,/* 4 classes separated by 128 bytes */
> 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
> 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
> 5456, 6552, 7280, 8192  /* 4 classes separated by ~1024 bytes */
> };
>
> node256 will be classed as 2616, which is still not good.
>
> Anyway, I'll implement DSA support for radix tree.
>

Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
to point to its child nodes, instead of C pointers (ig, backend-local
address). I'm thinking of a straightforward approach as the first
step; inner nodes have a union of rt_node* and dsa_pointer and we
choose either one based on whether the radix tree is shared or not. We
allocate and free the shared memory for individual nodes by
dsa_allocate() and dsa_free(), respectively. Therefore we need to get
a C pointer from dsa_pointer by using dsa_get_address() while
descending the tree. I'm a bit concerned that calling
dsa_get_address() for every descent could be performance overhead but
I'm going to measure it anyway.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Improve description of XLOG_RUNNING_XACTS

2022-10-03 Thread Masahiko Sawada

On Mon, Oct 3, 2022 at 5:15 PM Michael Paquier  wrote:
>
> On Fri, Sep 16, 2022 at 10:55:53AM +0900, Kyotaro Horiguchi wrote:
> > Putting an arbitrary upper-bound on the number of subxids to print
> > might work? I'm not sure how we can determine the upper-bound, though.
>
> You could hardcode it so as it does not blow up the whole view, say
> 20~30.  Anyway, I agree with the concern raised upthread about the
> amount of extra data this would add to the output, so having at least
> the number of subxids would be better than the existing state of
> things telling only if the list of overflowed.  So let's stick to
> that.

Why are only subtransaction information in XLOG_RUNNING_XACTS limited?
I think we have other information shown without bounds such as lock
information in XLOG_STANDBY_LOCK and invalidation messages in
XLOG_INVALIDATIONS.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Small miscellaneous fixes

2022-10-03 Thread Masahiko Sawada

On Fri, Sep 30, 2022 at 9:08 AM Ranier Vilela  wrote:
>
> Hi.
>
> There are assorted fixes to the head branch.
>
> 1. Avoid useless reassigning var _logsegno (src/backend/access/transam/xlog.c)
> Commit 7d70809 left a little oversight.
> XLByteToPrevSeg and XLByteToSeg are macros, and both assign _logsegno.
> So, the first assignment is lost and is useless.
>
> 2. Avoid retesting log_min_duration (src/backend/commands/analyze.c)
> The log_min_duration has already been tested before and the second test
> can be safely removed.
>
> 3. Avoid useless var declaration record (src/backend/utils/misc/guc.c)
> The var record is never really used.

Three changes look good to me.

>
> 4. Fix declaration volatile signal var (src/bin/pgbench/pgbench.c)
> Like how to commit 5ac9e86, this is a similar case.

The same is true also for alarm_triggered in pg_test_fsync.c?

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Log details for client certificate failures

2022-10-02 Thread Masahiko Sawada

On Sat, Oct 1, 2022 at 7:53 PM Peter Eisentraut
 wrote:
>
> On 29.09.22 06:52, Masahiko Sawada wrote:
> > While this seems a future-proof idea, I wonder if it might be overkill
> > since we don't need to worry about accumulation of leaked memory in
> > this case. Given that only check_cluter_name is the case where we
> > found a small memory leak, I think it's adequate to fix it.
> >
> > Fixing this issue suppresses the valgrind's complaint but since the
> > boot value of cluster_name is "" the memory leak we can avoid is only
> > 1 byte.
>
> I have committed this.  I think it's better to keep the code locally
> robust and not to have to rely on complex analysis of how GUC memory
> management works.

Thanks! Agreed.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-02 Thread Masahiko Sawada

On Mon, Oct 3, 2022 at 2:04 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-09-16 15:00:31 +0900, Masahiko Sawada wrote:
> > I've updated the radix tree patch. It's now separated into two patches.
>
> cfbot notices a compiler warning:
> https://cirrus-ci.com/task/6247907681632256?logs=gcc_warning#L446
>
> [11:03:05.343] radixtree.c: In function ‘rt_iterate_next’:
> [11:03:05.343] radixtree.c:1758:15: error: ‘slot’ may be used uninitialized 
> in this function [-Werror=maybe-uninitialized]
> [11:03:05.343]  1758 |*value_p = *((uint64 *) slot);
> [11:03:05.343]   |   ^~
>

Thanks, I'll fix it in the next version patch.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-30 Thread Masahiko Sawada

On Wed, Sep 28, 2022 at 3:18 PM John Naylor
 wrote:
>
> On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada  
> wrote:
>
> > BTW We need to consider not only aset/slab but also DSA since we
> > allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> > the following size classes:
> >
> > static const uint16 dsa_size_classes[] = {
> > [...]
>
> Thanks for that info -- I wasn't familiar with the details of DSA. For the 
> non-parallel case, I plan to at least benchmark using aset because I gather 
> it's the most heavily optimized. I'm thinking that will allow other problem 
> areas to be more prominent. I'll also want to compare total context size 
> compared to slab to see if possibly less fragmentation makes up for other 
> wastage.

Thanks!

>
> Along those lines, one thing I've been thinking about is the number of size 
> classes. There is a tradeoff between memory efficiency and number of branches 
> when searching/inserting. My current thinking is there is too much coupling 
> between size class and data type. Each size class currently uses a different 
> data type and a different algorithm to search and set it, which in turn 
> requires another branch. We've found that a larger number of size classes 
> leads to poor branch prediction [1] and (I imagine) code density.
>
> I'm thinking we can use "flexible array members" for the values/pointers, and 
> keep the rest of the control data in the struct the same. That way, we never 
> have more than 4 actual "kinds" to code and branch on. As a bonus, when 
> migrating a node to a larger size class of the same kind, we can simply 
> repalloc() to the next size.

Interesting idea. Using flexible array members for values would be
good also for the case in the future where we want to support other
value types than uint64.

With this idea, we can just repalloc() to grow to the larger size in a
pair but I'm slightly concerned that the more size class we use, the
more frequent the node needs to grow. If we want to support node
shrink, the deletion is also affected.

> To show what I mean, consider this new table:
>
> node2:   5 +  6   +(5)+  2*8 =   32 bytes
> node6:   5 +  6   +(5)+  6*8 =   64
>
> node12:  5 + 27   + 12*8 =  128
> node27:  5 + 27   + 27*8 =  248(->256)
>
> node91:  5 + 256 + 28 +(7)+ 91*8 = 1024
> node219: 5 + 256 + 28 +(7)+219*8 = 2048
>
> node256: 5 + 32   +(3)+256*8 = 2088(->4096)
>
> Seven size classes are grouped into the four kinds.
>
> The common base at the front is here 5 bytes because there is a new uint8 
> field for "capacity", which we can ignore for node256 since we assume we can 
> always insert/update that node. The control data is the same in each pair, 
> and so the offset to the pointer/value array is the same. Thus, migration 
> would look something like:

I think we can use a bitfield for capacity. That way, we can pack
count (9bits), kind (2bits)and capacity (4bits) in uint16.

> Somewhat unrelated, we could still implement Andres' idea [1] to dispense 
> with the isset array in inner nodes of the indirect array type (now node128), 
> since we can just test if the pointer is null.

Right. I didn't do that to use the common logic for inner node128 and
leaf node128.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Log details for client certificate failures

2022-09-28 Thread Masahiko Sawada

On Thu, Sep 29, 2022 at 1:43 AM Jacob Champion  wrote:
>
> On Tue, Sep 27, 2022 at 6:14 PM Masahiko Sawada  wrote:
> > No. Since cluster_name is PGC_POSTMATER, we leak a little postmaster
> > memory only once when starting up. application_name is PGC_USERSET but
> > since we normally allocate memory in PortalMemoryContext we eventually
> > can free it.
>
> Oh, I see; thank you for the correction. And even if someone put an
> application_name into their postgresql.conf, and then changed it a
> bunch of times, we'd free the leaked memory from the config_cxt that's
> created in ProcessConfigFile().

Right.

>
> Is there a reason we don't provide a similar temporary context during
> InitializeGUCOptions()? Naively it seems like that would suppress any
> future one-time leaks, and maybe cut down on some Valgrind noise. Then
> again, maybe there's just not that much demand for pallocs during GUC
> hooks.

While this seems a future-proof idea, I wonder if it might be overkill
since we don't need to worry about accumulation of leaked memory in
this case. Given that only check_cluter_name is the case where we
found a small memory leak, I think it's adequate to fix it.

Fixing this issue suppresses the valgrind's complaint but since the
boot value of cluster_name is "" the memory leak we can avoid is only
1 byte.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-27 Thread Masahiko Sawada

On Fri, Sep 23, 2022 at 12:11 AM John Naylor
 wrote:
>
>
> On Thu, Sep 22, 2022 at 11:46 AM John Naylor  
> wrote:
> > One thing I want to try soon is storing fewer than 16/32 etc entries, so 
> > that the whole node fits comfortably inside a power-of-two allocation. That 
> > would allow us to use aset without wasting space for the smaller nodes, 
> > which would be faster and possibly would solve the fragmentation problem 
> > Andres referred to in
>
> > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
>
> While calculating node sizes that fit within a power-of-two size, I noticed 
> the current base node is a bit wasteful, taking up 8 bytes. The node kind 
> only has a small number of values, so it doesn't really make sense to use an 
> enum here in the struct (in fact, Andres' prototype used a uint8 for 
> node_kind). We could use a bitfield for the count and kind:
>
> uint16 -- kind and count bitfield
> uint8 shift;
> uint8 chunk;
>
> That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the 
> bitfield can just go back to being count only.

Good point, agreed.

>
> Here are the v6 node kinds:
>
> node4:   8 +   4 +(4)+   4*8 =   48 bytes
> node16:  8 +  16 +  16*8 =  152
> node32:  8 +  32 +  32*8 =  296
> node128: 8 + 256 + 128/8 + 128*8 = 1304
> node256: 8   + 256/8 + 256*8 = 2088
>
> And here are the possible ways we could optimize nodes for space using aset 
> allocation. Parentheses are padding bytes. Even if my math has mistakes, the 
> numbers shouldn't be too far off:
>
> node3:   4 +   3 +(1)+   3*8 =   32 bytes
> node6:   4 +   6 +(6)+   6*8 =   64
> node13:  4 +  13 +(7)+  13*8 =  128
> node28:  4 +  28 +  28*8 =  256
> node31:  4 + 256 +  32/8 +  31*8 =  512 (XXX not good)
> node94:  4 + 256 +  96/8 +  94*8 = 1024
> node220: 4 + 256 + 224/8 + 220*8 = 2048
> node256: = 4096
>
> The main disadvantage is that node256 would balloon in size.

Yeah, node31 and node256 are bloated.  We probably could use slab for
node256 independently. It's worth trying a benchmark to see how it
affects the performance and the tree size.

BTW We need to consider not only aset/slab but also DSA since we
allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
the following size classes:

static const uint16 dsa_size_classes[] = {
sizeof(dsa_area_span), 0,   /* special size classes */
8, 16, 24, 32, 40, 48, 56, 64,  /* 8 classes separated by 8 bytes */
80, 96, 112, 128,   /* 4 classes separated by 16 bytes */
160, 192, 224, 256, /* 4 classes separated by 32 bytes */
320, 384, 448, 512, /* 4 classes separated by 64 bytes */
640, 768, 896, 1024,/* 4 classes separated by 128 bytes */
1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
5456, 6552, 7280, 8192  /* 4 classes separated by ~1024 bytes */
};

node256 will be classed as 2616, which is still not good.

Anyway, I'll implement DSA support for radix tree.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Log details for client certificate failures

2022-09-27 Thread Masahiko Sawada

On Wed, Sep 28, 2022 at 4:44 AM Jacob Champion  wrote:
>
> On Tue, Sep 27, 2022 at 1:51 AM Masahiko Sawada  wrote:
> > I think we can fix it by the attached patch but I'd like to discuss
> > whether it's worth fixing it.
>
> Whoops. So every time it's changed, we leak a little postmaster memory?

No. Since cluster_name is PGC_POSTMATER, we leak a little postmaster
memory only once when starting up. application_name is PGC_USERSET but
since we normally allocate memory in PortalMemoryContext we eventually
can free it. Since check_cluster_name and check_application_name are
similar, I changed both for consistency.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Log details for client certificate failures

2022-09-27 Thread Masahiko Sawada

Hi,

On Tue, Sep 13, 2022 at 11:11 PM Peter Eisentraut
 wrote:
>
> On 09.09.22 00:32, Jacob Champion wrote:
> > On Thu, Jul 28, 2022 at 9:19 AM Jacob Champion  
> > wrote:
> >> On Thu, Jul 21, 2022 at 4:29 PM Jacob Champion  
> >> wrote:
> >>> v4 attempts to fix this by letting the check hooks pass
> >>> MCXT_ALLOC_NO_OOM to pg_clean_ascii(). (It's ignored in the frontend,
> >>> which just mallocs.)
> >>
> >> Ping -- should I add an open item somewhere so this isn't lost?
> >
> > Trying again. Peter, is this approach acceptable? Should I try something 
> > else?
>
> This looks fine to me.  Committed.

While looking at the recent changes for check_cluster_name() I found
this thread. Regarding the following change made by the commit
45b1a67a0fc, there is possibly small memory leak:

 static bool
 check_cluster_name(char **newval, void **extra, GucSource source)
 {
+   char   *clean;
+
/* Only allow clean ASCII chars in the cluster name */
-   pg_clean_ascii(*newval);
+   clean = pg_clean_ascii(*newval, MCXT_ALLOC_NO_OOM);
+   if (!clean)
+   return false;
+
+   clean = guc_strdup(WARNING, clean);
+   if (!clean)
+   return false;

+   *newval = clean;
return true;
 }

pg_clean_ascii() does palloc_extended() to allocate memory in
Postmaster context for the new characters and the clean is then
replaced with the new memory allocated by guc_strdup(). No-one
references the memory allocated by pg_clean_ascii() and it lasts for
postmaster lifetime. Valgrind memcheck also shows:

 1 bytes in 1 blocks are definitely lost in loss record 4 of 70
at 0xCD2A16: palloc_extended (mcxt.c:1239)
by 0xD09437: pg_clean_ascii (string.c:99)
by 0x7A5CF3: check_cluster_name (variable.c:1061)
by 0xCAF7CD: call_string_check_hook (guc.c:6365)
by 0xCAA724: InitializeOneGUCOption (guc.c:1439)
by 0xCAA0ED: InitializeGUCOptions (guc.c:1268)
by 0x99B245: PostmasterMain (postmaster.c:691)
by 0x858896: main (main.c:197)

I think we can fix it by the attached patch but I'd like to discuss
whether it's worth fixing it.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c795cb7a29..e555fb3150 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1025,17 +1025,22 @@ bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
 	char	   *clean;
+	char	   *ret;
 
 	/* Only allow clean ASCII chars in the application name */
 	clean = pg_clean_ascii(*newval, MCXT_ALLOC_NO_OOM);
 	if (!clean)
 		return false;
 
-	clean = guc_strdup(WARNING, clean);
-	if (!clean)
+	ret = guc_strdup(WARNING, clean);
+	if (!ret)
+	{
+		pfree(clean);
 		return false;
+	}
 
-	*newval = clean;
+	pfree(clean);
+	*newval = ret;
 	return true;
 }
 
@@ -1056,17 +1061,22 @@ bool
 check_cluster_name(char **newval, void **extra, GucSource source)
 {
 	char	   *clean;
+	char	   *ret;
 
 	/* Only allow clean ASCII chars in the cluster name */
 	clean = pg_clean_ascii(*newval, MCXT_ALLOC_NO_OOM);
 	if (!clean)
 		return false;
 
-	clean = guc_strdup(WARNING, clean);
-	if (!clean)
+	ret = guc_strdup(WARNING, clean);
+	if (!ret)
+	{
+		pfree(clean);
 		return false;
+	}
 
-	*newval = clean;
+	pfree(clean);
+	*newval = ret;
 	return true;
 }

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-22 Thread Masahiko Sawada

On Thu, Sep 22, 2022 at 1:46 PM John Naylor
 wrote:
>
>
> On Thu, Sep 22, 2022 at 1:01 AM Nathan Bossart  
> wrote:
> >
> > On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:
> >
> > > In short, this code needs to be lower level so that we still have full
> > > control while being portable. I will work on this, and also the related
> > > code for node dispatch.
> >
> > Is it possible to use approach #2 here, too?  AFAICT space is allocated for
> > all of the chunks, so there wouldn't be any danger in searching all them
> > and discarding any results >= node->count.
>
> Sure, the caller could pass the maximum node capacity, and then check if the 
> returned index is within the range of the node count.
>
> > Granted, we're depending on the
> > number of chunks always being a multiple of elements-per-vector in order to
> > avoid the tail path, but that seems like a reasonably safe assumption that
> > can be covered with comments.
>
> Actually, we don't need to depend on that at all. When I said "junk" above, 
> that can be any bytes, as long as we're not reading off the end of allocated 
> memory. We'll never do that here, since the child pointers/values follow. In 
> that case, the caller can hard-code the  size (it would even happen to work 
> now to multiply rt_node_kind by 16, to be sneaky). One thing I want to try 
> soon is storing fewer than 16/32 etc entries, so that the whole node fits 
> comfortably inside a power-of-two allocation. That would allow us to use aset 
> without wasting space for the smaller nodes, which would be faster and 
> possibly would solve the fragmentation problem Andres referred to in
>
> https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
>
> While on the subject, I wonder how important it is to keep the chunks in the 
> small nodes in sorted order. That adds branches and memmove calls, and is the 
> whole reason for the recent "pg_lfind_ge" function.

Good point. While keeping the chunks in the small nodes in sorted
order is useful for visiting all keys in sorted order, additional
branches and memmove calls could be slow.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-20 Thread Masahiko Sawada

On Fri, Sep 16, 2022 at 4:54 PM John Naylor
 wrote:
>
> On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada  wrote:
> >
> > On Mon, Aug 15, 2022 at 10:39 PM John Naylor
> >  wrote:
> > >
> > > bool, buth = and <=. Should be pretty close. Also, i believe if you
> > > left this for last as a possible refactoring, it might save some work.
>
> v6 demonstrates why this should have been put off towards the end. (more 
> below)
>
> > > In any case, I'll take a look at the latest patch next month.
>
> Since the CF entry said "Needs Review", I began looking at v5 again
> this week. Hopefully not too much has changed, but in the future I
> strongly recommend setting to "Waiting on Author" if a new version is
> forthcoming. I realize many here share updated patches at any time,
> but I'd like to discourage the practice especially for large patches.

Understood. Sorry for the inconveniences.

>
> > I've updated the radix tree patch. It's now separated into two patches.
> >
> > 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
> > better names) that are similar to the pg_lfind8() family but they
> > return the index of the key in the vector instead of true/false. The
> > patch includes regression tests.
>
> I don't want to do a full review of this just yet, but I'll just point
> out some problems from a quick glance.
>
> +/*
> + * Return the index of the first element in the vector that is greater than
> + * or eual to the given scalar. Return sizeof(Vector8) if there is no such
> + * element.
>
> That's a bizarre API to indicate non-existence.
>
> + *
> + * Note that this function assumes the elements in the vector are sorted.
> + */
>
> That is *completely* unacceptable for a general-purpose function.
>
> +#else /* USE_NO_SIMD */
> + Vector8 r = 0;
> + uint8 *rp = (uint8 *) 
> +
> + for (Size i = 0; i < sizeof(Vector8); i++)
> + rp[i] = (((const uint8 *) )[i] == ((const uint8 *) )[i]) ? 0xFF : 0;
>
> I don't think we should try to force the non-simd case to adopt the
> special semantics of vector comparisons. It's much easier to just use
> the same logic as the assert builds.
>
> +#ifdef USE_SSE2
> + return (uint32) _mm_movemask_epi8(v);
> +#elif defined(USE_NEON)
> + static const uint8 mask[16] = {
> +1 << 0, 1 << 1, 1 << 2, 1 << 3,
> +1 << 4, 1 << 5, 1 << 6, 1 << 7,
> +1 << 0, 1 << 1, 1 << 2, 1 << 3,
> +1 << 4, 1 << 5, 1 << 6, 1 << 7,
> +  };
> +
> +uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t)
> vshrq_n_s8(v, 7));
> +uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
> +
> +return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
>
> For Arm, we need to be careful here. This article goes into a lot of
> detail for this situation:
>
> https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon
>
> Here again, I'd rather put this off and focus on getting the "large
> details" in good enough shape so we can got towards integrating with
> vacuum.

Thank you for the comments! These above comments are addressed by
Nathan in a newly derived thread. I'll work on the patch.

I'll consider how to integrate with vacuum as the next step. One
concern for me is how to limit the memory usage to
maintenance_work_mem. Unlike using a flat array, memory space for
adding one TID varies depending on the situation. If we want strictly
not to allow using memory more than maintenance_work_mem, probably we
need to estimate the memory consumption in a conservative way.


Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Fix incorrect variable type for origin_id

2022-09-19 Thread Masahiko Sawada

Hi,

I realized that there are some places where we use XLogRecPtr for
variables for replication origin id. The attached patch fixes them to
use RepOriginiId instead.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


fix_to_use_RepOriginId.patch
Description: Binary data

Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans

2022-09-16 Thread Masahiko Sawada

On Tue, Sep 13, 2022 at 6:02 AM Peter Geoghegan  wrote:
>
> My ongoing project to make VACUUM more predictable over time by
> proactive freezing [1] will increase the overall number of tuples
> frozen by VACUUM significantly (at least in larger tables). It's
> important that we avoid any new user-visible impact from extra
> freezing, though. I recently spent a lot of time on adding high-level
> techniques that aim to avoid extra freezing (e.g. by being lazy about
> freezing) when that makes sense. Low level techniques aimed at making
> the mechanical process of freezing cheaper might also help. (In any
> case it's well worth optimizing.)
>
> I'd like to talk about one such technique on this thread. The attached
> WIP patch reduces the size of xl_heap_freeze_page records by applying
> a simple deduplication process. This can be treated as independent
> work (I think it can, at least).

+1

> The patch doesn't change anything
> about the conceptual model used by VACUUM's lazy_scan_prune function
> to build xl_heap_freeze_page records for a page, and yet still manages
> to make the WAL records for freeze records over 5x smaller in many
> important cases. They'll be ~4x-5x smaller with *most* workloads,
> even.

After a quick benchmark, I've confirmed that the amount of WAL records
for freezing 1 million tuples reduced to about one-fifth (1.2GB vs
250MB). Great.

>
> Each individual tuple entry (each xl_heap_freeze_tuple) adds a full 12
> bytes to the WAL record right now -- no matter what. So the existing
> approach is rather space inefficient by any standard (perhaps because
> it was developed under time pressure while fixing bugs in Postgres
> 9.3). More importantly, there is a lot of natural redundancy among
> each xl_heap_freeze_tuple entry -- each tuple's xl_heap_freeze_tuple
> details tend to match. We can usually get away with storing each
> unique combination of values from xl_heap_freeze_tuple once per
> xl_heap_freeze_page record, while storing associated page offset
> numbers in a separate area, grouped by their canonical freeze plan
> (which is a normalized version of the information currently stored in
> xl_heap_freeze_tuple).

True. I've not looked at the patch in depth yet but I think we need
regression tests for this.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-16 Thread Masahiko Sawada

On Mon, Aug 15, 2022 at 10:39 PM John Naylor
 wrote:
>
> On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada  
> wrote:
> >
> > On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Jul 19, 2022 at 1:30 PM John Naylor
> > >  wrote:
> > > >
> > > >
> > > >
> > > > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada  
> > > > wrote:
> > > >
> > > > > I’d like to keep the first version simple. We can improve it and add
> > > > > more optimizations later. Using radix tree for vacuum TID storage
> > > > > would still be a big win comparing to using a flat array, even without
> > > > > all these optimizations. In terms of single-value leaves method, I'm
> > > > > also concerned about an extra pointer traversal and extra memory
> > > > > allocation. It's most flexible but multi-value leaves method is also
> > > > > flexible enough for many use cases. Using the single-value method
> > > > > seems to be too much as the first step for me.
> > > > >
> > > > > Overall, using 64-bit keys and 64-bit values would be a reasonable
> > > > > choice for me as the first step . It can cover wider use cases
> > > > > including vacuum TID use cases. And possibly it can cover use cases by
> > > > > combining a hash table or using tree of tree, for example.
> > > >
> > > > These two aspects would also bring it closer to Andres' prototype, 
> > > > which 1) makes review easier and 2) easier to preserve optimization 
> > > > work already done, so +1 from me.
> > >
> > > Thanks.
> > >
> > > I've updated the patch. It now implements 64-bit keys, 64-bit values,
> > > and the multi-value leaves method. I've tried to remove duplicated
> > > codes but we might find a better way to do that.
> > >
> >
> > With the recent changes related to simd, I'm going to split the patch
> > into at least two parts: introduce other simd optimized functions used
> > by the radix tree and the radix tree implementation. Particularly we
> > need two functions for radix tree: a function like pg_lfind32 but for
> > 8 bits integers and return the index, and a function that returns the
> > index of the first element that is >= key.
>
> I recommend looking at
>
> https://www.postgresql.org/message-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA%40mail.gmail.com
>
> since I did the work just now for searching bytes and returning a
> bool, buth = and <=. Should be pretty close. Also, i believe if you
> left this for last as a possible refactoring, it might save some work.
> In any case, I'll take a look at the latest patch next month.

I've updated the radix tree patch. It's now separated into two patches.

0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
better names) that are similar to the pg_lfind8() family but they
return the index of the key in the vector instead of true/false. The
patch includes regression tests.

0002 patch is the main radix tree implementation. I've removed some
duplicated codes of node manipulation. For instance, since node-4,
node-16, and node-32 have a similar structure with different fanouts,
I introduced the common function for them.

In addition to two patches, I've attached the third patch. It's not
part of radix tree implementation but introduces a contrib module
bench_radix_tree, a tool for radix tree performance benchmarking. It
measures loading and lookup performance of both the radix tree and a
flat array.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 5d0115b068ecb01d791eab5f8a78a6d25b9cf45c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Wed, 14 Sep 2022 12:38:01 +
Subject: [PATCH v6 1/3] Support pg_lsearch8_eq and pg_lsearch8_ge

---
 src/include/port/pg_lfind.h   |  71 
 src/include/port/simd.h   | 155 +-
 .../test_lfind/expected/test_lfind.out|  12 ++
 .../modules/test_lfind/sql/test_lfind.sql |   2 +
 .../modules/test_lfind/test_lfind--1.0.sql|   8 +
 src/test/modules/test_lfind/test_lfind.c  | 139 
 6 files changed, 378 insertions(+), 9 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 0625cac6b5..583f204763 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -80,6 +80,77 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
 	return false;
 }
 
+/*
+ * pg_lsearch8
+ *
+ * Return the index of the element in 'base'

Re: Improve description of XLOG_RUNNING_XACTS

2022-09-15 Thread Masahiko Sawada

On Wed, Sep 14, 2022 at 6:33 PM Amit Kapila  wrote:
>
> On Fri, Sep 9, 2022 at 6:18 AM Masahiko Sawada  wrote:
> >
> > Updated the patch accordingly.
> >
>
> I have created two xacts each with savepoints and after your patch,
> the record will show xacts/subxacts information as below:
>
> rmgr: Standby len (rec/tot): 74/74, tx:  0, lsn:
> 0/014AC238, prev 0/014AC1F8, desc: RUNNING_XACTS nextXid 733
> latestCompletedXid 726 oldestRunningXid 727; 2 xacts: 729 727; 4
> subxacts: 730 731 728 732
>
> There is no way to associate which subxacts belong to which xact, so
> will it be useful, and if so, how? I guess we probably don't need it
> here because the describe records just display the record information.

I think it's useful for debugging purposes. For instance, when I was
working on the fix 68dcce247f1a13318613a0e27782b2ca21a4ceb7
(REL_14_STABLE), I checked if all initial running transactions
including subtransactions are properly stored and purged by checking
the debug logs and pg_waldump output. Actually, until I realize that
the description of RUNNING_XACTS doesn't show subtransaction
information, I was confused by the fact that the saved initial running
transactions didn't match the description shown by pg_waldump.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Introduce wait_for_subscription_sync for TAP tests

2022-09-09 Thread Masahiko Sawada

On Sat, Sep 10, 2022 at 6:45 AM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > On Fri, Sep 9, 2022 at 11:31 PM Tom Lane  wrote:
> >> Recently a number of buildfarm animals have failed at the same
> >> place in src/test/subscription/t/100_bugs.pl [1][2][3][4]:
> >>
> >> #   Failed test '2x3000 rows in t'
> >> #   at t/100_bugs.pl line 149.
> >> #  got: '9000'
> >> # expected: '6000'
> >> # Looks like you failed 1 test of 7.
> >> [09:30:56] t/100_bugs.pl ..
> >>
> >> This was the last commit to touch that test script.  I'm thinking
> >> maybe it wasn't adjusted quite correctly?  On the other hand, since
> >> I can't find any similar failures before the last 48 hours, maybe
> >> there is some other more-recent commit to blame.  Anyway, something
> >> is wrong there.
>
> > It seems that this commit is innocent as it changed only how to wait.
>
> Yeah.  I was wondering if it caused us to fail to wait somewhere,
> but I concur that's not all that likely.
>
> > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef
> > is relevant.
>
> Noting that the errors have only appeared in the past couple of
> days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f
> (Fix recovery_prefetch with low maintenance_io_concurrency).

Probably I found the cause of this failure[1]. The commit
f6c5edb8abcac04eb3eac6da356e59d399b2bcef didn't fix the problem
properly.

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoAw0Oofi4kiDpJBOwpYyBBBkJj%3DsLUOn4Gd2GjUAKG-fw%40mail.gmail.com

-- 
Masahiko Sawada

Re: Introduce wait_for_subscription_sync for TAP tests

2022-09-09 Thread Masahiko Sawada

On Fri, Sep 9, 2022 at 11:31 PM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > Pushed.
>
> Recently a number of buildfarm animals have failed at the same
> place in src/test/subscription/t/100_bugs.pl [1][2][3][4]:
>
> #   Failed test '2x3000 rows in t'
> #   at t/100_bugs.pl line 149.
> #  got: '9000'
> # expected: '6000'
> # Looks like you failed 1 test of 7.
> [09:30:56] t/100_bugs.pl ..
>
> This was the last commit to touch that test script.  I'm thinking
> maybe it wasn't adjusted quite correctly?  On the other hand, since
> I can't find any similar failures before the last 48 hours, maybe
> there is some other more-recent commit to blame.  Anyway, something
> is wrong there.

It seems that this commit is innocent as it changed only how to wait.
Rather, looking at the logs, the tablesync worker errored out at an
interesting point:

022-09-09 09:30:19.630 EDT [631b3feb.840:13]
pg_16400_sync_16392_7141371862484106124 ERROR:  could not find record
while sending logically-decoded data: missing contrecord at 0/1D4FFF8
2022-09-09 09:30:19.630 EDT [631b3feb.840:14]
pg_16400_sync_16392_7141371862484106124 STATEMENT:  START_REPLICATION
SLOT "pg_16400_sync_16392_7141371862484106124" LOGICAL 0/0
(proto_version '3', origin 'any', publication_names '"testpub"')
ERROR:  could not find record while sending logically-decoded data:
missing contrecord at 0/1D4FFF8
2022-09-09 09:30:19.631 EDT [631b3feb.26e8:2] ERROR:  error while
shutting down streaming COPY: ERROR:  could not find record while
sending logically-decoded data: missing contrecord at 0/1D4FFF8

It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef
is relevant.

Regards,

-- 
Masahiko Sawada

Re: Improve description of XLOG_RUNNING_XACTS

2022-09-08 Thread Masahiko Sawada

On Tue, Aug 23, 2022 at 11:53 AM Kyotaro Horiguchi
 wrote:
>
> At Mon, 15 Aug 2022 11:16:56 +0900, Masahiko Sawada  
> wrote in
> > Sorry for the late reply.
>
> No worries. Anyway I was in a long (as a Japanese:) vacation.
>
> > On Thu, Jul 28, 2022 at 4:29 PM Kyotaro Horiguchi
> >  wrote:
> > > record is sound". Is it any trouble with assuming the both *can*
> > > happen at once?  If something's broken, it will be reflected in the
> > > output.
> >
> > Fair point. We may not need to interpret the contents.
>
> Yeah.
>
> > On Thu, Jul 28, 2022 at 3:24 PM Kyotaro Horiguchi
> >  wrote:
> > >
> > > Another point is if the xid/subxid lists get long, I see it annoying
> > > that the "overflowed" messages goes far away to the end of the long
> > > line. Couldn't we rearrange the item order of the line as the follows?
> > >
> > > nextXid %u latestCompletedXid %u oldestRunningXid %u;[ subxid 
> > > overflowed;][ %d xacts: %u %u ...;][ subxacts: %u %u ..]
> > >
> >
> > I'm concerned that we have two information of subxact apart. Given
> > that showing both individual subxacts and "overflow" is a bug, I think
>
> Bug or every kind of breakage of the file. So if "overflow"ed, we
> don't need to show a subxid there but I thought that there's no need
> to change behavior in that case since it scarcely happens.
>
> > we can output like:
> >
> > if (xlrec->subxcnt > 0)
> > {
> > appendStringInfo(buf, "; %d subxacts:", xlrec->subxcnt);
> > for (i = 0; i < xlrec->subxcnt; i++)
> > appendStringInfo(buf, " %u", xlrec->xids[xlrec->xcnt + i]);
> > }
> >
> > if (xlrec->subxid_overflow)
> > appendStringInfoString(buf, "; subxid overflowed");
>
> Yea, it seems like what I proposed upthread. I'm fine with that since
> it is an abonormal situation.
>
> > Or we can output the "subxid overwlowed" first.
>
> (I prefer this, as that doesn't change the output in the normal case
> but the anormality will be easilly seen if happens.)
>

Updated the patch accordingly.

Regards,

-- 
Masahiko Sawada


v3-0001-Improve-description-of-XLOG_RUNNING_XACTS.patch
Description: Binary data

Re: pgsql: Add ALTER SUBSCRIPTION ... SKIP.

2022-09-06 Thread Masahiko Sawada

On Sun, Sep 4, 2022 at 8:39 PM Amit Kapila  wrote:
>
> On Sun, Sep 4, 2022 at 1:48 PM Alvaro Herrera  wrote:
> >
> > On 2022-Mar-22, Amit Kapila wrote:
> >
> > > Add ALTER SUBSCRIPTION ... SKIP.
> >
> > There are two messages here that seem oddly worded.
> >
> > msgid "start skipping logical replication transaction finished at %X/%X"
> > msgid "done skipping logical replication transaction finished at %X/%X"
> >
> > Two complaints here.  First, the phrases "start / finished" and "done /
> > finished" look very strange.  It took me a while to realize that
> > "finished" refers to the LSN, not to the skipping operation.  Do we ever
> > talk about a transaction "finished at XYZ" as opposed to a transaction
> > whose LSN is XYZ?  (This became particularly strange when I realized
> > that the LSN might come from a PREPARE.)
> >
>
> The reason to add "finished at ..." was to be explicit about whether
> it is a starting LSN or an end LSN of a transaction. We do have such
> differentiation in ReorderBufferTXN (first_lsn ... end_lsn).
>
> > Second, "logical replication transaction".  Is it not a regular
> > transaction that we happen to be processing via logical replication?
> >
> > I think they should say something like
> >
> > "logical replication starts skipping transaction with LSN %X/%X"
> > "logical replication completed skipping transaction with LSN %X/%X"
> >
>
> This looks better to me.

+1

> If you find the above argument to
> differentiate between the start and end LSN convincing then we can
> think of replacing "with" in the above messages with "finished at". I
> see your point related to using "finished at" for PREPARE may not be a
> good idea but I don't have better ideas for the same.

Given that the user normally doesn't need to be aware of the
difference between start LSN and end LSN in the context of using this
feature, I think we can use "with LSN %X/%X".

Regards,

-- 
Masahiko Sawada

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-08-27 Thread Masahiko Sawada

On Sat, Aug 27, 2022 at 7:24 PM Amit Kapila  wrote:
>
> On Sat, Aug 27, 2022 at 1:06 PM Masahiko Sawada  wrote:
> >
> > On Sat, Aug 27, 2022 at 3:56 PM Amit Kapila  wrote:
> > >
> > > On Fri, Jul 29, 2022 at 12:15 PM Amit Kapila  
> > > wrote:
> > > >
> > > > >
> > > > > Yeah, your description makes sense to me. I've also considered how to
> > > > > hit this path but I guess it is never hit. Thinking of it in another
> > > > > way, first of all, at least 2 catalog modifying transactions have to
> > > > > be running while writing a xl_running_xacts. The serialized snapshot
> > > > > that is written when we decode the first xl_running_xact has two
> > > > > transactions. Then, one of them is committed before the second
> > > > > xl_running_xacts. The second serialized snapshot has only one
> > > > > transaction. Then, the transaction is also committed after that. Now,
> > > > > in order to execute the path, we need to start decoding from the first
> > > > > serialized snapshot. However, if we start from there, we cannot decode
> > > > > the full contents of the transaction that was committed later.
> > > > >
> > > >
> > > > I think then we should change this code in the master branch patch
> > > > with an additional comment on the lines of: "Either all the xacts got
> > > > purged or none. It is only possible to partially remove the xids from
> > > > this array if one or more of the xids are still running but not all.
> > > > That can happen if we start decoding from a point (LSN where the
> > > > snapshot state became consistent) where all the xacts in this were
> > > > running and then at least one of those got committed and a few are
> > > > still running. We will never start from such a point because we won't
> > > > move the slot's restart_lsn past the point where the oldest running
> > > > transaction's restart_decoding_lsn is."
> > > >
> > >
> > > Unfortunately, this theory doesn't turn out to be true. While
> > > investigating the latest buildfarm failure [1], I see that it is
> > > possible that only part of the xacts in the restored catalog modifying
> > > xacts list needs to be purged. See the attached where I have
> > > demonstrated it via a reproducible test. It seems the point we were
> > > missing was that to start from a point where two or more catalog
> > > modifying were serialized, it requires another open transaction before
> > > both get committed, and then we need the checkpoint or other way to
> > > force running_xacts record in-between the commit of initial two
> > > catalog modifying xacts. There could possibly be other ways as well
> > > but the theory above wasn't correct.
> > >
> >
> > Thank you for the analysis and the patch. I have the same conclusion.
> > Since we took this approach only on the master the back branches are
> > not affected.
> >
> > The new test scenario makes sense to me and looks better than the one
> > I have. Regarding the fix, I think we should use
> > TransactionIdFollowsOrEquals() instead of
> > NormalTransactionIdPrecedes():
> >
> >  +   for (off = 0; off < builder->catchange.xcnt; off++)
> >  +   {
> >  +   if (NormalTransactionIdPrecedes(builder->catchange.xip[off],
> >  +   builder->xmin))
> >  +   break;
> >  +   }
> >
>
> Right, fixed.

Thank you for updating the patch! It looks good to me.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-08-27 Thread Masahiko Sawada

On Sat, Aug 27, 2022 at 3:56 PM Amit Kapila  wrote:
>
> On Fri, Jul 29, 2022 at 12:15 PM Amit Kapila  wrote:
> >
> > >
> > > Yeah, your description makes sense to me. I've also considered how to
> > > hit this path but I guess it is never hit. Thinking of it in another
> > > way, first of all, at least 2 catalog modifying transactions have to
> > > be running while writing a xl_running_xacts. The serialized snapshot
> > > that is written when we decode the first xl_running_xact has two
> > > transactions. Then, one of them is committed before the second
> > > xl_running_xacts. The second serialized snapshot has only one
> > > transaction. Then, the transaction is also committed after that. Now,
> > > in order to execute the path, we need to start decoding from the first
> > > serialized snapshot. However, if we start from there, we cannot decode
> > > the full contents of the transaction that was committed later.
> > >
> >
> > I think then we should change this code in the master branch patch
> > with an additional comment on the lines of: "Either all the xacts got
> > purged or none. It is only possible to partially remove the xids from
> > this array if one or more of the xids are still running but not all.
> > That can happen if we start decoding from a point (LSN where the
> > snapshot state became consistent) where all the xacts in this were
> > running and then at least one of those got committed and a few are
> > still running. We will never start from such a point because we won't
> > move the slot's restart_lsn past the point where the oldest running
> > transaction's restart_decoding_lsn is."
> >
>
> Unfortunately, this theory doesn't turn out to be true. While
> investigating the latest buildfarm failure [1], I see that it is
> possible that only part of the xacts in the restored catalog modifying
> xacts list needs to be purged. See the attached where I have
> demonstrated it via a reproducible test. It seems the point we were
> missing was that to start from a point where two or more catalog
> modifying were serialized, it requires another open transaction before
> both get committed, and then we need the checkpoint or other way to
> force running_xacts record in-between the commit of initial two
> catalog modifying xacts. There could possibly be other ways as well
> but the theory above wasn't correct.
>

Thank you for the analysis and the patch. I have the same conclusion.
Since we took this approach only on the master the back branches are
not affected.

The new test scenario makes sense to me and looks better than the one
I have. Regarding the fix, I think we should use
TransactionIdFollowsOrEquals() instead of
NormalTransactionIdPrecedes():

 +   for (off = 0; off < builder->catchange.xcnt; off++)
 +   {
 +   if (NormalTransactionIdPrecedes(builder->catchange.xip[off],
 +   builder->xmin))
 +   break;
 +   }

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: pg15b3: crash in paralell vacuum

2022-08-18 Thread Masahiko Sawada

On Thu, Aug 18, 2022 at 11:04 PM Justin Pryzby  wrote:
>
> On Thu, Aug 18, 2022 at 08:34:06AM -0500, Justin Pryzby wrote:
> > Unfortunately, it looks like the RPM packages are compiled with -O2, so 
> > this is
> > of limited use.  So I'll be back shortly with more...
>
> #3  0x006874f1 in parallel_vacuum_process_all_indexes (pvs=0x25bdce0, 
> num_index_scans=0, vacuum=vacuum@entry=false) at vacuumparallel.c:611
> 611 Assert(indstats->status == 
> PARALLEL_INDVAC_STATUS_INITIAL);
>
> (gdb) p *pvs
> $1 = {pcxt = 0x25bc1e0, indrels = 0x25bbf70, nindexes = 8, shared = 
> 0x7fc5184393a0, indstats = 0x7fc5184393e0, dead_items = 0x7fc5144393a0, 
> buffer_usage = 0x7fc514439280, wal_usage = 0x7fc514439240,
>   will_parallel_vacuum = 0x266d818, nindexes_parallel_bulkdel = 5, 
> nindexes_parallel_cleanup = 0, nindexes_parallel_condcleanup = 5, bstrategy = 
> 0x264f120, relnamespace = 0x0, relname = 0x0, indname = 0x0,
>   status = PARALLEL_INDVAC_STATUS_INITIAL}
>
> (gdb) p *indstats
> $2 = {status = 11, parallel_workers_can_process = false, istat_updated = 
> false, istat = {num_pages = 0, estimated_count = false, num_index_tuples = 0, 
> tuples_removed = 0, pages_newly_deleted = 0, pages_deleted = 1,
> pages_free = 0}}

The status = 11 is invalid value. Probably because indstats was not
initialized to 0 as I mentioned.

Justin, if it's reproducible in your environment, could you please try
it again with the attached patch?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


initialize_indstats.patch
Description: Binary data

Re: pg15b3: crash in paralell vacuum

2022-08-18 Thread Masahiko Sawada

On Thu, Aug 18, 2022 at 11:06 PM Masahiko Sawada  wrote:
>
> Hi,
>
> On Thu, Aug 18, 2022 at 10:34 PM Justin Pryzby  wrote:
> >
> > Immediately after upgrading an internal instance, a loop around "vacuum" did
> > this:
>
> Thank you for the report!
>
> >
> > TRAP: FailedAssertion("indstats->status == PARALLEL_INDVAC_STATUS_INITIAL", 
> > File: "vacuumparallel.c", Line: 611, PID: 27635)
> > postgres: postgres pryzbyj [local] 
> > VACUUM(ExceptionalCondition+0x8d)[0x99d9fd]
> > postgres: postgres pryzbyj [local] VACUUM[0x6915db]
> > postgres: postgres pryzbyj [local] VACUUM(heap_vacuum_rel+0x12b6)[0x5083e6]
> > postgres: postgres pryzbyj [local] VACUUM[0x68e97a]
> > postgres: postgres pryzbyj [local] VACUUM(vacuum+0x48e)[0x68fe9e]
> > postgres: postgres pryzbyj [local] VACUUM(ExecVacuum+0x2ae)[0x69065e]
> > postgres: postgres pryzbyj [local] 
> > VACUUM(standard_ProcessUtility+0x530)[0x8567b0]
> > /usr/pgsql-15/lib/pg_stat_statements.so(+0x5450)[0x7f52b891c450]
> > postgres: postgres pryzbyj [local] VACUUM[0x85490a]
> > postgres: postgres pryzbyj [local] VACUUM[0x854a53]
> > postgres: postgres pryzbyj [local] VACUUM(PortalRun+0x179)[0x855029]
> > postgres: postgres pryzbyj [local] VACUUM[0x85099b]
> > postgres: postgres pryzbyj [local] VACUUM(PostgresMain+0x199a)[0x85268a]
> > postgres: postgres pryzbyj [local] VACUUM[0x496a21]
> > postgres: postgres pryzbyj [local] VACUUM(PostmasterMain+0x11c0)[0x7b3980]
> > postgres: postgres pryzbyj [local] VACUUM(main+0x1c6)[0x4986a6]
> > /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f52c4b893d5]
> > postgres: postgres pryzbyj [local] VACUUM[0x498c59]
> > < 2022-08-18 07:56:51.963 CDT  >LOG:  server process (PID 27635) was 
> > terminated by signal 6: Aborted
> > < 2022-08-18 07:56:51.963 CDT  >DETAIL:  Failed process was running: VACUUM 
> > ANALYZE alarms
> >
> > Unfortunately, it looks like the RPM packages are compiled with -O2, so 
> > this is
> > of limited use.
> >
> > Core was generated by `postgres: postgres pryzbyj [local] VACUUM
> >   '.
> > Program terminated with signal 6, Aborted.
> > #0  0x7f52c4b9d207 in raise () from /lib64/libc.so.6
> > Missing separate debuginfos, use: debuginfo-install 
> > audit-libs-2.8.4-4.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 
> > cyrus-sasl-lib-2.1.26-23.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 
> > elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-260.el7_6.3.x86_64 
> > keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-51.el7_9.x86_64 
> > libattr-2.4.46-13.el7.x86_64 libcap-2.22-9.el7.x86_64 
> > libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 
> > libgcc-4.8.5-39.el7.x86_64 libgcrypt-1.5.3-14.el7.x86_64 
> > libgpg-error-1.12-3.el7.x86_64 libicu-50.1.2-17.el7.x86_64 
> > libselinux-2.5-15.el7.x86_64 libstdc++-4.8.5-39.el7.x86_64 
> > libxml2-2.9.1-6.el7_9.6.x86_64 libzstd-1.5.2-1.el7.x86_64 
> > lz4-1.7.5-2.el7.x86_64 nspr-4.19.0-1.el7_5.x86_64 
> > nss-3.36.0-7.1.el7_6.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64 
> > nss-util-3.36.0-1.1.el7_6.x86_64 openldap-2.4.44-21.el7_6.x86_64 
> > openssl-libs-1.0.2k-22.el7_9.x86_64 pam-1.1.8-22.el7.x86_64 
> > pcre-8.32-17.el7.x86_64 systemd-libs-219-62.el7_6.5.x86_64 
> > xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
> > (gdb) bt
> > #0  0x7f52c4b9d207 in raise () from /lib64/libc.so.6
> > #1  0x7f52c4b9e8f8 in abort () from /lib64/libc.so.6
> > #2  0x0099da1e in ExceptionalCondition 
> > (conditionName=conditionName@entry=0xafae40 "indstats->status == 
> > PARALLEL_INDVAC_STATUS_INITIAL", errorType=errorType@entry=0x9fb4b7 
> > "FailedAssertion",
> > fileName=fileName@entry=0xafb0c0 "vacuumparallel.c", 
> > lineNumber=lineNumber@entry=611) at assert.c:69
> > #3  0x006915db in parallel_vacuum_process_all_indexes 
> > (pvs=0x2e85f80, num_index_scans=, vacuum=) at 
> > vacuumparallel.c:611
> > #4  0x005083e6 in heap_vacuum_rel (rel=, 
> > params=, bstrategy=) at vacuumlazy.c:2679
>
> It seems that parallel_vacuum_cleanup_all_indexes() got called[1],
> which means this was the first time to perform parallel vacuum (i.e.,
> index cleanup).

Sorry, this explanation is wrong. But according to the recent
information from Justin it was the first time to perform parallel
vacuum:

#3  0x006874f1 in parallel_vacuum_process_all_indexes
(pvs=0x25bdce0, num_index_scans=0, vacuum=vacuum@entry=false) at
vacuumparallel.c:611
611 Assert(indstats->status ==
PARALLEL_INDVAC_STATUS_INITIAL);

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: pg15b3: crash in paralell vacuum

2022-08-18 Thread Masahiko Sawada

Hi,

On Thu, Aug 18, 2022 at 10:34 PM Justin Pryzby  wrote:
>
> Immediately after upgrading an internal instance, a loop around "vacuum" did
> this:

Thank you for the report!

>
> TRAP: FailedAssertion("indstats->status == PARALLEL_INDVAC_STATUS_INITIAL", 
> File: "vacuumparallel.c", Line: 611, PID: 27635)
> postgres: postgres pryzbyj [local] VACUUM(ExceptionalCondition+0x8d)[0x99d9fd]
> postgres: postgres pryzbyj [local] VACUUM[0x6915db]
> postgres: postgres pryzbyj [local] VACUUM(heap_vacuum_rel+0x12b6)[0x5083e6]
> postgres: postgres pryzbyj [local] VACUUM[0x68e97a]
> postgres: postgres pryzbyj [local] VACUUM(vacuum+0x48e)[0x68fe9e]
> postgres: postgres pryzbyj [local] VACUUM(ExecVacuum+0x2ae)[0x69065e]
> postgres: postgres pryzbyj [local] 
> VACUUM(standard_ProcessUtility+0x530)[0x8567b0]
> /usr/pgsql-15/lib/pg_stat_statements.so(+0x5450)[0x7f52b891c450]
> postgres: postgres pryzbyj [local] VACUUM[0x85490a]
> postgres: postgres pryzbyj [local] VACUUM[0x854a53]
> postgres: postgres pryzbyj [local] VACUUM(PortalRun+0x179)[0x855029]
> postgres: postgres pryzbyj [local] VACUUM[0x85099b]
> postgres: postgres pryzbyj [local] VACUUM(PostgresMain+0x199a)[0x85268a]
> postgres: postgres pryzbyj [local] VACUUM[0x496a21]
> postgres: postgres pryzbyj [local] VACUUM(PostmasterMain+0x11c0)[0x7b3980]
> postgres: postgres pryzbyj [local] VACUUM(main+0x1c6)[0x4986a6]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f52c4b893d5]
> postgres: postgres pryzbyj [local] VACUUM[0x498c59]
> < 2022-08-18 07:56:51.963 CDT  >LOG:  server process (PID 27635) was 
> terminated by signal 6: Aborted
> < 2022-08-18 07:56:51.963 CDT  >DETAIL:  Failed process was running: VACUUM 
> ANALYZE alarms
>
> Unfortunately, it looks like the RPM packages are compiled with -O2, so this 
> is
> of limited use.
>
> Core was generated by `postgres: postgres pryzbyj [local] VACUUM  
> '.
> Program terminated with signal 6, Aborted.
> #0  0x7f52c4b9d207 in raise () from /lib64/libc.so.6
> Missing separate debuginfos, use: debuginfo-install 
> audit-libs-2.8.4-4.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 
> cyrus-sasl-lib-2.1.26-23.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 
> elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-260.el7_6.3.x86_64 
> keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-51.el7_9.x86_64 
> libattr-2.4.46-13.el7.x86_64 libcap-2.22-9.el7.x86_64 
> libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 
> libgcc-4.8.5-39.el7.x86_64 libgcrypt-1.5.3-14.el7.x86_64 
> libgpg-error-1.12-3.el7.x86_64 libicu-50.1.2-17.el7.x86_64 
> libselinux-2.5-15.el7.x86_64 libstdc++-4.8.5-39.el7.x86_64 
> libxml2-2.9.1-6.el7_9.6.x86_64 libzstd-1.5.2-1.el7.x86_64 
> lz4-1.7.5-2.el7.x86_64 nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-7.1.el7_6.x86_64 
> nss-softokn-freebl-3.36.0-5.el7_5.x86_64 nss-util-3.36.0-1.1.el7_6.x86_64 
> openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-22.el7_9.x86_64 
> pam-1.1.8-22.el7.x86_64 pcre-8.32-17.el7.x86_64 
> systemd-libs-219-62.el7_6.5.x86_64 xz-libs-5.2.2-1.el7.x86_64 
> zlib-1.2.7-18.el7.x86_64
> (gdb) bt
> #0  0x7f52c4b9d207 in raise () from /lib64/libc.so.6
> #1  0x7f52c4b9e8f8 in abort () from /lib64/libc.so.6
> #2  0x0099da1e in ExceptionalCondition 
> (conditionName=conditionName@entry=0xafae40 "indstats->status == 
> PARALLEL_INDVAC_STATUS_INITIAL", errorType=errorType@entry=0x9fb4b7 
> "FailedAssertion",
> fileName=fileName@entry=0xafb0c0 "vacuumparallel.c", 
> lineNumber=lineNumber@entry=611) at assert.c:69
> #3  0x006915db in parallel_vacuum_process_all_indexes (pvs=0x2e85f80, 
> num_index_scans=, vacuum=) at 
> vacuumparallel.c:611
> #4  0x005083e6 in heap_vacuum_rel (rel=, 
> params=, bstrategy=) at vacuumlazy.c:2679

It seems that parallel_vacuum_cleanup_all_indexes() got called[1],
which means this was the first time to perform parallel vacuum (i.e.,
index cleanup).

I'm not convinced yet but it could be a culprit that we missed doing
memset(0) for the shared array of PVIndStats in
parallel_vacuum_init(). This shared array was introduced in PG15.

[1] 
https://github.com/postgres/postgres/blob/REL_15_STABLE/src/backend/access/heap/vacuumlazy.c#L2679

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Logical WAL sender unresponsive during decoding commit

2022-08-16 Thread Masahiko Sawada

On Tue, Aug 16, 2022 at 2:32 PM Masahiko Sawada  wrote:
>
> On Tue, Aug 16, 2022 at 2:31 PM Amit Kapila  wrote:
> >
> > On Tue, Aug 16, 2022 at 10:56 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Aug 16, 2022 at 2:08 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Tue, Aug 16, 2022 at 9:28 AM Andrey Borodin  
> > > > wrote:
> > > > >
> > > > > Hi hackers!
> > > > >
> > > > > Some time ago I've seen a hanging logical replication that was trying 
> > > > > to send transaction commit after doing table pg_repack.
> > > > > I understand that those things do not mix well. Yet walsender was 
> > > > > ignoring pg_terminate_backend() and I think this worth fixing.
> > > > > Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace?
> > > > >
> > > >
> > > > I think if we want to do this in this code path then it may be it is
> > > > better to add it in ReorderBufferProcessTXN where we are looping to
> > > > process each change.
> > >
> > > +1
> > >
> > > The same issue is recently reported[1] on -bugs and I proposed the
> > > patch that adds CHECK_FOR_INTERRUPTS() to the loop in
> > > ReorderBufferProcessTXN(). I think it should be backpatched.
> > >
> >
> > I agree that it is better to backpatch this as well. Would you like to
> > verify if your patch works for all branches or if it need some tweaks?
> >
>
> Yes, I've confirmed v10 and master but will do that for other branches
> and send patches for all supported branches.
>

I've attached patches for all supported branches.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


REL11_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
Description: Binary data


REL15_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
Description: Binary data


REL14_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
Description: Binary data


REL13_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
Description: Binary data


REL12_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
Description: Binary data


REL10_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
Description: Binary data


master_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
Description: Binary data

Re: Logical WAL sender unresponsive during decoding commit

2022-08-15 Thread Masahiko Sawada

On Tue, Aug 16, 2022 at 2:31 PM Amit Kapila  wrote:
>
> On Tue, Aug 16, 2022 at 10:56 AM Masahiko Sawada  
> wrote:
> >
> > On Tue, Aug 16, 2022 at 2:08 PM Amit Kapila  wrote:
> > >
> > > On Tue, Aug 16, 2022 at 9:28 AM Andrey Borodin  
> > > wrote:
> > > >
> > > > Hi hackers!
> > > >
> > > > Some time ago I've seen a hanging logical replication that was trying 
> > > > to send transaction commit after doing table pg_repack.
> > > > I understand that those things do not mix well. Yet walsender was 
> > > > ignoring pg_terminate_backend() and I think this worth fixing.
> > > > Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace?
> > > >
> > >
> > > I think if we want to do this in this code path then it may be it is
> > > better to add it in ReorderBufferProcessTXN where we are looping to
> > > process each change.
> >
> > +1
> >
> > The same issue is recently reported[1] on -bugs and I proposed the
> > patch that adds CHECK_FOR_INTERRUPTS() to the loop in
> > ReorderBufferProcessTXN(). I think it should be backpatched.
> >
>
> I agree that it is better to backpatch this as well. Would you like to
> verify if your patch works for all branches or if it need some tweaks?
>

Yes, I've confirmed v10 and master but will do that for other branches
and send patches for all supported branches.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Logical WAL sender unresponsive during decoding commit

2022-08-15 Thread Masahiko Sawada

On Tue, Aug 16, 2022 at 2:08 PM Amit Kapila  wrote:
>
> On Tue, Aug 16, 2022 at 9:28 AM Andrey Borodin  wrote:
> >
> > Hi hackers!
> >
> > Some time ago I've seen a hanging logical replication that was trying to 
> > send transaction commit after doing table pg_repack.
> > I understand that those things do not mix well. Yet walsender was ignoring 
> > pg_terminate_backend() and I think this worth fixing.
> > Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace?
> >
>
> I think if we want to do this in this code path then it may be it is
> better to add it in ReorderBufferProcessTXN where we are looping to
> process each change.

+1

The same issue is recently reported[1] on -bugs and I proposed the
patch that adds CHECK_FOR_INTERRUPTS() to the loop in
ReorderBufferProcessTXN(). I think it should be backpatched.

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoD%2BaNfLje%2B9JOqWbTiq1GL4BOp9_f7FxLADm8rS8cDhCQ%40mail.gmail.com

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-08-14 Thread Masahiko Sawada

On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada  wrote:
>
> On Tue, Jul 19, 2022 at 1:30 PM John Naylor
>  wrote:
> >
> >
> >
> > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada  
> > wrote:
> >
> > > I’d like to keep the first version simple. We can improve it and add
> > > more optimizations later. Using radix tree for vacuum TID storage
> > > would still be a big win comparing to using a flat array, even without
> > > all these optimizations. In terms of single-value leaves method, I'm
> > > also concerned about an extra pointer traversal and extra memory
> > > allocation. It's most flexible but multi-value leaves method is also
> > > flexible enough for many use cases. Using the single-value method
> > > seems to be too much as the first step for me.
> > >
> > > Overall, using 64-bit keys and 64-bit values would be a reasonable
> > > choice for me as the first step . It can cover wider use cases
> > > including vacuum TID use cases. And possibly it can cover use cases by
> > > combining a hash table or using tree of tree, for example.
> >
> > These two aspects would also bring it closer to Andres' prototype, which 1) 
> > makes review easier and 2) easier to preserve optimization work already 
> > done, so +1 from me.
>
> Thanks.
>
> I've updated the patch. It now implements 64-bit keys, 64-bit values,
> and the multi-value leaves method. I've tried to remove duplicated
> codes but we might find a better way to do that.
>

With the recent changes related to simd, I'm going to split the patch
into at least two parts: introduce other simd optimized functions used
by the radix tree and the radix tree implementation. Particularly we
need two functions for radix tree: a function like pg_lfind32 but for
8 bits integers and return the index, and a function that returns the
index of the first element that is >= key.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Improve description of XLOG_RUNNING_XACTS

2022-08-14 Thread Masahiko Sawada

Sorry for the late reply.

On Thu, Jul 28, 2022 at 4:29 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 28 Jul 2022 15:53:33 +0900, Masahiko Sawada  
> wrote in
> > >
> > Do you mean that both could be true at the same time? If I read
> > GetRunningTransactionData() correctly, that doesn't happen.
>
> So, I wrote "since it is debugging output", and "fine if we asuume the
> record is sound". Is it any trouble with assuming the both *can*
> happen at once?  If something's broken, it will be reflected in the
> output.

Fair point. We may not need to interpret the contents.

On Thu, Jul 28, 2022 at 3:24 PM Kyotaro Horiguchi
 wrote:
>
> Another point is if the xid/subxid lists get long, I see it annoying
> that the "overflowed" messages goes far away to the end of the long
> line. Couldn't we rearrange the item order of the line as the follows?
>
> nextXid %u latestCompletedXid %u oldestRunningXid %u;[ subxid overflowed;][ 
> %d xacts: %u %u ...;][ subxacts: %u %u ..]
>

I'm concerned that we have two information of subxact apart. Given
that showing both individual subxacts and "overflow" is a bug, I think
we can output like:

if (xlrec->subxcnt > 0)
{
appendStringInfo(buf, "; %d subxacts:", xlrec->subxcnt);
for (i = 0; i < xlrec->subxcnt; i++)
appendStringInfo(buf, " %u", xlrec->xids[xlrec->xcnt + i]);
}

if (xlrec->subxid_overflow)
appendStringInfoString(buf, "; subxid overflowed");

Or we can output the "subxid overwlowed" first.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: bogus assert in logicalmsg_desc

2022-08-14 Thread Masahiko Sawada

On Mon, Aug 15, 2022 at 1:17 AM Tomas Vondra
 wrote:
>
> Hi,
>
> while experimenting with logical messages, I ran into this assert in
> logicalmsg_desc:
>
> Assert(prefix[xlrec->prefix_size] != '\0');
>
> This seems to be incorrect, because LogLogicalMessage does this:
>
> xlrec.prefix_size = strlen(prefix) + 1;
>
> So prefix_size includes the null byte, so the assert points out at the
> first payload byte. And of course, the check should be "==" because we
> expect the byte to be \0, not the other way around.
>
> It's pretty simple to make this crash by writing a logical message where
> the first payload byte is \0, e.g. like this:
>
> select pg_logical_emit_message(true, 'a'::text, '\x00'::bytea);
>
> and then running pg_waldump on the WAL segment.
>
> Attached is a patch addressing this. This was added in 14, so we should
> backpatch to that version.

+1

The patch looks good to me.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-08-11 Thread Masahiko Sawada

On Thu, Aug 11, 2022 at 3:10 PM Amit Kapila  wrote:
>
> On Mon, Aug 8, 2022 at 9:34 AM Amit Kapila  wrote:
> >
> > On Wed, Aug 3, 2022 at 1:20 PM Masahiko Sawada  
> > wrote:
> > >
> > >
> > > Oops, thanks for pointing it out. I've fixed it and attached updated
> > > patches for all branches so as not to confuse the patch version. There
> > > is no update from v12 patch on REL12 - master patches.
> > >
> >
> > Thanks for the updated patches, the changes look good to me.
> > Horiguchi-San, and others, do you have any further comments on this or
> > do you want to spend time in review of it? If not, I would like to
> > push this after the current minor version release.
> >
>
> Pushed.

Thank you!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: optimize lookups in snapshot [sub]xip arrays

2022-08-09 Thread Masahiko Sawada

On Wed, Aug 10, 2022 at 5:00 AM Nathan Bossart  wrote:
>
> On Tue, Aug 09, 2022 at 01:21:41PM +0700, John Naylor wrote:
> > I decided I wasn't quite comfortable changing snapshot handling
> > without further guarantees.  To this end, 0002 in the attached v11 is
> > an addendum that adds assert checking (also pgindent and some
> > comment-smithing). As I suspected, make check-world passes even with
> > purposefully screwed-up coding. 0003 uses pg_lfind32 in syscache.c and
> > I verified that sticking in the wrong answer will lead to a crash in
> > assert-enabled builds in short order. I'd kind of like to throw this
> > (or something else suitable) at the build farm first for that reason.
> > It's simpler than the qsort/qunique/binary search that was there
> > before, so that's nice, but I've not tried to test performance.
>
> Your adjustments in 0002 seem reasonable to me.  I think it makes sense to
> ensure there is test coverage for pg_lfind32(), but I don't know if that
> syscache code is the right choice.  For non-USE_SSE2 builds, it might make
> these lookups more expensive.

I think that for non-USE_SSE2 builds, there is no additional overhead
as all assertion-related code in pg_lfind32 depends on USE_SSE2.

> I'll look around to see if there are any
> other suitable candidates.

As you proposed, having a test module for that seems to be a good
idea. We can add test codes for future optimizations that utilize SIMD
operations.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Fix unmatched file identifications

2022-08-08 Thread Masahiko Sawada

On Tue, Aug 9, 2022 at 11:24 AM John Naylor
 wrote:
>
> On Tue, Aug 9, 2022 at 8:58 AM Masahiko Sawada  wrote:
> > I found that there are two .c and .h files whose identification in the
> > header comment doesn't match its actual path.
>
> > The attached small patch fixes them.
>
> Pushed, thanks!

Thank you!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Masahiko Sawada

On Tue, Aug 9, 2022 at 7:33 AM Nathan Bossart  wrote:
>
> On Mon, Aug 08, 2022 at 01:46:48PM +0700, John Naylor wrote:
> > Okay, I think it's basically in good shape. Since it should be a bit
> > faster than a couple versions ago, would you be up for retesting with
> > the original test having 8 to 512 writers?
>
> Sure thing.  The results are similar.  As before, the improvements are most
> visible when the arrays are large.
>
> writers  head  patch
> 8672   680
> 16   639   664
> 32   701   689
> 64   705   703
> 128  628   653
> 256  576   627
> 512  530   584
> 768  450   536
> 1024 350   494
>
> > And also add the const
> > markers we discussed upthread?
>
> Oops, sorry about that.  This is done in v9.
>
> > Aside from that, I plan to commit this
> > week unless there is further bikeshedding.
>
> Great, thanks.

The patch looks good to me. One minor point is:

+ * IDENTIFICATION
+ *   src/port/pg_lfind.h

The path doesn't match to the actual file path, src/include/port/pg_lfind.h.

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 2774 matches

Mail list logo