from:"houzj.f...@fujitsu.com"

RE: Column Filtering in Logical Replication

2022-03-09 Thread houzj.f...@fujitsu.com

On Wednesday, March 9, 2022 6:04 PM Amit Kapila 
> On Mon, Mar 7, 2022 at 8:48 PM Tomas Vondra
>  wrote:
> >
> > On 3/4/22 11:42, Amit Kapila wrote:
> >
> > > *
> > > Fetching column filter info in tablesync.c is quite expensive. It
> > > seems to be using four round-trips to get the complete info whereas
> > > for row-filter we use just one round trip. I think we should try to
> > > get both row filter and column filter info in just one round trip.
> > >
> >
> > Maybe, but I really don't think this is an issue.
> >
> 
> I am not sure but it might matter for small tables. Leaving aside the
> performance issue, I think the current way will get the wrong column list in
> many cases: (a) The ALL TABLES IN SCHEMA case handling won't work for
> partitioned tables when the partitioned table is part of one schema and
> partition table is part of another schema. (b) The handling of partition 
> tables in
> other cases will fetch incorrect lists as it tries to fetch the column list 
> of all the
> partitions in the hierarchy.
> 
> One of my colleagues has even tested these cases both for column filters and
> row filters and we find the behavior of row filter is okay whereas for column
> filter it uses the wrong column list. We will share the tests and results 
> with you
> in a later email. We are trying to unify the column filter queries with row 
> filter to
> make their behavior the same and will share the findings once it is done. I 
> hope
> if we are able to achieve this that we will reduce the chances of bugs in 
> this area.
> 
> Note: I think the first two patches for tests are not required after commit
> ceb57afd3c.

Hi,

Here are some tests and results about the table sync query of
column filter patch and row filter.

1) multiple publications which publish schema of parent table and partition.
pub
create schema s1;
create table s1.t (a int, b int, c int) partition by range (a);
create table t_1 partition of s1.t for values from (1) to (10);
create publication pub1 for all tables in schema s1;
create publication pub2 for table t_1(b);

sub
- prepare tables
CREATE SUBSCRIPTION sub CONNECTION 'port=1 dbname=postgres' PUBLICATION 
pub1, pub2;

When doing table sync for 't_1', the column list will be (b). I think it should
be no filter because table t_1 is also published via ALL TABLES IN SCHMEA
publication.

For Row Filter, it will use no filter for this case.


2) one publication publishes both parent and child
pub
create table t (a int, b int, c int) partition by range (a);
create table t_1 partition of t for values from (1) to (10)
   partition by range (a);
create table t_2 partition of t_1 for values from (1) to (10);

create publication pub2 for table t_1(a), t_2
  with (PUBLISH_VIA_PARTITION_ROOT);

sub
- prepare tables
CREATE SUBSCRIPTION sub CONNECTION 'port=1 dbname=postgres' PUBLICATION 
pub2;

When doing table sync for table 't_1', it has no column list. I think the
expected column list is (a).

For Row Filter, it will use the row filter of the top most parent table(t_1) in
this case.


3) one publication publishes both parent and child
pub
create table t (a int, b int, c int) partition by range (a);
create table t_1 partition of t for values from (1) to (10)
   partition by range (a);
create table t_2 partition of t_1 for values from (1) to (10);

create publication pub2 for table t_1(a), t_2(b)
  with (PUBLISH_VIA_PARTITION_ROOT);

sub
- prepare tables
CREATE SUBSCRIPTION sub CONNECTION 'port=1 dbname=postgres' PUBLICATION 
pub2;

When doing table sync for table 't_1', the column list would be (a, b). I think
the expected column list is (a).

For Row Filter, it will use the row filter of the top most parent table(t_1) in
this case.

Best regards,
Hou zj

RE: Data is copied twice when specifying both child and parent table in publication

2022-03-09 Thread houzj.f...@fujitsu.com

Hi,

When reviewing some logical replication related features. I noticed another
possible problem if the subscriber subscribes multiple publications which
publish parent and child table.

For example:

pub
create table t (a int, b int, c int) partition by range (a);
create table t_1 partition of t for values from (1) to (10);

create publication pub1 for table t
  with (PUBLISH_VIA_PARTITION_ROOT);
create publication pub2 for table t_1
  with (PUBLISH_VIA_PARTITION_ROOT);

sub
 prepare table t and t_1
CREATE SUBSCRIPTION sub CONNECTION 'port=1 dbname=postgres' PUBLICATION 
pub1, pub2;

select * from pg_subscription_rel ;
 srsubid | srrelid | srsubstate | srsublsn
-+-++---
   16391 |   16385(t) | r  | 0/150D100
   16391 |   16388(t_1) | r  | 0/150D138

If subscribe two publications one of them publish parent table with
(pubviaroot=true) and another publish child table. Both the parent table and
child table will exist in pg_subscription_rel which also means we will do
initial copy for both tables.

But after initial copy, we only publish change with the schema of the parent
table(t). It looks a bit inconsistent.

Based on the document of PUBLISH_VIA_PARTITION_ROOT option. I think the
expected behavior could be we only store the top most parent(table t) in
pg_subscription_rel and do initial copy for it if pubviaroot is on. I haven't
thought about how to fix this and will investigate this later.

Best regards,
Hou zj

RE: Column Filtering in Logical Replication

2022-03-14 Thread houzj.f...@fujitsu.com

On Monday, March 14, 2022 5:08 AM Tomas Vondra  
wrote:
> 
> On 3/12/22 05:30, Amit Kapila wrote:
> >> ...
> >
> > Okay, please find attached. I have done basic testing of this, if we
> > agree with this approach then this will require some more testing.
> >
> 
> Thanks, the proposed changes seem like a clear improvement, so I've
> added them, with some minor tweaks (mostly to comments).

Hi,

Thanks for updating the patches !
And sorry for the row filter bug caused by my mistake.

I looked at the two fixup patches. I am thinking would it be better if we
add one testcase for these two bugs? Maybe like the attachment.

(Attach the fixup patch to make the cfbot happy)

Best regards,
Hou zj




0003-fixup-row-filter-publications-20220313.patch
Description: 0003-fixup-row-filter-publications-20220313.patch


0001-fixup-publish_as_relid-20220313.patch
Description: 0001-fixup-publish_as_relid-20220313.patch


0002-testcase-for-publish-as-relid.patch
Description: 0002-testcase-for-publish-as-relid.patch


0004-testcase-for-row-filter-publication.patch
Description: 0004-testcase-for-row-filter-publication.patch

RE: logical replication empty transactions

2022-03-21 Thread houzj.f...@fujitsu.com

On Monday, March 21, 2022 6:01 PM Amit Kapila  wrote:
> 
> On Sat, Mar 19, 2022 at 9:10 AM Ajin Cherian  wrote:
> >
> > On Thu, Mar 17, 2022 at 10:43 PM Amit Kapila 
> wrote:
> >
> > > 3. Can we add a simple test for it in one of the existing test
> > > files(say in 001_rep_changes.pl)?
> >
> > added a simple test.
> >
> 
> This doesn't verify if the transaction is skipped. I think we should
> extend this test to check for a DEBUG message in the Logs (you need to
> probably set log_min_messages to DEBUG1 for this test). As an example,
> you can check the patch [1]. Also, it seems by mistake you have added
> wait_for_catchup() twice.

I added a testcase to check the DEBUG message.

> Few other comments:
> =
> 1. Let's keep the parameter name as skipped_empty_xact in
> OutputPluginUpdateProgress so as to not confuse with the other patch's
> [2] keep_alive parameter. I think in this case we must send the
> keep_alive message so as to not make the syncrep wait whereas in the
> other patch we only need to send it periodically based on
> wal_sender_timeout parameter.
> 2. The new function SyncRepEnabled() seems confusing to me as the
> comments in SyncRepWaitForLSN() clearly state why we need to first
> read the parameter 'sync_standbys_defined' without any lock then read
> it again with a lock if the parameter is true. So, I just put that
> check back and also added a similar check in WalSndUpdateProgress.
> 3.
> @@ -1392,11 +1481,21 @@ pgoutput_truncate(LogicalDecodingContext *ctx,
> ReorderBufferTXN *txn,
>   continue;
> 
>   relids[nrelids++] = relid;
> +
> + /* Send BEGIN if we haven't yet */
> + if (txndata && !txndata->sent_begin_txn)
> + pgoutput_send_begin(ctx, txn);
>   maybe_send_schema(ctx, change, relation, relentry);
>   }
> 
>   if (nrelids > 0)
>   {
> + txndata = (PGOutputTxnData *) txn->output_plugin_private;
> +
> + /* Send BEGIN if we haven't yet */
> + if (txndata && !txndata->sent_begin_txn)
> + pgoutput_send_begin(ctx, txn);
> +
> 
> Why do we need to try sending the begin in the second check? I think
> it should be sufficient to do it in the above loop.
> 
> I have made these and a number of other changes in the attached patch.
> Do let me know what you think of the attached?

The changes look good to me.
And I did some basic tests for the patch and didn’t find some other problems.

Attach the new version patch.

Best regards,
Hou zj


v28-0001-Skip-empty-transactions-for-logical-replication.patch
Description:  v28-0001-Skip-empty-transactions-for-logical-replication.patch

RE: logical replication empty transactions

2022-03-21 Thread houzj.f...@fujitsu.com

> On Monday, March 21, 2022 6:01 PM Amit Kapila 
> wrote:
> >
> > On Sat, Mar 19, 2022 at 9:10 AM Ajin Cherian  wrote:
> > >
> > > On Thu, Mar 17, 2022 at 10:43 PM Amit Kapila
> > > 
> > wrote:
> > >
> > > > 3. Can we add a simple test for it in one of the existing test
> > > > files(say in 001_rep_changes.pl)?
> > >
> > > added a simple test.
> > >
> >
> > This doesn't verify if the transaction is skipped. I think we should
> > extend this test to check for a DEBUG message in the Logs (you need to
> > probably set log_min_messages to DEBUG1 for this test). As an example,
> > you can check the patch [1]. Also, it seems by mistake you have added
> > wait_for_catchup() twice.
> 
> I added a testcase to check the DEBUG message.
> 
> > Few other comments:
> > =
> > 1. Let's keep the parameter name as skipped_empty_xact in
> > OutputPluginUpdateProgress so as to not confuse with the other patch's
> > [2] keep_alive parameter. I think in this case we must send the
> > keep_alive message so as to not make the syncrep wait whereas in the
> > other patch we only need to send it periodically based on
> > wal_sender_timeout parameter.
> > 2. The new function SyncRepEnabled() seems confusing to me as the
> > comments in SyncRepWaitForLSN() clearly state why we need to first
> > read the parameter 'sync_standbys_defined' without any lock then read
> > it again with a lock if the parameter is true. So, I just put that
> > check back and also added a similar check in WalSndUpdateProgress.
> > 3.
> > @@ -1392,11 +1481,21 @@ pgoutput_truncate(LogicalDecodingContext *ctx,
> > ReorderBufferTXN *txn,
> >   continue;
> >
> >   relids[nrelids++] = relid;
> > +
> > + /* Send BEGIN if we haven't yet */
> > + if (txndata && !txndata->sent_begin_txn) pgoutput_send_begin(ctx,
> > + txn);
> >   maybe_send_schema(ctx, change, relation, relentry);
> >   }
> >
> >   if (nrelids > 0)
> >   {
> > + txndata = (PGOutputTxnData *) txn->output_plugin_private;
> > +
> > + /* Send BEGIN if we haven't yet */
> > + if (txndata && !txndata->sent_begin_txn) pgoutput_send_begin(ctx,
> > + txn);
> > +
> >
> > Why do we need to try sending the begin in the second check? I think
> > it should be sufficient to do it in the above loop.
> >
> > I have made these and a number of other changes in the attached patch.
> > Do let me know what you think of the attached?
> 
> The changes look good to me.
> And I did some basic tests for the patch and didn’t find some other problems.
> 
> Attach the new version patch.

Oh, sorry, I posted the wrong patch, here is the correct one.

Best regards,
Hou zj


v28-0001-Skip-empty-transactions-for-logical-replication.patch
Description:  v28-0001-Skip-empty-transactions-for-logical-replication.patch

RE: logical replication empty transactions

2022-03-23 Thread houzj.f...@fujitsu.com

On Tuesday, March 22, 2022 7:50 PM Amit Kapila  wrote:
> On Tue, Mar 22, 2022 at 7:25 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > > On Monday, March 21, 2022 6:01 PM Amit Kapila
> > > 
> > > wrote:
> >
> > Oh, sorry, I posted the wrong patch, here is the correct one.
> >
> 
> The test change looks good to me. I think additionally we can verify that the
> record is not reflected in the subscriber table. Apart from that, I had made
> minor changes mostly in the comments in the attached patch. If those look
> okay to you, please include those in the next version.

Thanks, the changes look good to me, I merged the diff patch.

Attach the new version patch which include the following changes:

- Fix a typo
- Change the requestreply flag of the newly added WalSndKeepalive to false,
  because the subscriber can judge whether it's necessary to post a reply based
  on the received LSN.
- Add a testcase to make sure there is no data in subscriber side when the
  transaction is skipped.
- Change the name of flag skipped_empty_xact to skipped_xact which seems more
  understandable.
- Merge Amit's suggested changes.


Best regards,
Hou zj


v29-0001-Skip-empty-transactions-for-logical-replication.patch
Description:  v29-0001-Skip-empty-transactions-for-logical-replication.patch

RE: Determine parallel-safety of partition relations for Inserts

2021-02-21 Thread houzj.f...@fujitsu.com

Hi,
 
Attaching v7 patches with the changes:
   * rebase the code on the greg's latest parallel insert patch.

 Please consider it for further review.
 
 Best regards,
 houzj



v7_0004-reloption-parallel_dml-test-and-doc.patch
Description: v7_0004-reloption-parallel_dml-test-and-doc.patch


v7_0001-guc-option-enable_parallel_dml-src.patch
Description: v7_0001-guc-option-enable_parallel_dml-src.patch


v7_0002-guc-option-enable_parallel_dml-doc-and-test.patch
Description: v7_0002-guc-option-enable_parallel_dml-doc-and-test.patch


v7_0003-reloption-parallel_dml-src.patch.patch
Description: v7_0003-reloption-parallel_dml-src.patch.patch

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-21 Thread houzj.f...@fujitsu.com

> Posting a new version of the patches, with the following updates:
> - Moved the update of glob->relationOIDs (i.e. addition of partition OIDs that
> plan depends on, resulting from parallel-safety checks) from within
> max_parallel_hazard() to set_plan_references().
> - Added an extra test for partition plan-cache invalidation.
> - Simplified query_has_modifying_cte() temporary bug-fix.
> - Added a comment explaining why parallel-safety of partition column defaults
> is not checked.
> - Minor simplification: hasSubQuery return to
> is_parallel_possible_for_modify().

Hi

(I may be wrong here)
I noticed that the patch does not have check for partial index(index predicate).
It seems parallel index build will check it like the following:
--
/*
 * Determine if it's safe to proceed.
 *
 * Currently, parallel workers can't access the leader's temporary 
tables.
 * Furthermore, any index predicate or index expressions must be 
parallel
 * safe.
 */
if (heap->rd_rel->relpersistence == RELPERSISTENCE_TEMP ||
!is_parallel_safe(root, (Node *) 
RelationGetIndexExpressions(index)) ||
!is_parallel_safe(root, (Node *) 
RelationGetIndexPredicate(index)))
--

Should we do parallel safety check for it ?

Best regards,
houzj

RE: A reloption for partitioned tables - parallel_workers

2021-02-22 Thread houzj.f...@fujitsu.com

> > 4. Maybe it also doesn't make sense to consider the parent relation's
> > parallel_workers if Parallel Append is disabled
> > (enable_parallel_append = off).  That's because with a simple
> > (non-parallel) Append running under Gather, all launched parallel
> > workers process the same partition before moving to the next one.
> > OTOH, one's intention of setting parallel_workers on the parent
> > partitioned table would most likely be to distribute workers across
> > partitions, which is only possible with parallel Append
> > (enable_parallel_append = on).  So, the new code should look like
> > this:
> >
> > if (IS_PARTITIONED_REL(rel) && enable_parallel_append)
> > parallel_workers = compute_parallel_worker(rel, -1, -1,
> > max_parallel_workers_per_gather);
> 
> Here is an updated version of the Seamus' patch that takes into account these
> and other comments received on this thread so far.
> Maybe warrants adding some tests too but I haven't.
> 
> Seamus, please register this patch in the next commit-fest:
> https://commitfest.postgresql.org/32/
> 
> If you haven't already, you will need to create a community account to use 
> that
> site.

It seems the patch does not include the code that get the parallel_workers from 
new struct " PartitionedTableRdOptions ",
Did I miss something ?

Best regards,
houzj

RE: A reloption for partitioned tables - parallel_workers

2021-02-23 Thread houzj.f...@fujitsu.com

> > It seems the patch does not include the code that get the
> > parallel_workers from new struct " PartitionedTableRdOptions ", Did I miss
> something ?
> 
> Aren't the following hunks in the v2 patch what you meant?
> 
> diff --git a/src/backend/access/common/reloptions.c
> b/src/backend/access/common/reloptions.c
> index c687d3ee9e..f8443d2361 100644
> --- a/src/backend/access/common/reloptions.c
> +++ b/src/backend/access/common/reloptions.c
> @@ -377,7 +377,7 @@ static relopt_int intRelOpts[] =
>   {
>   "parallel_workers",
>   "Number of parallel processes that can be used per executor node for this
> relation.",
> - RELOPT_KIND_HEAP,
> + RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
>   ShareUpdateExclusiveLock
>   },
>   -1, 0, 1024
> @@ -1962,12 +1962,18 @@ bytea *
>  partitioned_table_reloptions(Datum reloptions, bool validate)  {
>   /*
> - * There are no options for partitioned tables yet, but this is able to do
> - * some validation.
> + * Currently the only setting known to be useful for partitioned tables
> + * is parallel_workers.
>   */
> + static const relopt_parse_elt tab[] = { {"parallel_workers",
> + RELOPT_TYPE_INT, offsetof(PartitionedTableRdOptions,
> + parallel_workers)}, };
> +
>   return (bytea *) build_reloptions(reloptions, validate,
> RELOPT_KIND_PARTITIONED,
> -   0, NULL, 0);
> +   sizeof(PartitionedTableRdOptions),
> +   tab, lengthof(tab));
>  }
> 
>  /*
> 
> diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h index
> 10b63982c0..fe114e0856 100644
> --- a/src/include/utils/rel.h
> +++ b/src/include/utils/rel.h
> @@ -308,6 +308,16 @@ typedef struct StdRdOptions
>   bool vacuum_truncate; /* enables vacuum to truncate a relation */  }
> StdRdOptions;
> 
> +/*
> + * PartitionedTableRdOptions
> + * Contents of rd_options for partitioned tables  */ typedef struct
> +PartitionedTableRdOptions {
> + int32 vl_len_; /* varlena header (do not touch directly!) */  int
> +parallel_workers; /* max number of parallel workers */ }
> +PartitionedTableRdOptions;
> +
>  #define HEAP_MIN_FILLFACTOR 10
>  #define HEAP_DEFAULT_FILLFACTOR 100
Hi,

I am not sure.
IMO, for normal table, we use the following macro to get the parallel_workers:
--
/*
 * RelationGetParallelWorkers
 *  Returns the relation's parallel_workers reloption setting.
 *  Note multiple eval of argument!
 */
#define RelationGetParallelWorkers(relation, defaultpw) \
((relation)->rd_options ? \
 ((StdRdOptions *) (relation)->rd_options)->parallel_workers : 
(defaultpw))
--

Since we add new struct " PartitionedTableRdOptions ", It seems we need to get 
parallel_workers in different way.
Do we need similar macro to get partitioned table's parallel_workers ?

Like:
#define PartitionedTableGetParallelWorkers(relation, defaultpw) \ 
xxx
(PartitionedTableRdOptions *) (relation)->rd_options)->parallel_workers : 
(defaultpw))

Best regards,
houzj

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-24 Thread houzj.f...@fujitsu.com

> > It is quite possible what you are saying is correct but I feel that is
> > not this patch's fault. So, won't it better to discuss this in a
> > separate thread?
> >
> > Good use case but again, I think this can be done as a separate patch.
> 
> Agreed.
> I think even the current patch offers great benefits and can be committed in 
> PG
> 14, even if all my four feedback comments are left unaddressed.  I just 
> touched
> on them for completeness in terms of typically expected use cases.  They will
> probably be able to be implemented along the current design.
> 
> 
> 
> > I think here you are talking about the third patch (Parallel Inserts).
> > I guess if one has run inserts parallelly from psql then also similar
> > behavior would have been observed. For tables, it might lead to better
> > results once we have the patch discussed at [1]. Actually, this needs
> > more investigation.
> >
> > [1] -
> >
> https://www.postgresql.org/message-id/20200508072545.GA9701%40telsas
> o
> > ft.com
> 
> That looks interesting and worth a try.

Hi,

I test the bitmapscan with both multi-insert patch and parallel insert patch 
applied.
But the performance degradation and table size increased still happened in my 
machine.

To better analyze this issue, I did some more research on it (only applied 
parallel insert patch)

I add some code to track the time spent in index operation.
From the results[1], we can see more workers will bring more cost in 
_bt_search_insert() in each worker.
After debugged, the most cost part is the following:
-
/* drop the read lock on the page, then acquire one on its 
child */
*bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-

It seems the order of parallel bitmap scan's result will result in more lock 
time in parallel insert.
[1]---parallel bitmap scan--
worker 0:
psql:test.sql:10: INFO:  insert index _bt_search_insert time:834735
psql:test.sql:10: INFO:  insert index total time:1895330
psql:test.sql:10: INFO:  insert tuple time:628064

worker 2:
psql:test.sql:10: INFO:  insert index _bt_search_insert time:1552242
psql:test.sql:10: INFO:  insert index total time:2374741
psql:test.sql:10: INFO:  insert tuple time:314571

worker 4:
psql:test.sql:10: INFO:  insert index _bt_search_insert time:2496424
psql:test.sql:10: INFO:  insert index total time:3016150
psql:test.sql:10: INFO:  insert tuple time:211741


Based on above, I tried to change the order of results that bitmapscan return.
In the original test, we prepare data in order (like: 
generate_series(1,1,1)),
If we change the order we insert the data in the source table, the performance 
degradation will not always happen[2].
And table size difference will be small.

---out of order source table-
insert into x(a,b,c) select i,i+1,i+2 from generate_series(1,6) as t(i) 
order by random();


Test results when source table out of order(using bitmap heap scan):
[2]
Worker 0:
Execution Time: 37028.006 ms
Worker 2:
Execution Time: 11355.153 ms
Worker 4:
Execution Time: 9273.398 ms


So, this performance degradation issue seems related on the order of the data 
in the source table.
It does not always happen. Do we need to do some specific fix for it ?

For multi-insert, I guess the reason why it does not solve the performance 
problem is that we do not actually have a api for multi-index insert,
Like the api for tableam rd_tableam->multi_insert(), so we still execute 
ExecInsertIndexTuples in a loop for the multi index insert.

I plan to do some more test for multi-insert and parallel insert with out of 
order source table.

Best regards,
houzj

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-25 Thread houzj.f...@fujitsu.com

> I add some code to track the time spent in index operation.
> From the results[1], we can see more workers will bring more cost in
> _bt_search_insert() in each worker.
> After debugged, the most cost part is the following:
> -
>   /* drop the read lock on the page, then acquire one on its child
> */
>   *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
> -
> 
> It seems the order of parallel bitmap scan's result will result in more lock 
> time in
> parallel insert.
> [1]---parallel bitmap scan-- worker 0:
> psql:test.sql:10: INFO:  insert index _bt_search_insert time:834735
> psql:test.sql:10: INFO:  insert index total time:1895330
> psql:test.sql:10: INFO:  insert tuple time:628064
> 
> worker 2:
> psql:test.sql:10: INFO:  insert index _bt_search_insert time:1552242
> psql:test.sql:10: INFO:  insert index total time:2374741
> psql:test.sql:10: INFO:  insert tuple time:314571
> 
> worker 4:
> psql:test.sql:10: INFO:  insert index _bt_search_insert time:2496424
> psql:test.sql:10: INFO:  insert index total time:3016150
> psql:test.sql:10: INFO:  insert tuple time:211741
> 
> 
> Based on above, I tried to change the order of results that bitmapscan return.
> In the original test, we prepare data in order (like: 
> generate_series(1,1,1)), If
> we change the order we insert the data in the source table, the performance
> degradation will not always happen[2].
> And table size difference will be small.
> 
> ---out of order source table-
> insert into x(a,b,c) select i,i+1,i+2 from generate_series(1,6) as 
> t(i)
> order by random();
> 
> 
> Test results when source table out of order(using bitmap heap scan):
> [2]
> Worker 0:
> Execution Time: 37028.006 ms
> Worker 2:
> Execution Time: 11355.153 ms
> Worker 4:
> Execution Time: 9273.398 ms
> 
> 
> So, this performance degradation issue seems related on the order of the data
> in the source table.
> It does not always happen. Do we need to do some specific fix for it ?

After doing some more tests on it (performance degradation will not happen when 
source table is out of order).
I think we can say the performance degradation is related to the order of the 
data in source table.
Also , In master branch, I found the order of data in source table will not 
affect the planner when generating plan(for select part).

As we can see from [1][2], If source table's data is in order, when set 
max_parallel_workers_per_gather to 4,
the planner will choose parallel seqscan here but it is actually slower than 
serial bitmapscan.
If data in source table is out of order, the performance degradation will not 
happen again[3][4].

So, the order of data 's influence seems a normal phenomenon, I think it seems 
we do not need to do anything about it (currently).
It seems better to mark it as todo which we can improve this in the future.

(Since the performance degradation in parallel bitmap is because of the lock in 
_bt_search, It will not always happen 
when the target table already have data, less race condition will happened when 
parallel insert into a evenly distributed btree).



Test result with sql: "postgres=# explain analyze verbose select a from x where 
a<8 or (a%2=0 and a>19900);"

[1]---Source table data in order and max_parallel_workers_per_gather = 
0---
Bitmap Heap Scan on public.x  (cost=22999.40..4991850.30 rows=91002 width=4) 
(actual time=45.445..201.322 rows=57 loops=1)
   Output: a
   Recheck Cond: ((x.a < 8) OR (x.a > 19900))
   Filter: ((x.a < 8) OR (((x.a % 2) = 0) AND (x.a > 19900)))
   Rows Removed by Filter: 50
   Heap Blocks: exact=5840
   ->  BitmapOr  (cost=22999.40..22999.40 rows=1242768 width=0) (actual 
time=44.622..44.637 rows=0 loops=1)
 ->  Bitmap Index Scan on x_a_idx  (cost=0.00..1575.70 rows=85217 
width=0) (actual time=3.622..3.634 rows=7 loops=1)
   Index Cond: (x.a < 8)
 ->  Bitmap Index Scan on x_a_idx  (cost=0.00..21378.20 rows=1157551 
width=0) (actual time=40.998..40.998 rows=100 loops=1)
   Index Cond: (x.a > 19900)
 Planning Time: 0.199 ms
 Execution Time: 232.327 ms

[2]---Source table data in order and max_parallel_workers_per_gather = 
4---
Gather  (cost=1000.00..2091183.08 rows=91002 width=4) (actual 
time=0.594..4216.197 rows=57 loops=1)
   Output: a
   Workers Planned: 4
   Workers Launched: 4
   ->  Parallel Seq Scan on public.x  (cost=0.00..2081082.88 rows=22750 
width=4) (actual time=0.087..4197.570 rows=116000 loops=5)
 Output: a
 Filter: ((x.a < 8) OR (((x.a % 2) = 0) AND (x.a > 19900)))
 Rows Removed by Filter: 39884000

RE: simplifying foreign key/RI checks

2021-03-01 Thread houzj.f...@fujitsu.com

Hi amit,

(sorry about not cc the hacker list)
I have an issue about command id here.
It's probably not directly related to your patch, so I am sorry if it bothers 
you.

+   /*
+* Start the scan. To make the changes of the current command visible to
+* the scan and for subsequent locking of the tuple (if any) found,
+* increment the command counter.
+*/
+   CommandCounterIncrement();

For insert on fk relation, is it necessary to create new command id every time ?
I think it is only necessary when it modifies the referenced table.
for example: 1) has modifyingcte
 2) has modifying function(trigger/domain...)

All of the above seems not supported in parallel mode(parallel unsafe).
So I was wondering if we can avoid the CommandCounterIncrement in parallel mode.

Best regards,
houzj

Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table

2021-03-03 Thread houzj.f...@fujitsu.com

Hi hackers,

Currently, postgres increments command id in ri trigger every time when 
inserting into a referencing table(fk relation).
RI_FKey_check-> ri_PerformCheck->SPI_execute_snapshot-> _SPI_execute_plan-> 
CommandCounterIncrement

It can be a block for supporting "parallel insert into a referencing table", 
because we do not allow increment command id in parallel mode.

So, I was wondering if we can avoid incrementing command id in some cases when 
executing INSERT.

As far as I can see, it’s only necessary to increment command id when the 
INSERT command modified the referenced table.
And INSERT command only have one target table, the modification on other tables 
can happen in the following cases.

1) has modifyingcte which modifies the referenced table
2) has modifying function which modifies the referenced table.
(If I missed something please let me know)

Since the above two cases are not supported in parallel mode(parallel unsafe).
IMO, It seems it’s not necessary to increment command id in parallel mode,
we can just skip commandCounterIncrement when in parallel mode.

With this change, we can smoothly support "parallel insert into referencing 
table" which is desirable.

Part of what I plan to change is as the following:
---
 RI_FKey_check_ins(PG_FUNCTION_ARGS)
 {
+   bool needCCI = true;
+
/* Check that this is a valid trigger call on the right time and event. 
*/
ri_CheckTrigger(fcinfo, "RI_FKey_check_ins", RI_TRIGTYPE_INSERT);

+   /*
+* We do not need to increment the command counter
+* in parallel mode, because any other modifications
+* other than the insert event itself are parallel unsafe.
+* So, there is no chance to modify the pk relation.
+*/
+   if (IsInParallelMode())
+   needCCI = false;
+
/* Share code with UPDATE case. */
-   return RI_FKey_check((TriggerData *) fcinfo->context);
+   return RI_FKey_check((TriggerData *) fcinfo->context, needCCI);
...
---

Thoughts ?

Best regards,
houzj

RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table

2021-03-03 Thread houzj.f...@fujitsu.com

Hi,

> Why do we have to increment the command ID when the INSERT's target table
> is a referenced table?

Please refer to the explanation below.

> > And INSERT command only have one target table, the modification on
> > other tables can happen in the following cases.
> >
> > 1) has modifyingcte which modifies the referenced table
> > 2) has modifying function which modifies the referenced table.
> > (If I missed something please let me know)
> 
> Also, why do we need CCI in these cases?  What kind of problem would
> happen if we don't do CCI?
 
>From the wiki[1], CCI is to let statements can not see the rows they modify.

Here is an example of the case 1):
(Note table referenced and referencing are both empty)
-
postgres=# with cte as (insert into referenced values(1)) insert into 
referencing values(1);
-
The INSERT here will first modify the referenced table(pk table) and then 
modify the referencing table.
When modifying the referencing table, it has to check if the tuple to be insert 
exists in referenced table.
At this moment, CCI can let the modification on referenced in WITH visible when 
check the existence of the tuple.
If we do not CCI here, postgres will report an constraint error because it can 
not find the tuple in referenced table.

What do you think?


> > Since the above two cases are not supported in parallel mode(parallel
> unsafe).
> > IMO, It seems it’s not necessary to increment command id in parallel
> > mode, we can just skip commandCounterIncrement when in parallel mode.
> >
> > +   /*
> > +* We do not need to increment the command counter
> > +* in parallel mode, because any other modifications
> > +* other than the insert event itself are parallel unsafe.
> > +* So, there is no chance to modify the pk relation.
> > +*/
> > +   if (IsInParallelMode())
> > +   needCCI = false;

>  I'm worried about having this dependency in RI check, because the planner 
> may allow parallel INSERT in these cases in the future.

If we support parallel insert that can have other modifications in the future,
I think we will also be able to share or increment command ID in parallel 
wokers in that time.
And it seems we can remove this dependency in that time.
How about add some more comments about this to remind future developer.
/*
 * If extra parallel modification is support in the future, this dependency 
should be removed.
 */

[1] 
https://wiki.postgresql.org/wiki/Developer_FAQ#What_is_CommandCounterIncrement.28.29.3F

Best regards,
houzj

RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table

2021-03-03 Thread houzj.f...@fujitsu.com

> From the wiki[1], CCI is to let statements can not see the rows they modify.
Sorry, a typo here "not".
I meant CCI is to let statements can see the rows they modify.

Best regards,
houzj

RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table

2021-03-04 Thread houzj.f...@fujitsu.com

> > >  I'm worried about having this dependency in RI check, because the
> > > planner
> > may allow parallel INSERT in these cases in the future.
> >
> > If we support parallel insert that can have other modifications in the
> > future, I think we will also be able to share or increment command ID
> > in parallel wokers in that time.
> > And it seems we can remove this dependency in that time.
> > How about add some more comments about this to remind future
> developer.
> > /*
> >  * If extra parallel modification is support in the future, this
> > dependency should be removed.
> >  */
> 
> Agreed.
> 
> I'm excited to see PostgreSQL's parallel DML work in wider use cases and
> satisfy users' expectations.

Thanks !
Attaching the first version patch which avoid CCI in RI trigger when insert 
into referencing table.

Best regards,
houzj


0001-Avoid-CCI-in-RI-trigger-when-INSERT-fk-relation.patch
Description:  0001-Avoid-CCI-in-RI-trigger-when-INSERT-fk-relation.patch

RE: should INSERT SELECT use a BulkInsertState?

2021-03-08 Thread houzj.f...@fujitsu.com

> > I am very interested in this patch, and I plan to do some experiments with 
> > the
> patch.
> > Can you please rebase the patch because it seems can not applied to the
> master now.
> 
> Thanks for your interest.
> 
> I was sitting on a rebased version since the bulk FDW patch will cause 
> conflicts,
> and since this should maybe be built on top of the table-am patch (2871).
> Have fun :)
Hi,

When I testing with the patch, I found I can not use "\d tablename".
It reports the following error, it this related to the patch?

--
ERROR:  did not find '}' at end of input node at character 141
STATEMENT:  SELECT pol.polname, pol.polpermissive,
  CASE WHEN pol.polroles = '{0}' THEN NULL ELSE 
pg_catalog.array_to_string(array(select rolname from pg_catalog.pg_roles where 
oid = any (pol.polroles) order by 1),',') END,
  pg_catalog.pg_get_expr(pol.polqual, pol.polrelid),
  pg_catalog.pg_get_expr(pol.polwithcheck, pol.polrelid),
  CASE pol.polcmd
WHEN 'r' THEN 'SELECT'
WHEN 'a' THEN 'INSERT'
WHEN 'w' THEN 'UPDATE'
WHEN 'd' THEN 'DELETE'
END AS cmd
FROM pg_catalog.pg_policy pol
WHERE pol.polrelid = '58112' ORDER BY 1;
ERROR:  did not find '}' at end of input node
LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog...
--

Best regards,
houzj

RE: should INSERT SELECT use a BulkInsertState?

2021-03-08 Thread houzj.f...@fujitsu.com

> > > I am very interested in this patch, and I plan to do some
> > > experiments with the
> > patch.
> > > Can you please rebase the patch because it seems can not applied to
> > > the
> > master now.
> >
> > Thanks for your interest.
> >
> > I was sitting on a rebased version since the bulk FDW patch will cause
> > conflicts, and since this should maybe be built on top of the table-am patch
> (2871).
> > Have fun :)
> Hi,
> 
> When I testing with the patch, I found I can not use "\d tablename".
> It reports the following error, it this related to the patch?

Sorry, solved by re-initdb.

Best regards,
houzj

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-08 Thread houzj.f...@fujitsu.com

> >
> > I've attached an updated set of patches with the suggested locking changes.
> >
> 
> Amit L, others, do let me know if you have still more comments on
> 0001* patch or if you want to review it further?

I took a look into the latest 0001 patch, and it looks good to me.

Best regards,
houzj

Questions about CommandIsReadOnly

2021-03-09 Thread houzj.f...@fujitsu.com

Hi hackers,

When reading the code, I found that in function CommandIsReadOnly[1], "select 
for update/share" is defined as "not read only".
[1]-
if (pstmt->rowMarks != NIL)
return false;   /* SELECT FOR [KEY] UPDATE/SHARE */
-

And from the comment [2], I think it means we need to CCI for "select for 
update/share ",
I am not very familiar this, is there some reason that we have to do CCI for 
"select for update/share " ?
Or Did I misunderstand ?

[2]-
* the query must be *in truth* read-only, because the caller wishes
* not to do CommandCounterIncrement for it.
-

Best regards,
houzj

RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table

2021-03-09 Thread houzj.f...@fujitsu.com

> Attaching the first version patch which avoid CCI in RI trigger when insert 
> into
> referencing table.

After some more on how to support parallel insert into fk relation.
It seems we do not have a cheap way to implement this feature.
Please see the explanation below:

In RI_FKey_check, Currently, postgres execute "select xx for key share" to 
check that foreign key exists in PK table.
However "select for update/share" is considered as parallel unsafe. It may be 
dangerous to do this in parallel mode, we may want to change this.

And also, "select for update/share" is considered as "not read only" which will 
force readonly = false in _SPI_execute_plan.
So, it seems we have to completely change the implementation of RI_FKey_check.

At the same time, " simplifying foreign key/RI checks " thread is trying to 
replace "select xx for key share" with index_beginscan()+table_tuple_lock() (I 
think it’s parallel safe).
May be we can try to support parallel insert fk relation after " simplifying 
foreign key/RI checks " patch applied ?

Best regards,
houzj

RE: Parallel Inserts in CREATE TABLE AS

2021-03-10 Thread houzj.f...@fujitsu.com

> Seems like v22 patch was failing in cfbot for one of the unstable test cases.
> Attaching v23 patch set with modification in 0003 and 0004 patches. No
> changes to 0001 and 0002 patches. Hopefully cfbot will be happy with v23.
> 
> Please consider v23 for further review.
Hi,

I was looking into the latest patch, here are some comments:

1)
 * (Note that we do allow CREATE TABLE AS, SELECT INTO, and CREATE
 * MATERIALIZED VIEW to use parallel plans, but as of now, only the 
leader
 * backend writes into a completely new table.  In the future, we can

Since we will support parallel insert with CTAS, this existing comments need to 
be changed.

2)
In parallel.sgml

The query writes any data or locks any database rows. If a query
contains a data-modifying operation either at the top level or within
a CTE, no parallel plans for that query will be generated.  As an
exception, the commands CREATE TABLE ... AS,

The same as 1), we'd better comment we have support parallel insert with CTAS.

3)
ExecInitParallelPlan(PlanState *planstate, EState *estate,
 Bitmapset *sendParams, int nworkers,
-int64 tuples_needed)
+int64 tuples_needed,
+ParallelInsertCmdKind parallel_ins_cmd,
+void *parallel_ins_info)

Is it better to place ParallelInsertCmdKind in struct ParallelInsertCTASInfo?
We can pass less parameter in some places.
Like:
typedef struct ParallelInsertCTASInfo
{
+   ParallelInsertCmdKind parallel_ins_cmd;
IntoClause *intoclause;
Oid objectid;

} ParallelInsertCTASInfo;

Best regards,
houzj

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-10 Thread houzj.f...@fujitsu.com

> > 2. Should we keep the default value of GUC to on or off? It is
> > currently off. I am fine keeping it off for this release and we can
> > always turn it on in the later releases if required. Having said that,
> > I see the value in keeping it on because in many cases Insert ...
> > Select will be used for large data and there we will see a benefit of
> > parallelism and users facing trouble (who have a very large number of
> > partitions with less data to query) can still disable the parallel
> > insert for that particular table. Also, the other benefit of keeping
> > it on till at least the beta period is that this functionality will
> > get tested and if we found reports of regression then we can turn it
> > off for this release as well.
> >
> > Thoughts?
> 
> IMHO, we should keep it on because in most of the cases it is going the give
> benefit to the user, and if for some specific scenario where a table has a 
> lot of
> partition then it can be turned off using reloption.  And, if for some users 
> most
> of the tables are like that where they always have a lot of partition then 
> the user
> can turn it off using guc.
> 
I think your suggestion makes sense.
If no one have other opinions, I will post a new version set default 
enable_parallel_insert=on soon.

Best regards,
houzj

RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table

2021-03-11 Thread houzj.f...@fujitsu.com


> > After some more on how to support parallel insert into fk relation.
> > It seems we do not have a cheap way to implement this feature.
> >
> > In RI_FKey_check, Currently, postgres execute "select xx for key
> > share" to check that foreign key exists in PK table.
> > However "select for update/share" is considered as parallel unsafe. It
> > may be dangerous to do this in parallel mode, we may want to change this.
> 
> Hmm, I guess the parallel leader and workers can execute SELECT FOR KEY
> SHARE, if the parallelism infrastructure allows execution of SPI calls.  The 
> lock
> manager supports tuple locking in parallel leader and workers by the group
> locking.  Also, the tuple locking doesn't require combo Cid, which is 
> necessary
> for parallel UPDATE and DELETE.
> 
> Perhaps the reason why SELECT FOR is treated as parallel-unsafe is that tuple
> locking modifies data pages to store lock information in the tuple header.  
> But
> now, page modification is possible in parallel processes, so I think we can
> consider SELECT FOR as parallel-safe.  (I may be too optimistic.)

I think you are right.
After reading the original parallel-safety check's commit message , 
README.tuplock, and have some discussions with the author.
I think the reason why [SELECT FOR UPDATE/SHARE] is parallel unsafe is that 
[SELECT FOR] will call GetCurrentCommandId(true).
GetCurrentCommandId(true) was not supported in parallel worker but [SELECT FOR] 
need command ID to mark the change(lock info).

Fortunately, With greg's parallel insert patch, we can use command ID in 
parallel worker.
So, IMO, In parallel insert case, the RI check is parallel safe.


> > And also, "select for update/share" is considered as "not read only"
> > which will force readonly = false in _SPI_execute_plan.
> 
> read_only is used to do CCI.  Can we arrange to skip CCI?

Yes, we can.
Currently, I try to add one parameter(need CCI) to SPI_execute_snapshot and 
_SPI_execute_plan to control CCI.
I was still researching is there a more elegant way.

> > At the same time, " simplifying foreign key/RI checks " thread is
> > trying to replace "select xx for key share" with
> > index_beginscan()+table_tuple_lock() (I think it’s parallel safe).
> > May be we can try to support parallel insert fk relation after "
> > simplifying foreign key/RI checks " patch applied ?
> 
> Why do you think it's parallel safe?
> 
> Can you try running parallel INSERT SELECT on the target table with FK and see
> if any problem happens?

I have tested this in various cases:
All the test results looks good.
* test different worker lock on the same tuple.
* test different worker lock on different tuple.
* test no lock.
* test lock with concurrent update
* test constraint error.

 
> 
> Surprisingly, Oracle doesn't support parallel INSERT SELECT on a table with FK
> as follows.  SQL Server doesn't mention anything, so I guess it's supported.
> This is a good chance for PostgreSQL to exceed Oracle.
> 
> https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/types-parallelism.html#GUID-D4CFC1F2-44D3-4BE3-B5ED-6A309EB8BF06
> 
> Table 8-1 Referential Integrity Restrictions
> DML Statement Issued on ParentIssued on Child
>   Self-Referential
> INSERT(Not applicable)Not parallelizedNot parallelized

Ah, that's really good chance. 

To support parallel insert into FK relation.
There are two scenarios need attention.
1) foreign key and primary key are on the same table(INSERT's target table).
  (referenced and referencing are the same table)
2) referenced and referencing table are both partition of INSERT's target table.
(These cases are really rare for me)

In the two cases, the referenced table could be modified when INSERTing and CCI 
is necessary,
So, I think we should treat these cases as parallel restricted while doing 
safety check.

Attaching V1 patch that Implemented above feature and passed regression test.

Best regards,
houzj


v1_0002-WIP-safety-check.patch
Description: v1_0002-WIP-safety-check.patch


v1_0001-WIP-avoid-cci.patch
Description: v1_0001-WIP-avoid-cci.patch

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-11 Thread houzj.f...@fujitsu.com

> I guess to have the finer granularity we'd have to go with 
> enable_parallel_insert,
> which then would mean possibly having to later add enable_parallel_update,
> should parallel update have similar potential overhead in the parallel-safety
> checks (which to me, looks like it could, and parallel delete may not ...)
> 
> It's a shame there is no "set" type for GUC options.
> e.g.
> enable_parallel_dml='insert,update'
> Maybe that's going too far.
> 
> > 2. Should we keep the default value of GUC to on or off? It is
> > currently off. I am fine keeping it off for this release and we can
> > always turn it on in the later releases if required. Having said that,
> > I see the value in keeping it on because in many cases Insert ...
> > Select will be used for large data and there we will see a benefit of
> > parallelism and users facing trouble (who have a very large number of
> > partitions with less data to query) can still disable the parallel
> > insert for that particular table. Also, the other benefit of keeping
> > it on till at least the beta period is that this functionality will
> > get tested and if we found reports of regression then we can turn it
> > off for this release as well.
> >
> 
> I'd agree to keeping it on by default (and it can be selectively turned off 
> for a
> particular table using the reloption, if needed).

Thanks, attaching new version patch keeping the default value of guc option to 
ON.

Best regards,
houzj


v26-0003-Parallel-SELECT-for-INSERT-INTO-.-SELECT-advanced-tests.patch
Description:  v26-0003-Parallel-SELECT-for-INSERT-INTO-.-SELECT-advanced-tests.patch


v26-0002-Add-new-GUC-option-enable_parallel_insert-boolean.patch
Description:  v26-0002-Add-new-GUC-option-enable_parallel_insert-boolean.patch

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-11 Thread houzj.f...@fujitsu.com

> On Thu, Mar 11, 2021 at 01:01:42PM +0000, houzj.f...@fujitsu.com wrote:
> > > I guess to have the finer granularity we'd have to go with
> > > enable_parallel_insert, which then would mean possibly having to
> > > later add enable_parallel_update, should parallel update have
> > > similar potential overhead in the parallel-safety checks (which to
> > > me, looks like it could, and parallel delete may not ...)
> > >
> > > It's a shame there is no "set" type for GUC options.
> > > e.g.
> > > enable_parallel_dml='insert,update'
> > > Maybe that's going too far.
> 
> Isn't that just GUC_LIST_INPUT ?
> I'm not sure why it'd be going to far ?
> 
> The GUC-setting assign hook can parse the enable_parallel_dml_list value set
> by the user, and set an internal int/bits enable_parallel_dml variable with 
> some
> define/enum values like:
> 
> GUC_PARALLEL_DML_INSERT 0x01
> GUC_PARALLEL_DML_DELETE 0x02
> GUC_PARALLEL_DML_UPDATE 0x04
> 
> The namespace.c assign hook is a good prototype for this.  The parsed,
> integer GUC can probably be a static variable in clauses.c.
> 
> Then, the planner can check if:
> |commandType == CMD_INSERT &&
> |   (enable_parallel_dml & GUC_PARALLEL_DML_INSERT) != 0
> [...]
> 
> +  this table. When enabled (and provided that
> +   is also
> + true),

I think this ideas works, but we still need to consider about the reloption.
After looking into the reloption, I think postgres do not have a list-like type 
for reloption.
And I think it's better that the guc and reloption is consistent.

Besides, a list type guc option that only support one valid value 'insert' 
seems a little weird to me(we only support parallel insert for now).

So, I tend to keep the current style of guc option. 
If we turn out that we do need same option to restrict update/delete, we can 
improve this in the future
What do you think ?

Best regards,
houzj

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-12 Thread houzj.f...@fujitsu.com

> > The problem is that target_rel_trigger_max_parallel_hazard and its
> > caller think they can use a relcache TriggerDesc field across other
> > cache accesses, which they can't because the relcache doesn't
> > guarantee that that won't move.
> >
> > One approach would be to add logic to RelationClearRelation similar to
> > what it does for tupdescs, rules, etc, to avoid moving them when their
> > contents haven't changed.  But given that we've not needed that for
> > the past several decades, I'm disinclined to add the overhead.  I
> > think this code ought to be adjusted to not make its own copy of the
> > trigdesc pointer, but instead fetch it out of the relcache struct each
> > time it is accessed.  There's no real reason why
> > target_rel_trigger_max_parallel_hazard shouldn't be passed the
> > (stable) Relation pointer instead of just the trigdesc pointer.
> >
> 
> I have attached a patch to fix the issue, based on your suggestion (tested 
> with
> CLOBBER_CACHE_ALWAYS defined).
> 
> > BTW, having special logic for FK triggers in
> > target_rel_trigger_max_parallel_hazard seems quite loony to me.
> > Why isn't that handled by setting appropriate proparallel values for
> > those trigger functions?
> >
> 
> ... and also attached a patch to update the code for this issue.
> 
> (2nd patch relies on application of the 1st patch)
> 
> Thanks again for pointing out these problems.

I have tested the triggerdesc bugfix patch with CLOBBER_CACHE_ALWAYS flag.
It passed the testset where is fail in buildfarm (foreign_key, foreign_data).

And the patch LGTM.

Best regards,
houzj

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-12 Thread houzj.f...@fujitsu.com

> On Fri, Mar 12, 2021 at 6:10 AM Justin Pryzby  
> wrote:
> > Note also this CF entry
> > https://commitfest.postgresql.org/32/2987/
> > | Allow setting parallel_workers on partitioned tables
> 
> +/*
> + * PartitionedOptions
> + * Contents of rd_options for partitioned tables
> + */
> +typedef struct PartitionedOptions
> +{
> +   int32   vl_len_;/* varlena header (do not touch directly!) */
> +   boolparallel_insert_enabled;/* enables planner's use
> of parallel insert */
> +} PartitionedOptions;
> 
> houzj, could you please consider naming the struct PartitionedTableRdOptions
> as the patch for adding parallel_workers option to partitioned tables does?

Thanks for reminding.
I agreed that " PartitionedTableRdOptions " is better.

Attaching new version patch with this change.

Best regards,
houzj


v27-0003-Parallel-SELECT-for-INSERT-INTO-.-SELECT-advanced-tests.patch
Description:  v27-0003-Parallel-SELECT-for-INSERT-INTO-.-SELECT-advanced-tests.patch


v27-0002-Add-new-GUC-option-enable_parallel_insert-boolean.patch
Description:  v27-0002-Add-new-GUC-option-enable_parallel_insert-boolean.patch

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-15 Thread houzj.f...@fujitsu.com

> > Attaching new version patch with this change.
> >
> 
> Thanks, the patch looks good to me. I have made some minor cosmetic
> changes in the attached. I am planning to push this by tomorrow unless you or
> others have any more comments or suggestions for this patch.

Thanks for the review.
I have no more comments.
BTW, I have done the final test on the patch and all of the tests passed.

Best regards,
houzj

RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table

2021-03-16 Thread houzj.f...@fujitsu.com

> To support parallel insert into FK relation.
> There are two scenarios need attention.
> 1) foreign key and primary key are on the same table(INSERT's target table).
>   (referenced and referencing are the same table)
> 2) referenced and referencing table are both partition of INSERT's target 
> table.
> (These cases are really rare for me)
> 
> In the two cases, the referenced table could be modified when INSERTing and
> CCI is necessary, So, I think we should treat these cases as parallel 
> restricted
> while doing safety check.
> 
> Attaching V1 patch that Implemented above feature and passed regression
> test.

Attaching rebased patch based on HEAD.

Best regards,
houzj



v2-0002-extend-safery-check-for-fk-relation.patch
Description: v2-0002-extend-safery-check-for-fk-relation.patch


v2-0001-skip-cci.patch
Description: v2-0001-skip-cci.patch

RE: Data is copied twice when specifying both child and parent table in publication

2021-12-02 Thread houzj.f...@fujitsu.com

On Thursday, December 2, 2021 12:50 PM Amit Kapila  
wrote:
> On Thu, Dec 2, 2021 at 9:41 AM Greg Nancarrow 
> wrote:
> >
> > On Thu, Dec 2, 2021 at 1:48 PM Greg Nancarrow 
> wrote:
> > >
> > > If you updated my original description to say "(instead of just the
> > > individual partitions)", it would imply the same I think.
> > > But I don't mind if you want to explicitly state both cases to make it 
> > > clear.
> > >
> >
> > For example, something like:
> >
> > For publications of partitioned tables with publish_via_partition_root
> > set to true, only the partitioned table (and not its partitions) is
> > included in the view, whereas if publish_via_partition_root is set to
> > false, only the individual partitions are included in the view.
> >
> 
> Yeah, that sounds good to me.

It looks good to me as well.
Attach the patches for (HEAD~13) which merge the suggested doc change. I
prepared the code patch and test patch separately to make it easier for 
committer 
to confirm.

Best regards,
Hou zj




HEAD-0001-Fix-double-publish-of-child-table-s-data.patch
Description: HEAD-0001-Fix-double-publish-of-child-table-s-data.patch


PG14-0001-Fix-double-publish-of-child-table-s-data.patch
Description: PG14-0001-Fix-double-publish-of-child-table-s-data.patch


PG13-0001-Fix-double-publish-of-child-table-s-data.patch
Description: PG13-0001-Fix-double-publish-of-child-table-s-data.patch


HEAD-0002-testcases.patch
Description: HEAD-0002-testcases.patch


PG13-0002-testcases.patch
Description: PG13-0002-testcases.patch


PG14-0002-testcases.patch
Description: PG14-0002-testcases.patch

RE: Data is copied twice when specifying both child and parent table in publication

2021-12-02 Thread houzj.f...@fujitsu.com

On Thursday, December 2, 2021 4:54 PM houzj.f...@fujitsu.com 
 wrote:
> On Thursday, December 2, 2021 12:50 PM Amit Kapila
>  wrote:
> > On Thu, Dec 2, 2021 at 9:41 AM Greg Nancarrow 
> > wrote:
> > >
> > > On Thu, Dec 2, 2021 at 1:48 PM Greg Nancarrow 
> > wrote:
> > > >
> > > > If you updated my original description to say "(instead of just
> > > > the individual partitions)", it would imply the same I think.
> > > > But I don't mind if you want to explicitly state both cases to make it 
> > > > clear.
> > > >
> > >
> > > For example, something like:
> > >
> > > For publications of partitioned tables with
> > > publish_via_partition_root set to true, only the partitioned table
> > > (and not its partitions) is included in the view, whereas if
> > > publish_via_partition_root is set to false, only the individual 
> > > partitions are
> included in the view.
> > >
> >
> > Yeah, that sounds good to me.
> 
> It looks good to me as well.
> Attach the patches for (HEAD~13) which merge the suggested doc change. I
> prepared the code patch and test patch separately to make it easier for
> committer to confirm.

It seems we might not need to backpatch the doc change, so
attach another version which remove the doc changes from backpatch patches.

Best regards,
Hou zj



HEAD-0001-Fix-double-publish-of-child-table-s-data.patch
Description: HEAD-0001-Fix-double-publish-of-child-table-s-data.patch


PG14-0001-Fix-double-publish-of-child-table-s-data.patch
Description: PG14-0001-Fix-double-publish-of-child-table-s-data.patch


PG13-0001-Fix-double-publish-of-child-table-s-data.patch
Description: PG13-0001-Fix-double-publish-of-child-table-s-data.patch


HEAD-0002-testcases.patch
Description: HEAD-0002-testcases.patch


PG13-0002-testcases.patch
Description: PG13-0002-testcases.patch


PG14-0002-testcases.patch
Description: PG14-0002-testcases.patch


HEAD-0003-improve-the-doc-for-pg_publication_tables.patch
Description: HEAD-0003-improve-the-doc-for-pg_publication_tables.patch

RE: parallel vacuum comments

2021-12-03 Thread houzj.f...@fujitsu.com

On Thur, Dec 2, 2021 8:31 PM Masahiko Sawada  wrote:
> I've attached updated patches.
> 
> The first patch is the main patch for refactoring parallel vacuum code; 
> removes
> bitmap-related code and renames function names for consistency. The second
> patch moves these parallel-related codes to vacuumparallel.c as well as
> common functions that are used by both lazyvacuum.c and vacuumparallel.c to
> vacuum.c. The third patch adds regression tests for parallel vacuum on
> different kinds of indexes with multiple index scans. Please review them.

Thanks for updating the patch.
I reviewed the 0001 patch and didn’t find some big issues in the patch.

I only have a personally suggestion for the following function name:

parallel_vacuum_process_unsafe_indexes
parallel_vacuum_index_is_parallel_safe

It seems not only "unsafe" index are processed in the above functions,
but also index which is unsuitable(based on parallel_vacuum_should_skip_index).
So, it might be clear to avoid "unsafe" in the name. Maybe we can use: 
"xxin_leader"
or " can_participate".

Best regards,
Hou zj

RE: pg_get_publication_tables() output duplicate relid

2021-12-05 Thread houzj.f...@fujitsu.com

On Thursday, December 2, 2021 9:48 PM Amit Langote  
wrote:
> On Thu, Dec 2, 2021 at 2:27 PM Amit Kapila  wrote:
> > On Thu, Dec 2, 2021 at 8:42 AM Amit Langote 
> wrote:
> > > Reading Alvaro's email at the top again gave me a pause to
> > > reconsider what I had said in reply.  It might indeed have been nice
> > > if the publication DDL itself had prevented the cases where a
> > > partition and its ancestor are added to a publication, though as
> > > Hou-san mentioned, partition ATTACHes make that a bit tricky to
> > > enforce at all times, though of course not impossible.  Maybe it's
> > > okay to just de-duplicate pg_publication_tables output as the patch
> > > does, even though it may appear to be a band-aid solution if we one
> > > considers Alvaro's point about consistency.
> >
> > True, I think even if we consider that idea it will be a much bigger
> > change, and also as it will be a behavioral change so we might want to
> > keep it just for HEAD and some of these fixes need to be backpatched.
> > Having said that, if you or someone want to pursue that idea and come
> > up with a better solution than what we have currently it is certainly
> > welcome.
> 
> Okay, I did write a PoC patch this morning after sending out my earlier 
> email.  I
> polished it a bit, which is attached.

After thinking more on this. I find there might be some usage about adding both
child and parent to the publication.

For the future feature publication row filter(and maybe column filter), It
could be useful for user to adding child and parent with different filter
expression. If pubviaroot=true, user can expect the parent's filter take
effect, If pubviaroot=false, they can expect the child's filter take effect.

If we disallow adding both child and parent to publication, it could be harder
for these features to implement.

So, personally, I am not sure disallow adding both child and parent to the
publication is a good idea.

Best regards,
Hou zj

RE: parallel vacuum comments

2021-12-07 Thread houzj.f...@fujitsu.com

On Tuesday, December 7, 2021 1:42 PM Masahiko Sawada  
wrote:
> I've attached an updated patch. I've removed 0003 patch that added
> regression tests as per discussion. Regarding the terminology like "bulkdel"
> and "cleanup" you pointed out, I've done that in 0002 patch while moving the
> code to vacuumparallel.c. In that file, we can consistently use the terms
> "bulkdel" and "cleanup" instead of "vacuum"
> and "cleanup".
Hi,

Thanks for updating the patch.
I noticed few minor things.

0001
1)

 * Skip processing indexes that are unsafe for workers (these 
are
-* processed in do_serial_processing_for_unsafe_indexes() by 
leader)
+* processed in parallel_vacuum_process_unsafe_indexes() by 
leader)

It might be clearer to mention that the index to be skipped are unsafe OR not
worthwhile.

2)
+   /* Set index vacuum status and mark as parallel safe or not */
+   for (int i = 0; i < pvc->nindexes; i++)
+   {
...
+   pindstats->parallel_workers_can_process =
+   parallel_vacuum_index_is_parallel_safe(vacrel,
+   
   vacrel->indrels[i],
+   
   vacuum);

For the comments above the loop, maybe better to mention we are marking whether
worker can process the index(not only safe/unsafe).

0002
3)

+   /*
+* Skip indexes that are unsuitable target for parallel index 
vacuum
+*/
+   if (parallel_vacuum_should_skip_index(indrel))
+   continue;
+

It seems we can use will_parallel_vacuum[] here instead of invoking the function
again.

Best regards,
Hou zj

RE: row filtering for logical replication

2021-12-08 Thread houzj.f...@fujitsu.com

On Wednesday, December 8, 2021 7:52 PM Ajin Cherian 
> On Tue, Dec 7, 2021 at 5:36 PM Peter Smith  wrote:
> >
> > We were mid-way putting together the next v45* when your latest
> > attachment was posted over the weekend. So we will proceed with our
> > original plan to post our v45* (tomorrow).
> >
> > After v45* is posted we will pause to find what are all the
> > differences between your unified patch and our v45* patch set. Our
> > intention is to integrate as many improvements as possible from your
> > changes into the v46* etc that will follow tomorrow’s v45*. On some
> > points, we will most likely need further discussion.
> 
> 
> Posting an update for review comments, using contributions majorly from
> Peter Smith.
> I've also included changes based on Euler's combined patch, specially changes
> to documentation and test cases.
> I have left out Hou-san's 0005, in this patch-set. Hou-san will provide a 
> rebased
> update based on this.
> 
> This patch addresses the following review comments:

Hi,

Thanks for updating the patch.
I noticed a possible issue.

+   /* Check row filter. */
+   if (!pgoutput_row_filter(data, relation, 
oldtuple, NULL, relentry))
+   break;
+
+   maybe_send_schema(ctx, change, relation, 
relentry);
+
/* Switch relation if publishing via root. */
if (relentry->publish_as_relid != 
RelationGetRelid(relation))
{
...
/* Convert tuple if needed. */
if (relentry->map)
tuple = 
execute_attr_map_tuple(tuple, relentry->map);

Currently, we execute the row filter before converting the tuple, I think it 
could
get wrong result if we are executing a parent table's row filter and the column
order of the parent table is different from the child table. For example:


create table parent(a int primary key, b int) partition by range (a);
create table child (b int, a int primary key);
alter table parent attach partition child default;
create publication pub for table parent where(a>10) 
with(PUBLISH_VIA_PARTITION_ROOT);

The column number of 'a' is '1' in filter expression while column 'a' is the
second one in the original tuple. I think we might need to execute the filter
expression after converting.

Best regards,
Hou zj

RE: parallel vacuum comments

2021-12-13 Thread houzj.f...@fujitsu.com

On Tuesday, December 14, 2021 10:11 AM Tang, Haiying wrote:
> On Monday, December 13, 2021 2:12 PM Masahiko Sawada
>  wrote:
> > I've attached the patch. I've just moved some functions back but not
> > done other changes.
> >
> 
> Thanks for your patch.
> 
> I tested your patch and tried some cases, like large indexes, different types 
> of
> indexes, it worked well.

+1, the patch looks good to me.

Best regards,
Hou zj

RE: pg_get_publication_tables() output duplicate relid

2021-12-13 Thread houzj.f...@fujitsu.com

On Sat, Nov 20, 2021 7:31 PM Amit Kapila  wrote:
> On Fri, Nov 19, 2021 at 10:58 AM Amit Kapila 
> wrote:
> >
> > On Fri, Nov 19, 2021 at 7:19 AM Amit Langote 
> wrote:
> > >
> > > The problematic case is attaching the partition *after* the
> > > subscriber has already marked the root parent as synced and/or ready
> > > for replication.  Refreshing the subscription doesn't help it
> > > discover the newly attached partition, because a
> > > publish_via_partition_root only ever tells about the root parent,
> > > which would be already synced, so the subscriber would think there's
> > > nothing to copy.
> > >
> >
> > Okay, I see this could be a problem but I haven't tried to reproduce it.
> 
> One more thing you mentioned is that the initial sync won't work after refresh
> but later changes will be replicated but I noticed that later changes also 
> don't
> get streamed till we restart the subscriber server.
> I am not sure but we might not be invalidating apply workers cache due to
> which it didn't notice the same.

I investigated this bug recently, and I think the reason is that when receiving
relcache invalidation message, the callback function[1] in walsender only reset
the schema sent status while it doesn't reset the replicate_valid flag. So, it
won’t rebuild the publication actions of the relation.

[1]
static void
rel_sync_cache_relation_cb(Datum arg, Oid relid)
...
/*
 * Reset schema sent status as the relation definition may have changed.
 * Also free any objects that depended on the earlier definition.
 */
if (entry != NULL)
{
entry->schema_sent = false;
list_free(entry->streamed_txns);
...

Also, when you DETACH a partition, the publication won’t be rebuilt too because
of the same reason. Which could cause unexpected behavior if we modify the
detached table's data . And the bug happens regardless of whether pubviaroot is
set or not.

For the fix:

I think if we also reset replicate_valid flag in rel_sync_cache_relation_cb,
then the bug can be fixed. I have a bit hesitation about this approach, because
it could increase the frequency of invalidating and rebuilding the publication
action. But I haven't produced some other better approaches.

Another possibility could be that we add a syscache callback function for
pg_inherits table, but we currently don't have syscache for pg_inherits. We
might need to add the cache pg_inherits first which doesn't seems better
than the above approach.

What do you think ?

Attach an initial patch which reset the replicate_valid flag in
rel_sync_cache_relation_cb and add some reproduction tests in 013_partition.pl.

Best regards,
Hou zj

0001-fix-replication-after-ATTACH-or-DETACH-partition.patch
Description:  0001-fix-replication-after-ATTACH-or-DETACH-partition.patch

RE: Failed transaction statistics to measure the logical replication progress

2021-12-14 Thread houzj.f...@fujitsu.com

On Tues, Dec 14, 2021 10:28 AM Amit Kapila  wrote:
> On Mon, Dec 13, 2021 at 5:48 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Monday, December 13, 2021 6:19 PM Amit Kapila
>  wrote:
> > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > >
> > > Few questions and comments:
> > Thank you for your comments !
> >
> > > 
> > > 1.
> > > one row per subscription worker on which errors have occurred, for 
> > > workers
> > > applying logical replication changes and workers handling the initial 
> > > data
> > > -   copy of the subscribed tables.  The statistics entry is removed when 
> > > the
> > > -   corresponding subscription is dropped.
> > > +   copy of the subscribed tables. Also, the row corresponding to the 
> > > apply
> > > +   worker shows all transaction statistics of both types of workers on 
> > > the
> > > +   subscription. The statistics entry is removed when the corresponding
> > > +   subscription is dropped.
> > >
> > > Why did you choose to show stats for both types of workers in one row?
> > This is because if we have hundreds or thousands of tables for table sync,
> > we need to create many entries to cover them and store the entries for all
> > tables.
> >
> 
> If we fear a large number of entries for such workers then won't it be
> better to show the value of these stats for apply workers. I
> think normally the table sync workers perform only copy operation or
> maybe a fixed number of xacts, so, one might not be interested in the
> transaction stats of these workers. I find merging only specific stats
> of two different types of workers confusing.
> 
> What do others think about this?

Personally, I agreed that merging two types of stats into one row might not be
a good idea. And the xact stats of table sync workers are usually less
interesting than the apply worker's, So, it's seems acceptable to me if we
show stats only for apply workers and document about this.

Best regards,
Hou zj

RE: parallel vacuum comments

2021-12-15 Thread houzj.f...@fujitsu.com

On Wed, Dec 15, 2021 4:03 PM Masahiko Sawada  wrote:
> On Tue, Dec 14, 2021 at 12:03 PM Amit Kapila  wrote:
> > There is still pending
> > work related to moving parallel vacuum code to a separate file and a
> > few other pending comments that are still under discussion. We can
> > take care of those in subsequent patches. Do, let me know if you or
> > others think differently?
> 
> I'm on the same page.
> 
> I've attached an updated patch. The patch incorporated several changes from
> the last version:
> 
> * Rename parallel_vacuum_begin() to parallel_vacuum_init()
> * Unify the terminology; use "index bulk-deletion" and "index cleanup"
> instead of "index vacuum" and "index cleanup".
> * Fix the comment of parallel_vacuum_init() pointed out by Andres
> * Fix a typo that is left in commit 22bd3cbe0c (pointed out by Hou)
> 
> Please review it.

Thanks for updating the patch.
Here are a few comments:

1)
+   case PARALLEL_INDVAC_STATUS_NEED_CLEANUP:
+   errcontext("while cleanup index \"%s\" of relation 
\"%s.%s\"",

I noticed current code uses error msg "While cleaning up index xxx" which seems 
a little
different from the patch's maybe we can use the previous one ?

2)
static inline Size max_items_to_alloc_size(int max_items);

This old function declaration can be deleted.

3)
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list

I think we need to remove LVShared, LVSharedIndStats, LVDeadItems and
LVParallelState from typedefs.list and add PVShared and PVIndStats to the file.

Best regards,
Hou zj

RE: row filtering for logical replication

2021-12-16 Thread houzj.f...@fujitsu.com

On Mon, Dec 13, 2021 5:49 PM Peter Smith  wrote:
> PSA the v46* patch set.
> 
> Here are the main differences from v45:
> 0. Rebased to HEAD
> 1. Integrated many comments, docs, messages, code etc from Euler's patch
> [Euler 6/12] 2. Several bugfixes 3. Patches are merged/added
> 
> ~~
> 
> Bugfix and Patch Merge details:
> 
> v46-0001 (main)
> - Merged from v45-0001 (main) + v45-0005 (exprstate)
> - Fix for mem leak reported by Greg (off-list)
> 
> v46-0002 (validation)
> - Merged from v45-0002 (node validation) + v45-0006 (replica identity
> validation)
> 
> v46-0003
> - Rebased from v45-0003
> - Fix for partition column order [Houz 9/12]
> - Fix for core dump reported by Tang (off-list)
> 
> v46-0004 (tab-complete and dump)
> - Rebased from v45-0004
> 
> v46-0005 (for all tables)
> - New patch
> - Fix for FOR ALL TABLES [Tang 7/12]
> 

Thanks for updating the patch.

When reviewing the patch, I found the patch allows using system columns in
row filter expression.
---
create publication pub for table test WHERE ('(0,1)'::tid=ctid);
---

Since we can't create index on system column and most
existing expression feature(index expr,partition expr,table constr) doesn't
allow using system column, I think it might be better to disallow using system
column when creating or altering the publication. We can check like:

rowfilter_walker(Node *node, Relation relation)
...
if (var->varattno < 0)
ereport(ERROR,
errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
errmsg("cannot use system column \"%s\" in column 
generation expression",
...

Best regards,
Hou zj

RE: Column Filtering in Logical Replication

2021-12-16 Thread houzj.f...@fujitsu.com

On Tues, Dec 14, 2021 1:48 AM Alvaro Herrera  wrote:
> Hmm, I messed up the patch file I sent.  Here's the complete patch.
> 

Hi,

I have a minor question about the replica identity check of this patch.

+check_publication_add_relation(Relation targetrel, Bitmapset *columns)
...
+   idattrs = RelationGetIndexAttrBitmap(targetrel,
+   
 INDEX_ATTR_BITMAP_IDENTITY_KEY);
+   if (!bms_is_subset(idattrs, columns))
+   ereport(ERROR,
+   
errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
+   errmsg("invalid column list for 
publishing relation \"%s\"",
+  
RelationGetRelationName(targetrel)),
+   errdetail("All columns in 
REPLICA IDENTITY must be present in the column list."));
+

The patch ensures all columns of RT are in column list when CREATE/ALTER
publication, but it seems doesn't prevent user from changing the replica
identity or dropping the index used in replica identity. Do we also need to
check those cases ?

Best regards,
Hou zj

RE: Column Filtering in Logical Replication

2021-12-17 Thread houzj.f...@fujitsu.com

On Friday, December 17, 2021 1:55 AM Alvaro Herrera  
wrote:
> On 2021-Dec-16, houzj.f...@fujitsu.com wrote:
> 
> > The patch ensures all columns of RT are in column list when
> > CREATE/ALTER publication, but it seems doesn't prevent user from
> > changing the replica identity or dropping the index used in replica
> > identity. Do we also need to check those cases ?
> 
> Yes, we do.  As it happens, I spent a couple of hours yesterday writing code 
> for
> that, at least partially.  I haven't yet checked what happens with cases like
> REPLICA NOTHING, or REPLICA INDEX  and then dropping that index.
> 
> My initial ideas were a bit wrong BTW: I thought we should check the
> combination of column lists in all publications (a bitwise-OR of column 
> bitmaps,
> so to speak).  But conceptually that's wrong: we need to check the column list
> of each publication individually instead.  Otherwise, if you wanted to hide a
> column from some publication but that column was part of the replica identity,
> there'd be no way to identify the tuple in the replica.  (Or, if the pgouput 
> code
> disobeys the column list and sends the replica identity even if it's not in 
> the
> column list, then you'd be potentially publishing data that you wanted to 
> hide.)

Thanks for the explanation.

Apart from ALTER REPLICA IDENTITY and DROP INDEX, I think there could be
some other cases we need to handle for the replica identity check:

1)
When adding a partitioned table with column list to the publication, I think we
need to check the RI of all its leaf partition. Because the RI on the partition
is the one actually takes effect.

2)
ALTER TABLE ADD PRIMARY KEY;
ALTER TABLE DROP CONSTRAINT "PRIMAEY KEY";

If the replica identity is default, it will use the primary key. we might also
need to prevent user from adding or removing primary key in this case.

Based on the above cases, the RI check seems could bring considerable amount of
code. So, how about we follow what we already did in CheckCmdReplicaIdentity(),
we can put the check for RI in that function, so that we can cover all the
cases and reduce the code change. And if we are worried about the cost of do
the check for UPDATE and DELETE every time, we can also save the result in the
relcache. It's safe because every operation change the RI will invalidate the
relcache. We are using this approach in row filter patch to make sure all
columns in row filter expression are part of RI.

Best regards,
Hou zj

RE: row filtering for logical replication

2021-12-19 Thread houzj.f...@fujitsu.com



> -Original Message-
> From: Amit Kapila 
On Saturday, December 18, 2021 10:33 AM
> On Fri, Dec 17, 2021 at 5:29 PM Greg Nancarrow 
> wrote:
> >
> > On Fri, Dec 17, 2021 at 7:20 PM Ajin Cherian  wrote:
> > >
> > > On Fri, Dec 17, 2021 at 5:46 PM Greg Nancarrow 
> wrote:
> > >
> > > > So using the v47 patch-set, I still find that the UPDATE above results 
> > > > in
> publication of an INSERT of (2,1), rather than an UPDATE of (1,1) to (2,1).
> > > > This is according to the 2nd UPDATE rule below, from patch 0003.
> > > >
> > > > + * old-row (no match)new-row (no match)  -> (drop change)
> > > > + * old-row (no match)new row (match) -> INSERT
> > > > + * old-row (match)   new-row (no match)  -> DELETE
> > > > + * old-row (match)   new row (match) -> UPDATE
> > > >
> > > > This is because the old row (1,1) doesn't match the UPDATE filter 
> > > > "(a>1)",
> but the new row (2,1) does.
> > > > This functionality doesn't seem right to me. I don't think it can be
> assumed that (1,1) was never published (and thus requires an INSERT rather 
> than
> UPDATE) based on these checks, because in this example, (1,1) was previously
> published via a different operation - INSERT (and using a different filter 
> too).
> > > > I think the fundamental problem here is that these UPDATE rules assume
> that the old (current) row was previously UPDATEd (and published, or not
> published, according to the filter applicable to UPDATE), but this is not
> necessarily the case.
> > > > Or am I missing something?
> > >
> > > But it need not be correct in assuming that the old-row was part of
> > > a previous INSERT either (and published, or not published according
> > > to the filter applicable to an INSERT).
> > > For example, change the sequence of inserts and updates prior to the
> > > last update:
> > >
> > > truncate tbl1 ;
> > > insert into tbl1 values (1,5); ==> not replicated since insert and !
> > > (b < 2); update tbl1 set b = 1; ==> not replicated since update and
> > > ! (a > 1) update tbl1 set a = 2; ==> replicated and update converted
> > > to insert since (a > 1)
> > >
> > > In this case, the last update "update tbl1 set a = 2; " is updating
> > > a row that was previously updated and not inserted and not
> > > replicated to the subscriber.
> > > How does the replication logic differentiate between these two
> > > cases, and decide if the update was previously published or not?
> > > I think it's futile for the publisher side to try and figure out the
> > > history of published rows. In fact, if this level of logic is
> > > required then it is best implemented on the subscriber side, which
> > > then defeats the purpose of a publication filter.
> > >
> >
> > I think it's a concern, for such a basic example with only one row,
> > getting unpredictable (and even wrong) replication results, depending
> > upon the order of operations.
> >
> 
> I am not sure how we can deduce that. The results are based on current and
> new values of row which is what I think we are expecting here.
> 
> > Doesn't this problem result from allowing different WHERE clauses for
> > different pubactions for the same table?
> > My current thoughts are that this shouldn't be allowed, and also WHERE
> > clauses for INSERTs should, like UPDATE and DELETE, be restricted to
> > using only columns covered by the replica identity or primary key.
> >
> 
> Hmm, even if we do that one could have removed the insert row filter by the
> time we are evaluating the update. So, we will get the same result. I think 
> the
> behavior in your example is as we expect as per the specs defined by the patch
> and I don't see any problem, in this case, w.r.t replication results. Let us 
> see
> what others think on this?

I think it might not be hard to predict the current behavior. User only need to 
be
aware of that:
1) pubaction and row filter on different publications are combined with 'OR'.
2) FOR UPDATE, we execute the fiter for both OLD and NEW tuple and would change
   the operation type accordingly.

For the example mentioned:
create table tbl1 (a int primary key, b int);
create publication A for table tbl1 where (b<2) with(publish='insert');
create publication B for table tbl1 where (a>1) with(publish='update');

If we follow the rule 1) and 2), I feel we are able to predict the following
conditions:
--
WHERE (action = 'insert' AND b < 2) OR (action = 'update' AND a > 1)
--

So, it seems acceptable to me.

Personally, I think the current design could give user more flexibility to
handle some complex scenario. If user want some simple setting for publication,
they can also set same row filter for the same table in different publications.
To avoid confusion, I think we can document about these rules clearly.

BTW, From the document of IBM, I think IBM also support this kind of complex
condition [1].
[1] https://www.ibm.com/docs/en/idr/11.4.0?topic=rows-log-record-variables

Best regards,
Hou zj

relcache not invalidated when ADD PRIMARY KEY USING INDEX

2021-12-19 Thread houzj.f...@fujitsu.com

Hi,

When reviewing some replica identity related patches, I found that when adding
primary key using an existing unique index on not null columns, the
target table's relcache won't be invalidated.

This would cause an error When REPLICA IDENTITY is default and we are
UPDATE/DELETE a published table , because we will refer to
RelationData::rd_pkindex to check if the UPDATE or DELETE can be safety
executed in this case.

---reproduction steps
CREATE TABLE test(a int not null, b int);
CREATE UNIQUE INDEX a ON test(a);
CREATE PUBLICATION PUB for TABLE test;
UPDATE test SET a = 2;
ERROR:  cannot update table "test" because it does not have a replica 
identity and publishes updates
HINT:  To enable updating the table, set REPLICA IDENTITY using ALTER 
TABLE.

alter table test add primary key using index a;
UPDATE test SET a = 2;
ERROR:  cannot update table "test" because it does not have a replica 
identity and publishes updates
HINT:  To enable updating the table, set REPLICA IDENTITY using ALTER 
TABLE.
---

I think the bug exists in HEAD ~ PG11 after the commit(f66e8bf) which remove
relhaspkey from pg_class. In PG10, when adding a primary key, it will always
update the relhaspkey in pg_class which will invalidate the relcache, so it
was OK.

I tried to write a patch to fix this by invalidating the relcache after marking
primary key in index_constraint_create().

Best regards,
Hou zj


0001-Invalid-relcache-when-ADD-PRIMARY-KEY-USING-INDEX.patch
Description:  0001-Invalid-relcache-when-ADD-PRIMARY-KEY-USING-INDEX.patch

RE: parallel vacuum comments

2021-12-21 Thread houzj.f...@fujitsu.com

On Wed, Dec 22, 2021 8:38 AM Masahiko Sawada  wrote:
> On Tue, Dec 21, 2021 at 10:24 PM Amit Kapila 
> wrote:
> >
> > On Tue, Dec 21, 2021 at 11:24 AM Masahiko Sawada
>  wrote:
> > >
> > > On Tue, Dec 21, 2021 at 2:04 PM Amit Kapila 
> wrote:
> > > >
> > >
> > > Thank you for the comment. Agreed.
> > >
> > > I've attached updated version patches. Please review them.
> > >
> >
> > These look mostly good to me. Please find attached the slightly edited
> > version of the 0001 patch. I have modified comments, ran pgindent, and
> > modify the commit message. I'll commit this tomorrow if you are fine
> > with it.
> 
> It looks good to me.
+1, the patch passes check-world and looks good to me.

Best regards,
Hou zj

RE: parallel vacuum comments

2021-12-22 Thread houzj.f...@fujitsu.com

On Wed, Dec 22, 2021 11:36 AM Masahiko Sawada  wrote:
> On Tue, Dec 21, 2021 at 10:24 PM Amit Kapila 
> wrote:
> >
> > On Tue, Dec 21, 2021 at 11:24 AM Masahiko Sawada
>  wrote:
> > >
> > > On Tue, Dec 21, 2021 at 2:04 PM Amit Kapila 
> wrote:
> > > >
> > >
> > > Thank you for the comment. Agreed.
> > >
> > > I've attached updated version patches. Please review them.
> > >
> >
> > These look mostly good to me. Please find attached the slightly edited
> > version of the 0001 patch. I have modified comments, ran pgindent, and
> > modify the commit message. I'll commit this tomorrow if you are fine
> > with it.
> 
> Thank you for committing the first patch.
> 
> I've attached an updated version second patch. Please review it.
> 
> Regards,

Hi,

The patch looks mostly good to me.
I only have few comments.

1)
+/*
+ * Do parallel index bulk-deletion with parallel workers.
+ */
+void
+parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long 
num_table_tuples)
+{
+   Assert(!IsParallelWorker());
+

Would it be better to also put Assert(pvs != NULL) here ? Because we removed
the Assert(ParallelVacuumIsActive(vacrel)) check in the old function.


2)
+#include "utils/rel.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"

It might be better to keep the header file in alphabetical order.
:
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"

Best regards,
Hou zj

Delay the variable initialization in get_rel_sync_entry

2021-12-22 Thread houzj.f...@fujitsu.com

Hi,

When reviewing some logical replication patches. I noticed that in
function get_rel_sync_entry() we always invoke get_rel_relispartition() 
and get_rel_relkind() at the beginning which could cause unnecessary
cache access.

---
get_rel_sync_entry(PGOutputData *data, Oid relid)
{
RelationSyncEntry *entry;
boolam_partition = get_rel_relispartition(relid);
charrelkind = get_rel_relkind(relid);
---

The extra cost could sometimes be noticeable because get_rel_sync_entry is a
hot function which is executed for each change. And the 'am_partition' and
'relkind' are necessary only when we need to rebuild the RelationSyncEntry.

Here is the perf result for the case when inserted large amounts of data into a
un-published table in which case the cost is noticeable.

--12.83%--pgoutput_change
|--11.84%--get_rel_sync_entry
|--4.76%--get_rel_relispartition
|--4.70%--get_rel_relkind

So, I think it would be better if we do the initialization only when
RelationSyncEntry in invalid.

Attach a small patch which delay the initialization.

Thoughts ?

Best regards,
Hou zj


0001-Delay-the-variable-initialization-in-get_rel_sync_en.patch
Description:  0001-Delay-the-variable-initialization-in-get_rel_sync_en.patch

RE: parallel vacuum comments

2021-12-22 Thread houzj.f...@fujitsu.com

On Wed, Dec 22, 2021 9:55 PM Amit Kapila  wrote:
> On Wed, Dec 22, 2021 at 6:22 PM Amit Kapila 
> wrote:
> >
> > On Wed, Dec 22, 2021 at 5:39 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > >
> > > 2)
> > > +#include "utils/rel.h"
> > > +#include "utils/lsyscache.h"
> > > +#include "utils/memutils.h"
> > >
> > > It might be better to keep the header file in alphabetical order.
> > > :
> > > +#include "utils/lsyscache.h"
> > > +#include "utils/memutils.h"
> > > +#include "utils/rel.h"
> > >
> >
> > Right, I'll take care of this as I am already making some other edits
> > in the patch.
> >
> 
> Fixed this and made a few other changes in the patch that includes (a) passed
> down the num_index_scans information in parallel APIs based on which it can
> make the decision whether to reinitialize DSM or consider conditional parallel
> vacuum clean up; (b) get rid of first-time variable in ParallelVacuumState as 
> that
> is not required if we have num_index_scans information; (c) there seems to be
> quite a few unnecessary includes in vacuumparallel.c which I have removed; (d)
> unnecessary error callback info was being set in ParallelVacuumState in leader
> backend; (e) changed/added comments at quite a few places.
> 
> Can you please once verify the changes in the attached?

The changes look ok to me.
I tested the patch for multi-pass parallel vacuum cases and ran 'make 
check-world',
all the tests passed

Best regards,
Hou zj

RE: Delay the variable initialization in get_rel_sync_entry

2021-12-24 Thread houzj.f...@fujitsu.com

On Friday, December 24, 2021 8:13 AM Michael Paquier  
wrote:
> 
> On Thu, Dec 23, 2021 at 12:54:41PM -0300, Euler Taveira wrote:
> > On Wed, Dec 22, 2021, at 10:11 AM, houzj.f...@fujitsu.com wrote:
> >> The extra cost could sometimes be noticeable because get_rel_sync_entry is
> a
> >> hot function which is executed for each change. And the 'am_partition' and
> >> 'relkind' are necessary only when we need to rebuild the RelationSyncEntry.
> >>
> >> Here is the perf result for the case when inserted large amounts of data 
> >> into
> a
> >> un-published table in which case the cost is noticeable.
> >>
> >> --12.83%--pgoutput_change
> >> |--11.84%--get_rel_sync_entry
> >> |--4.76%--get_rel_relispartition
> >> |--4.70%--get_rel_relkind
> 
> How does the perf balance change once you apply the patch?  Do we have
> anything else that stands out?  Getting rid of this bottleneck is fine
> by itself, but I am wondering if there are more things to worry about
> or not.

Thanks for the response.
Here is the perf result of pgoutput_change after applying the patch.
I didn't notice something else that stand out.

|--2.99%--pgoutput_change
|--1.80%--get_rel_sync_entry
|--1.56%--hash_search

Also attach complete profiles.

> > Good catch. WFM. Deferring variable initialization close to its first use is
> > good practice.
> 
> Yeah, it is usually a good practice to have the declaration within
> the code block that uses it rather than piling everything at the
> beginning of the function.  Being able to see that in profiles is
> annoying, and the change is simple, so I'd like to backpatch it.

+1

> This is a period of vacations for a lot of people, so I'll wait until
> the beginning-ish of January before doing anything.

Thanks, added it to CF.
https://commitfest.postgresql.org/36/3471/

Best regards,
Hou zj
<>

RE: row filtering for logical replication

2021-12-27 Thread houzj.f...@fujitsu.com

On Mon, Dec 27, 2021 9:16 PM houzj.f...@fujitsu.com  
wrote:
> On Thur, Dec 23, 2021 4:28 PM Peter Smith  wrote:
> > Here is the v54* patch set:
> 
> Attach the v55 patch set which add the following testcases in 0003 patch.
Sorry for the typo here, I mean the tests are added 0002 patch.

Best regards,
Hou zj

RE: Confused comment about drop replica identity index

2021-12-29 Thread houzj.f...@fujitsu.com

On Tues, Dec 21, 2021 8:47 AM Michael Paquier  wrote:
> On Mon, Dec 20, 2021 at 11:57:32AM -0300, Euler Taveira wrote:
> > What do you think about the attached patch? It forbids the DROP INDEX.
> > We might add a detail message but I didn't in this patch.
> 
> Yeah.  I'd agree about doing something like that on HEAD, and that would help
> with some of the logirep-related patch currently being worked on, as far as I
> understood.

Hi,

I think forbids DROP INDEX might not completely solve this problem. Because
user could still use other command to delete the index, for example: ALTER
TABLE DROP COLUMN. After dropping the column, the index on it will also be
dropped.

Besides, user can also ALTER REPLICA IDENTITY USING INDEX "primary key", and in
this case, when they ALTER TABLE DROP CONSTR "PRIMARY KEY", the replica
identity index will also be dropped.

Best regards,
Hou zj

RE: Failed transaction statistics to measure the logical replication progress

2022-01-02 Thread houzj.f...@fujitsu.com

On Wednesday, December 22, 2021 6:14 PM Osumi, Takamichi 
 wrote:
> Attached the new patch v19.
Hi,

Thanks for updating the patch.

--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -15,6 +15,7 @@
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h" /* for MAX_XFN_CHARS */
 #include "replication/logicalproto.h"
+#include "replication/worker_internal.h"

I noticed that the patch includes "worker_internal.h " in pgstat.h.
I think it might be better to only include this file in pgstat.c.

And it seems we can access MyLogicalRepWorker directly in the
following functions instead of passing a parameter.

+extern void pgstat_report_subworker_xact_end(LogicalRepWorker *repWorker,
+   
 LogicalRepMsgType command,
+   
 bool bforce);
+extern void pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker,
+   
 bool force);

Best regards,
Hou zj

RE: Delay the variable initialization in get_rel_sync_entry

2022-01-04 Thread houzj.f...@fujitsu.com

On Wednesday, January 5, 2022 9:31 AM Michael Paquier  
wrote:
> On Fri, Dec 24, 2021 at 01:27:26PM +0000, houzj.f...@fujitsu.com wrote:
> > Here is the perf result of pgoutput_change after applying the patch.
> > I didn't notice something else that stand out.
> >
> > |--2.99%--pgoutput_change
> > |--1.80%--get_rel_sync_entry
> > |--1.56%--hash_search
> >
> > Also attach complete profiles.
> 
> Thanks.  I have also done my own set of measurements, and the difference is
> noticeable in the profiles I looked at.  So, applied down to 13.
Thanks for pushing!

Best regards,
Hou zj

RE: row filtering for logical replication

2022-01-04 Thread houzj.f...@fujitsu.com

On Wednesday, January 5, 2022 2:45 PM Amit Kapila  
wrote:
> 
> On Wed, Jan 5, 2022 at 11:04 AM Peter Smith 
> wrote:
> >
> >
> > 11. src/backend/utils/cache/relcache.c - GetRelationPublicationActions
> >
> > Something seemed slightly fishy with the code doing the memcpy,
> > because IIUC is possible for the GetRelationPublicationInfo function
> > to return without setting the relation->rd_pubactions. Is it just
> > missing an Assert or maybe a comment to say such a scenario is not
> > possible in this case because the is_publishable_relation was already
> > tested?
> >
> 
> I think it would be good to have an Assert for a valid value of
> relation->rd_pubactions before doing memcpy. Alternatively, in
> function, GetRelationPublicationInfo(), we can have an Assert when
> rd_rfcol_valid is true. I think we can add comments atop
> GetRelationPublicationInfo about pubactions.
> 
> >
> > 13. src/include/utils/rel.h - comment typos
> >
> > @@ -164,6 +164,13 @@ typedef struct RelationData
> >   PublicationActions *rd_pubactions; /* publication actions */
> >
> >   /*
> > + * true if the columns referenced in row filters from all the
> > + publications
> > + * the relation is in are part of replica identity, or the
> > + publication
> > + * actions do not include UPDATE and DELETE.
> > + */
> >
> > Some minor rewording of the comment:
> >
> ...
> > "UPDATE and DELETE" --> "UPDATE or DELETE"
> >
> 
> The existing comment seems correct to me. Hou-San can confirm it once as I
> think this is written by him.

I think the code comment is trying to say
"the publication does not include UPDATE and also does not include DELETE"
I am not too sure about the grammar, I noticed there is some other places in
the code use " no updates or deletes ", so maybe it's fine to change it to
"UPDATE or DELETE"

Best regards,
Hou zj

RE: row filtering for logical replication

2022-01-09 Thread houzj.f...@fujitsu.com

On Friday, January 7, 2022 3:40 PM Peter Smith  wrote:
> Below are some review comments for the v60 patches.
> 
> 1e.
> "Subsequent UPDATE or DELETE statements have no effect."
> 
> Why won't they have an effect? The first impression is the newly updated tuple
> now matches the filter, I think this part seems to need some more detailed
> explanation. I saw there are some slightly different details in the header
> comment of the pgoutput_row_filter_update_check function - does it help?

Thanks for the comments ! I have addressed all the comments except 1e which
I will think over it and update in next version.

Best regards,
Hou zj

RE: row filtering for logical replication

2022-01-09 Thread houzj.f...@fujitsu.com

On Wednesday, January 5, 2022 2:01 PM vignesh C  wrote:
> On Tue, Jan 4, 2022 at 9:58 AM Peter Smith  wrote:
> >
> > Here is the v58* patch set:
> >
> > Main changes from v57* are
> > 1. Couple of review comments fixed
> >
> > ~~
> >
> > Review comments (details)
> > =
> >
> > v58-0001 (main)
> > - PG docs updated as suggested [Alvaro, Euler 26/12]
> >
> > v58-0002 (new/old tuple)
> > - pgputput_row_filter_init refactored as suggested [Wangw 30/12] #3
> > - re-ran pgindent
> >
> > v58-0003 (tab, dump)
> > - no change
> >
> > v58-0004 (refactor transformations)
> > - minor changes to commit message
> 
> Few comments:

Thanks for the comments!

> 1) We could include namespace names along with the relation to make it more
> clear to the user if the user had specified tables having same table names 
> from
> different schemas:

Since most of the error message in publicationcmd.c and pg_publication.c 
doesn't include include namespace names along with the relation,
I am not sure is it necessary to add this. So, I didn’t change this in the 
patch.

> 5) This log will be logged for each tuple, if there are millions of records 
> it will
> get logged millions of times, we could remove it:
> +   /* update requires a new tuple */
> +   Assert(newtuple);
> +
> +   elog(DEBUG3, "table \"%s.%s\" has row filter",
> +
> get_namespace_name(get_rel_namespace(RelationGetRelid(relation))),
> +get_rel_name(relation->rd_id));

Since the message is logged only in DEBUG3 and could be useful for some
debugging purpose, so I didn't remove this in the new version patch.

Best regards,
Hou zj

RE: row filtering for logical replication

2022-01-13 Thread houzj.f...@fujitsu.com

On Thur, Jan 13, 2022 5:22 PM Alvaro Herrera  wrote:
> >  /*
> > + * Only 3 publication actions are used for row filtering ("insert",
> > +"update",
> > + * "delete"). See RelationSyncEntry.exprstate[].
> > + */
> > +typedef enum RowFilterPubAction
> > +{
> > +   PUBACTION_INSERT,
> > +   PUBACTION_UPDATE,
> > +   PUBACTION_DELETE,
> > +   NUM_ROWFILTER_PUBACTIONS  /* must be last */ }
> RowFilterPubAction;
> 
> Please do not add NUM_ROWFILTER_PUBACTIONS as an enum value.  It's a bit
> of a lie and confuses things, because your enum now has 4 possible values, not
> 3.  I suggest to #define NUM_ROWFILTER_PUBACTIONS
> (PUBACTION_DELETE+1) instead.
> 
> > +   int idx_ins = PUBACTION_INSERT;
> > +   int idx_upd = PUBACTION_UPDATE;
> > +   int idx_del = PUBACTION_DELETE;
> 
> I don't understand the purpose of these variables; can't you just use the
> constants?

Thanks for the comments !
Changed the code as suggested.

Best regards,
Hou zj

RE: row filtering for logical replication

2022-01-13 Thread houzj.f...@fujitsu.com

On Thursday, January 13, 2022 2:49 PM Peter Smith 
> Thanks for posting the merged v63.
> 
> Here are my review comments for the v63-0001 changes.
> 
> ~~~

Thanks for the comments!

> 1. src/backend/replication/logical/proto.c - logicalrep_write_tuple
> 
>   TupleDesc desc;
> - Datum values[MaxTupleAttributeNumber];
> - bool isnull[MaxTupleAttributeNumber];
> + Datum*values;
> + bool*isnull;
>   int i;
>   uint16 nliveatts = 0;
> 
> Those separate declarations for values / isnull are not strictly
> needed anymore, so those vars could be deleted. IIRC those were only
> added before when there were both slots and tuples. OTOH, maybe you
> prefer to keep it this way just for code readability?

Yes, I prefer the current style for code readability.

> 
> 2. src/backend/replication/pgoutput/pgoutput.c - typedef
> 
> +typedef enum RowFilterPubAction
> +{
> + PUBACTION_INSERT,
> + PUBACTION_UPDATE,
> + PUBACTION_DELETE,
> + NUM_ROWFILTER_PUBACTIONS  /* must be last */
> +} RowFilterPubAction;
> 
> This typedef is not currently used by any of the code.
> 
> So I think choices are:
> 
> - Option 1: remove the typedef, because nobody is using it.
> 
> - Option 2: keep the typedef, but use it! e.g. everywhere there is an
> exprstate array index variable probably it should be declared as a
> 'RowFilterPubAction idx' instead of just 'int idx'.

Thanks, I used the option 1.

> 
> 3. src/backend/replication/pgoutput/pgoutput.c - map_changetype_pubaction
> 
> After this recent v63 refactoring and merging of some APIs it seems
> that the map_changetype_pubaction is now ONLY used by
> pgoutput_row_filter function. So this can now be a static member of
> pgoutput_row_filter function instead of being declared at file scope.
> 

Changed.

> 
> 4. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter
> comments
> The function header comment says the same thing 2x about the return values.
> 

Changed.

> 
> ~~~
> 
> 5. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter
> 
> + ExprContext*ecxt;
> + int filter_index = map_changetype_pubaction[*action];
> +
> + /* *action is already assigned default by caller */
> + Assert(*action == REORDER_BUFFER_CHANGE_INSERT ||
> +*action == REORDER_BUFFER_CHANGE_UPDATE ||
> +*action == REORDER_BUFFER_CHANGE_DELETE);
> +
> 
> Accessing the map_changetype_pubaction array should be done *after* the
> Assert.
> 
> ~~~

Changed.

> 
> 6. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter
> 
> ExprState *filter_exprstate;
> ...
> filter_exprstate = entry->exprstate[map_changetype_pubaction[*action]];
> 
> Please consider what way you think is best.

Changed as suggested.

> 
> 7. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter
> There are several things not quite right with that comment:
> a. I thought now it should refer to "slots" instead of "tuples"
> 
> Suggested replacement comment:

Changed but I prefer "tuple" which is easy to understand.

> ~~~
> 
> 8. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter
> 
> + if (!new_slot || !old_slot)
> + {
> + ecxt->ecxt_scantuple = new_slot ? new_slot : old_slot;
> + result = pgoutput_row_filter_exec_expr(entry->exprstate[filter_index],
> +ecxt);
> +
> + FreeExecutorState(estate);
> + PopActiveSnapshot();
> +
> + return result;
> + }
> +
> + tmp_new_slot = new_slot;
> + slot_getallattrs(new_slot);
> + slot_getallattrs(old_slot);
> 
> I think after this "if" condition then the INSERT, DELETE and simple
> UPDATE are already handled. So, the remainder of the code is for
> deciding what update transformation is needed etc.
> 
> I think there should be some block comment somewhere here to make that
> more obvious.

Changed.
> ~~
> 
> 9. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter
> 
> Many of the comments in this function are still referring to old/new
> "tuple". Now that all the params are slots instead of tuples maybe now
> all the comments should also refer to "slots" instead of "tuples".
> Please search all the comments - e.g. including all the "Case 1:" ...
> "Case 4:" comments.

I also think tuple still makes sense, so I didn’t change this.

> 
> 10. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter_init
> 
> + int idx_ins = PUBACTION_INSERT;
> + int idx_upd = PUBACTION_UPDATE;
> + int idx_del = PUBACTION_DELETE;
> 
> These variables are unnecessary now... They previously were added only
> as short synonyms because the other enum names were too verbose (e.g.
> REORDER_BUFFER_CHANGE_INSERT) but now that we have shorter enum
> names
> like PUBACTION_INSERT we can just use those names directly
> 
Changed.

> 
> 11. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter_init
> 
> I felt that the code would seem more natural if the
> pgoutput_row_filter_init function came *before* the
> pgoutput_row_filter function in the source code.
> 
Changed.

> 
> 12. src/backend/replication/pgoutput/pgoutput.c - pgoutput_ch

RE: Parallel Inserts in CREATE TABLE AS

2021-03-17 Thread houzj.f...@fujitsu.com


> > Seems like v22 patch was failing in cfbot for one of the unstable test 
> > cases.
> > Attaching v23 patch set with modification in 0003 and 0004 patches. No
> > changes to 0001 and 0002 patches. Hopefully cfbot will be happy with v23.
> >
> > Please consider v23 for further review.
> Hi,
> 
> I was looking into the latest patch, here are some comments:

Few comments.

1)
Executing the following SQL will cause assertion failure.
---sql---
create table data(a int);
insert into data select 1 from generate_series(1,100,1) t;
explain (verbose) create table tt as select a,2 from data;
--

The stack message:
---stack---
TRAP: FailedAssertion("!allow && gather_exists && tuple_cost_opts && 
!(*tuple_cost_opts & PARALLEL_INSERT_TUP_COST_IGNORED)", File: 
"execParallel.c", Line: 1872, PID: 1618247)
postgres: houzj postgres [local] EXPLAIN(ExceptionalCondition+0x8b)[0x940f0b]
postgres: houzj postgres [local] EXPLAIN[0x67ba1c]
postgres: houzj postgres [local] EXPLAIN(ExplainOnePlan+0x1c2)[0x605997]
postgres: houzj postgres [local] EXPLAIN[0x605d11]
postgres: houzj postgres [local] EXPLAIN(ExplainOneUtility+0x162)[0x605eb0]
--
In this case, The Gather node have projection in which case parallel CTAS is 
not supported,
but we still ignore the cost in planner.
If we plan to detect the projection, we may need to check tlist_same_exprs.

+if (tlist_same_exprs)
+   {
ignore_parallel_tuple_cost(root);
+}
generate_useful_gather_paths(root, rel, false);

2)
+* Parallelize inserts only when the upper Gather node has no 
projections.
 */
-   gstate->dest = dest;
+   if (!gstate->ps.ps_ProjInfo)

IMO, It's better to add some comments about why we do not support projection 
for now.
Because, not all the projection are parallel unsafe (such as the case in 1) ), 
it will be desirable to support these later.

3)

+   if 
(IsParallelInsertionAllowed(PARALLEL_INSERT_CMD_CREATE_TABLE_AS,
+  
¶llel_ins_info))
...
/* plan the query */
plan = pg_plan_query(query, pstate->p_sourcetext,
 
CURSOR_OPT_PARALLEL_OK, params);
...
+   if 
(IsParallelInsertionAllowed(PARALLEL_INSERT_CMD_CREATE_TABLE_AS,
+  
¶llel_ins_info))
Currently, the patch call IsParallelInsertionAllowed() before and after 
pg_plan_query(),
This might lead to a misunderstanding that parallel_ins_info will get changed 
during pg_plan_query().
Since parallel_ins_info will not get changed in pg_plan_query, is it possible 
to add a bool flag(allowed) 
in parallel_ins_info to avoid the second call of IsParallelInsertionAllowed ?

Best regards,
houzj

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-17 Thread houzj.f...@fujitsu.com

> >  If a table parameter value is set and the
> >  equivalent toast. parameter is not, the TOAST table
> >  will use the table's parameter value.
> > -Specifying these parameters for partitioned tables is not supported,
> > -but you may specify them for individual leaf partitions.
> > +These parameters, with the exception of
> > +parallel_insert_enabled,
> > +are not supported on partitioned tables, but may be specified for
> > +individual leaf partitions.
> > 
> >
> 
> Your patch looks good to me. While checking this, I notice a typo in the
> previous patch:
> -  planner parameter parallel_workers.
> +  planner parameter parallel_workers and
> +  parallel_insert_enabled.
> 
> Here, it should be /planner parameter/planner parameters.

Thanks amit and justin for pointing this out !
The changes looks good to me.

Best regards,
houzj

RE: Parallel Inserts in CREATE TABLE AS

2021-03-19 Thread houzj.f...@fujitsu.com

> They are pretty simple though. I think someone can also check if the same
> regression exists for parallel inserts in "INSERT INTO SELECT"
> patch set as well for larger tuple sizes.

Thanks for reminding.
I did some performance tests for parallel inserts in " INSERT INTO SELECT " 
with the testcase you provided,
the regression seems does not exists in "INSERT INTO SELECT".

I will try to test with larger tuple size later.

Best regards,
houzj

RE: Parallel Inserts in CREATE TABLE AS

2021-03-19 Thread houzj.f...@fujitsu.com

> > They are pretty simple though. I think someone can also check if the
> > same regression exists for parallel inserts in "INSERT INTO SELECT"
> > patch set as well for larger tuple sizes.
> 
> Thanks for reminding.
> I did some performance tests for parallel inserts in " INSERT INTO SELECT " 
> with
> the testcase you provided, the regression seems does not exists in "INSERT
> INTO SELECT".

I forgot to share the test results with Parallel CTAS.

I test with sql: explain analyze verbose create table test as select * from 
tenk1;

> CREATE UNLOGGED TABLE tenk1(c1 int, c2 int);
> CREATE UNLOGGED TABLE tenk1(c1 int, c2 int, c3 varchar(8), c4 varchar(8), c5 
> varchar(8));
> CREATE UNLOGGED TABLE tenk1(c1 bigint, c2 bigint, c3 name, c4 name, c5 name, 
> c6 varchar(8));
I did not see regression in these cases (low tuple size).

> CREATE UNLOGGED TABLE tenk1(c1 bigint, c2 bigint, c3 name, c4 name, c5 name, 
> c6 name, c7 name, c8 name, c9 name, c10 name, c11 name, c12 name, c13 name, 
> c14 name, 
> c15 name, c16 name, c17 name, c18 name);

I can see the degradation in this case.
The average test results of CTAS are:
Serial  CTAS -Execution Time: 80892.240 ms
Parallel CTAS -Execution Time: 85725.591 ms
About 6% degradation.

I also test with Parallel INSERT patch in this case.
(Note: to keep consistent, I create a new target table(test) before inserting.)
The average test results of Parallel INSERT are:
Serial Parallel INSERT -- Execution Time: 90075.501 ms
Parallel Parallel INSERT- Execution Time: 85812.202 ms
No degradation.

Best regards,
houzj

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-21 Thread houzj.f...@fujitsu.com

> Since the "Parallel SELECT for INSERT" patch and related GUC/reloption patch 
> have been committed, I have now rebased and attached the rest of the original 
> patchset, 
> which includes:
>- Additional tests for "Parallel SELECT for INSERT"
>- Parallel INSERT functionality
>- Test and documentation updates for Parallel INSERT
Hi,

I noticed that some comments may need updated since we introduced parallel 
insert in this patch.

1) src/backend/executor/execMain.c
 * Don't allow writes in parallel mode.  Supporting UPDATE and DELETE
 * would require (a) storing the combocid hash in shared memory, rather
 * than synchronizing it just once at the start of parallelism, and (b) 
an
 * alternative to heap_update()'s reliance on xmax for mutual exclusion.
 * INSERT may have no such troubles, but we forbid it to simplify the
 * checks.

As we will allow INSERT in parallel mode, we'd better change the comment here.

2) src/backend/storage/lmgr/README
dangers are modest.  The leader and worker share the same transaction,
snapshot, and combo CID hash, and neither can perform any DDL or, indeed,
write any data at all.  Thus, for either to read a table locked exclusively by

The same as 1), parallel insert is the exception.

3) src/backend/storage/lmgr/README
mutual exclusion method for such cases.  Currently, the parallel mode is
strictly read-only, but now we have the infrastructure to allow parallel
inserts and parallel copy.

May be we can say:
+mutual exclusion method for such cases.  Currently, we only allowed parallel
+inserts, but we already have the infrastructure to allow parallel copy.


Best regards,
houzj

RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table

2021-03-21 Thread houzj.f...@fujitsu.com

Hi,

Thanks for the review.

> I noticed some things on the first scan through:
> 
> Patch 0001:
> 1) Tidy up the comments a bit:
> 
> Suggest the following update to part of the comments:

Changed.

> Patch 0002:
> 1) The new max_parallel_hazard_context member "pk_rels" is not being 
> set (to
> NIL) in the is_parallel_safe() function, so it will have a junk value 
> in that case - though it does look like nothing could reference it 
> then (but the issue may be detected by a Valgrind build, as performed by the 
> buildfarm).

Changed.

> 2) Few things to tidy up the patch comments:

Changed.

> 3) In target_rel_trigger_max_parallel_hazard(), you have added a 
> variable declaration "int trigtype;" after code, instead of before:

Changed.

Attaching new version patch with these changes.

Also attaching the simple performance test results (insert into table with 
foreign key):
postgres=# explain (analyze, verbose) insert into fk select a,func_xxx() from 
data where a%2=0 or a%3 = 0;
  workers  | serial insert + parallel select | parallel insert | performace gain
---+-+-+
2 workers | 85512.153ms | 61384.957ms | 29%
4 workers | 85436.957ms | 39335.797ms | 54%

-data prepare-
create table pk(a int primary key,b int);
create table fk(a int references pk,b int);
create table data(a int, b int);
insert into data select * from generate_series(1,1000,1) t;
 insert into pk select generate_series(1,1000,1);


Best regards,
houzj


v3-0001-skip-cci.patch
Description: v3-0001-skip-cci.patch


v3-0002-extend-safery-check-for-fk-relation.patch
Description: v3-0002-extend-safery-check-for-fk-relation.patch

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-22 Thread houzj.f...@fujitsu.com

> > I noticed that some comments may need updated since we introduced
> parallel insert in this patch.
> >
> > 1) src/backend/executor/execMain.c
> >  * Don't allow writes in parallel mode.  Supporting UPDATE and
> DELETE
> >  * would require (a) storing the combocid hash in shared memory,
> rather
> >  * than synchronizing it just once at the start of parallelism, and 
> > (b) an
> >  * alternative to heap_update()'s reliance on xmax for mutual
> exclusion.
> >  * INSERT may have no such troubles, but we forbid it to simplify 
> > the
> >  * checks.
> >
> > As we will allow INSERT in parallel mode, we'd better change the comment
> here.
> >
> 
> Thanks, it does need to be updated for parallel INSERT.
> I was thinking of the following change:
> 
> - * Don't allow writes in parallel mode.  Supporting UPDATE and DELETE
> - * would require (a) storing the combocid hash in shared memory, rather
> - * than synchronizing it just once at the start of parallelism, and (b) 
> an
> - * alternative to heap_update()'s reliance on xmax for mutual exclusion.
> - * INSERT may have no such troubles, but we forbid it to simplify the
> - * checks.
> + * Except for INSERT, don't allow writes in parallel mode.  Supporting
> + * UPDATE and DELETE would require (a) storing the combocid hash in
> shared
> + * memory, rather than synchronizing it just once at the start of
> + * parallelism, and (b) an alternative to heap_update()'s reliance on 
> xmax
> + * for mutual exclusion.
> 
> 
> 
> > 2) src/backend/storage/lmgr/README
> > dangers are modest.  The leader and worker share the same transaction,
> > snapshot, and combo CID hash, and neither can perform any DDL or,
> > indeed, write any data at all.  Thus, for either to read a table
> > locked exclusively by
> >
> > The same as 1), parallel insert is the exception.
> >
> 
> I agree, it needs to be updated too, to account for parallel INSERT now being
> supported.
> 
> -write any data at all.  ...
> +write any data at all (with the exception of parallel insert).  ...
> 
> 
> > 3) src/backend/storage/lmgr/README
> > mutual exclusion method for such cases.  Currently, the parallel mode
> > is strictly read-only, but now we have the infrastructure to allow
> > parallel inserts and parallel copy.
> >
> > May be we can say:
> > +mutual exclusion method for such cases.  Currently, we only allowed
> > +parallel inserts, but we already have the infrastructure to allow parallel 
> > copy.
> >
> 
> Yes, agree, something like:
> 
> -mutual exclusion method for such cases.  Currently, the parallel mode is
> -strictly read-only, but now we have the infrastructure to allow parallel 
> -inserts
> and parallel copy.
> +mutual exclusion method for such cases.  Currently, only parallel
> +insert is allowed, but we have the infrastructure to allow parallel copy.
> 
> 
> Let me know if these changes seem OK to you.

Yes, these changes look good to me.

Best regards,
houzj

RE: should INSERT SELECT use a BulkInsertState?

2021-03-22 Thread houzj.f...@fujitsu.com

Hi,

About the 0002-patch [Check for volatile defaults].

I wonder if we can check the volatile default value by traversing 
"query->targetList" in planner.

IMO, the column default expression was written into the targetList, and the 
current parallel-safety check
travere the "query->targetList" to determine whether it contains unsafe column 
default expression.
Like: standard_planner-> query_tree_walker
if (walker((Node *) query->targetList, context))
return true;
May be we can do the similar thing to check the volatile defaults, if so, we do 
not need to add a field to TargetEntry.

Best regards,
houzj

RE: [HACKERS] logical decoding of two-phase transactions

2021-03-24 Thread houzj.f...@fujitsu.com

> I have incorporated all your changes and additionally made few more changes
> (a) got rid of LogicalRepBeginPrepareData and instead used
> LogicalRepPreparedTxnData, (b) made a number of changes in comments and
> docs, (c) ran pgindent, (d) modified tests to use standard wait_for_catch
> function and removed few tests to reduce the time and to keep regression
> tests reliable.

Hi,

When reading the code, I found some comments related to the patch here.

 * XXX Now, this can even lead to a deadlock if 
the prepare
 * transaction is waiting to get it logically 
replicated for
 * distributed 2PC. Currently, we don't have an 
in-core
 * implementation of prepares for distributed 
2PC but some
 * out-of-core logical replication solution can 
have such an
 * implementation. They need to inform users to 
not have locks
 * on catalog tables in such transactions.
 */

Since we will have in-core implementation of prepares, should we update the 
comments here ?

Best regards,
houzj

RE: Add Nullif case for eval_const_expressions_mutator

2021-03-24 Thread houzj.f...@fujitsu.com

> + if (!has_nonconst_input)
> + return ece_evaluate_expr(expr);
> 
> That's not okay without a further check to see if the comparison function used
> by the node is immutable.  Compare ScalarArrayOpExpr, for instance.

Thanks for pointing it out.
Attaching new patch with this change.

Best regards,
houzj


v4-0001-add-nullif-case-for-eval_const_expressions.patch
Description: v4-0001-add-nullif-case-for-eval_const_expressions.patch

RE: fix typo in reorderbuffer.c

2021-03-24 Thread houzj.f...@fujitsu.com

> 
> What about "Combo CID(s)", for Combo command ID?  README.parallel uses
> this term for example.

Sorry for the late reply and yes, " Combo CID(s)" looks better.
Attaching patch which replaces all styles "Combocid(s)" with " Combo CID(s)".

Best regards,
houzj


v2-0001-make-the-comments-about-combo-command-id-consistent.patch
Description:  v2-0001-make-the-comments-about-combo-command-id-consistent.patch

RE: Hybrid Hash/Nested Loop joins and caching results from subplans

2021-04-01 Thread houzj.f...@fujitsu.com

> I've attached the updated patch.  I'll let the CFbot grab this to ensure it's
> happy with it before I go looking to push it again.

Hi,

I took a look into the patch and noticed some minor things.

1.
+   case T_ResultCache:
+   ptype = "ResultCache";
+   subpath = ((ResultCachePath *) path)->subpath;
+   break;
case T_UniquePath:
ptype = "Unique";
subpath = ((UniquePath *) path)->subpath;
should we use "case T_ResultCachePath" here?

2.
Is it better to add ResultCache's info to " src/backend/optimizer/README " ?
Something like:
  NestPath  - nested-loop joins
  MergePath - merge joins
  HashPath  - hash joins
+ ResultCachePath - Result cache

Best regards,
Hou zhijie

RE: Table refer leak in logical replication

2021-04-05 Thread houzj.f...@fujitsu.com

> WARNING: relcache reference leak: relation "xxx" not closed.
> 
> Example of the procedure:
> --publisher--
> create table test (a int primary key);
> create publication pub for table test;
> 
> --subscriber--
> create table test (a int primary key);
> create subscription sub connection 'dbname=postgres' publication pub;
> create function funcA() returns trigger as $$ begin return null; end; $$ 
> language
> plpgsql; create trigger my_trig after insert or update or delete on test for 
> each
> row execute procedure funcA(); alter table test enable replica trigger 
> my_trig;
> 
> --publisher--
> insert into test values (6);
> 
> It seems an issue about reference leak. Anyone can fix this?

It seems ExecGetTriggerResultRel will reopen the target table because it cannot 
find an existing one.
Storing the opened table in estate->es_opened_result_relations seems solves the 
problem.

Attaching a patch that fix this.
BTW, it seems better to add a testcase for this ?

Best regards,
houzj


fix_table_refer_leak.diff
Description: fix_table_refer_leak.diff

RE: Table refer leak in logical replication

2021-04-05 Thread houzj.f...@fujitsu.com

> > > insert into test values (6);
> > >
> > > It seems an issue about reference leak. Anyone can fix this?
> >
> > It seems ExecGetTriggerResultRel will reopen the target table because it
> cannot find an existing one.
> > Storing the opened table in estate->es_opened_result_relations seems
> solves the problem.
> 
> It seems like commit 1375422c is related to this bug. The commit introduced a
> new function ExecInitResultRelation() that sets both
> estate->es_result_relations and estate->es_opened_result_relations. I
> think it's better to use ExecInitResultRelation() rather than directly setting
> estate->es_opened_result_relations. It might be better to do that in
> create_estate_for_relation() though. Please find an attached patch.
> 
> Since this issue happens on only HEAD and it seems an oversight of commit
> 1375422c, I don't think regression tests for this are essential.

It seems we can not only use ExecInitResultRelation.
In function ExecInitResultRelation, it will use ExecGetRangeTableRelation which
will also open the target table and store the rel in "Estate->es_relations".
We should call ExecCloseRangeTableRelations at the end of apply_handle_xxx to
close the rel in "Estate->es_relations".

Attaching the patch with this change.

Best regards,
houzj



fix_relcache_leak_in_lrworker.diff
Description: fix_relcache_leak_in_lrworker.diff

RE: Added schema level support for publication.

2021-10-11 Thread houzj.f...@fujitsu.com

On Monday, October 11, 2021 2:39 PM vignesh C  wrote:
> 
> These comments are fixed in the v38 patch attached.

Thanks for updating the patches.
Here are a few comments on the v38-0004-Doc patch.

1.
+  
+   Adding/Setting a table that is part of schema specified in
+   ALL TABLES IN SCHEMA, adding/setting a schema to a
+   publication along with same schema's table specified with
+   TABLE, adding/setting a schema to a publication that
+   already has a table that is part of specified schema or adding/setting a
+   table to a publication that already has a table's schema as part of
+   specified schema is not supported.

ISTM we can remove the description "adding/setting a schema to a publication
along with same schema's table specified with TABLE",
because it seems the same as the first mentioned case "Adding/Setting a table
that is part of schema specified in ALL TABLES IN SCHEMA"

2.

+
+
+  
+   Add some schemas to the publication:
+
+ALTER PUBLICATION sales_publication ADD ALL TABLES IN SCHEMA marketing_june, 
sales_june;
+
+  
+
+  
+   Add some tables and schemas to the publication:
...
+
+  
+   Drop some schemas from the publication:
...
+  
+   Set some schemas to the publication:
+
+ALTER PUBLICATION production_publication SET ALL TABLES IN SCHEMA 
production_september, production_october;

Personally, I think we don't need the example about DROP and SET here.
The example of ADD seems sufficient.

3.
+
+  
+
+  
+   Create a publication that publishes all changes for all the tables present 
in
+   the schema "production":
+
+CREATE PUBLICATION production_publication FOR ALL TABLES IN SCHEMA production;
+
+  
...
+  
+   Create a publication that publishes all changes for all the tables present 
in
+   the schemas "marketing" and "sales":
+
+CREATE PUBLICATION sales_publication FOR ALL TABLES IN SCHEMA marketing, sales;

I think the example for publishing all the tables in schemas "marketing" and
"sales" is sufficient, the example for pulishing signal schema seems can be
removed.

Best regards,
Hou zj

Drop replslot after pgstat_shutdown cause assert coredump

2021-10-11 Thread houzj.f...@fujitsu.com

Hi,

When testing logical replication, I found a case which caused assert coredump on
latest HEAD. The reproduction steps are as follows:

1)
publisher
create table test(i int);
create publication pub for table test;
begin;

insert into test values(1);

2)
subscriber
create table test(i int);
create subscription sub connection 'dbname=postgres port=1' publication pub;
- wait for a second and Ctrl-C

3)
publisher
commit;

I can see the walsender tried to release a not-quite-ready repliaction slot
that was created when create a subscription. But the pgstat has been shutdown
before invoking ReplicationSlotRelease().

The stack is as follows:

#2  in ExceptionalCondition (pgstat_is_initialized && !pgstat_is_shutdown)
#3  in pgstat_assert_is_up () at pgstat.c:4852
#4  in pgstat_send (msg=msg@entry=0x7ffe716f7470, len=len@entry=144) at 
pgstat.c:3075
#5  in pgstat_report_replslot_drop (slotname=slotname@entry=0x7fbcf57a3c98 
"sub") at pgstat.c:1869
#6  in ReplicationSlotDropPtr (slot=0x7fbcf57a3c80) at slot.c:696
#7  in ReplicationSlotDropAcquired () at slot.c:585
#8  in ReplicationSlotRelease () at slot.c:482
#9  in ProcKill (code=, arg=) at proc.c:852
#10 in shmem_exit (code=code@entry=0) at ipc.c:272
#11 in proc_exit_prepare (code=code@entry=0) at ipc.c:194
#12 in proc_exit (code=code@entry=0) at ipc.c:107
#13 in ProcessRepliesIfAny () at walsender.c:1807
#14 in WalSndWaitForWal (loc=loc@entry=22087632) at walsender.c:1417
#15 in logical_read_xlog_page (state=0x2f8c600, targetPagePtr=22085632,
reqLen=, targetRecPtr=, cur_page=0x2f6c1e0 "\016\321\005") at 
walsender.c:821
#16 in ReadPageInternal (state=state@entry=0x2f8c600,
pageptr=pageptr@entry=22085632, reqLen=reqLen@entry=2000) at 
xlogreader.c:667
#17 in XLogReadRecord (state=0x2f8c600,
errormsg=errormsg@entry=0x7ffe716f7f98) at xlogreader.c:337
#18 in DecodingContextFindStartpoint (ctx=ctx@entry=0x2f8c240)
at logical.c:606
#19 in CreateReplicationSlot (cmd=cmd@entry=0x2f1aef0)

Is this behavior expected ?

Best regards,
Hou zhijie

RE: Added schema level support for publication.

2021-10-11 Thread houzj.f...@fujitsu.com

On Monday, October 11, 2021 11:02 PM vignesh C  wrote:
> The attached v39 patch has the fixes for the above issues.

Thanks for the updates.
I have a few minor suggestions about the testcases in the v39-0003-Test patch.

1)
+-- alter publication drop CURRENT_SCHEMA
+ALTER PUBLICATION testpub1_forschema DROP ALL TABLES IN SCHEMA CURRENT_SCHEMA;
+\dRp+ testpub1_forschema

Since we already tested CURRENT_SCHEMA in various CREATE PUBLICATION cases, 
maybe
we don't need to test it again in SET/DROP/ADD cases.

2)
+-- alter publication set schema
+ALTER PUBLICATION testpub1_forschema SET ALL TABLES IN SCHEMA pub_test1;
+\dRp+ testpub1_forschema
+
+-- alter publication set multiple schema
+ALTER PUBLICATION testpub1_forschema SET ALL TABLES IN SCHEMA pub_test1, 
pub_test2;
+\dRp+ testpub1_forschema
+

I think the multiple schemas testcase is sufficient, maybe we can remove the
single schema case.


3)
+
+-- alter publication set it with the same schema
+ALTER PUBLICATION testpub1_forschema SET ALL TABLES IN SCHEMA pub_test1, 
pub_test2;
+\dRp+ testpub1_forschema

ISTM, we didn't have some special code path for this case, maybe we can remove
this testcase.


Best regards,
Hou zj

RE: Added schema level support for publication.

2021-10-12 Thread houzj.f...@fujitsu.com

On Tuesday, October 12, 2021 9:15 PM vignesh C 
> Attached v40 patch has the fix for the above comments.

Thanks for the update, I have some minor issues about partition related 
behavior.

1)

Tang tested and discussed this issue with me.
The testcase is:
We publish a schema and there is a partition in the published schema. If
publish_via_partition_root is on and the partition's parent table is not in the
published schema, neither the change on the partition nor the parent table will
not be published.

But if we publish by FOR TABLE partition and set publish_via_partition_root to
on, the change on the partition will be published. So, I think it'd be better to
publish the change on partition for FOR ALL TABLES IN SCHEMA case if its parent 
table
is not published in the same publication.

It seems we should pass publication oid to the GetSchemaPublicationRelations()
and add some check like the following:

if (is_publishable_class(relid, relForm) &&
!(relForm->relispartition && pub_partopt == 
PUBLICATION_PART_ROOT))
result = lappend_oid(result, relid);
+   if (relForm->relispartition && pub_partopt == 
PUBLICATION_PART_ROOT)
+   {
+   bool skip = false;
+   List *ancestors = get_partition_ancestors(relid);
+   List *schemas = GetPublicationSchemas(pubid);
+   ListCell   *lc;
+   foreach(lc, ancestors)
+   {
+   if (list_member_oid(schemas, 
get_rel_namespace(lfirst_oid(lc
+   {
+   skip = true;
+   break;
+   }
+   }
+   if (!skip)
+   result = lappend_oid(result, relid);
+   }



2)
+   /*
+* It is quite possible that some of the partitions are in a different
+* schema than the parent table, so we need to get such partitions
+* separately.
+*/
+   scan = table_beginscan_catalog(classRel, keycount, key);
+   while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+   {
+   Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+
+   if (is_publishable_class(relForm->oid, relForm))
+   {
+   List   *partitionrels = NIL;
+
+   partitionrels = 
GetPubPartitionOptionRelations(partitionrels,
+   
   pub_partopt,
+   
   relForm->oid);

I think a partitioned table could also be a partition which should not be
appended to the list. I think we should also filter these cases here by same
check in 1).
Thoughts ?

Best regards,
Hou zj

RE: Failed transaction statistics to measure the logical replication progress

2021-10-13 Thread houzj.f...@fujitsu.com

On Thursday, September 30, 2021 12:15 PM Amit Kapila 
> On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰  
> wrote:
> >
> > On Tues, Sep 28, 2021 6:05 PM Amit Kapila  wrote:
> > >
> > > Can't we keep the current and new stats both in-memory and persist on
> > > disk? So, the persistent stats data will be used to fill the in-memory
> > > counters after restarting of workers, otherwise, we will always refer
> > > to in-memory values.
> >
> > I think this approach works, but I have another concern about it.
> >
> > The current pg_stat_subscription view is listed as "Dynamic Statistics 
> > Views"
> in
> > the document, the data in it seems about the worker process, and the view
> data
> > shows what the current worker did. But if we keep the new xact stat persist,
> > then it's not what the current worker did, it looks more related to the
> > subscription historic data.
> >
> 
> I see your point.
> 
> > Adding a new view seems resonalble, but it will bring another subscription
> > related view which might be too much. OTOH, I can see there are already
> some
> > different views[1] including xact stat, maybe adding another one is
> accepatble ?
> >
> 
> These all views are related to untransmitted to the collector but what
> we really need is a view similar to pg_stat_archiver or
> pg_stat_bgwriter which gives information about background workers.
> Now, the problem as I see is if we go that route then
> pg_stat_subscription will no longer remain dynamic view and one might
> consider that as a compatibility break. The other idea I have shared
> is that we display these stats under the new view introduced by
> Sawada-San's patch [1] and probably rename that view as
> pg_stat_subscription_worker where all the stats (xact info and last
> failure information) about each worker will be displayed. Do you have
> any opinion on that idea or do you see any problem with it?

Personally, I think it seems reasonable to merge the xact stat into the view 
from
sawada-san's patch.

One problem I noticed is that pg_stat_subscription_error
currently have a 'count' column which show how many times the last error
happened. The xact stat here also have a similar value 'xact_error'. I think we
might need to rename it or merge them into one in some way.

Besides, if we decide to merge xact stat into pg_stat_subscription_error, some 
column
seems need to be renamed. Maybe like:
error_message => Last_error_message, command=> last_error_command..

Best regards,
Hou zj

Data is copied twice when specifying both child and parent table in publication

2021-10-15 Thread houzj.f...@fujitsu.com

Hi,

In another logical replication related thread[1], my colleague Greg found that
if publish_via_partition_root is true, then the child table's data will be
copied twice when adding both child and parent table to the publication.

Example:

-
Pub:
create table tbl1 (a int) partition by range (a);
create table tbl1_part1 partition of tbl1 for values from (1) to (10);
create table tbl1_part2 partition of tbl1 for values from (10) to (20);
create publication pub for table tbl1, tbl1_part1 with 
(publish_via_partition_root=on);

insert into tbl1_part1 values(1);

Sub:
create table tbl1 (a int) partition by range (a);
create table tbl1_part1 partition of tbl1 for values from (1) to (10);
create table tbl1_part2 partition of tbl1 for values from (10) to (20);
create subscription sub CONNECTION 'dbname=postgres port=1' publication pub;

-- data is copied twice
select * from tbl1_part1;
 a
---
 1
 1
-

The reason is that the subscriber will fetch the table list from publisher
using the following sql[2] and the subscriber will execute table
synchronization for each table in the query results in this case. But
tbl1_part1 is a partition of tbl1, so the data of tbl1_part1 was copied twice.

[2]
select * from pg_publication_tables;
 pubname | schemaname | tablename
-++
 pub | public | tbl1
 pub | public | tbl1_part1

IMO, it looks like a bug and it's more natural to only execute the table
synchronization for the parent table in the above case. Because as the document
said: if publish_via_partition_root is true, "changes in a partitioned table
(or on its partitions) contained in the publication will be published using the
identity and schema of the partitioned table rather than that of the individual
partitions that are actually changed;"

To fix it, I think we should fix function GetPublicationRelations which
generate data for the view pg_publication_tables and make it only show the
parent table if publish_via_partition_root is true. And for other future
feature like schema level publication, we can also follow this to exclude
partitions if their parent is specified by FOR TABLE in the same publication.

Attach a patch to fix it.
Thoughts ?

[1] 
https://www.postgresql.org/message-id/CAJcOf-eBhDUT2J5zs8Z0qEMiZUdhinX%2BbuGX3GN4V83fPnZV3Q%40mail.gmail.com

Best regards,
Hou zhijie



0001-fix-double-publish.patch
Description: 0001-fix-double-publish.patch

RE: Data is copied twice when specifying both child and parent table in publication

2021-10-15 Thread houzj.f...@fujitsu.com

On Friday, October 15, 2021 7:23 PM houzj.f...@fujitsu.com wrote:
> Attach a patch to fix it.
Attach a new version patch which refactor the fix code in a cleaner way.

Best regards,
Hou zj


v2-0001-fix-double-publish.patch
Description: v2-0001-fix-double-publish.patch

RE: Failed transaction statistics to measure the logical replication progress

2021-10-17 Thread houzj.f...@fujitsu.com

On Thursday, October 14, 2021 2:13 PM Osumi, Takamichi wrote:
> On Thursday, October 14, 2021 12:54 PM Hou, Zhijie 
> wrote:
> > On Thursday, September 30, 2021 12:15 PM Amit Kapila
> > > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰wrote:
> > > >
> > > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila
> > > > 
> > wrote:
> > > > >
> > > > > Can't we keep the current and new stats both in-memory and
> > > > > persist on disk? So, the persistent stats data will be used to
> > > > > fill the in-memory counters after restarting of workers,
> > > > > otherwise, we will always refer to in-memory values.
> > > >
> > > > I think this approach works, but I have another concern about it.
> > > >
> > > > The current pg_stat_subscription view is listed as "Dynamic
> > > > Statistics Views" in
> > > > the document, the data in it seems about the worker process, and
> > > > the view data
> > > > shows what the current worker did. But if we keep the new xact
> > > > stat persist, then it's not what the current worker did, it looks
> > > > more related to the subscription historic data.
> > > >
> > >
> > > I see your point.
> > >
> > > > Adding a new view seems resonalble, but it will bring another
> > > > subscription related view which might be too much. OTOH, I can see
> > > > there are already some
> > > > different views[1] including xact stat, maybe adding another one
> > > > is accepatble ?
> > > >
> > >
> > > These all views are related to untransmitted to the collector but
> > > what we really need is a view similar to pg_stat_archiver or
> > > pg_stat_bgwriter which gives information about background workers.
> > > Now, the problem as I see is if we go that route then
> > > pg_stat_subscription will no longer remain dynamic view and one
> > > might consider that as a compatibility break. The other idea I have
> > > shared is that we display these stats under the new view introduced
> > > by Sawada-San's patch [1] and probably rename that view as
> > > pg_stat_subscription_worker where all the stats (xact info and last
> > > failure information) about each worker will be displayed. Do you
> > > have any opinion on that idea or do you see any problem with it?
> >
> > Personally, I think it seems reasonable to merge the xact stat into
> > the view from sawada-san's patch.
> >
> > One problem I noticed is that pg_stat_subscription_error currently
> > have a 'count' column which show how many times the last error
> > happened. The xact stat here also have a similar value 'xact_error'. I
> > think we might need to rename it or merge them into one in some way.
> >
> > Besides, if we decide to merge xact stat into
> > pg_stat_subscription_error, some column seems need to be renamed.
> Maybe like:
> > error_message => Last_error_message, command=> last_error_command..
> Yeah, we must make them distinguished clearly.
> 
> I guessed that you are concerned about
> amount of renaming codes that could be a bit large or you come up with a
> necessity to consider the all column names of the pg_stat_subscription_worker
> together all at once in advance.
> 
> It's because my instant impression is,
> when we go with the current xact stats column definitions (xact_commit,
> xact_commit_bytes, xact_error, xact_error_bytes, xact_abort,
> xact_abort_bytes), the renaming problem can be solved if I write one
> additional patch or extend the main patch of xact stats to handle renaming.
> (This can work to keep both threads independent from each other).
> 
> Did you have some concern that cannot be handled by this way ?
Hi,

Currently, I don't find some unsolvable issues in this approach.

Best regards,
Hou zj

RE: Added schema level support for publication.

2021-10-17 Thread houzj.f...@fujitsu.com

On Monday, October 18, 2021 12:04 PM Greg Nancarrow wrote:
> On Sat, Oct 16, 2021 at 4:57 PM houzj.f...@fujitsu.com wrote:
> > Based on the V40 patchset, attaching the Top-up patch which try to fix
> > the partition issue in a cleaner way.
> >
> 
> A minor thing, in your "top-up patch", the test code added to publication.sql,
> you need to remove the last "DROP TABLE sch2.tbl1_part1;". It causes an error
> because the table doesn't exist and it seems to have been erroneously copied
> from the previous test case.

Thanks for the comment.
I have removed the last "DROP TABLE sch2.tbl1_part1;" in V41 patch set.

Best regards,
Hou zj

RE: Data is copied twice when specifying both child and parent table in publication

2021-10-18 Thread houzj.f...@fujitsu.com

On Saturday, October 16, 2021 2:31 PM houzj.f...@fujitsu.com wrote:
> On Friday, October 15, 2021 7:23 PM houzj.f...@fujitsu.com wrote:
> > Attach a patch to fix it.
> Attach a new version patch which refactor the fix code in a cleaner way.

Although the discussion about the partition behavior[1] is going on,
attach a refactored fix patch which make the pg_publication_tables view be
consistent for FOR TABLE and FOR ALL TABLES here in case someone want
to have a look.

[1] 
https://www.postgresql.org/message-id/CAA4eK1JC5sy5M_UVoGdgubHN2--peYqApOJkT%3DFLCq%2BVUxqerQ%40mail.gmail.com

Best regards,
Hou zj

v3-0001-fix-double-publish.patch
Description: v3-0001-fix-double-publish.patch

RE: Failed transaction statistics to measure the logical replication progress

2021-10-18 Thread houzj.f...@fujitsu.com

On Monday, October 18, 2021 6:04 PM Amit Kapila  wrote:
> On Thu, Oct 14, 2021 at 9:23 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Thursday, September 30, 2021 12:15 PM Amit Kapila
> > 
> > >
> > > These all views are related to untransmitted to the collector but
> > > what we really need is a view similar to pg_stat_archiver or
> > > pg_stat_bgwriter which gives information about background workers.
> > > Now, the problem as I see is if we go that route then
> > > pg_stat_subscription will no longer remain dynamic view and one
> > > might consider that as a compatibility break. The other idea I have
> > > shared is that we display these stats under the new view introduced
> > > by Sawada-San's patch [1] and probably rename that view as
> > > pg_stat_subscription_worker where all the stats (xact info and last
> > > failure information) about each worker will be displayed. Do you
> > > have any opinion on that idea or do you see any problem with it?
> >
> > Personally, I think it seems reasonable to merge the xact stat into
> > the view from sawada-san's patch.
> >
> > One problem I noticed is that pg_stat_subscription_error currently
> > have a 'count' column which show how many times the last error
> > happened. The xact stat here also have a similar value 'xact_error'. I
> > think we might need to rename it or merge them into one in some way.
> >
> > Besides, if we decide to merge xact stat into
> > pg_stat_subscription_error, some column seems need to be renamed.
> Maybe like:
> > error_message => Last_error_message, command=> last_error_command..
> >
> 
> Don't you think that keeping the view name as pg_stat_subscription_error
> would be a bit confusing if it has to display xact_info? Isn't it better to 
> change it
> to pg_stat_subscription_worker or some other worker-specific generic name?

Yes, I agreed that rename the view to pg_stat_subscription_worker or some
other worker-specific generic name is better if we decide to move forward
with this approach.

Best regards,
Hou zj

RE: Data is copied twice when specifying both child and parent table in publication

2021-10-18 Thread houzj.f...@fujitsu.com

On Monday, October 18, 2021 5:03 PM Amit Langote  
wrote:
> I can imagine that the behavior seen here may look surprising, but not
> sure if I would call it a bug as such.  I do remember thinking about
> this case and the current behavior is how I may have coded it to be.
> 
> Looking at this command in Hou-san's email:
> 
>   create publication pub for table tbl1, tbl1_part1 with
> (publish_via_partition_root=on);
> 
> It's adding both the root partitioned table and the leaf partition
> *explicitly*, and it's not clear to me if the latter's inclusion in
> the publication should be assumed because the former is found to have
> been added to the publication, that is, as far as the latter's
> visibility to the subscriber is concerned.  It's not a stretch to
> imagine that a user may write the command this way to account for a
> subscriber node on which tbl1 and tbl1_part1 are unrelated tables.
> 
> I don't think we assume anything on the publisher side regarding the
> state/configuration of tables on the subscriber side, at least with
> publication commands where tables are added to a publication
> explicitly, so it is up to the user to make sure that the tables are
> not added duplicatively.  One may however argue that the way we've
> decided to handle FOR ALL TABLES does assume something about
> partitions where it skips advertising them to subscribers when
> publish_via_partition_root flag is set to true, but that is exactly to
> avoid the duplication of data that goes to a subscriber.

Hi,

Thanks for the explanation.

I think one reason that I consider this behavior a bug is that: If we add
both the root partitioned table and the leaf partition explicitly to the
publication (and set publish_via_partition_root = on), the behavior of the
apply worker is inconsistent with the behavior of table sync worker.

In this case, all changes in the leaf the partition will be applied using the
identity and schema of the partitioned(root) table. But for the table sync, it
will execute table sync for both the leaf and the root table which cause
duplication of data.

Wouldn't it be better to make the behavior consistent here ?

Best regards,
Hou zj

RE: Skipping logical replication transactions on subscriber side

2021-10-19 Thread houzj.f...@fujitsu.com

On Mon, Oct 18, 2021 9:34 AM Masahiko Sawada  wrote:
> I've attached updated patches that incorporate all comments I got so far.

Hi,

Here are some minor comments for the patches.

v17-0001-Add-a-subscription-errors-statistics-view-pg_sta.patch

1)

+   /* Clean up */
+   if (not_ready_rels != NIL)
+   list_free_deep(not_ready_rels);

Maybe we don't need the ' if (not_ready_rels != NIL)' check as
list_free_deep will do this check internally.

2)

+   for (int i = 0; i < msg->m_nentries; i++)
+   {
+   HASH_SEQ_STATUS sstat;
+   PgStat_StatSubWorkerEntry *wentry;
+
+   /* Remove all worker statistics of the subscription */
+   hash_seq_init(&sstat, subWorkerStatHash);
+   while ((wentry = (PgStat_StatSubWorkerEntry *) 
hash_seq_search(&sstat)) != NULL)
+   {
+   if (wentry->key.subid == msg->m_subids[i])
+   (void) hash_search(subWorkerStatHash, (void *) 
&(wentry->key),
+  HASH_REMOVE, 
NULL);

Would it be a little faster if we scan hashtable in outerloop and
scan the msg in innerloop ?
Like:
while ((wentry = (PgStat_StatSubWorkerEntry *) hash_seq_search(&sstat)) != NULL)
{
for (int i = 0; i < msg->m_nentries; i++)
...


v17-0002-Add-RESET-command-to-ALTER-SUBSCRIPTION-command

1)
I noticed that we cannot RESET slot_name while we can SET it.
And the slot_name have a default behavior that " use the name of the 
subscription for the slot name.".
So, is it possible to support RESET it ?

Best regards,
Hou zj

RE: Added schema level support for publication.

2021-10-20 Thread houzj.f...@fujitsu.com

On Thurs, Oct 21, 2021 12:25 AM vignesh C  wrote:
> This version of patch retains the changes related to PublicationRelInfo, I 
> will
> handle the merging of the patches in the next version so that this version of
> patch change related to PublicationRelInfo can be reviewed easily.

Thanks for the patches,
I think the change related to PublicationRelInfo looks good.

Best regards,
Hou zj

RE: Skipping logical replication transactions on subscriber side

2021-10-26 Thread houzj.f...@fujitsu.com

On Thurs, Oct 21, 2021 12:59 PM Masahiko Sawada  wrote:
> I've attached updated patches. In this version, in addition to the review
> comments I go so far, I've changed the view name from
> pg_stat_subscription_errors to pg_stat_subscription_workers as per the
> discussion on including xact info to the view on another thread[1].
> I’ve also changed related codes accordingly.

When reviewing the v18-0002 patch.
I noticed that "RESET SYNCHRONOUS_COMMIT" does not take effect
(RESET doesn't change the value to 'off').

+   if (!is_reset)
+   {
+   opts->synchronous_commit = defGetString(defel);

-   ...
+   }

I think we need to add else branch here to set the synchronous_commit to 'off'.

Best regards,
Hou zj

RE: Data is copied twice when specifying both child and parent table in publication

2021-10-28 Thread houzj.f...@fujitsu.com

Hi,

As there are basically two separate issues mentioned in the thread, I tried to
summarize the discussion so far which might be helpful to others.

* The first issue[1]:

If we include both the partitioned table and (explicitly) its child partitions
in the publication when set publish_via_partition_root=true, like:
---
CREATE PUBLICATION pub FOR TABLE parent_table, child_table with 
(publish_via_partition_root=on);
---
It could execute initial sync for both the partitioned(parent_table) table and
(explicitly) its child partitions(child_table) which cause duplication of
data in partition(child_table) in subscriber side.

The reasons I considered this behavior a bug are:

a) In this case, the behavior of initial sync is inconsistent with the behavior
of transaction streaming. All changes in the leaf the partition will be applied
using the identity and schema of the partitioned(root) table. But for the
initial sync, it will execute sync for both the partitioned(root) table and
(explicitly) its child partitions which cause duplication of data.

b) The behavior of FOR TABLE is inconsistent with the behavior of FOR ALL TABLE.
If user create a FOR ALL TABLE publication and set 
publish_via_partition_root=true,
then only the top most partitioned(root) table will execute initial sync.

IIRC, most people in this thread agreed that the current behavior is not
expected. So, maybe it's time to try to fix it.

Attach my fix patch here. The patch try to fix this issue by making the
pg_publication_tables view only show partitioned table when
publish_via_partition_root is true.


* The second issue[2]:
-
CREATE TABLE sale (sale_date date not null,country_code text, product_sku text,
units integer) PARTITION BY RANGE (sale_date);
CREATE TABLE sale_201901 PARTITION OF sale FOR VALUES FROM ('2019-01-01') TO
('2019-02-01');
CREATE TABLE sale_201902 PARTITION OF sale FOR VALUES FROM ('2019-02-01') TO
('2019-03-01');

(1) PUB:  CREATE PUBLICATION pub FOR TABLE sale_201901,
sale_201902 WITH (publish_via_partition_root=true);
(2) SUB:  CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres host=localhost 
port=5432' PUBLICATION pub;
(3) PUB:  INSERT INTO sale VALUES('2019-01-01', 'AU', 'cpu', 5), ('2019-01-02', 
'AU', 'disk', 8);
(4) SUB:  SELECT * FROM sale;
(5) PUB:  ALTER PUBLICATION pub ADD TABLE sale;
(6) SUB:  ALTER SUBSCRIPTION sub REFRESH PUBLICATION;
(7) SUB:  SELECT * FROM sale;
-

In step (7), we can see duplication of data.

The reason is that the INSERTed data is first published though the partitions,
since initially there is no partitioned table in the publication (so
publish_via_partition_root=true doesn't have any effect). But then adding the
partitioned table to the publication and refreshing the publication in the
subscriber, the data is then published "using the identity and schema of the
partitioned table" due to publish_via_partition_root=true.
(Copied from Greg's analysis).

Whether this behavior is correct is still under debate.


Overall, I think the second issue still needs further discussion while the
first issue seems clear that most people think it's unexpected. So, I think it
might be better to fix the first issue.

[1] 
https://www.postgresql.org/message-id/OS0PR01MB57167F45D481F78CDC5986F794B99%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[2] 
https://www.postgresql.org/message-id/flat/CAJcOf-d8SWk3z3fJaLW9yuVux%3D2ESTsXOSdKzCq1O3AWBpgnMQ%40mail.gmail.com#fc96a42158b5e98ace26d077a6f7eac5

Best regards,
Hou zj


v4-0001-fix-data-double-published.patch
Description: v4-0001-fix-data-double-published.patch

RE: Added schema level support for publication.

2021-11-02 Thread houzj.f...@fujitsu.com

On Wed, Nov 3, 2021 12:25 PM vignesh C  wrote:
> On Wed, Nov 3, 2021 at 8:30 AM Amit Kapila 
> wrote:
> >
> > On Tue, Nov 2, 2021 at 8:13 PM Tomas Vondra
> >  wrote:
> > >
> > >
> > > >
> > > > Yeah, that is also true. So maybe at this, we can just rename the
> > > > few types as suggested by you and we can look at it later if we
> > > > anytime have more number of objects to add.
> > > >
> > >
> > > +1
> > >
> >
> > Apart from what you have pointed above, we are using
> > "DO_PUBLICATION_REL_IN_SCHEMA" in pg_dump. I think we should replace
> > that as well with "DO_PUBLICATION_TABLE_IN_SCHEMA".
> 
> Thanks for the comments, the attached patch has the changes for the same.
> 

Thanks for the patch.
I have only one minor comment:

+   PUBLICATIONOBJ_TABLE_IN_SCHEMA, /* Relations in schema type */

I think the ' Relations' in the comments also need to be changed to 'tables'.

The other part of the patch looks good to me.

Best regards,
Hou zj

RE: Added schema level support for publication.

2021-11-03 Thread houzj.f...@fujitsu.com

On Thurs, Nov 4, 2021 8:12 AM Peter Smith  wrote:
> FYI - I found a small problem with one of the new PublicationObjSpec parser
> error messages that was introduced by the recent schema publication commit
> [1].
> 
> The error message text is assuming that the error originates from CREATE
> PUBLICATION, but actually that same error can also come from ALTER
> PUBLICATION.
> e.g.2) Here the error came from ALTER PUBLICATION, so the message text is
> not OK because the ALTER syntax [2] does not even have a FOR keyword.
> 
> test_pub=# ALTER PUBLICATION p1 SET t1;
> 2021-11-04 10:51:53.912 AEDT [738] ERROR:  FOR TABLE/FOR ALL TABLES IN
> SCHEMA should be specified before the table/schema name(s) at character 26
> 2021-11-04 10:51:53.912 AEDT [738] STATEMENT:  ALTER PUBLICATION p1 SET
> t1;
> ERROR:  FOR TABLE/FOR ALL TABLES IN SCHEMA should be specified before
> the table/schema name(s) LINE 1: ALTER PUBLICATION p1 SET t1;

I think it might be better to report " TABLE/ALL TABLES IN SCHEMA should be 
specified before ...".

Best regards,
Hou zj

RE: row filtering for logical replication

2021-11-03 Thread houzj.f...@fujitsu.com

On Wednesday, November 3, 2021 8:51 PM Ajin Cherian  wrote:
> On Tue, Nov 2, 2021 at 10:44 PM Ajin Cherian  wrote:
> .
> >
> > The patch 0005 and 0006 has not yet been rebased but will be updated
> > in a few days.
> >
> 
> Here's a rebase of all the 6 patches. Issue remaining:

Thanks for the patches.
I started to review the patches and here are a few comments.

1)
/*
 * ALTER PUBLICATION ... ADD TABLE provides a PublicationTable 
List
 * (Relation, Where clause). ALTER PUBLICATION ... DROP TABLE 
provides
 * a Relation List. Check the List element to be used.
 */
if (IsA(lfirst(lc), PublicationTable))
whereclause = true;
else
whereclause = false;

I am not sure about the comments here, wouldn't it be better to always provides
PublicationTable List which could be more consistent.

2)
+   if ($3)
+   {
+   $$->pubtable->whereClause = $3;
+   }

It seems we can remove the if ($3) check here.

3)

+   oldctx = 
MemoryContextSwitchTo(CacheMemoryContext);
+   rfnode = 
stringToNode(TextDatumGetCString(rfdatum));
+   exprstate = 
pgoutput_row_filter_init_expr(rfnode);
+   entry->exprstates = 
lappend(entry->exprstates, exprstate);
+   MemoryContextSwitchTo(oldctx);
+   }

Currently in the patch, it save and execute each expression separately. I was
thinking it might be better if we can use "AND" to combine all the expressions
into one expression, then we can initialize and optimize the final expression
and execute it only once.

Best regards,
Hou zj

RE: parallel vacuum comments

2021-11-04 Thread houzj.f...@fujitsu.com

On Thur, Nov 4, 2021 1:25 PM Masahiko Sawada  wrote:
> On Wed, Nov 3, 2021 at 1:08 PM Amit Kapila  wrote:
> >
> > On Tue, Nov 2, 2021 at 11:17 AM Masahiko Sawada 
> > wrote:
> > >
> > > On Tue, Nov 2, 2021 at 5:57 AM Peter Geoghegan  wrote:
> > > >
> > >
> > > > Rather than inventing PARALLEL_VACUUM_KEY_INDVAC_CHECK (just for
> > > > assert-enabled builds), we should invent PARALLEL_VACUUM_STATS --
> > > > a dedicated shmem area for the array of LVSharedIndStats (no more
> > > > storing LVSharedIndStats entries at the end of the LVShared space
> > > > in an ad-hoc, type unsafe way). There should be one array element
> > > > for each and every index -- even those indexes where parallel
> > > > index vacuuming is unsafe or not worthwhile (unsure if avoiding
> > > > parallel processing for "not worthwhile" indexes actually makes
> > > > sense, BTW). We can then get rid of the bitmap/IndStatsIsNull()
> > > > stuff entirely. We'd also add new per-index state fields to
> > > > LVSharedIndStats itself. We could directly record the status of
> > > > each index (e.g., parallel unsafe, amvacuumcleanup processing
> > > > done, ambulkdelete processing done) explicitly. All code could
> > > > safely subscript the LVSharedIndStats array directly, using idx
> > > > style integers. That seems far more robust and consistent.
> > >
> > > Sounds good.
> > >
> > > During the development, I wrote the patch while considering using
> > > fewer shared memory but it seems that it brought complexity (and
> > > therefore the bug). It would not be harmful even if we allocate
> > > index statistics on DSM for unsafe indexes and “not worthwhile"
> > > indexes in practice.
> > >
> >
> > If we want to allocate index stats for all indexes in DSM then why not
> > consider it on the lines of buf/wal_usage means tack those via
> > LVParallelState? And probably replace bitmap with an array of bools
> > that indicates which indexes can be skipped by the parallel worker.
> >
> 
> I've attached a draft patch. The patch incorporated all comments from Andres
> except for the last comment that moves parallel related code to another file.
> I'd like to discuss how we split vacuumlazy.c.

Hi,

I was recently reading the parallel vacuum code, and I think the patch can
bring a certain improvement.

Here are a few minor comments about it.

1)

+* Reset all index status back to invalid (while checking that we have
+* processed all indexes).
+*/
+   for (int i = 0; i < vacrel->nindexes; i++)
+   {
+   LVSharedIndStats *stats = &(lps->lvsharedindstats[i]);
+
+   Assert(stats->status == INDVAC_STATUS_COMPLETED);
+   stats->status = INDVAC_STATUS_INITIAL;
+   }

Do you think it might be clearer to report an error here ?

2)

+prepare_parallel_index_processing(LVRelState *vacrel, bool vacuum)

For the second paramater 'vacuum'. Would it be clearer if we pass a
LVIndVacStatus type instead of the boolean value ?

Best regards,
Hou zj

RE: Added schema level support for publication.

2021-11-05 Thread houzj.f...@fujitsu.com

> On Thu, Nov 4, 2021 at 3:24 PM vignesh C  wrote:
> >
> > On Thu, Nov 4, 2021 at 5:41 AM Peter Smith 
> wrote:
> > >
> > > FYI - I found a small problem with one of the new PublicationObjSpec
> > > parser error messages that was introduced by the recent schema
> > > publication commit [1].
> > >
> > > The error message text is assuming that the error originates from
> > > CREATE PUBLICATION, but actually that same error can also come from
> > > ALTER PUBLICATION.
> > >
> > > For example,
> > >
> > > e.g.1) Here the error came from CREATE PUBLICATION, so the message
> > > text looks OK.
> > >
> > > test_pub=# CREATE PUBLICATION p1 FOR t1;
> > > 2021-11-04 10:50:17.208 AEDT [738] ERROR:  FOR TABLE/FOR ALL TABLES
> > > IN SCHEMA should be specified before the table/schema name(s) at
> > > character 27
> > > 2021-11-04 10:50:17.208 AEDT [738] STATEMENT:  CREATE PUBLICATION p1
> > > FOR t1;
> > > ERROR:  FOR TABLE/FOR ALL TABLES IN SCHEMA should be specified
> > > before the table/schema name(s) LINE 1: CREATE PUBLICATION p1 FOR
> > > t1;
> > >   ^
> > > ~~
> > >
> > > e.g.2) Here the error came from ALTER PUBLICATION, so the message
> > > text is not OK because the ALTER syntax [2] does not even have a FOR
> > > keyword.
> > >
> > > test_pub=# ALTER PUBLICATION p1 SET t1;
> > > 2021-11-04 10:51:53.912 AEDT [738] ERROR:  FOR TABLE/FOR ALL TABLES
> > > IN SCHEMA should be specified before the table/schema name(s) at
> > > character 26
> > > 2021-11-04 10:51:53.912 AEDT [738] STATEMENT:  ALTER PUBLICATION p1
> > > SET t1;
> > > ERROR:  FOR TABLE/FOR ALL TABLES IN SCHEMA should be specified
> > > before the table/schema name(s) LINE 1: ALTER PUBLICATION p1 SET t1;
> > >  ^
> >
> > Thanks for the comment, I changed the error message to remove the FOR
> > keyword. The attached patch has the changes for the same.
> >
> 
> LGTM.

+1

Best regards,
Hou zj

RE: row filtering for logical replication

2021-11-07 Thread houzj.f...@fujitsu.com

On Fri, Nov 5, 2021 1:14 PM Peter Smith  wrote:
> PSA new set of v37* patches.

Thanks for updating the patches.
Few comments:

1) v37-0001

I think it might be better to also show the filter expression in '\d+
tablename' command after publication description.

2) v37-0004

+   /* Scan the expression tree for referenceable objects */
+   find_expr_references_walker(expr, &context);
+
+   /* Remove any duplicates */
+   eliminate_duplicate_dependencies(context.addrs);
+

The 0004 patch currently use find_expr_references_walker to get all the
reference objects. I am thinking do we only need get the columns in the
expression ? I think maybe we can check the replica indentity like[1].

3) v37-0005

- no parse nodes of any kind other than Var, OpExpr, Const, BoolExpr, FuncExpr

I think there could be other node type which can also be considered as simple
expression, for exmaple T_NullIfExpr.

Personally, I think it's natural to only check the IMMUTABLE and
whether-user-defined in the new function rowfilter_walker. We can keep the
other row-filter errors which were thrown for EXPR_KIND_PUBLICATION_WHERE in
the 0001 patch.

[1]
rowfilter_expr_checker
...
if (replica_identity == REPLICA_IDENTITY_DEFAULT)
context.bms_replident = 
RelationGetIndexAttrBitmap(rel, INDEX_ATTR_BITMAP_PRIMARY_KEY);
else
context.bms_replident = 
RelationGetIndexAttrBitmap(rel, INDEX_ATTR_BITMAP_IDENTITY_KEY);

(void) rowfilter_expr_replident_walker(rfnode, 
&context);

...
static bool
rowfilter_expr_replident_walker(Node *node, rf_context *context)
{
if (node == NULL)
return false;

if (IsA(node, Var))
{
Oid relid = RelationGetRelid(context->rel);
Var*var = (Var *) node;
AttrNumber  attnum = var->varattno - 
FirstLowInvalidHeapAttributeNumber;

if (!bms_is_member(attnum, context->bms_replident))
{
const char *colname = get_attname(relid, attnum, false);
ereport(ERROR,

(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
errmsg("cannot add relation \"%s\" to 
publication",

RelationGetRelationName(context->rel)),
errdetail("Row filter column \"%s\" is 
not part of the REPLICA IDENTITY",
colname)));

return false;
}

return true;
}

return expression_tree_walker(node, rowfilter_expr_replident_walker,
  (void *) 
context);
}

Best regards,
Hou zj

RE: row filtering for logical replication

2021-11-09 Thread houzj.f...@fujitsu.com

On Fri, Nov 5, 2021 4:49 PM Amit Kapila  wrote:
> On Fri, Nov 5, 2021 at 10:44 AM Peter Smith  wrote:
> >
> > PSA new set of v37* patches.
> 3.
> - | ColId
> + | ColId OptWhereClause
>   {
>   $$ = makeNode(PublicationObjSpec);
>   $$->pubobjtype = PUBLICATIONOBJ_CONTINUATION;
> - $$->name = $1;
> + if ($2)
> + {
> + $$->pubtable = makeNode(PublicationTable); $$->pubtable->relation =
> + makeRangeVar(NULL, $1, @1); $$->pubtable->whereClause = $2; } else {
> + $$->name = $1; }
> 
> Again this doesn't appear to be the right way. I think this should be handled 
> at
> a later point.

I think the difficulty to handle this at a later point is that we need to make
sure we don't lose the whereclause. Currently, we can only save the whereclause
in PublicationTable structure and the PublicationTable is only used for TABLE,
but '| ColId' can be used for either a SCHEMA or TABLE. We cannot distinguish
the actual type at this stage, so we always need to save the whereclause if
it's NOT NULL.

I think the possible approaches to delay this check are:

(1) we can delete the PublicationTable structure and put all the vars(relation,
whereclause) in PublicationObjSpec. In this approach, we don't need check if
the whereclause is NULL in the '| ColId', we can check this at a later point.

Or

(2) Add a new pattern for whereclause in PublicationObjSpec:

The change could be:

PublicationObjSpec:
...
| ColId
... 
+ | ColId WHERE '(' a_expr ')'
+ {
+ $$ = makeNode(PublicationObjSpec);
+ $$->pubobjtype = PUBLICATIONOBJ_CONTINUATION;
+ $$->pubtable = makeNode(PublicationTable);
+ $$->pubtable->relation = makeRangeVar(NULL, $1, @1);
+ $$->pubtable->whereClause = $2;
+ }

In this approach, we also don't need the "if ($2)" check.

What do you think ?

Best regards,
Hou zj

RE: row filtering for logical replication

2021-11-09 Thread houzj.f...@fujitsu.com

On Thur, Nov 4, 2021 10:47 AM Peter Smith  wrote:
> PROPOSAL
> 
> I propose that we change the way duplicate tables are processed to make it so
> that it is always the *last* one that takes effect (instead of the *first* 
> one). AFAIK
> doing this won't affect any current PG behaviour, but doing this will let the 
> new
> row-filter feature work in a consistent/predictable/sane way.
> 
> Thoughts?

Last one take effect sounds reasonable to me.

OTOH, I think we should make the behavior here consistent with Column Filter
Patch in another thread. IIRC, in the current column filter patch, only the
first one's filter takes effect. So, maybe better to get Rahila and Alvaro's
thoughts on this.

Best regards,
Hou zj

RE: row filtering for logical replication

2021-11-09 Thread houzj.f...@fujitsu.com

On Wed, Nov 10, 2021 10:48 AM Amit Kapila  wrote:
> On Tue, Nov 9, 2021 at 2:22 PM houzj.f...@fujitsu.com wrote:
> >
> > On Fri, Nov 5, 2021 4:49 PM Amit Kapila  wrote:
> > > On Fri, Nov 5, 2021 at 10:44 AM Peter Smith  wrote:
> > > >
> > > > PSA new set of v37* patches.
> > > 3.
> > > - | ColId
> > > + | ColId OptWhereClause
> > >   {
> > >   $$ = makeNode(PublicationObjSpec);
> > >   $$->pubobjtype = PUBLICATIONOBJ_CONTINUATION;
> > > - $$->name = $1;
> > > + if ($2)
> > > + {
> > > + $$->pubtable = makeNode(PublicationTable); $$->pubtable->relation
> > > + = makeRangeVar(NULL, $1, @1); $$->pubtable->whereClause = $2; }
> > > + else { $$->name = $1; }
> > >
> > > Again this doesn't appear to be the right way. I think this should
> > > be handled at a later point.
> >
> > I think the difficulty to handle this at a later point is that we need
> > to make sure we don't lose the whereclause. Currently, we can only
> > save the whereclause in PublicationTable structure and the
> > PublicationTable is only used for TABLE, but '| ColId' can be used for
> > either a SCHEMA or TABLE. We cannot distinguish the actual type at
> > this stage, so we always need to save the whereclause if it's NOT NULL.
> >
> 
> I see your point. But, I think we can add some comments here indicating that
> the user might have mistakenly given where clause with some schema which we
> will identify later and give an appropriate error. Then, in
> preprocess_pubobj_list(), identify if the user has given the where clause with
> schema name and give an appropriate error.
> 

OK, IIRC, in this approach, we need to set both $$->name and $$->pubtable in
'| ColId OptWhereClause'. And In preprocess_pubobj_list, we can add some check
if both name and pubtable is NOT NULL.

the grammar code could be:

| ColId OptWhereClause
{
$$ = makeNode(PublicationObjSpec);
$$->pubobjtype = PUBLICATIONOBJ_CONTINUATION;

$$->name = $1;
+   /* xxx */
+   $$->pubtable = makeNode(PublicationTable);
+   $$->pubtable->relation = makeRangeVar(NULL, $1, @1);
+   $$->pubtable->whereClause = $2;
$$->location = @1;
}

preprocess_pubobj_list
...
else if (pubobj->pubobjtype == PUBLICATIONOBJ_REL_IN_SCHEMA ||
pubobj->pubobjtype == PUBLICATIONOBJ_CURRSCHEMA)
{
...
+if (pubobj->name &&
+(!pubobj->pubtable || !pubobj->pubtable->whereClause))
pubobj->pubobjtype = PUBLICATIONOBJ_REL_IN_SCHEMA;
else if (!pubobj->name && !pubobj->pubtable)
pubobj->pubobjtype = PUBLICATIONOBJ_CURRSCHEMA;
else
ereport(ERROR,
errcode(ERRCODE_SYNTAX_ERROR),
errmsg("invalid schema name at or near"),
parser_errposition(pubobj->location));
}


Best regards,
Hou zj

RE: [BUG]Invalidate relcache when setting REPLICA IDENTITY

2021-11-10 Thread houzj.f...@fujitsu.com

On Wed, Nov 10, 2021 7:29 PM Amit Kapila  wrote:
> On Wed, Nov 10, 2021 at 12:42 PM tanghy.f...@fujitsu.com
>  wrote:
> >
> > Hi
> >
> > I think I found a bug related to logical replication(REPLICA IDENTITY in
> specific).
> > If I change REPLICA IDENTITY after creating publication,  the
> DELETE/UPDATE operations won't be replicated as expected.
> >
> > For example:
> > -- publisher
> > CREATE TABLE tbl(a int, b int);
> > ALTER TABLE tbl ALTER COLUMN a SET NOT NULL; CREATE UNIQUE INDEX
> idx_a
> > ON tbl(a); ALTER TABLE tbl ALTER COLUMN b SET NOT NULL; CREATE UNIQUE
> > INDEX idx_b ON tbl(b); ALTER TABLE tbl REPLICA IDENTITY USING INDEX
> > idx_a; CREATE PUBLICATION pub FOR TABLE tbl; ALTER TABLE tbl REPLICA
> > IDENTITY USING INDEX idx_b; INSERT INTO tbl VALUES (1,1);
> >
> > -- subscriber
> > CREATE TABLE tbl(a int, b int);
> > ALTER TABLE tbl ALTER COLUMN b SET NOT NULL; CREATE UNIQUE INDEX
> idx_b
> > ON tbl(b); ALTER TABLE tbl REPLICA IDENTITY USING INDEX idx_b; CREATE
> > SUBSCRIPTION sub CONNECTION 'dbname=postgres port=5432'
> PUBLICATION
> > pub; SELECT * FROM tbl;
> >
> > -- publisher
> > postgres=# UPDATE tbl SET a=-a;
> > UPDATE 1
> > postgres=# SELECT * FROM tbl;
> > a  | b
> > +---
> >  -1 | 1
> > (1 row)
> >
> > -- subscriber
> > postgres=# SELECT * FROM tbl;
> >  a | b
> > ---+---
> >  1 | 1
> > (1 row)
> >
> > (a in subscriber should be -1)
> >
> 
> I don't understand the purpose of idx_b in the above test case, why is it
> required to reproduce the problem?
> @@ -15488,6 +15488,7 @@ relation_mark_replica_identity(Relation rel, char
> ri_type, Oid indexOid,
>   CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
>   InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
>   InvalidOid, is_internal);
> + CacheInvalidateRelcache(rel);
> 
> CatalogTupleUpdate internally calls heap_update which calls
> CacheInvalidateHeapTuple(), why is that not sufficient for invalidation?

I think it's because the bug happens only when changing REPLICA IDENTITY index
from one(idx_a) to another one(idx_b).

When first time specify the REPLICA IDENTITY index, the code works well because
it will invoke 'CatalogTupleUpdate(pg_class,...' Which will invalidate the
target table's relcache.

if (pg_class_form->relreplident != ri_type)
{
pg_class_form->relreplident = ri_type;
CatalogTupleUpdate(pg_class, &pg_class_tuple->t_self, 
pg_class_tuple);
}

But when changing REPLICA IDENTITY index, the code only invoke
'CatalogTupleUpdate(pg_index,' which seems only invalidate the index's cache 
not the
target table.

if (dirty)
{
CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, 
pg_index_tuple);
InvokeObjectPostAlterHookArg(IndexRelationId, 
thisIndexOid, 0,

 InvalidOid, is_internal);
}

Best regards,
Hou zj

1 2 3 4 5 >

1 - 100 of 482 matches

Mail list logo