Re: UniqueKey v2
Antonin Houska wrote: > Andy Fan wrote: > > > > > > * Combining the UKs > > > > > > IMO this is the most problematic part of the patch. You call > > > populate_joinrel_uniquekeys() for the same join multiple times, > > > > Why do you think so? The below code is called in "make_join_rel" > > Consider join of tables "a", "b" and "c". My understanding is that > make_join_rel() is called once with rel1={a} and rel2={b join c}, then with > rel1={a join b} and rel2={c}, etc. I wanted to say that each call should > produce the same set of unique keys. > > I need to check this part more in detail. I think I understand now. By calling populate_joinrel_uniquekeys() for various orderings, you can find out that various input relation unique keys can represent the whole join. For example, if the ordering is A JOIN (B JOIN C) you can prove that the unique keys of A can be used for the whole join, while for the ordering B JOIN (A JOIN C) you can prove the same for the unique keys of B, and so on. > > Is your original question is about populate_joinrel_uniquekey_for_rel > > rather than populate_joinrel_uniquekeys? We have the below codes: > > > > outeruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, > > outerrel, > > > > innerrel, restrictlist); > > inneruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, > > innerrel, > > > > outerrel, restrictlist); > > > > This is mainly because of the following theory. Quoted from > > README.uniquekey. Let's called this as "rule 1". > > > > """ > > How to maintain the uniquekey? > > --- > > .. From the join relation, it is maintained with two rules: > > > > - the uniquekey in one side is still unique if it can't be duplicated > > after the join. for example: > > > > SELECT t1.pk FROM t1 JOIN t2 ON t1.a = t2.pk; > > UniqueKey on t1: t1.pk > > UniqueKey on t1 Join t2: t1.pk > > """ > > > > AND the blow codes: > > > > > > if (outeruk_still_valid || inneruk_still_valid) > > > > /* > > * the uniquekey on outers or inners have been added into > > joinrel so > > * the combined uniuqekey from both sides is not needed. > > */ > > return; > > > > > > We don't create the component uniquekey if any one side of the boths > > sides is unique already. For example: > > > > "(t1.id) in joinrel(t1, t2) is unique" OR "(t2.id) in joinrel is > > unique", there is no need to create component UniqueKey (t1.id, t2.id); > > ok, I need to check more in detail how this part works. This optimization makes sense to me. > > > > > > Of course one problem is that the number of combinations can grow > > > exponentially as new relations are joined. > > > > Yes, that's why "rule 1" needed and "How to reduce the overhead" in > > UniqueKey.README is introduced. I think there should yet be some guarantee that the number of unique keys does not grow exponentially. Perhaps a constant that allows a relation (base or join) to have at most N unique keys. (I imagine N to be rather small, e.g. 3 or 4.) And when picking the "best N keys", one criterion could be the number of expressions in the key (the shorter key the better). > > > > > > 2) Check if the join relation is single-row > > > > > > I in order to get rid of the dependency on 'restrictlist', I think you > > > can > > > use ECs. Consider a query from your regression tests: > > > > > > CREATE TABLE uk_t (id int primary key, a int not null, b int not null, c > > > int, d int, e int); > > > > > > SELECT distinct t1.d FROM uk_t t1 JOIN uk_t t2 ON t1.e = t2.id and t1.id > > > = 1; > > > > > > The idea here seems to be that no more than one row of t1 matches the > > > query > > > clauses. Therefore, if t2(id) is unique, the clause t1.e=t2.id ensures > > > that > > > no more than one row of t2 matches the query (because t1 cannot provide > > > the > > > clause with more than one input value of 'e'). And therefore, the join > > > also > > > produces at most one row. > > > > You are correct and IMO my current code are able to tell it is a single > > row as well. > > > > 1. Since t1.id = 1, so t1 is single row, so t1.d is unqiuekey as a > > consequence. > > 2. Given t2.id is unique, t1.e = t2.id so t1's unqiuekey can be kept > > after the join because of rule 1 on joinrel. and t1 is singlerow, so the > > joinrel is singlerow as well. ok, I think I understand now. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: UniqueKey v2
t', I > think you can use ECs.", let's see what we can improve. > > > > My theory is that relation is single-row if it has an UK such that each of > > its ECs meets at least one of the following conditions: > > > > a) contains a constant > > True. > > > > b) contains a column of a relation which has already been proven > > single-row. > > True, not sure if it is easy to tell. > > > b) is referenced by an UK of a relation which has already been proven > > single-row. > > I can't follow here... This is similar to EC containing a constant: if an EC is used by a single-row UK, all its member can only have a single value. > > > > I think that in the example above, an EC {t1.e, t2.id} should exist. So > > when > > checking whether 't2' is single-row, the condition b) cam be used: the UK > > of > > 't2' should reference the EC {t1.e, t2.id}, which in turn contains the > > column t1.e. And 't1' is unique because its EC meets the condition a). (Of > > course you need to check em_jdomain before you use particular EM.) > > I think the existing rule 1 for joinrel works well with the singlerow > case naturally, what can be improved if we add the theory you suggested > here? This is still the explanation of the idea how to mark join unique key as a single-row separately from the other logic. As noted above, I need to learn more about the unique keys of a join. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Join removal and attr_needed cleanup
The attached patch tries to fix a corner case where attr_needed of an inner relation of an OJ contains the join relid only because another, already-removed OJ, needed some of its attributes. The unnecessary presence of the join relid in attr_needed can prevent the planner from further join removals. Do cases like this seem worth the effort and is the logic I use correct? -- Antonin Houska Web: https://www.cybertec-postgresql.com >From 279856bf97ce08c0c2e0c736a00831bf6324713b Mon Sep 17 00:00:00 2001 From: Antonin Houska Date: Mon, 29 Apr 2024 11:34:30 +0200 Subject: [PATCH] Cleanup attr_needed properly after join removal. If an outer join (J1) was removed and if the remaining outer relation is also an outer join (J2), the inner relation of J2 may still have the J2's relid in the attr_needed set, because its attribute was referenced by J1. However, after removal of J1, it's possible that no other join (nor the query targetlist) actually needs any attribute of J2, and so the J2's relid should not be in attr_needed anymore. This patch tries to spend more effort on checking the contents of attr_needed after removal of J1 so that J2 can potentially be removed as well. --- src/backend/optimizer/plan/analyzejoins.c | 117 +- src/test/regress/expected/join.out| 9 ++ src/test/regress/sql/join.sql | 5 + 3 files changed, 130 insertions(+), 1 deletion(-) diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c index 506fccd20c..1b9623efaa 100644 --- a/src/backend/optimizer/plan/analyzejoins.c +++ b/src/backend/optimizer/plan/analyzejoins.c @@ -46,6 +46,10 @@ bool enable_self_join_removal; /* local functions */ static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo); +static bool innerrel_attrs_needed_above_outer_join(PlannerInfo *root, + RelOptInfo *innerrel, + SpecialJoinInfo *sjinfo, + Relids joinrelids); static void remove_leftjoinrel_from_query(PlannerInfo *root, int relid, SpecialJoinInfo *sjinfo); static void remove_rel_from_restrictinfo(RestrictInfo *rinfo, @@ -183,6 +187,8 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo) List *clause_list = NIL; ListCell *l; int attroff; + Oid remove_ojrelid; + bool remove_ojrelid_valid; /* * Must be a left join to a single baserel, else we aren't going to be @@ -218,6 +224,13 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo) joinrelids = bms_copy(inputrelids); joinrelids = bms_add_member(joinrelids, sjinfo->ojrelid); + /* + * By default there's no reason to remove sjinfo->ojrelid from + * attr_needed, see below. + */ + remove_ojrelid = InvalidOid; + remove_ojrelid_valid = false; + /* * We can't remove the join if any inner-rel attributes are used above the * join. Here, "above" the join includes pushed-down conditions, so we @@ -233,7 +246,38 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo) attroff >= 0; attroff--) { - if (!bms_is_subset(innerrel->attr_needed[attroff], inputrelids)) + Relids attr_needed = innerrel->attr_needed[attroff]; + + /* + * If this join was among outer relations of an already-removed join, + * attr_needed may still contain the relid of the current join because + * join clauses of the removed join referenced it. Re-check if another + * join can still reference this join and if not, remove it from + * attr_needed. + */ + if (bms_is_member(sjinfo->ojrelid, attr_needed)) + { + /* Do the following check only once per inner relation. */ + if (!remove_ojrelid_valid) + { +/* + * If no clause requires the join relid anymore, remember that + * we should remove it from attr_needed. + */ +if (!innerrel_attrs_needed_above_outer_join(root, innerrel, + sjinfo, + joinrelids)) + remove_ojrelid = sjinfo->ojrelid; + +remove_ojrelid_valid = true; + } + + if (OidIsValid(remove_ojrelid)) +innerrel->attr_needed[attroff] = bms_del_member(attr_needed, +remove_ojrelid); + } + + if (!bms_is_subset(attr_needed, inputrelids)) return false; } @@ -333,6 +377,77 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo) return false; } +/* + * innerrel_attrs_needed_above_outer_join + * Re-check whether attributes of inner relation of an OJ are still needed + * above that OJ. + */ +static bool +innerrel_attrs_needed_above_outer_join(PlannerInfo *root, RelOptInfo *innerrel, + SpecialJoinInfo *sjinfo, + Relids joinrelids) +{ + bool found = false; + ListCell *l; + + /* Check the join clauses */ + foreach(l, innerrel->joininfo) + { + RestrictInfo *ri = lfirst_node(RestrictInfo, l); + + if (bms_is_member(sjinfo->ojrelid, ri->required_relids)) + return true; + } + + /* + * Also check if any clause generated from EC
Use "unique keys" to enhance outer join removal
While the "unique keys" feature [1] is still under development, I'm thinking how it could be used to enhance the removal of useless outer joins. Is something really bad about the 0002 patch attached? I recognize it may be weird that a join relation possibly produces non-join paths (e.g. SeqScan), but right now don't have better idea where the new code should appear. I considered planning the subqueries before the existing call of remove_useless_joins(), to make the unique keys available earlier. However it seems that more items can be added to 'baserestrictinfo' of the subquery relation after that. Thus by planning the subquery too early we could miss some opportunities to push clauses down to the subquery. Please note that this patch depends on [1], which enhances rel_is_distinct_for() and thus makes join_is_removable() a bit smareter. Also note that 0001 is actually a minor fix to [1]. [1] https://www.postgresql.org/message-id/7971.1713526758%40antos -- Antonin Houska Web: https://www.cybertec-postgresql.com >From a0be4ee7698ff03d6c22398f20fd2c7efadbff45 Mon Sep 17 00:00:00 2001 From: Antonin Houska Date: Mon, 29 Apr 2024 07:53:00 +0200 Subject: [PATCH 1/2] Undo comment changes. These belong to the following patch. --- src/test/regress/expected/join.out | 11 +-- src/test/regress/sql/join.sql | 13 ++--- 2 files changed, 11 insertions(+), 13 deletions(-) diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 571198a86a..fa407d37f5 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -5648,12 +5648,11 @@ select d.* from d left join (select distinct * from b) s Seq Scan on d (1 row) --- when the GROUP BY contains a column that is not in the join condition, join --- removal takes place too, due to "unique keys" (Note: as of 9.6, we notice --- that b.id is a primary key and so drop b.c_id from the GROUP BY of the --- resulting plan; but this happens too late for join removal in the outer --- plan level.) -explain (costs off, verbose on) +-- join removal is not possible when the GROUP BY contains a column that is +-- not in the join condition. (Note: as of 9.6, we notice that b.id is a +-- primary key and so drop b.c_id from the GROUP BY of the resulting plan; +-- but this happens too late for join removal in the outer plan level.) +explain (costs off) select d.* from d left join (select * from b group by b.id, b.c_id) s on d.a = s.id; QUERY PLAN diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql index bf8fb27072..c4c6c7b8ba 100644 --- a/src/test/regress/sql/join.sql +++ b/src/test/regress/sql/join.sql @@ -2047,17 +2047,16 @@ explain (costs off) select d.* from d left join (select distinct * from b) s on d.a = s.id and d.b = s.c_id; --- when the GROUP BY contains a column that is not in the join condition, join --- removal takes place too, due to "unique keys" (Note: as of 9.6, we notice --- that b.id is a primary key and so drop b.c_id from the GROUP BY of the --- resulting plan; but this happens too late for join removal in the outer --- plan level.) -explain (costs off, verbose on) +-- join removal is not possible when the GROUP BY contains a column that is +-- not in the join condition. (Note: as of 9.6, we notice that b.id is a +-- primary key and so drop b.c_id from the GROUP BY of the resulting plan; +-- but this happens too late for join removal in the outer plan level.) +explain (costs off) select d.* from d left join (select * from b group by b.id, b.c_id) s on d.a = s.id; -- similarly, but keying off a DISTINCT clause -explain (costs off, verbose on) +explain (costs off) select d.* from d left join (select distinct * from b) s on d.a = s.id; -- 2.44.0 >From cb0166a3e1f1f5ff88634c38c2474de16825625a Mon Sep 17 00:00:00 2001 From: Antonin Houska Date: Mon, 29 Apr 2024 08:24:42 +0200 Subject: [PATCH 2/2] Use Unique Keys to remove an useless left join. To consider an outer join useless, we need to prove that the inner relation is unique for given join clause(s). We may be unable do that at early stage of planning, especially if the inner relation is a subquery. The new "unique keys" feature (still under development) can resolve this problem by re-trying the removal at later stage, when the unique keys have (possibly) been initialized. --- src/backend/optimizer/path/joinpath.c | 35 src/backend/optimizer/path/joinrels.c | 10 + src/backend/optimizer/plan/analyzejoins.c | 4 +- src/include/optimizer/paths.h | 5 +++ src/include/optimizer/planmain.h | 1 + src/test/regress/expected/join.out| 50 +++ src/test/regress/sql/join.sql | 13 +++--- 7 files changed, 73 insertions(+), 45 deletions(-) diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimize
Re: UniqueKey v2
niqueness. 2) Check if the join relation is single-row I in order to get rid of the dependency on 'restrictlist', I think you can use ECs. Consider a query from your regression tests: CREATE TABLE uk_t (id int primary key, a int not null, b int not null, c int, d int, e int); SELECT distinct t1.d FROM uk_t t1 JOIN uk_t t2 ON t1.e = t2.id and t1.id = 1; The idea here seems to be that no more than one row of t1 matches the query clauses. Therefore, if t2(id) is unique, the clause t1.e=t2.id ensures that no more than one row of t2 matches the query (because t1 cannot provide the clause with more than one input value of 'e'). And therefore, the join also produces at most one row. My theory is that relation is single-row if it has an UK such that each of its ECs meets at least one of the following conditions: a) contains a constant b) contains a column of a relation which has already been proven single-row. b) is referenced by an UK of a relation which has already been proven single-row. I think that in the example above, an EC {t1.e, t2.id} should exist. So when checking whether 't2' is single-row, the condition b) cam be ised: the UK of 't2' should reference the EC {t1.e, t2.id}, which in turn contains the column t1.e. And 't1' is unique because its EC meets the condition a). (Of course you need to check em_jdomain before you use particular EM.) Are you going to submit the patch to the first CF of PG 18? Please let me know if I can contribute to the effort by reviewing or writing some code. -- Antonin Houska Web: https://www.cybertec-postgresql.com >From 9fe85dae62ae580f31935ee399e7f46dc99b Mon Sep 17 00:00:00 2001 From: Antonin Houska Date: Thu, 18 Apr 2024 14:43:02 +0200 Subject: [PATCH 1/5] Unique keys rebased. The original patch: https://www.postgresql.org/message-id/871qczm9oc.fsf%40163.com --- src/backend/nodes/list.c| 17 + src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/README.uniquekey | 220 +++ src/backend/optimizer/path/allpaths.c | 2 + src/backend/optimizer/path/equivclass.c | 48 ++ src/backend/optimizer/path/joinrels.c | 2 + src/backend/optimizer/path/uniquekey.c | 654 src/backend/optimizer/plan/initsplan.c | 10 +- src/backend/optimizer/plan/planner.c| 33 + src/backend/optimizer/util/plancat.c| 10 + src/backend/optimizer/util/relnode.c| 2 + src/include/nodes/pathnodes.h | 19 + src/include/nodes/pg_list.h | 2 + src/include/optimizer/paths.h | 13 + src/test/regress/expected/join.out | 11 +- src/test/regress/expected/uniquekey.out | 268 src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/uniquekey.sql | 104 18 files changed, 1410 insertions(+), 10 deletions(-) create mode 100644 src/backend/optimizer/path/README.uniquekey create mode 100644 src/backend/optimizer/path/uniquekey.c create mode 100644 src/test/regress/expected/uniquekey.out create mode 100644 src/test/regress/sql/uniquekey.sql diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index e2615ab105..20eb1267eb 100644 --- a/src/backend/nodes/list.c +++ b/src/backend/nodes/list.c @@ -695,6 +695,23 @@ list_member_ptr(const List *list, const void *datum) return false; } +/* + * Return index of the datum in list if found. otherwise return -1. + */ +int +list_member_ptr_pos(const List *list, const void *datum) +{ + ListCell *lc; + + foreach(lc, list) + { + if (lfirst(lc) == datum) + return foreach_current_index(lc); + } + + return -1; +} + /* * Return true iff the integer 'datum' is a member of the list. */ diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile index 1e199ff66f..63cc1505d9 100644 --- a/src/backend/optimizer/path/Makefile +++ b/src/backend/optimizer/path/Makefile @@ -21,6 +21,7 @@ OBJS = \ joinpath.o \ joinrels.o \ pathkeys.o \ - tidpath.o + tidpath.o \ + uniquekey.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey new file mode 100644 index 00..be13b113b9 --- /dev/null +++ b/src/backend/optimizer/path/README.uniquekey @@ -0,0 +1,220 @@ +Here is a design document and a part of implementation. + +What is UniqueKey? +- + +UniqueKey represents a uniqueness information for a RelOptInfo. for +example: + +SELECT id FROM t; + +where the ID is the UniqueKey for the RelOptInfo (t). In the real world, +it has the following attributes: + +1). It should be EquivalenceClass based. for example: + +SELECT a FROM t WHERE a = id; + +In this case, the UniqueKey should be 1 EC with two members +- EC(EM(a), EM(id)). + + +2). Each UniqueKey may be made up with 1+ EquivalenceClass. for example: + +CREATE TABLE t(a int not null, b
Re: why there is not VACUUM FULL CONCURRENTLY?
Alvaro Herrera wrote: > This is great to hear. > > On 2024-Jan-31, Antonin Houska wrote: > > > Is your plan to work on it soon or should I try to write a draft patch? (I > > assume this is for PG >= 18.) > > I don't have plans for it, so if you have resources, please go for it. ok, I'm thinking how can the feature be integrated into the core. BTW, I'm failing to understand why cluster_rel() has no argument of the BufferAccessStrategy type. According to buffer/README, the criterion for using specific strategy is that page "is unlikely to be needed again soon". Specifically for cluster_rel(), the page will *definitely* not be used again (unless the VACCUM FULL/CLUSTER command fails): BufferTag contains the relatin file number and the old relation file is eventually dropped. Am I missing anything? -- Antonin Houska Web: https://www.cybertec-postgresql.com
pg_language(langispl) column apparently not needed
I couldn't find a reference to the 'langispl' attribute, so I removed it (see the diff attached) and the master branch compiled cleanly. Is there yet a reason to keep it? -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/backend/commands/proclang.c b/src/backend/commands/proclang.c index c849d65e62..d353b4e3f7 100644 --- a/src/backend/commands/proclang.c +++ b/src/backend/commands/proclang.c @@ -112,7 +112,6 @@ CreateProceduralLanguage(CreatePLangStmt *stmt) namestrcpy(, languageName); values[Anum_pg_language_lanname - 1] = NameGetDatum(); values[Anum_pg_language_lanowner - 1] = ObjectIdGetDatum(languageOwner); - values[Anum_pg_language_lanispl - 1] = BoolGetDatum(true); values[Anum_pg_language_lanpltrusted - 1] = BoolGetDatum(stmt->pltrusted); values[Anum_pg_language_lanplcallfoid - 1] = ObjectIdGetDatum(handlerOid); values[Anum_pg_language_laninline - 1] = ObjectIdGetDatum(inlineOid); diff --git a/src/include/catalog/pg_language.h b/src/include/catalog/pg_language.h index 581372addd..7ff3fb0d08 100644 --- a/src/include/catalog/pg_language.h +++ b/src/include/catalog/pg_language.h @@ -36,9 +36,6 @@ CATALOG(pg_language,2612,LanguageRelationId) /* Language's owner */ Oid lanowner BKI_DEFAULT(POSTGRES) BKI_LOOKUP(pg_authid); - /* Is a procedural language */ - bool lanispl BKI_DEFAULT(f); - /* PL is trusted */ bool lanpltrusted BKI_DEFAULT(f);
Re: why there is not VACUUM FULL CONCURRENTLY?
Alvaro Herrera wrote: > On 2024-Jan-30, Pavel Stehule wrote: > > > One of my customer today is reducing one table from 140GB to 20GB. Now he > > is able to run archiving. He should play with pg_repack, and it is working > > well today, but I ask myself, what pg_repack does not be hard to do > > internally because it should be done for REINDEX CONCURRENTLY. This is not > > a common task, and not will be, but on the other hand, it can be nice to > > have feature, and maybe not too hard to implement today. But I didn't try it > > FWIW a newer, more modern and more trustworthy alternative to pg_repack > is pg_squeeze, which I discovered almost by random chance, and soon > discovered I liked it much more. > > So thinking about your question, I think it might be possible to > integrate a tool that works like pg_squeeze, such that it runs when > VACUUM is invoked -- either under some new option, or just replace the > code under FULL, not sure. If the Cybertec people allows it, we could > just grab the pg_squeeze code and add it to the things that VACUUM can > run. There are no objections from Cybertec. Nevertheless, I don't expect much code to be just copy & pasted. If I started to implement the extension today, I'd do some things in a different way. (Some things might actually be simpler in the core, i.e. a few small changes in PG core are easier than the related workarounds in the extension.) The core idea is that: 1) a "historic snapshot" is used to get the current contents of the table, 2) logical decoding is used to capture the changes done while the data is being copied to new storage, 3) the exclusive lock on the table is only taken for very short time, to swap the storage (relfilenode) of the table. I think it should be coded in a way that allows use by VACUUM FULL, CLUSTER, and possibly some subcommands of ALTER TABLE. For example, some users of pg_squeeze requested an enhancement that allows the user to change column data type w/o service disruption (typically when it appears that integer type is going to overflow and change bigint is needed). Online (re)partitioning could be another use case, although I admit that commands that change the system catalog are a bit harder to implement than VACUUM FULL / CLUSTER. One thing that pg_squeeze does not handle is visibility: it uses heap_insert() to insert the tuples into the new storage, so the problems described in [1] can appear. The in-core implementation should rather do something like tuple rewriting (rewriteheap.c). Is your plan to work on it soon or should I try to write a draft patch? (I assume this is for PG >= 18.) [1] https://www.postgresql.org/docs/current/mvcc-caveats.html -- Antonin Houska Web: https://www.cybertec-postgresql.com
cost_incremental_sort() and limit_tuples
I think that cost_incremental_sort() does not account for the limit_tuples argument properly. Attached is my proposal to fix the problem. -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index d6ceafd51c..829af80ea7 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -2036,6 +2036,36 @@ cost_incremental_sort(Path *path, NULL, NULL); group_tuples = input_tuples / input_groups; + + /* + * Do we have a useful LIMIT? + * + * Unlike the full sort, which must read all the input tuples regardless + * the limit, the incremental sort only needs to read the groups + * containing at least limit_tuples tuples in total. In other words, the + * number of input tuples is almost identical to the number of output + * tuples. Therefore we apply the limit to the *input* set. + */ + if (limit_tuples > 0 && limit_tuples < input_tuples) + { + double input_tuples_orig = input_tuples; + + /* + * We may need fewer groups, but there must be enough to accommodate + * limit_tuples. + */ + input_groups = limit_tuples / group_tuples; + input_groups = ceil(input_groups); + + /* Fewer input groups implies fewer input tuples. */ + input_tuples = input_groups * group_tuples; + /* XXX Should we instead apply ceil() to group_tuples above? */ + input_tuples = ceil(input_tuples); + + /* Also adjust input_run_cost. */ + input_run_cost /= input_tuples_orig / input_tuples; + } + group_input_run_cost = input_run_cost / input_groups; /* @@ -2044,7 +2074,7 @@ cost_incremental_sort(Path *path, */ cost_tuplesort(_startup_cost, _run_cost, group_tuples, width, comparison_cost, sort_mem, - limit_tuples); + -1); /* * Startup cost of incremental sort is the startup cost of its first group
Stop the search once replication origin is found
Although it's not performance-critical, I think it just makes sense to break the loop in replorigin_session_setup() as soon as we've found the origin. -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c index b0255ffd25..460e3dcc38 100644 --- a/src/backend/replication/logical/origin.c +++ b/src/backend/replication/logical/origin.c @@ -1144,6 +1144,7 @@ replorigin_session_setup(RepOriginId node, int acquired_by) /* ok, found slot */ session_replication_state = curstate; + break; }
Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
Thomas Munro wrote: > On Wed, Aug 16, 2023 at 11:18 PM Antonin Houska wrote: > > I try to understand this patch (commit 5ffb7c7750) because I use condition > > variable in an extension. One particular problem occured to me, please > > consider: > > > > ConditionVariableSleep() gets interrupted, so AbortTransaction() calls > > ConditionVariableCancelSleep(), but the signal was sent in between. > > Shouldn't > > at least AbortTransaction() and AbortSubTransaction() check the return value > > of ConditionVariableCancelSleep(), and re-send the signal if needed? > > I wondered about that in the context of our only in-tree user of > ConditionVariableSignal(), in parallel btree index creation, but since > it's using the parallel executor infrastructure, any error would be > propagated everywhere so all waits would be aborted. I see, ConditionVariableSignal() is currently used only to signal other workers running in the same transactions. The other parts use ConditionVariableBroadcast(), so no consumer should miss its signal. > > Note that I'm just thinking about such a problem, did not try to reproduce > > it. > > Hmm. I looked for users of ConditionVariableSignal() in the usual web > tools and didn't find anything, so I guess your extension is not > released yet or not open source. I'm curious: what does it actually > do if there is an error in a CV-wakeup-consuming backend? I guess it > might be some kind of work-queue processing system... it seems > inevitable that if backends are failing with errors, and you don't > respond by retrying/respawning, you'll lose or significantly delay > jobs/events/something anyway (imagine only slightly different timing: > you consume the signal and start working on a job and then ereport, > which amounts to the same thing in the end now that your transaction > is rolled back?), and when you retry you'll see whatever condition was > waited for anyway. But that's just me imagining what some > hypothetical strawman system might look like... what does it really > do? If you're interested, the extension is pg_squeeze [1]. I think the use case is rather special. All the work is done by a background worker, but an user function can be called to submit a "task" for the worker and wait for its completion. So the function sleeps on a CV and the worker uses the CV to wake it up. If this function ends due to ERROR, the user is supposed to find a log message in the worker output sooner or later. It may sound weird, but that function exists primarily for regression tests, so ERROR is a problem anyway. > (FWIW when I worked on a couple of different work queue-like systems > and tried to use ConditionVariableSignal() I eventually concluded that > it was the wrong tool for the job because its wakeup order is > undefined. It's actually FIFO, but I wanted LIFO so that workers have > a chance to become idle and reduce the pool size, but I started to > think that once you want that level of control you really want to > build a bespoke wait list system, so I never got around to proposing > that we consider changing that.) > > Our condition variables are weird. They're not associated with a > lock, so we made start-of-wait non-atomic: prepare first, then return > control and let the caller check its condition, then sleep. Typical > user space condition variable APIs force you to acquire some kind of > lock that protects the condition first, then check the condition, then > atomically release-associated-lock-and-start-sleeping, so there is no > data race but also no time where control is returned to the caller but > the thread is on the wait list consuming signals. That choice has > some pros (you can choose whatever type of lock you want to protect > your condition, or none at all if you can get away with memory > barriers and magic) and cons.. However, as I think Andres was getting > at, having a non-atomic start-of-wait doesn't seem to require us to > have a non-atomic end-of-wait and associated extra contention. So > maybe we should figure out how to fix that in 17. Thanks for sharing your point of view. I'm fine with this low-level approach: it's well documented and there are examples in the PG code showing how it should be used :-) -- Antonin Houska Web: https://www.cybertec-postgresql.com https://github.com/cybertec-postgresql/pg_squeeze/
Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
Thomas Munro wrote: > On Tue, Aug 15, 2023 at 2:23 AM Tomas Vondra > wrote: > > I'm not familiar with the condition variable code enough to have an > > opinion, but the patch seems to resolve the issue for me - I can no > > longer reproduce the high CPU usage. > > Thanks, pushed. I try to understand this patch (commit 5ffb7c7750) because I use condition variable in an extension. One particular problem occured to me, please consider: ConditionVariableSleep() gets interrupted, so AbortTransaction() calls ConditionVariableCancelSleep(), but the signal was sent in between. Shouldn't at least AbortTransaction() and AbortSubTransaction() check the return value of ConditionVariableCancelSleep(), and re-send the signal if needed? Note that I'm just thinking about such a problem, did not try to reproduce it. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Shouldn't cost_append() also scale the partial path's cost?
Like in cost_seqscan(), I'd expect the subpath cost to be divided among parallel workers. The patch below shows what I mean. Am I right? -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index ef475d95a1..5427822e0e 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -2313,19 +2313,29 @@ cost_append(AppendPath *apath) * Apply parallel divisor to subpaths. Scale the number of rows * for each partial subpath based on the ratio of the parallel * divisor originally used for the subpath to the one we adopted. - * Also add the cost of partial paths to the total cost, but - * ignore non-partial paths for now. + * Also add the scaled cost of partial paths to the total cost, + * but ignore non-partial paths for now. */ if (i < apath->first_partial_path) apath->path.rows += subpath->rows / parallel_divisor; else { double subpath_parallel_divisor; +double scale_factor; +Cost run_cost; subpath_parallel_divisor = get_parallel_divisor(subpath); -apath->path.rows += subpath->rows * (subpath_parallel_divisor / - parallel_divisor); -apath->path.total_cost += subpath->total_cost; +scale_factor = subpath_parallel_divisor / parallel_divisor; +apath->path.rows += subpath->rows * scale_factor; +/* + * XXX run_cost includes both CPU cost, which is divided among + * workers, and disk cost, which is not. Unfortunately we + * don't have enough information to separate the two, so scale + * the whole run_cost. + */ +run_cost = subpath->total_cost - subpath->startup_cost; +apath->path.total_cost += subpath->startup_cost + + run_cost * scale_factor;; } apath->path.rows = clamp_row_est(apath->path.rows);
Re: Privileges on PUBLICATION
Gregory Stark (as CFM) wrote: > FYI this looks like it needs a rebase due to a conflict in copy.c and > an offset in pgoutput.c. > > Is there anything specific that still needs review or do you think > you've handled all Peter's concerns? In particular, is there "a > comprehensive description of what it is trying to do"? :) I tried to improve the documentation and commit messages in v05. v06 (just rebased) is attached. -- Antonin Houska Web: https://www.cybertec-postgresql.com >From d4490664ec80f52d23c4345eec5771764bcdbb17 Mon Sep 17 00:00:00 2001 From: Antonin Houska Date: Wed, 15 Mar 2023 04:21:01 +0100 Subject: [PATCH 1/2] Move some code into functions. This is only a preparation for a patch that introduces USAGE privilege on publications. It should make the following diff a little bit easier to read. --- src/backend/catalog/pg_publication.c| 236 +--- src/backend/commands/copy.c | 81 +-- src/backend/commands/copyto.c | 89 src/backend/replication/pgoutput/pgoutput.c | 139 +--- src/include/catalog/pg_publication.h| 6 + src/include/commands/copy.h | 2 + 6 files changed, 308 insertions(+), 245 deletions(-) diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c index a98fcad421..7f6024b7a5 100644 --- a/src/backend/catalog/pg_publication.c +++ b/src/backend/catalog/pg_publication.c @@ -1025,6 +1025,208 @@ GetPublicationByName(const char *pubname, bool missing_ok) return OidIsValid(oid) ? GetPublication(oid) : NULL; } +/* + * Get the mapping for given publication and relation. + */ +void +GetPublicationRelationMapping(Oid pubid, Oid relid, + Datum *attrs, bool *attrs_isnull, + Datum *qual, bool *qual_isnull) +{ + Publication *publication; + HeapTuple pubtuple = NULL; + Oid schemaid = get_rel_namespace(relid); + + publication = GetPublication(pubid); + + /* + * We don't consider row filters or column lists for FOR ALL TABLES or + * FOR TABLES IN SCHEMA publications. + */ + if (!publication->alltables && + !SearchSysCacheExists2(PUBLICATIONNAMESPACEMAP, + ObjectIdGetDatum(schemaid), + ObjectIdGetDatum(publication->oid))) + pubtuple = SearchSysCacheCopy2(PUBLICATIONRELMAP, + ObjectIdGetDatum(relid), + ObjectIdGetDatum(publication->oid)); + + if (HeapTupleIsValid(pubtuple)) + { + /* Lookup the column list attribute. */ + *attrs = SysCacheGetAttr(PUBLICATIONRELMAP, pubtuple, + Anum_pg_publication_rel_prattrs, + attrs_isnull); + + /* Null indicates no filter. */ + *qual = SysCacheGetAttr(PUBLICATIONRELMAP, pubtuple, +Anum_pg_publication_rel_prqual, +qual_isnull); + } + else + { + *attrs_isnull = true; + *qual_isnull = true; + } +} +/* + * Pick those publications from a list which should actually be used to + * publish given relation and return them. + * + * If publish_as_relid_p is passed, the relation whose tuple descriptor should + * be used to publish the data is stored in *publish_as_relid_p. + * + * If pubactions is passed, update the structure according to the matching + * publications. + */ +List * +GetEffectiveRelationPublications(Oid relid, List *publications, + Oid *publish_as_relid_p, + PublicationActions *pubactions) +{ + Oid schemaId = get_rel_namespace(relid); + List *pubids = GetRelationPublications(relid); + /* + * We don't acquire a lock on the namespace system table as we build the + * cache entry using a historic snapshot and all the later changes are + * absorbed while decoding WAL. + */ + List *schemaPubids = GetSchemaPublications(schemaId); + ListCell *lc; + Oid publish_as_relid = relid; + int publish_ancestor_level = 0; + bool am_partition = get_rel_relispartition(relid); + char relkind = get_rel_relkind(relid); + List *rel_publications = NIL; + + foreach(lc, publications) + { + Publication *pub = lfirst(lc); + bool publish = false; + + /* + * Under what relid should we publish changes in this publication? + * We'll use the top-most relid across all publications. Also track + * the ancestor level for this publication. + */ + Oid pub_relid = relid; + int ancestor_level = 0; + + /* + * If this is a FOR ALL TABLES publication, pick the partition root + * and set the ancestor level accordingly. + */ + if (pub->alltables) + { + publish = true; + if (pub->pubviaroot && am_partition) + { +List *ancestors = get_partition_ancestors(relid); + +pub_relid = llast_oid(ancestors); +ancestor_level = list_length(ancestors); + } + } + + if (!publish) + { + bool ancestor_published = false; + + /* + * For a partition, check if any of the ancestors are published. + * If so, note down the topmost ancestor that is published via + * this publication, which will be used as th
Re: Parallelize correlated subqueries that execute within each worker
James Coleman wrote: > On Mon, Feb 6, 2023 at 11:39 AM Antonin Houska wrote: > Attached is v9. ok, I've changed the status to RfC -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Parallelize correlated subqueries that execute within each worker
exactly NULL if empty */ if (bms_is_empty(outer_relids)) outer_relids = NULL; Another question is whether in this call simple_gather_path = (Path *) create_gather_path(root, rel, cheapest_partial_path, rel->reltarget, required_outer, rowsp); required_outer should be passed to create_gather_path(). Shouldn't it rather be PATH_REQ_OUTER(cheapest_partial_path) that you test just above? Again, build_index_paths() initializes outer_relids this way outer_relids = bms_copy(rel->lateral_relids); but then it may add some more relations: /* OK to include this clause */ index_clauses = lappend(index_clauses, iclause); outer_relids = bms_add_members(outer_relids, rinfo->clause_relids); So I think that PATH_REQ_OUTER(cheapest_partial_path) in generate_gather_paths() can eventually contain more relations than required_outer, and therefore it's safer to check the first. Similar comments might apply to generate_useful_gather_paths(). Here I also suggest to move this test /* We can't pass params to workers. */ if (!bms_is_subset(PATH_REQ_OUTER(subpath), rel->relids)) continue; to the top of the loop because it's relatively cheap. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: RLS makes COPY TO process child tables
Yugo NAGATA wrote: > On Wed, 01 Feb 2023 12:45:57 +0100 > Antonin Houska wrote: > > > While working on [1] I noticed that if RLS gets enabled, the COPY TO command > > includes the contents of child table into the result, although the > > documentation says it should not: > > > > "COPY TO can be used only with plain tables, not views, and does not > > copy rows from child tables or child partitions. For example, COPY > > table TO copies the same rows as SELECT * FROM ONLY table. The syntax > > COPY (SELECT * FROM table) TO ... can be used to dump all of the rows > > in an inheritance hierarchy, partitioned table, or view." > > > > A test case is attached (rls.sql) as well as fix proposal > > (copy_rls_no_inh.diff). > > I think this is a bug because the current behaviour is different from > the documentation. > > When RLS is enabled on a table in `COPY ... TO ...`, the query is converted > to `COPY (SELECT * FROM ...) TO ...` to allow the rewriter to add in RLS > clauses. This causes to dump the rows of child tables. > > The patch fixes this by setting "inh" of the table in the converted query > to false. This seems reasonable and actually fixes the problem. > > However, I think we would want a comment on the added line. A short comment added, see the new patch version. > Also, the attached test should be placed in the regression test. Hm, I'm not sure it's necessary. It would effectively test whether the 'inh' field works, but if it didn't, many other tests would fail. I discovered the bug by reading the code, so I wanted to demonstrate (also to myself) that it causes incorrect behavior from user perspective. That was the purpose of the test. -- Antonin Houska Web: https://www.cybertec-postgresql.com >From 3a4acb29901a7b53eb32767e30bdfce74b80df8b Mon Sep 17 00:00:00 2001 From: Antonin Houska Date: Thu, 2 Feb 2023 07:31:32 +0100 Subject: [PATCH] Make sure that COPY TO does not process child tables. It's expected (and documented) that the child tables are not copied, however the query generated for RLS didn't meet this expectation so far. --- src/backend/commands/copy.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index e34f583ea7..539fb682c2 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -250,6 +250,9 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt, pstrdup(RelationGetRelationName(rel)), -1); + /* COPY TO should not process child tables. */ + from->inh = false; + /* Build query */ select = makeNode(SelectStmt); select->targetList = targetList; -- 2.31.1
RLS makes COPY TO process child tables
While working on [1] I noticed that if RLS gets enabled, the COPY TO command includes the contents of child table into the result, although the documentation says it should not: "COPY TO can be used only with plain tables, not views, and does not copy rows from child tables or child partitions. For example, COPY table TO copies the same rows as SELECT * FROM ONLY table. The syntax COPY (SELECT * FROM table) TO ... can be used to dump all of the rows in an inheritance hierarchy, partitioned table, or view." A test case is attached (rls.sql) as well as fix proposal (copy_rls_no_inh.diff). [1] https://commitfest.postgresql.org/41/3641/ -- Antonin Houska Web: https://www.cybertec-postgresql.com create table a(i int); insert into a values (1); create table a1() inherits(a); insert into a1 values (1); -- Only the parent table is copied, as the documentation claims. copy a to stdout; alter table a enable row level security; create role u; create policy pol_perm on a as permissive for select to u using (true); grant select on table a to u; set role u; -- Both "a" and "a1" appears in the output. copy a to stdout; diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index e34f583ea7..3b8c25dadd 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -249,6 +249,7 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt, from = makeRangeVar(get_namespace_name(RelationGetNamespace(rel)), pstrdup(RelationGetRelationName(rel)), -1); + from->inh = false; /* Build query */ select = makeNode(SelectStmt);
Re: Cross-partition UPDATE and foreign table partitions
Antonin Houska wrote: > I was wondering why ExecCrossPartitionUpdateForeignKey() has an unused > argument "oldslot" and wanted to suggest its removal. However, before I did, > it occurred to me that callers may want to pass the whole slot when the > partition is a foreign table, i.e. when the "tupleid" argument cannot be > used. (In that case the problem would be that the function implementation is > incomplete.) > > However, when checking how cross-partition UPDATE works internally for foreign > tables, I saw surprising behavior. The attached script creates partitioned > table "a" with foreign table partitions "a1" and "a2". If you then run the > following commands > > INSERT INTO a VALUES (1), (10); > UPDATE a SET i=11 WHERE i=1; > TABLE a1; > > you'll see that the tuples are correctly routed into the partitions, but the > UPDATE is simply executed on the "a1" partition. Instead, I'd expect it to > delete the tuple from "a1" and insert it into "a2". That looks like a bug. Well, as it usually happens, I found a related information as soon as I had sent a report. The documentation of CREATE FOREIGN TABLE says: "However it is not currently possible to move a row from a foreign-table partition to another partition. An UPDATE that would require doing that will fail due to the partitioning constraint, assuming that that is properly enforced by the remote server." So the remaining question is whether the "oldslot" argument of ExecCrossPartitionUpdateForeignKey() will be used in the future or should be removed. Note that the ExecUpdateAct() passes its "slot" variable for it, which seems to contain the *new* version of the tuple rather than the old. Some cleanup may be needed. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Cross-partition UPDATE and foreign table partitions
I was wondering why ExecCrossPartitionUpdateForeignKey() has an unused argument "oldslot" and wanted to suggest its removal. However, before I did, it occurred to me that callers may want to pass the whole slot when the partition is a foreign table, i.e. when the "tupleid" argument cannot be used. (In that case the problem would be that the function implementation is incomplete.) However, when checking how cross-partition UPDATE works internally for foreign tables, I saw surprising behavior. The attached script creates partitioned table "a" with foreign table partitions "a1" and "a2". If you then run the following commands INSERT INTO a VALUES (1), (10); UPDATE a SET i=11 WHERE i=1; TABLE a1; you'll see that the tuples are correctly routed into the partitions, but the UPDATE is simply executed on the "a1" partition. Instead, I'd expect it to delete the tuple from "a1" and insert it into "a2". That looks like a bug. -- Antonin Houska Web: https://www.cybertec-postgresql.com CREATE EXTENSION IF NOT EXISTS postgres_fdw WITH SCHEMA public; CREATE SERVER s1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS ( dbname 'postgres', host 'localhost', port '5432' ); CREATE USER MAPPING FOR CURRENT_ROLE SERVER s1; CREATE TABLE public.a ( i integer NOT NULL ) PARTITION BY RANGE (i); CREATE TABLE public.a1 ( i integer NOT NULL ); CREATE FOREIGN TABLE public.a1_loc ( i integer NOT NULL ) SERVER s1 OPTIONS ( table_name 'a1' ); CREATE TABLE public.a2 ( i integer NOT NULL ); CREATE FOREIGN TABLE public.a2_loc ( i integer NOT NULL ) SERVER s1 OPTIONS ( table_name 'a2' ); ALTER TABLE ONLY public.a ATTACH PARTITION public.a1_loc FOR VALUES FROM (0) TO (10); ALTER TABLE ONLY public.a ATTACH PARTITION public.a2_loc FOR VALUES FROM (10) TO (20); ALTER TABLE ONLY public.a1 ADD CONSTRAINT a1_pkey PRIMARY KEY (i); ALTER TABLE ONLY public.a2 ADD CONSTRAINT a2_pkey PRIMARY KEY (i);
Re: grouping pushdown
David Rowley wrote: > On Wed, 4 Jan 2023 at 23:21, Spring Zhong wrote: > > The plan is apparently inefficient, since the hash aggregate goes after the > > Cartesian product. We could expect the query's performance get much > > improved if the HashAggregate node can be pushed down to the SCAN node. > > > Is someone has suggestions on this? > > I think this is being worked on. See [1]. Well, the current version of that patch requires the query to contain at least one aggregate. It shouldn't be a big deal to modify it. However note that this feature pushes the aggregate/grouping only to one side of the join ("fake" aggregate count(*) added to the query): SET enable_agg_pushdown TO on; EXPLAIN select i1,i2, count(*) from t1, t2 group by i1,i2; QUERY PLAN Finalize GroupAggregate (cost=440.02..440.04 rows=1 width=16) Group Key: t1.i1, t2.i2 -> Sort (cost=440.02..440.02 rows=1 width=16) Sort Key: t1.i1, t2.i2 -> Nested Loop (cost=195.00..440.01 rows=1 width=16) -> Partial HashAggregate (cost=195.00..195.01 rows=1 width=12) Group Key: t1.i1 -> Seq Scan on t1 (cost=0.00..145.00 rows=1 width=4) -> Seq Scan on t2 (cost=0.00..145.00 rows=1 width=4) If both sides should be grouped, finalization of the partial aggregates would be more difficult, and I'm not sure it'd be worth the effort. > [1] https://commitfest.postgresql.org/41/3764/ -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Privileges on PUBLICATION
Antonin Houska wrote: > Antonin Houska wrote: > > > Peter Eisentraut wrote: > > > > > On 04.11.22 08:28, Antonin Houska wrote: > > > > I thought about the whole concept a bit more and I doubt if the > > > > PUBLICATION > > > > privilege is the best approach. In particular, the user specified in > > > > CREATE > > > > SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have > > > > SELECT > > > > privilege on the tables replicated. So if the DBA excludes some columns > > > > from > > > > the publication's column list and sets the (publication) privileges in > > > > such a > > > > way that the user cannot get the column values via other publications, > > > > the > > > > user still can connect to the database directly and get values of the > > > > excluded > > > > columns. > > > > > > Why are the SELECT privileges needed? Maybe that's something to think > > > about > > > and maybe change. > > > > I haven't noticed an explanation in comments nor did I search in the mailing > > list archives, but the question makes sense: the REPLICATION attribute of a > > role is sufficient for streaming replication, so why should the logical > > replication require additional privileges? > > > > Technically the SELECT privilege is needed because the sync worker does > > actually execute SELECT query on the published tables. However, I realize > > now > > that it's not checked by the output plugin. Thus if SELECT is revoked from > > the > > "subscription user" after the table has been synchronized, the replication > > continues to work. So the necessity for the SELECT privilege might be an > > omission rather than a design choice. (Even the documentation says that the > > SELECT privilege is needed only for the initial synchronization [1], however > > it does not tell why.) > > > > > > As an alternative to the publication privileges, I think that the CREATE > > > > SUBSCRIPTION command could grant ACL_SELECT automatically to the > > > > subscription > > > > user on the individual columns contained in the publication column > > > > list, and > > > > DROP SUBSCRIPTION would revoke that privilege. > > > > > > I think that approach is weird and unusual. Privileges and object > > > creation > > > should be separate operations. > > > > ok. Another approach would be to skip the check for the SELECT privilege (as > > well as the check for the USAGE privilege on the corresponding schema) if > > given column is being accessed via a publication which has it on its column > > list and if the subscription user has the USAGE privilege on that > > publication. > > > > So far I wasn't sure if we can do that because, if pg_upgrade grants the > > USAGE > > privilege on all publications to the "public" role, the DBAs who relied on > > the > > SELECT privileges might not notice that any role having the REPLICATION > > attribute can access all the published tables after the upgrade. (pg_upgrade > > can hardly do anything else because it has no information on the > > "subscription > > users", so it cannot convert the SELECT privilege on tables to the USAGE > > privileges on publications.) > > > > But now that I see that the logical replication doesn't check the SELECT > > privilege properly anyway, I think we can get rid of it. > > The attached version tries to do that - as you can see in 0001, the SELECT > privilege is not required for the walsender process. > > I also added PUBLICATION_NAMES option to the COPY TO command so that the > publisher knows which publications are subject to the ACL check. Only data of > those publications are returned to the subscriber. (In the previous patch > version the ACL checks were performed on the subscriber side, but I that's not > ideal in terms of security.) > > I also added the regression tests for publications, enhanced psql (the \dRp+ > command) so that it displays the publication ACL and added a few missing > pieces of documentation. > This is v4. The patch had to be rebased due to the commit 369f09e420. -- Antonin Houska Web: https://www.cybertec-postgresql.com publication_privileges_v04.tgz Description: application/gzip
Re: refactor ExecGrant_*() functions
Peter Eisentraut wrote: > On 12.12.22 10:44, Antonin Houska wrote: > > Peter Eisentraut wrote: > > > >> On 06.12.22 09:41, Antonin Houska wrote: > >>> Attached are my proposals for improvements. One is to avoid memory leak, > >>> the > >>> other tries to improve readability a little bit. > >> > >> I added the readability improvement to my v2 patch. The pfree() calls > >> aren't > >> necessary AFAICT. > > It's something to consider, but since this is a refactoring patch and the old > code didn't do it either, I think it's out of scope. Well, the reason I brought this topic up is that the old code didn't even palloc() those arrays. (Because the were located in the stack.) > > I see that memory contexts exist and that the amount of memory freed is not > > huge, but my style is to free the memory explicitly if it's allocated in a > > loop. > > v2 looks good to me. > > Committed, thanks. ok, I'll post rebased "USAGE privilege on PUBLICATION" patch [1] soon. -- Antonin Houska Web: https://www.cybertec-postgresql.com [1] https://commitfest.postgresql.org/41/3641/
Re: refactor ExecGrant_*() functions
Peter Eisentraut wrote: > On 06.12.22 09:41, Antonin Houska wrote: > > Attached are my proposals for improvements. One is to avoid memory leak, the > > other tries to improve readability a little bit. > > I added the readability improvement to my v2 patch. The pfree() calls aren't > necessary AFAICT. I see that memory contexts exist and that the amount of memory freed is not huge, but my style is to free the memory explicitly if it's allocated in a loop. v2 looks good to me. -- Antonin Houska Web: https://www.cybertec-postgresql.com
sendFileWithContent() does not advance the source pointer
When checking something else in the base backup code, I've noticed that sendFileWithContent() does not advance the 'content' pointer. The sink buffer is large enough (32kB) so that the first iteration usually processes the whole file (only special files are processed by this function), and thus that the problem is hidden. However it's possible to hit the issue: if there are too many tablespaces, pg_basebackup generates corrupted tablespace_map. Instead of writing all the tablespace paths it writes only some and then starts to write the contents from the beginning again. The attached script generates scripts to create many tablespaces as well as the underlying directories. Fix is attached here as well. -- Antonin Houska Web: https://www.cybertec-postgresql.com #!/bin/bash TBSPDIR=/mnt/ramdisk/tbspcs TBSPCOUNT=2048 SCRIPT_SH=create.sh SCRIPT_SQL=create.sql echo "#!/bin/bash" > ${SCRIPT_SH} echo "mkdir -p $TBSPDIR" >> ${SCRIPT_SH} echo "" > ${SCRIPT_SQL} i=0 while [ $i -lt $TBSPCOUNT ]; do printf "mkdir %s/%.4d\n" $TBSPDIR $i >> ${SCRIPT_SH} printf "CREATE TABLESPACE tbsp_%.4d LOCATION '%s/%.4d';\n" $i $TBSPDIR $i >> ${SCRIPT_SQL} i=$((i+1)) done chmod u+x ${SCRIPT_SH} diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c index 74fb529380..7aa5f6e44d 100644 --- a/src/backend/backup/basebackup.c +++ b/src/backend/backup/basebackup.c @@ -1073,6 +1073,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content, memcpy(sink->bbs_buffer, content, nbytes); bbsink_archive_contents(sink, nbytes); bytes_done += nbytes; + content += nbytes; } _tarWritePadding(sink, len);
Re: refactor ExecGrant_*() functions
Peter Eisentraut wrote: > Continuing the ideas in [0], this patch refactors the ExecGrant_Foo() > functions and replaces many of them by a common function that is driven by the > ObjectProperty table. > > It would be nice to do more here, for example ExecGrant_Language(), which has > additional non-generalizable checks, but I think this is already a good start. > For example, the work being discussed on privileges on publications [1] would > be able to take good advantage of this. Right, I mostly copy and pasted the code when writing ExecGrant_Publication(). I agree that your refactoring is very useful. Attached are my proposals for improvements. One is to avoid memory leak, the other tries to improve readability a little bit. > [0]: > https://www.postgresql.org/message-id/95c30f96-4060-2f48-98b5-a4392d3b6...@enterprisedb.com > [1]: https://www.postgresql.org/message-id/flat/20330.1652105397@antos -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c index d3121a469f..ac4490c0b8 100644 --- a/src/backend/catalog/aclchk.c +++ b/src/backend/catalog/aclchk.c @@ -2228,6 +2234,9 @@ ExecGrant_generic(InternalGrant *istmt, Oid classid, AclMode default_privs) newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values, nulls, replaces); + pfree(values); + pfree(nulls); + pfree(replaces); CatalogTupleUpdate(relation, >t_self, newtuple); diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c index d3121a469f..ac4490c0b8 100644 --- a/src/backend/catalog/aclchk.c +++ b/src/backend/catalog/aclchk.c @@ -2144,6 +2144,7 @@ ExecGrant_generic(InternalGrant *istmt, Oid classid, AclMode default_privs) { Oid objectid = lfirst_oid(cell); Datum aclDatum; + Datum nameDatum; bool isNull; AclMode avail_goptions; AclMode this_privileges; @@ -2197,6 +2198,11 @@ ExecGrant_generic(InternalGrant *istmt, Oid classid, AclMode default_privs) old_acl, ownerId, , _goptions); + nameDatum = SysCacheGetAttr(cacheid, tuple, + get_object_attnum_name(classid), + ); + Assert(!isNull); + /* * Restrict the privileges to what we can actually grant, and emit the * standards-mandated warning and error messages. @@ -2205,7 +2211,7 @@ ExecGrant_generic(InternalGrant *istmt, Oid classid, AclMode default_privs) restrict_and_check_grant(istmt->is_grant, avail_goptions, istmt->all_privs, istmt->privileges, objectid, grantorId, get_object_type(classid, objectid), - NameStr(*DatumGetName(SysCacheGetAttr(cacheid, tuple, get_object_attnum_name(classid), ))), + NameStr(*DatumGetName(nameDatum)), 0, NULL); /*
Re: Privileges on PUBLICATION
Antonin Houska wrote: > Peter Eisentraut wrote: > > > On 04.11.22 08:28, Antonin Houska wrote: > > > I thought about the whole concept a bit more and I doubt if the > > > PUBLICATION > > > privilege is the best approach. In particular, the user specified in > > > CREATE > > > SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have > > > SELECT > > > privilege on the tables replicated. So if the DBA excludes some columns > > > from > > > the publication's column list and sets the (publication) privileges in > > > such a > > > way that the user cannot get the column values via other publications, the > > > user still can connect to the database directly and get values of the > > > excluded > > > columns. > > > > Why are the SELECT privileges needed? Maybe that's something to think about > > and maybe change. > > I haven't noticed an explanation in comments nor did I search in the mailing > list archives, but the question makes sense: the REPLICATION attribute of a > role is sufficient for streaming replication, so why should the logical > replication require additional privileges? > > Technically the SELECT privilege is needed because the sync worker does > actually execute SELECT query on the published tables. However, I realize now > that it's not checked by the output plugin. Thus if SELECT is revoked from the > "subscription user" after the table has been synchronized, the replication > continues to work. So the necessity for the SELECT privilege might be an > omission rather than a design choice. (Even the documentation says that the > SELECT privilege is needed only for the initial synchronization [1], however > it does not tell why.) > > > > As an alternative to the publication privileges, I think that the CREATE > > > SUBSCRIPTION command could grant ACL_SELECT automatically to the > > > subscription > > > user on the individual columns contained in the publication column list, > > > and > > > DROP SUBSCRIPTION would revoke that privilege. > > > > I think that approach is weird and unusual. Privileges and object creation > > should be separate operations. > > ok. Another approach would be to skip the check for the SELECT privilege (as > well as the check for the USAGE privilege on the corresponding schema) if > given column is being accessed via a publication which has it on its column > list and if the subscription user has the USAGE privilege on that publication. > > So far I wasn't sure if we can do that because, if pg_upgrade grants the USAGE > privilege on all publications to the "public" role, the DBAs who relied on the > SELECT privileges might not notice that any role having the REPLICATION > attribute can access all the published tables after the upgrade. (pg_upgrade > can hardly do anything else because it has no information on the "subscription > users", so it cannot convert the SELECT privilege on tables to the USAGE > privileges on publications.) > > But now that I see that the logical replication doesn't check the SELECT > privilege properly anyway, I think we can get rid of it. The attached version tries to do that - as you can see in 0001, the SELECT privilege is not required for the walsender process. I also added PUBLICATION_NAMES option to the COPY TO command so that the publisher knows which publications are subject to the ACL check. Only data of those publications are returned to the subscriber. (In the previous patch version the ACL checks were performed on the subscriber side, but I that's not ideal in terms of security.) I also added the regression tests for publications, enhanced psql (the \dRp+ command) so that it displays the publication ACL and added a few missing pieces of documentation. -- Antonin Houska Web: https://www.cybertec-postgresql.com publication_privileges_v03.tgz Description: application/gzip
Re: Privileges on PUBLICATION
Peter Eisentraut wrote: > On 04.11.22 08:28, Antonin Houska wrote: > > I thought about the whole concept a bit more and I doubt if the PUBLICATION > > privilege is the best approach. In particular, the user specified in CREATE > > SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have > > SELECT > > privilege on the tables replicated. So if the DBA excludes some columns from > > the publication's column list and sets the (publication) privileges in such > > a > > way that the user cannot get the column values via other publications, the > > user still can connect to the database directly and get values of the > > excluded > > columns. > > Why are the SELECT privileges needed? Maybe that's something to think about > and maybe change. I haven't noticed an explanation in comments nor did I search in the mailing list archives, but the question makes sense: the REPLICATION attribute of a role is sufficient for streaming replication, so why should the logical replication require additional privileges? Technically the SELECT privilege is needed because the sync worker does actually execute SELECT query on the published tables. However, I realize now that it's not checked by the output plugin. Thus if SELECT is revoked from the "subscription user" after the table has been synchronized, the replication continues to work. So the necessity for the SELECT privilege might be an omission rather than a design choice. (Even the documentation says that the SELECT privilege is needed only for the initial synchronization [1], however it does not tell why.) > > As an alternative to the publication privileges, I think that the CREATE > > SUBSCRIPTION command could grant ACL_SELECT automatically to the > > subscription > > user on the individual columns contained in the publication column list, and > > DROP SUBSCRIPTION would revoke that privilege. > > I think that approach is weird and unusual. Privileges and object creation > should be separate operations. ok. Another approach would be to skip the check for the SELECT privilege (as well as the check for the USAGE privilege on the corresponding schema) if given column is being accessed via a publication which has it on its column list and if the subscription user has the USAGE privilege on that publication. So far I wasn't sure if we can do that because, if pg_upgrade grants the USAGE privilege on all publications to the "public" role, the DBAs who relied on the SELECT privileges might not notice that any role having the REPLICATION attribute can access all the published tables after the upgrade. (pg_upgrade can hardly do anything else because it has no information on the "subscription users", so it cannot convert the SELECT privilege on tables to the USAGE privileges on publications.) But now that I see that the logical replication doesn't check the SELECT privilege properly anyway, I think we can get rid of it. [1] https://www.postgresql.org/docs/current/logical-replication-security.html -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: refactor ownercheck and aclcheck functions
Peter Eisentraut wrote: > These patches take the dozens of mostly-duplicate pg_foo_ownercheck() and > pg_foo_aclcheck() functions and replace (most of) them by common functions > that are driven by the ObjectProperty table. All the required information is > already in that table. > > This is similar to the consolidation of the drop-by-OID functions that we did > a while ago (b1d32d3e3230f00b5baba08f75b4f665c7d6dac6). I've reviewed this patch, as it's related to my patch [1] (In particular, it reduces the size of my patch a little bit). I like the idea to reduce the amount of (almost) copy & pasted code. I haven't found any problem in your patch that would be worth mentioning, except that the 0001 part does not apply to the current master branch. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Privileges on PUBLICATION
Mark Dilger wrote: > > On Nov 4, 2022, at 12:28 AM, Antonin Houska wrote: > > > > I thought about the whole concept a bit more and I doubt if the PUBLICATION > > privilege is the best approach. In particular, the user specified in CREATE > > SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have > > SELECT > > privilege on the tables replicated. So if the DBA excludes some columns from > > the publication's column list and sets the (publication) privileges in such > > a > > way that the user cannot get the column values via other publications, the > > user still can connect to the database directly and get values of the > > excluded > > columns. > > > > As an alternative to the publication privileges, I think that the CREATE > > SUBSCRIPTION command could grant ACL_SELECT automatically to the > > subscription > > user on the individual columns contained in the publication column list, and > > DROP SUBSCRIPTION would revoke that privilege. > > > > Of course a question is what to do if the replication user already has that > > privilege on some columns: either the CREATE SUBSCRIPTION command should > > raise > > ERROR, or we should introduce a new privilege (e.g. ACL_SELECT_PUB) for this > > purpose, which would effectivelly be ACL_SELECT, but hidden from the users > > of > > GRANT / REVOKE. > > > > If this approach was taken, the USAGE privilege on schema would need to be > > granted / revoked in a similar way. > > When you talk about a user needing to have privileges, it sounds like you > mean privileges on the publishing database. But then you talk about CREATE > SUBSCRIPTION granting privileges, which would necessarily be on the > subscriber database. Can you clarify what you have in mind? Right, the privileges need to be added on the publishing side, but the user that needs those privileges is specified on the subscription side. I didn't think much in detail how it would work. The "subscription user" certainly cannot connect to the publisher database and add grant the privileges to itself. Perhaps some of the workers on the publisher side could do it on startup. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Privileges on PUBLICATION
Amit Kapila wrote: > On Thu, Nov 3, 2022 at 11:12 AM Antonin Houska wrote: > > > > Peter Eisentraut wrote: > > > > > The CF entry is about privileges on publications. Please rebase that > > > patch > > > and repost it so that the CF app and the CF bot are up to date. > > > > The rebased patch (with regression tests added) is attached here. > > > > There's still one design issue that I haven't mentioned yet: if the USAGE > > privilege on a publication is revoked after the synchronization phase > > completed, the missing privilege on a publication causes ERROR in the output > > plugin. If the privilege is then granted, the error does not disappear > > because > > the same (historical) snapshot we use to decode the failed data change again > > is also used to check the privileges in the catalog, so the output plugin > > does > > not see that the privilege has already been granted. > > > > We have a similar problem even when publication is dropped/created. > The replication won't be able to proceed. > > > The only solution seems to be to drop the publication from the subscription > > and add it again, or to drop and re-create the whole subscription. I haven't > > added a note about this problem to the documentation yet, in case someone > > has > > better idea how to approach the problem. > > > > I think one possibility is that the user advances the slot used in > replication by using pg_replication_slot_advance() at or after the > location where the privilege is granted. Some other ideas have been > discussed in the thread [1], in particular, see email [2] and > discussion after that but we didn't reach any conclusion. Thanks for feedback. Regarding the publications, I'm not sure what the user should expect if he revokes the permissions in the middle of processing. In such a situation, he should be aware that some privileged data could already have been replicated. My preference in such a situation would be to truncate the table(s) on the subscriber side and to restart the replication, rather than fix the replication and keep the subscriber's access to the leaked data. > [1] - > https://www.postgresql.org/message-id/CAHut%2BPvMbCsL8PAz1Qc6LNoL0Ag0y3YJtPVJ8V0xVXJOPb%2B0xw%40mail.gmail.com > [2] - > https://www.postgresql.org/message-id/CAA4eK1JTwOAniPua04o2EcOXfzRa8ANax%3D3bpx4H-8dH7M2p%3DA%40mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Privileges on PUBLICATION
Peter Eisentraut wrote: > On 03.11.22 01:43, Antonin Houska wrote: > > Peter Eisentraut wrote: > > > >> The CF entry is about privileges on publications. Please rebase that patch > >> and repost it so that the CF app and the CF bot are up to date. > > The rebased patch (with regression tests added) is attached here. > > Some preliminary discussion: > > What is the upgrade strategy? I suppose the options are either that > publications have a default acl that makes them publicly accessible, > thus preserving the existing behavior by default, or pg_dump would need to > create additional GRANT statements when upgrading from pre-PG16. I don't see > anything like either of these mentioned in the patch. What is your plan? So far I considered the first > You might be interested in this patch, which relates to yours: > https://commitfest.postgresql.org/40/3955/ ok, I'll check. > Looking at your patch, I would also like to find a way to refactor away the > ExecGrant_Publication() function. I'll think about that. > > I think you should add some tests under src/test/regress/ for the new GRANT > and REVOKE statements, just to have some basic coverage that it works. > sql/publication.sql would be appropriate, I think. I thought about the whole concept a bit more and I doubt if the PUBLICATION privilege is the best approach. In particular, the user specified in CREATE SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have SELECT privilege on the tables replicated. So if the DBA excludes some columns from the publication's column list and sets the (publication) privileges in such a way that the user cannot get the column values via other publications, the user still can connect to the database directly and get values of the excluded columns. As an alternative to the publication privileges, I think that the CREATE SUBSCRIPTION command could grant ACL_SELECT automatically to the subscription user on the individual columns contained in the publication column list, and DROP SUBSCRIPTION would revoke that privilege. Of course a question is what to do if the replication user already has that privilege on some columns: either the CREATE SUBSCRIPTION command should raise ERROR, or we should introduce a new privilege (e.g. ACL_SELECT_PUB) for this purpose, which would effectivelly be ACL_SELECT, but hidden from the users of GRANT / REVOKE. If this approach was taken, the USAGE privilege on schema would need to be granted / revoked in a similar way. What do you think about that? -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Privileges on PUBLICATION
Peter Eisentraut wrote: > The CF entry is about privileges on publications. Please rebase that patch > and repost it so that the CF app and the CF bot are up to date. The rebased patch (with regression tests added) is attached here. There's still one design issue that I haven't mentioned yet: if the USAGE privilege on a publication is revoked after the synchronization phase completed, the missing privilege on a publication causes ERROR in the output plugin. If the privilege is then granted, the error does not disappear because the same (historical) snapshot we use to decode the failed data change again is also used to check the privileges in the catalog, so the output plugin does not see that the privilege has already been granted. The only solution seems to be to drop the publication from the subscription and add it again, or to drop and re-create the whole subscription. I haven't added a note about this problem to the documentation yet, in case someone has better idea how to approach the problem. -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 03c0193709..d510220a07 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -1935,6 +1935,13 @@ REVOKE ALL ON accounts FROM PUBLIC; statements that have previously performed this lookup, so this is not a completely secure way to prevent object access. + + For publications, allows logical replication via particular + publication. The user specified in + the CREATE + SUBSCRIPTION command must have this privilege on all + publications listed in that command. + For sequences, allows use of the currval and nextval functions. diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml index f8756389a3..4286d709d9 100644 --- a/doc/src/sgml/logical-replication.sgml +++ b/doc/src/sgml/logical-replication.sgml @@ -1712,7 +1712,9 @@ CONTEXT: processing remote data for replication origin "pg_16395" during "INSER In order to be able to copy the initial table data, the role used for the replication connection must have the SELECT privilege on - a published table (or be a superuser). + a published table (or be a superuser). In addition, the role must have + the USAGE privilege on all the publications referenced + by particular subscription. @@ -1728,14 +1730,12 @@ CONTEXT: processing remote data for replication origin "pg_16395" during "INSER - There are currently no privileges on publications. Any subscription (that - is able to connect) can access any publication. Thus, if you intend to - hide some information from particular subscribers, such as by using row - filters or column lists, or by not adding the whole table to the - publication, be aware that other publications in the same database could - expose the same information. Publication privileges might be added to - PostgreSQL in the future to allow for - finer-grained access control. + If you intend to hide some information from particular subscribers, such as + by using row filters or column lists, or by not adding the whole table to + the publication, be aware that other publications in the same database + could expose the same + information. Privileges on publication can + be used to implement finer-grained access control. diff --git a/doc/src/sgml/ref/grant.sgml b/doc/src/sgml/ref/grant.sgml index dea19cd348..e62b7f643c 100644 --- a/doc/src/sgml/ref/grant.sgml +++ b/doc/src/sgml/ref/grant.sgml @@ -82,6 +82,11 @@ GRANT { { SET | ALTER SYSTEM } [, ... ] | ALL [ PRIVILEGES ] } TO role_specification [, ...] [ WITH GRANT OPTION ] [ GRANTED BY role_specification ] +GRANT { USAGE | ALL [ PRIVILEGES ] } +ON PUBLICATION pub_name [, ...] +TO role_specification [, ...] [ WITH GRANT OPTION ] +[ GRANTED BY role_specification ] + GRANT { { CREATE | USAGE } [, ...] | ALL [ PRIVILEGES ] } ON SCHEMA schema_name [, ...] TO role_specification [, ...] [ WITH GRANT OPTION ] @@ -488,7 +493,7 @@ GRANT admins TO joe; -Privileges on databases, tablespaces, schemas, languages, and +Privileges on databases, tablespaces, schemas, languages, publications and configuration parameters are PostgreSQL extensions. diff --git a/doc/src/sgml/ref/revoke.sgml b/doc/src/sgml/ref/revoke.sgml index 4fd4bfb3d7..7e8d018743 100644 --- a/doc/src/sgml/ref/revoke.sgml +++ b/doc/src/sgml/ref/revoke.sgml @@ -104,6 +104,13 @@ REVOKE [ GRANT OPTION FOR ] [ GRANTED BY role_specification ] [ CASCADE | RESTRICT ] +REVOKE [ GRANT OPTION FOR ] +{ USAGE | ALL [ PRIVILEGES ] } +ON PUBLICATION pub_name [, ...] +FROM role_specification [, ...] +[ GRANTED BY role_specification ] +[ CASCADE | RESTRICT ] + REVOKE [ GRANT OPTION FOR ]
Re: Temporary file access API
Hi, John Morris wrote: > I’m a latecomer to the discussion, but as a word of introduction, I’m working > with Stephen, and I have started looking over the temp file proposal with the > idea of helping it move along. > > I’ll start by summarizing the temp file proposal and its goals. > > From a high level, the proposed code: > > * Creates an fread/fwrite replacement (BufFileStream) for buffering data to a > single file. > > * Updates BufFile, which reads/writes a set of files, to use BufFileStream > internally. > > * Does not impact the normal (page cached) database I/O. > > * Updates all the other places where fread/fwrite and read/write are used. Not really all, just those where the change seemed reasonable (i.e. where it does not make the code more complex) > * Creates and removes transient files. The "stream API" is rather an additional layer on top of files that user needs to create / remove at lower level. > I see the following goals: > > * Unify all the “other” file accesses into a single, consistent API. > > * Integrate with VFDs. > > * Integrate transient files with transactions and tablespaces. If you mean automatic closing/deletion of files on transaction end, this is also the lower level thing that I didn't try to change. > * Create a consolidated framework where features like encryption and > compression can be more easily added. > > * Maintain good streaming performance. > > Is this a fair description of the proposal? Basically that's it. > For myself, I’d like to map out how features like compression and encryption > would fit into the framework, more as a sanity check than anything else, and > I’d like to look closer at some of the implementation details. But at the > moment, I want to make sure I have the > proper high-level view of the temp file proposal. I think the high level design (i.e. how the API should be used) still needs discussion. In particular, I don't know whether it should aim at the encryption adoption or not. If it does, then it makes sense to base it on buffile.c, because encryption essentially takes place in memory. But if buffering itself (w/o encryption) is not really useful at other places (see Robert's comments in [1]), then we can design something simpler, w/o touching buffile.c (which, in turn, won't be usable for encryption, compression or so). So I think that code simplification and easy adoption of in-memory data changes (such as encryption or compression) are two rather distinct goals. admit that I'm running out of ideas how to develop a framework that'd be useful for both. [1] https://www.postgresql.org/message-id/CA%2BTgmoZWP8UtkNVLd75Qqoh9VGOVy_0xBK%2BSHZAdNvnfaikKsQ%40mail.gmail.com > From: Robert Haas > Date: Wednesday, September 21, 2022 at 11:54 AM > To: Antonin Houska > Cc: PostgreSQL Hackers , Peter Eisentraut > , Stephen Frost > Subject: Re: Temporary file access API > > On Mon, Aug 8, 2022 at 2:26 PM Antonin Houska wrote: > > > I don't think that (3) is a separate advantage from (1) and (2), so I > > > don't have anything separate to say about it. > > > > I thought that the uncontrollable buffer size is one of the things you > > complaint about in > > > > https://www.postgresql.org/message-id/CA+TgmoYGjN_f=fcerx49bzjhng+gocty+a+xhnrwcvvdy8u...@mail.gmail.com > > Well, I think that email is mostly complaining about there being no > buffering at all in a situation where it would be advantageous to do > some buffering. But that particular code I believe is gone now because > of the shared-memory stats collector, and when looking through the > patch set, I didn't see that any of the cases that it touched were > similar to that one. > > -- > Robert Haas > EDB: http://www.enterprisedb.com > -- Antonin Houska Web: https://www.cybertec-postgresql.com
Return value of PathNameOpenFile()
I've noticed that some callers of PathNameOpenFile() (e.g. bbsink_server_begin_archive()) consider the call failed even if the function returned zero, while other ones do check whether the file descriptor is strictly negative. Since the file descriptor is actually returned by the open() system call, I assume that zero is a valid result, isn't it? -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Temporary file access API
Robert Haas wrote: > On Fri, Jul 29, 2022 at 11:36 AM Antonin Houska wrote: > > Attached is a new version. It allows the user to set "elevel" (i.e. ERROR is > > not necessarily thrown on I/O failure, if the user prefers to check the > > number > > of bytes read/written) and to specify the buffer size. Also, 0015 adds a > > function to copy data from one file descriptor to another. > > I basically agree with Peter Eisentraut's earlier comment: "I don't > understand what this patch set is supposed to do." The intended > advantages, as documented in buffile.c, are: > > + * 1. Less code is needed to access the file. > + * > + * 2.. Buffering reduces the number of I/O system calls. > + * > + * 3. The buffer size can be controlled by the user. (The larger the buffer, > + * the fewer I/O system calls are needed, but the more data needs to be > + * written to the buffer before the user recognizes that the file access > + * failed.) > + * > + * 4. It should make features like Transparent Data Encryption less invasive. > > It does look to me like there is a modest reduction in the total > amount of code from these patches, so I do think they might be > achieving (1). > > I'm not so sure about (2), though. A number of these places are cases > where we only do a single write to the file - like in patches 0010, > 0011, 0012, and 0014, for example. And even in 0013, where we do > multiple I/Os, it makes no sense to introduce a second buffering > layer. We're reading from a file into a buffer several kilobytes in > size and then shipping the data out. The places where you want > buffering are places where we might be doing system calls for a few > bytes of I/O at a time, not places where we're already doing > relatively large I/Os. An extra layer of buffering probably makes > things slower, not faster. The buffering seemed to me like a good opportunity to plug-in the encryption, because the data needs to be encrypted in memory before it's written to disk. Of course it'd be better to use the existing buffering for that than to add another layer. > I don't think that (3) is a separate advantage from (1) and (2), so I > don't have anything separate to say about it. I thought that the uncontrollable buffer size is one of the things you complaint about in https://www.postgresql.org/message-id/CA+TgmoYGjN_f=fcerx49bzjhng+gocty+a+xhnrwcvvdy8u...@mail.gmail.com > I find it really unclear whether, or how, it achieves (4). In general, > I don't think that the large number of raw calls to read() and write() > spread across the backend is a particularly good thing. TDE is an > example of a feature that might affect every byte we read or write, > but you can also imagine instrumentation and tracing facilities that > might want to be able to do something every time we do an I/O without > having to modify a zillion separate call sites. Such features can > certainly benefit from consolidation, but I suspect it isn't helpful > to treat every I/O in exactly the same way. For instance, let's say we > settle on a design where everything in the cluster is encrypted but, > just to make things simple for the sake of discussion, let's say there > are no stored nonces anywhere. Can we just route every I/O in the > backend through a wrapper layer that does encryption and decryption? > Probably not. I imagine postgresql.conf and pg_hba.conf aren't going > to be encrypted, for example, because they're intended to be edited by > users. And Bruce has previously proposed designs where the WAL would > be encrypted with a different key than the data files. Another idea, > which I kind of like, is to encrypt WAL on a record level rather than > a block level. Either way, you can't just encrypt every single byte > the same way. There are probably other relevant distinctions: some > data is temporary and need not survive a server restart or be shared > with other backends (except maybe in the case of parallel query) and > some data is permanent. Some data is already block-structured using > blocks of size BLCKSZ; some data is block-structured using some other > block size; and some data is not block-structured. Some data is in > standard-format pages that are part of the buffer pool and some data > isn't. I feel like some of those distinctions probably matter to TDE, > and maybe to other things that we might want to do. > > For instance, to go back to my earlier comments, if the data is > already block-structured, we do not need to insert a second layer of > buffering; if it isn't, maybe we should, not just for TDE but for > other reasons as well. If the data is temporary, TDE might want to > encrypt it using a temporary key which is generated on the fly and > only stored in server memory,
Re: Temporary file access API
Antonin Houska wrote: > Peter Eisentraut wrote: > > > On 04.07.22 18:35, Antonin Houska wrote: > > >> Attached is a new version of the patch, to evaluate what the API use in > > >> the > > >> backend could look like. I haven't touched places where the file is > > >> accessed > > >> in a non-trivial way, e.g. lseek() / fseek() or pg_pwrite() / pg_pread() > > >> is > > >> called. > > > Rebased patch set is attached here, which applies to the current master. > > > (A few more opportunities for the new API implemented here.) > > > > I don't understand what this patch set is supposed to do. AFAICT, the > > thread > > originally forked off from a TDE discussion and, considering the thread > > subject, was possibly once an attempt to refactor temporary file access to > > make integrating encryption easier? The discussion then talked about things > > like saving on system calls and excising all OS-level file access API use, > > which I found confusing, and the thread then just became a general > > TDE-related > > mini-discussion. > > Yes, it's an attempt to make the encryption less invasive, but there are a few > other objectives, at least: 1) to make the read / write operations "less > low-level" (there are common things like error handling which are often just > copy & pasted from other places), 2) to have buffered I/O with configurable > buffer size (the current patch version still has fixed buffer size though) > > It's true that the discussion ends up to be encryption-specific, however the > scope of the patch is broader. The first meassage of the thread references a > related discussion > > https://www.postgresql.org/message-id/CA+TgmoYGjN_f=fcerx49bzjhng+gocty+a+xhnrwcvvdy8u...@mail.gmail.com > > which is more important for this patch than the suggestions about encryption. > > > The patches at hand extend some internal file access APIs and then sprinkle > > them around various places in the backend code, but there is no explanation > > why or how this is better. I don't see any real structural savings one might > > expect from a refactoring patch. No information has been presented how this > > might help encryption. > > At this stage I expected feedback from the developers who have already > contributed to the discussion, because I'm not sure myself if this version > fits their ideas - that's why I didn't elaborate the documentation yet. I'll > try to summarize my understanding in the next version, but I'd appreciate if I > got some feedback for the current version first. > > > I also suspect that changing around the use of various file access APIs > > needs > > to be carefully evaluated in each individual case. Various code has subtle > > expectations about atomic write behavior, syncing, flushing, error recovery, > > etc. I don't know if this has been considered here. > > I considered that, but could have messed up at some places. Right now I'm > aware of one problem: pgstat.c does not expect the file access API to raise > ERROR - this needs to be fixed. Attached is a new version. It allows the user to set "elevel" (i.e. ERROR is not necessarily thrown on I/O failure, if the user prefers to check the number of bytes read/written) and to specify the buffer size. Also, 0015 adds a function to copy data from one file descriptor to another. -- Antonin Houska Web: https://www.cybertec-postgresql.com temp_file_api_v4.tgz Description: application/gzip
Re: Proposal: allow database-specific role memberships
Kenaniah Cerny wrote: > Rebased yet again... > > On Mon, Jul 4, 2022 at 1:17 PM Kenaniah Cerny wrote: > The "unsafe_tests" directory is where the pre-existing role tests were > located. According to the readme of the "unsafe_tests" directory, the tests > contained within are not run during "make installcheck" because they could > have side-effects that seem undesirable for a production installation. This > seemed like a reasonable location as the new tests that this patch > introduces also modifies the "state" of the database cluster by adding, > modifying, and removing roles & databases (including template1). ok, I missed the purpose of "unsafe_tests" so far, thanks. > Regarding roles_is_member_of(), the nuance is that role "A" in your example > would only be considered a member of role "B" (and by extension role "C") > when connected to the database in which "A" was granted database-specific > membership to "B". > Conversely, when connected to any other database, "A" would not be > considered to be a member of "B". > > This patch is designed to solve the scenarios in which one may want to > grant constrained access to a broader set of privileges. For example, > membership in "pg_read_all_data" effectively grants SELECT and USAGE rights > on everything (implicitly cluster-wide in today's implementation). By > granting a role membership to "pg_read_all_data" within the context of a > specific database, the grantee's read-everything privilege is effectively > constrained to just that specific database (as membership within > "pg_read_all_data" would not otherwise be held). ok, I tried to view the problem rather from general perspective. However, the permissions like "pg_read_all_data" are unusual in that they are rather strong and at the same time they are usually located at the top of the groups hierarchy. I've got no better idea how to solve the problem. A few more comments on the patch: * It's not clear from the explanation of the GRANT ... IN DATABASE ... / GRANT ... IN CURRENT DATABASE ... that, even if "membership in ... will be effective only when the recipient is connected to the database ...", the ADMIN option might not be "fully effective". I refer to the part of the regression tests starting with -- Ensure database-specific admin option can only grant within that database For example, "role_read_34" does have the ADMIN option for the "pg_read_all_data" role and for the "db_4" database: GRANT pg_read_all_data TO role_read_34 IN DATABASE db_4 WITH ADMIN OPTION; (in other words, "role_read_34" does have the database-specific membership in "pg_read_all_data"), but it cannot use the option (in other words, cannot use some ability resulting from that membership) unless the session to that database is active: \connect db_3 SET SESSION AUTHORIZATION role_read_34; ... GRANT pg_read_all_data TO role_granted IN CURRENT DATABASE; -- success GRANT pg_read_all_data TO role_granted IN DATABASE db_3; -- notice NOTICE: role "role_granted" is already a member of role "pg_read_all_data" in database "db_3" GRANT pg_read_all_data TO role_granted IN DATABASE db_4; -- error ERROR: must have admin option on role "pg_read_all_data" Specifically on the regression tests: * The function check_memberships() has no parameters - is there a reason not to use a view? * I'm not sure if the pg_auth_members catalog can contain InvalidOid in other columns than dbid. Thus I think that the query in check_memberships() only needs an outer JOIN for the pg_database table, while the other joins can be inner. * In this part SET SESSION AUTHORIZATION role_read_12_noinherit; SELECT * FROM data; -- error SET ROLE role_read_12; -- error SELECT * FROM data; -- error I think you don't need to query the table again if the SET ROLE statement failed and the same query had been executed before the SET ROLE. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Problem about postponing gathering partial paths for topmost scan/join rel
Richard Guo wrote: > On Fri, Jul 15, 2022 at 5:00 PM Richard Guo wrote: > > On Fri, Jul 15, 2022 at 4:03 PM Richard Guo wrote: > > On Thu, Jul 14, 2022 at 10:02 PM Antonin Houska wrote: > > I'd prefer a test that demonstrates that the Gather node at the top of the > "subproblem plan" is useful purely from the *cost* perspective, rather than > due to executor limitation. > > This patch provides an additional path (Gather atop of subproblem) which > was not available before. But your concern makes sense that we need to > show this new path is valuable from competing on cost with other paths. > > How about we change to Nested Loop at the topmost? Something like: > > Maybe a better example is that we use a small table 'c' to avoid the > Gather node above scanning 'c', so that the path of parallel nestloop is > possible to be generated. > > Update the patch with the new test case. ok, this makes sense to me. Just one minor suggestion: the command alter table d_star reset (parallel_workers); is not necessary because it's immediately followed by rollback; I'm going to set the CF entry to "ready for committer'". -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Problem about postponing gathering partial paths for topmost scan/join rel
Richard Guo wrote: > On Wed, Jul 28, 2021 at 3:42 PM Richard Guo wrote: > > To fix this problem, I'm thinking we can leverage 'root->all_baserels' > to tell if we are at the topmost scan/join rel, something like: > > --- a/src/backend/optimizer/path/allpaths.c > +++ b/src/backend/optimizer/path/allpaths.c > @@ -3041,7 +3041,7 @@ standard_join_search(PlannerInfo *root, int > levels_needed, List *initial_rels) > * partial paths. We'll do the same for the topmost > scan/join rel > * once we know the final targetlist (see > grouping_planner). > */ > - if (lev < levels_needed) > + if (!bms_equal(rel->relids, root->all_baserels)) > generate_useful_gather_paths(root, rel, > false); > > Any thoughts? > > Attach a patch to include the fix described upthread. Would appreciate > any comments on this topic. I think I understand the idea but I'm not sure about the regression test. I suspect that in the plan EXPLAIN (COSTS OFF) SELECT count(*) FROM tenk1 a JOIN tenk1 b ON a.two =3D b.two FULL JOIN tenk1 c ON b.two =3D c.two; QUERY PLAN Aggregate -> Hash Full Join Hash Cond: (b.two =3D c.two) -> Gather Workers Planned: 4 -> Parallel Hash Join Hash Cond: (a.two =3D b.two) -> Parallel Seq Scan on tenk1 a -> Parallel Hash -> Parallel Seq Scan on tenk1 b -> Hash -> Gather Workers Planned: 4 -> Parallel Seq Scan on tenk1 c the Gather node is located below the "Hash Full Join" node only because that kind of join currently cannot be executed by parallel workers. If the parallel "Hash Full Join" gets implemented (I've noticed but not checked in detail [1]), it might break this test. I'd prefer a test that demonstrates that the Gather node at the top of the "subproblem plan" is useful purely from the *cost* perspective, rather than due to executor limitation. [1] https://commitfest.postgresql.org/38/2903/ -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: [PROPOSAL] Detecting plan changes with plan_id in pg_stat_activity
Shouldn't the patch status be set to "Waiting on Author"? (I was curious if this is a patch that I can review.) Julien Rouhaud wrote: > On Wed, Jun 22, 2022 at 11:05:54PM +, Imseih (AWS), Sami wrote: > > >Can you describe how it's kept in sync, and how it makes sure that the > > > property > > >is maintained over restart / gc? I don't see any change in the code > > > for the > > >2nd part so I don't see how it could work (and after a quick test it > > > indeed > > >doesn't). > > > > There is a routine called qtext_hash_sync which removed all entries from > > the qtext_hash and reloads it will all the query ids from pgss_hash. > > [...] > > All the points when it's called, an exclusive lock is held.this allows or > > syncing all > > The present queryid's in pgss_hash with qtext_hash. > > So your approach is to let the current gc / file loading behavior happen as > before and construct your mapping hash using the resulting query text / offset > info. That can't work. > > > >2nd part so I don't see how it could work (and after a quick test it > > > indeed > > >doesn't). > > > > Can you tell me what test you used to determine it is not in sync? > > What test did you use to determine it is in sync? Have you checked how the > gc/ > file loading actually work? > > In my case I just checked the size of the query text file after running some > script that makes sure that there are the same few queries ran by multiple > different roles, then: > > Size of $PGDATA/pg_stat_tmp/pgss_query_texts.stat: 559B > pg_ctl restart > Size of $PGDATA/pg_stat_tmp/pgss_query_texts.stat: 2383B > > > >Can you share more details on the benchmarks you did? Did you also run > > >benchmark on workloads that induce entry eviction, with and without > > > need for > > >gc? Eviction in pgss is already very expensive, and making things > > > worse just > > >to save a bit of disk space doesn't seem like a good compromise. > > > > Sorry this was poorly explained by me. I went back and did some benchmarks. > > Attached is > > The script and results. But here is a summary: > > On a EC2 r5.2xlarge. The benchmark I performed is: > > 1. create 10k tables > > 2. create 5 users > > 3. run a pgbench script that performs per transaction a select on > > A randomly chosen table for each of the 5 users. > > 4. 2 variants of the test executed . 1 variant is with the default > > pg_stat_statements.max = 5000 > > and one test with a larger pg_stat_statements.max = 1. > > But you wrote: > > > ## > > ## pg_stat_statements.max = 15000 > > ## > > So which one is it? > > > > > So 10-15% is not accurate. I originally tested on a less powered machine. > > For this > > Benchmark I see a 6% increase in TPS (732k vs 683k) when we have a larger > > sized > > pg_stat_statements.max is used and less gc/deallocations. > > Both tests show a drop in gc/deallocations and a net increase > > In tps. Less GC makes sense since the external file has less duplicate SQLs. > > On the other hand you're rebuilding the new query_offset hashtable every time > there's an entry eviction, which seems quite expensive. > > Also, as I mentioned you didn't change any of the heuristic for > need_gc_qtexts(). So if the query texts are indeed deduplicated, doesn't it > mean that gc will artificially > be called less often? The wanted target of "50% bloat" will become "50% > bloat assuming no deduplication is done" and the average query text file size > will stay the same whether the query texts are deduplicated or not. > > I'm wondering the improvements you see due to the patch or simply due to > artificially calling gc less often? What are the results if instead of using > vanilla pg_stat_statements you patch it to perform roughly the same number of > gc as your version does? > > Also your benchmark workload is very friendly with your feature, what are the > results with other workloads? Having the results with query texts that aren't > artificially long would be interesting for instance, after fixing the problems > mentioned previously. > > Also, you said that if you run that benchmark with a single user you don't see > any regression. I don't see how rebuilding an extra hashtable in > entry_dealloc(), so when holding an exclusive lwlock, while not saving any > other work elsewhere could be free? > > Looking at the script, you have: > echo "log_min_messages=debug1" >> $PGDATA/postgresql.conf; \ > > Is that really necessary? Couldn't you upgrade the gc message to a higher > level for your benchmark need, or expose some new counter in > pg_stat_statements_info maybe? Have you done the benchmark using a debug > build > or normal build? > > -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Temporary file access API
Peter Eisentraut wrote: > On 04.07.22 18:35, Antonin Houska wrote: > >> Attached is a new version of the patch, to evaluate what the API use in the > >> backend could look like. I haven't touched places where the file is > >> accessed > >> in a non-trivial way, e.g. lseek() / fseek() or pg_pwrite() / pg_pread() is > >> called. > > Rebased patch set is attached here, which applies to the current master. > > (A few more opportunities for the new API implemented here.) > > I don't understand what this patch set is supposed to do. AFAICT, the thread > originally forked off from a TDE discussion and, considering the thread > subject, was possibly once an attempt to refactor temporary file access to > make integrating encryption easier? The discussion then talked about things > like saving on system calls and excising all OS-level file access API use, > which I found confusing, and the thread then just became a general TDE-related > mini-discussion. Yes, it's an attempt to make the encryption less invasive, but there are a few other objectives, at least: 1) to make the read / write operations "less low-level" (there are common things like error handling which are often just copy & pasted from other places), 2) to have buffered I/O with configurable buffer size (the current patch version still has fixed buffer size though) It's true that the discussion ends up to be encryption-specific, however the scope of the patch is broader. The first meassage of the thread references a related discussion https://www.postgresql.org/message-id/CA+TgmoYGjN_f=fcerx49bzjhng+gocty+a+xhnrwcvvdy8u...@mail.gmail.com which is more important for this patch than the suggestions about encryption. > The patches at hand extend some internal file access APIs and then sprinkle > them around various places in the backend code, but there is no explanation > why or how this is better. I don't see any real structural savings one might > expect from a refactoring patch. No information has been presented how this > might help encryption. At this stage I expected feedback from the developers who have already contributed to the discussion, because I'm not sure myself if this version fits their ideas - that's why I didn't elaborate the documentation yet. I'll try to summarize my understanding in the next version, but I'd appreciate if I got some feedback for the current version first. > I also suspect that changing around the use of various file access APIs needs > to be carefully evaluated in each individual case. Various code has subtle > expectations about atomic write behavior, syncing, flushing, error recovery, > etc. I don't know if this has been considered here. I considered that, but could have messed up at some places. Right now I'm aware of one problem: pgstat.c does not expect the file access API to raise ERROR - this needs to be fixed. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Temporary file access API
Antonin Houska wrote: > Robert Haas wrote: > > > On Tue, Apr 12, 2022 at 5:30 AM Antonin Houska wrote: > > > Robert Haas wrote: > > > > On Mon, Apr 11, 2022 at 4:05 AM Antonin Houska wrote: > > > > > There are't really that many kinds of files to encrypt: > > > > > > > > > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#List_of_the_files_that_contain_user_data > > > > > > > > > > (And pg_stat/* files should be removed from the list.) > > > > > > > > This kind of gets into some theoretical questions. Like, do we think > > > > that it's an information leak if people can look at how many > > > > transactions are committing and aborting in pg_xact_status? In theory > > > > it could be, but I know it's been argued that that's too much of a > > > > side channel. I'm not sure I believe that, but it's arguable. > > > > > > I was referring to the fact that the statistics are no longer stored in > > > files: > > > > > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd > > > > Oh, yeah, I agree with that. > > I see now that the statistics are yet saved to a file on server shutdown. I've > updated the wiki page. > > Attached is a new version of the patch, to evaluate what the API use in the > backend could look like. I haven't touched places where the file is accessed > in a non-trivial way, e.g. lseek() / fseek() or pg_pwrite() / pg_pread() is > called. Rebased patch set is attached here, which applies to the current master. (A few more opportunities for the new API implemented here.) -- Antonin Houska Web: https://www.cybertec-postgresql.com temp_file_api_v3.tgz Description: application/gzip
Re: Proposal: allow database-specific role memberships
Kenaniah Cerny wrote: > Attached is a newly-rebased patch -- would love to get a review from someone > whenever possible. I've picked this patch for a review. The patch currently does not apply to the master branch, so I could only read the diff. Following are my comments: * I think that roles_is_member_of() deserves a comment explaining why the code that you moved into append_role_memberships() needs to be called twice, i.e. once for global memberships and once for the database-specific ones. I think the reason is that if, for example, role "A" is a database-specific member of role "B" and "B" is a "global" member of role "C", then "A" should not be considered a member of "C", unless "A" is granted "C" explicitly. Is this behavior intended? Note that in this example, the "C" members are a superset of "B" members, and thus "C" should have weaker permissions on database objects than "B". What's then the reason to not consider "A" a member of "C"? If "C" gives its members some permissions of "B" (e.g. "pg_write_all_data"), then I think the roles hierarchy is poorly designed. A counter-example might help me to understand. * Why do you think that "unsafe_tests" is the appropriate name for the directory that contains regression tests? I can spend more time on the review if the patch gets rebased. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Privileges on PUBLICATION
Euler Taveira wrote: > --eeab359ad6094efd84562cddd7fb9e89 > Content-Type: text/plain > > On Wed, May 18, 2022, at 6:44 AM, Antonin Houska wrote: > > ok, please see the next version. > The new paragraph looks good to me. I'm not sure if the CREATE PUBLICATION is > the right place to provide such information. As I suggested in a previous > email > [1], you could add it to "Logical Replication > Security". ok, I missed that. The next version moves the text there. > [1] https://postgr.es/m/d96103fe-99e2-4119-bd76-952d326b7...@www.fastmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml index 145ea71d61b..2fcaa9d261a 100644 --- a/doc/src/sgml/logical-replication.sgml +++ b/doc/src/sgml/logical-replication.sgml @@ -1171,6 +1171,17 @@ CONTEXT: processing remote data for replication origin "pg_16395" during "INSER schema automatically, the user must be a superuser. + + Note that there are currently no privileges on publication, and that any + subscriber can access any publication. Thus if you're trying to hide some + information from particular subscribers (by using the + WHERE clause or the column list, or by not adding the + whole table to the publication), please be aware that other publications + can expose the same information. Publication privileges might be added + to PostgreSQL in the future to allow for + fine-grained access control. + + To create a subscription, the user must be a superuser.
Re: Privileges on PUBLICATION
Euler Taveira wrote: > On Wed, May 18, 2022, at 6:16 AM, Antonin Houska wrote: > > The patch is attached to this message. > > Great. Add it to the next CF. I'll review it when I have some spare time. https://commitfest.postgresql.org/38/3641/ Thanks! -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Privileges on PUBLICATION
Euler Taveira wrote: > On Fri, May 13, 2022, at 3:36 AM, Antonin Houska wrote: > > Attached is my proposal. It tries to be more specific and does not mention > the > absence of the privileges explicitly. > > You explained the current issue but say nothing about the limitation. This > information will trigger a question possibly in one of the MLs. IMO if you say > something like the sentence above at the end, it will make it clear why that > setup expose all data (there is no access control to publications) and > explicitly say there is a TODO here. > > Additional privileges might be added to control access to table data in a > future version of PostgreSQL. I thought it sound too negative if absence of some feature was mentioned explicitly. However it makes sense to be clear from technical point of view. > I also wouldn't use the warning tag because it fits in the same category as > the > other restrictions listed in the page. ok, please see the next version. -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/doc/src/sgml/ref/create_publication.sgml b/doc/src/sgml/ref/create_publication.sgml index 1a828e8d2ff..259fe20a148 100644 --- a/doc/src/sgml/ref/create_publication.sgml +++ b/doc/src/sgml/ref/create_publication.sgml @@ -112,6 +112,17 @@ CREATE PUBLICATION name Specifying a table that is part of a schema specified by FOR ALL TABLES IN SCHEMA is not supported. + + + Note that there are currently no privileges on publication, and that any + subscriber can access any publication. Thus if you're trying to hide + some information from particular subscribers (by using the + WHERE clause or the column list, or by not adding the + whole table to the publication), please be aware that other publications + can expose the same information. Publication privileges might be added + to PostgreSQL in the future to allow for + fine-grained access control. +
Re: Privileges on PUBLICATION
Antonin Houska wrote: > Euler Taveira wrote: > > > On Tue, May 10, 2022, at 5:37 AM, Antonin Houska wrote: > > > > My understanding is that the rows/columns filtering is a way for the > > *publisher* to control which data is available to particular replica. From > > this point of view, the publication privileges would just make the control > > complete. > > > > I agree. IMO it is a new feature. We already require high privilege for > > logical > > replication. Hence, we expect the replication user to have access to all > > data. > > Unfortunately, nobody mentioned about this requirement during the row > > filter / > > column list development; someone could have written a patch for GRANT ... ON > > PUBLICATION. > > I can try that for PG 16, unless someone is already working on it. The patch is attached to this message. -- Antonin Houska Web: https://www.cybertec-postgresql.com >From 4004260cf1f3443455b7db89a70a50e38d1663b0 Mon Sep 17 00:00:00 2001 From: Antonin Houska Date: Wed, 18 May 2022 10:58:42 +0200 Subject: [PATCH] Draft implementation of USAGE privilege on PUBLICATION. This privilege seems to be useful for databases from which multiple subscribers replicate the data. It should not be assumed that any publication can be exposed to any subscription. --- doc/src/sgml/ddl.sgml | 7 + doc/src/sgml/logical-replication.sgml | 4 +- doc/src/sgml/ref/grant.sgml | 7 +- doc/src/sgml/ref/revoke.sgml| 7 + src/backend/catalog/aclchk.c| 206 src/backend/commands/copyto.c | 114 +++ src/backend/commands/publicationcmds.c | 2 + src/backend/parser/gram.y | 8 + src/backend/replication/pgoutput/pgoutput.c | 13 +- src/backend/utils/adt/acl.c | 7 + src/bin/pg_dump/dumputils.c | 2 + src/bin/pg_dump/pg_dump.c | 28 ++- src/bin/pg_dump/pg_dump.h | 1 + src/bin/psql/tab-complete.c | 3 + src/include/catalog/pg_publication.h| 6 + src/include/nodes/parsenodes.h | 4 +- src/include/utils/acl.h | 4 + 17 files changed, 413 insertions(+), 10 deletions(-) diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index f2ac1ba0034..dc499c50756 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -1935,6 +1935,13 @@ REVOKE ALL ON accounts FROM PUBLIC; statements that have previously performed this lookup, so this is not a completely secure way to prevent object access. + + For publications, allows logical replication via particular + publication. The user specified in + the CREATE + SUBSCRIPTION command must have this privilege on all + publications listed in that command. + For sequences, allows use of the currval and nextval functions. diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml index 145ea71d61b..63194b34541 100644 --- a/doc/src/sgml/logical-replication.sgml +++ b/doc/src/sgml/logical-replication.sgml @@ -1156,7 +1156,9 @@ CONTEXT: processing remote data for replication origin "pg_16395" during "INSER In order to be able to copy the initial table data, the role used for the replication connection must have the SELECT privilege on - a published table (or be a superuser). + a published table (or be a superuser). In addition, the role must have + the USAGE privilege on all the publications referenced + by particular subscription. diff --git a/doc/src/sgml/ref/grant.sgml b/doc/src/sgml/ref/grant.sgml index f744b05b55d..95b9c46137f 100644 --- a/doc/src/sgml/ref/grant.sgml +++ b/doc/src/sgml/ref/grant.sgml @@ -82,6 +82,11 @@ GRANT { { SET | ALTER SYSTEM } [, ... ] | ALL [ PRIVILEGES ] } TO role_specification [, ...] [ WITH GRANT OPTION ] [ GRANTED BY role_specification ] +GRANT { USAGE | ALL [ PRIVILEGES ] } +ON PUBLICATION pub_name [, ...] +TO role_specification [, ...] [ WITH GRANT OPTION ] +[ GRANTED BY role_specification ] + GRANT { { CREATE | USAGE } [, ...] | ALL [ PRIVILEGES ] } ON SCHEMA schema_name [, ...] TO role_specification [, ...] [ WITH GRANT OPTION ] @@ -460,7 +465,7 @@ GRANT admins TO joe; -Privileges on databases, tablespaces, schemas, languages, and +Privileges on databases, tablespaces, schemas, languages, publications and configuration parameters are PostgreSQL extensions. diff --git a/doc/src/sgml/ref/revoke.sgml b/doc/src/sgml/ref/revoke.sgml index 62f19710369..c21ea4c9c69 100644 --- a/doc/src/sgml/ref/revoke.sgml +++ b/doc/src/sgml/ref/revoke.sgml @@ -104,6 +104,13 @@ REVOKE [ GRANT OPTION FOR ] [ GR
Re: Privileges on PUBLICATION
Peter Eisentraut wrote: > On 10.05.22 10:37, Antonin Houska wrote: > > My understanding is that the rows/columns filtering is a way for the > > *publisher* to control which data is available to particular replica. From > > this point of view, the publication privileges would just make the control > > complete. > > I think privileges on publications would eventually be useful. But are you > arguing that we need them for PG15 to make the new features usable safely? I didn't think that far, but user should be aware of the problem. My proposal of documentation is in https://www.postgresql.org/message-id/5859.1652423797%40antos -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Privileges on PUBLICATION
Euler Taveira wrote: > On Tue, May 10, 2022, at 5:37 AM, Antonin Houska wrote: > > My understanding is that the rows/columns filtering is a way for the > *publisher* to control which data is available to particular replica. From > this point of view, the publication privileges would just make the control > complete. > > I agree. IMO it is a new feature. We already require high privilege for > logical > replication. Hence, we expect the replication user to have access to all data. > Unfortunately, nobody mentioned about this requirement during the row filter / > column list development; someone could have written a patch for GRANT ... ON > PUBLICATION. I can try that for PG 16, unless someone is already working on it. > I understand your concern. Like I said in my last sentence in the previous > email: it is a fine-grained access control on the publisher. Keep in mind that > it will *only* work for non-superusers (REPLICATION attribute). It is not > exposing something that we didn't expose before. In this particular case, > there > is no mechanism to prevent the subscriber to obtain data provided by the > various row filters if they know the publication names. We could probably add > a > sentence to "Logical Replication > Security" section: > > There is no privileges for publications. If you have multiple publications in > a > database, a subscription can use all publications available. Attached is my proposal. It tries to be more specific and does not mention the absence of the privileges explicitly. -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/doc/src/sgml/ref/create_publication.sgml b/doc/src/sgml/ref/create_publication.sgml index 1a828e8d2ff..b74ba625649 100644 --- a/doc/src/sgml/ref/create_publication.sgml +++ b/doc/src/sgml/ref/create_publication.sgml @@ -94,6 +94,16 @@ CREATE PUBLICATION name list is specified, it must include the replica identity columns. + + + If you are using the WHERE clause or the column list + to omit some table data from the replication for security reasons, + please make sure that the same data is not exposed via other + publications which contain the same table and have different (or + none) WHERE clause or column list. + + + Only persistent base tables and partitioned tables can be part of a publication. Temporary tables, unlogged tables, foreign tables,
Re: Privileges on PUBLICATION
Amit Kapila wrote: > On Tue, May 10, 2022 at 12:16 AM Euler Taveira wrote: > > > > On Mon, May 9, 2022, at 11:09 AM, Antonin Houska wrote: > > > > Now that the user can specify rows and columns to be omitted from the > > logical > > replication [1], I suppose hiding rows and columns from the subscriber is an > > important use case. However, since the subscription connection user (i.e. > > the > > user specified in the CREATE SUBSCRIPTION ... CONNECTION ... command) needs > > SELECT permission on the replicated table (on the publication side), he can > > just use another publication (which has different filters or no filters at > > all) to get the supposedly-hidden data replicated. > > > > The required privileges were not relaxed on publisher after the row filter > > and > > column list features. It is not just to "create another publication". Create > > publications require CREATE privilege on databases (that is *not* granted to > > PUBLIC).If you have an untrusted user that could bypass your rules about > > hidden > > data, it is better to review your user privileges. > > > > Also, to create a subscription (which combines multiple publications > to bypass rules), a user must be a superuser. So, isn't that a > sufficient guarantee that users shouldn't be able to bypass such > rules? My understanding is that the rows/columns filtering is a way for the *publisher* to control which data is available to particular replica. From this point of view, the publication privileges would just make the control complete. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Privileges on PUBLICATION
Euler Taveira wrote: > On Mon, May 9, 2022, at 11:09 AM, Antonin Houska wrote: > > Now that the user can specify rows and columns to be omitted from the logical > replication [1], I suppose hiding rows and columns from the subscriber is an > important use case. However, since the subscription connection user (i.e. the > user specified in the CREATE SUBSCRIPTION ... CONNECTION ... command) needs > SELECT permission on the replicated table (on the publication side), he can > just use another publication (which has different filters or no filters at > all) to get the supposedly-hidden data replicated. > > The required privileges were not relaxed on publisher after the row filter > and > column list features. It is not just to "create another publication". Create > publications require CREATE privilege on databases (that is *not* granted to > PUBLIC).If you have an untrusted user that could bypass your rules about > hidden > data, it is better to review your user privileges. > > postgres=# CREATE ROLE foo REPLICATION LOGIN; > CREATE ROLE > postgres=# \c - foo > You are now connected to database "postgres" as user "foo". > postgres=> CREATE PUBLICATION pub1; > ERROR: permission denied for database postgres > > The documentation [1] says > > "The role used for the replication connection must have the REPLICATION > attribute (or be a superuser)." > > You can use role foo for the replication connection but role foo couldn't be a > superuser. In this case, even if role foo open a connection to database > postgres, a publication cannot be created due to lack of privileges. > > Don't we need privileges on publication (e.g GRANT USAGE ON PUBLICATION ...) > now? > > Maybe. We rely on CREATE privilege on databases right now. If you say that > GRANT USAGE ON PUBLICATION is just a command that will have the same effect as > REPLICATION property [1] has right now, I would say it won't. Are you aiming a > fine-grained access control on publisher? The configuration I'm thinking of is multiple replicas reading data from the same master. For example, consider "foo" and "bar" roles, used by "subscr_foo" and "subscr_bar" subscriptions respectively. (Therefore, both roles need the REPLICATION option.) The subscriptions "subscr_foo" and "subscr_bar" are located in "db_foo" and "db_bar" databases respectively. On the master side, there are two publications: "pub_foo" and "pub_bar", to be used by "subscr_foo" and "subscr_bar" subscriptions respectively. The publications replicate the same table, but each with a different row filter. The problem is that the admin of "db_foo" can add the "pub_bar" publication to the "subscr_foo" subscription, and thus get the data that his "pub_foo" would filter out. Likewise, the admin of "db_bar" can "steal" the data from "pub_foo" by adding that publication to "subscr_bar". In this case, the existing publications are misused, so the CREATE PUBLICATION privileges do not help. Since the REPLICATION option of a role is cluster-wide, but I need specific roles to be restricted to specific publications, it can actually be called fine-grained access control as you say. > [1] https://www.postgresql.org/docs/devel/logical-replication-security.html -- Antonin Houska Web: https://www.cybertec-postgresql.com
Privileges on PUBLICATION
Now that the user can specify rows and columns to be omitted from the logical replication [1], I suppose hiding rows and columns from the subscriber is an important use case. However, since the subscription connection user (i.e. the user specified in the CREATE SUBSCRIPTION ... CONNECTION ... command) needs SELECT permission on the replicated table (on the publication side), he can just use another publication (which has different filters or no filters at all) to get the supposedly-hidden data replicated. Don't we need privileges on publication (e.g GRANT USAGE ON PUBLICATION ...) now? [1] https://www.postgresql.org/docs/devel/sql-createpublication.html -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Temporary file access API
Robert Haas wrote: > On Tue, Apr 12, 2022 at 5:30 AM Antonin Houska wrote: > > Robert Haas wrote: > > > On Mon, Apr 11, 2022 at 4:05 AM Antonin Houska wrote: > > > > There are't really that many kinds of files to encrypt: > > > > > > > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#List_of_the_files_that_contain_user_data > > > > > > > > (And pg_stat/* files should be removed from the list.) > > > > > > This kind of gets into some theoretical questions. Like, do we think > > > that it's an information leak if people can look at how many > > > transactions are committing and aborting in pg_xact_status? In theory > > > it could be, but I know it's been argued that that's too much of a > > > side channel. I'm not sure I believe that, but it's arguable. > > > > I was referring to the fact that the statistics are no longer stored in > > files: > > > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd > > Oh, yeah, I agree with that. I see now that the statistics are yet saved to a file on server shutdown. I've updated the wiki page. Attached is a new version of the patch, to evaluate what the API use in the backend could look like. I haven't touched places where the file is accessed in a non-trivial way, e.g. lseek() / fseek() or pg_pwrite() / pg_pread() is called. Another use case might be copying one file to another via a buffer. Something like BufFileCopy(int dstfd, int srcfd, int bufsize) The obvious call site would be in copydir.c:copy_file(), but I think there are a few more in the server code. -- Antonin Houska Web: https://www.cybertec-postgresql.com temp_file_api_v2.tgz Description: application/gzip
Re: Temporary file access API
Matthias van de Meent wrote: > On Mon, 11 Apr 2022 at 10:05, Antonin Houska wrote: > > > > There are't really that many kinds of files to encrypt: > > > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#List_of_the_files_that_contain_user_data > > > > (And pg_stat/* files should be removed from the list.) > > I was looking at that list of files that contain user data, and > noticed that all relation forks except the main fork were marked as > 'does not contain user data'. To me this seems not necessarily true: > AMs do have access to forks for user data storage as well (without any > real issues or breaking the abstraction), and the init-fork is > expected to store user data (specifically in the form of unlogged > sequences). Shouldn't those forks thus also be encrypted-by-default, > or should we provide some other method to ensure that non-main forks > with user data are encrypted? Thanks. I've updated the wiki page (also included Robert's comments). -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Temporary file access API
Robert Haas wrote: > On Mon, Apr 11, 2022 at 4:05 AM Antonin Houska wrote: > > There are't really that many kinds of files to encrypt: > > > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#List_of_the_files_that_contain_user_data > > > > (And pg_stat/* files should be removed from the list.) > > This kind of gets into some theoretical questions. Like, do we think > that it's an information leak if people can look at how many > transactions are committing and aborting in pg_xact_status? In theory > it could be, but I know it's been argued that that's too much of a > side channel. I'm not sure I believe that, but it's arguable. I was referring to the fact that the statistics are no longer stored in files: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd > Similarly, the argument that global/pg_internal.init doesn't contain > user data relies on the theory that the only table data that will make > its way into the file is for system catalogs. I guess that's not user > data *exactly* but ... are we sure that's how we want to roll here? Yes, this is worth attention. > I really don't know how you can argue that pg_dynshmem/mmap.NNN > doesn't contain user data - surely it can. Good point. Since postgres does not control writing into this file, it's a special case though. (Maybe TDE will have to reject to start if dynamic_shared_memory_type is set to mmap and the instance is encrypted.) Thanks. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: 回复:POC: Cleaning up orphaned files using undo logs
孔凡深(云梳) wrote: > Hi, Antonin. I am more interested in zheap. Recently reviewing the patch you > submitted. > When I use pg_undodump-tool to dump the undo page chunk, I found that some > chunk header is abnormal. > After reading the relevant codes in > 0006-The-initial-implementation-of-the-pg_undodump-tool.patch, > I feel that there is a bug in the function parse_undo_page. Thanks, I'll take a look if the project happens to continue. Currently it seems that another approach is more likely to be taken: https://www.postgresql.org/message-id/CA%2BTgmoa_VNzG4ZouZyQQ9h%3DoRiy%3DZQV5%2BxHQXxMWmep4Ygg8Dg%40mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Temporary file access API
Robert Haas wrote: > On Fri, Apr 8, 2022 at 4:54 AM Antonin Houska wrote: > > Do you think that the use of a system call is a problem itself (e.g. because > > the code looks less simple if read/write is used somewhere and fread/fwrite > > elsewhere; of course of read/write is mandatory in special cases like WAL, > > heap pages, etc.) or is the problem that the system calls are used too > > frequently? I suppose only the latter. > > I'm not really super-interested in reducing the number of system > calls. It's not a dumb thing in which to be interested and I know that > for example Thomas Munro is very interested in it and has done a bunch > of work in that direction just to improve performance. But for me the > attraction of this is mostly whether it gets us closer to TDE. > > And that's why I'm asking these questions about adopting it in > different places. I kind of thought that your previous patches needed > to encrypt, I don't know, 10 or 20 different kinds of files. So I was > surprised to see this patch touching the handling of only 2 kinds of > files. If we consolidate the handling of let's say 15 of 20 cases into > a single mechanism, we've really moved the needle in the right > direction -- but consolidating the handling of 2 of 20 cases, or > whatever the real numbers are, isn't very exciting. There are't really that many kinds of files to encrypt: https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#List_of_the_files_that_contain_user_data (And pg_stat/* files should be removed from the list.) -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Temporary file access API
Robert Haas wrote: > On Tue, Mar 8, 2022 at 6:12 AM Antonin Houska wrote: > > Thanks for your comments, the initial version is attached here. > > I've been meaning to look at this thread for some time but have not > found enough time to do that until just now. And now I have > questions... > > 1. Supposing we accepted this, how widely do you think that we could > adopt it? I see that this patch set adapts a few places to use it and > that's nice, but I have a feeling there's a lot more places that are > making use of system calls directly, or through some other > abstraction, than just this. I'm not sure that we actually want to use > something like this everywhere, but what would be the rule for > deciding where to use it and where not to use > it? If the plan for this facility is just to adapt these two > particular places to use it, that doesn't feel like enough to be > worthwhile. Admittedly I viewed the problem from the perspective of the TDE, so I haven't spent much time looking for other opportunities. Now, with the stats collector using shared memory, even one of the use cases implemented here no longer exists. I need to do more research. Do you think that the use of a system call is a problem itself (e.g. because the code looks less simple if read/write is used somewhere and fread/fwrite elsewhere; of course of read/write is mandatory in special cases like WAL, heap pages, etc.) or is the problem that the system calls are used too frequently? I suppose only the latter. Anyway, I'm not sure there are *many* places where system calls are used too frequently. Instead, the coding uses to be such that the information is first assembled in memory and then written to file at once. So the value of the (buffered) stream is that it makes the code simpler (eliminates the need to prepare the data in memory). That's what I tried to do for reorderbuffer.c and pgstat.c in my patch. Related question is whether we should try to replace some uses of the libc stream (FILE *) at some places. You seem to suggest that in [1]. One example is snapmgr.c:ExportSnapshot(), if we also implement output formatting. Of course there are places where (FILE *) cannot be replaced because, besides regular file, the code needs to work with stdin/stdout in general. (Parsing of configuration files falls into this category, but that doesn't matter because bison-generated parser seems to implement buffering anyway.) > 2. What about frontend code? Most frontend code won't examine data > files directly, but at least pg_controldata, pg_checksums, and > pg_resetwal are exceptions. If the frequency of using system calls is the problem, then I wouldn't change these because ControlFileData structure needs to be initialized in memory anyway and then written at once. And pg_checksums reads whole blocks anyway. I'll take a closer look. > 3. How would we extend this to support encryption? Add an extra > argument to BufFileStreamInit(V)FD passing down the encryption > details? Yes. > There are some smaller things about the patch with which I'm not 100% > comfortable, but I'd like to start by understanding the big picture. Thanks! [1] https://www.postgresql.org/message-id/CA+TgmoYGjN_f=fcerx49bzjhng+gocty+a+xhnrwcvvdy8u...@mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Logical replication row filtering and TOAST
Amit Kapila wrote: > > Antonin Houska wrote: > > > > Nevertheless, a comment in pgoutput_row_filter(), saying that TOASTed values > > are not expected if old_slot is NULL, might be useful. > > > > How about something like the attached? Yes, that'd be sufficient. Thanks. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Logical replication row filtering and TOAST
Antonin Houska wrote: > I spent some time thinking about a special case of evaluation of the row > filter and wrote a comment that might be useful (see the attachment). However > now I think that it's not perfect if the code really relies on the fact that > value of an indexed column cannot be TOASTed due to size restrictions. > > I could hit two different error messages when trying activate TOAST on an > index column (in this case PG was build with 16kB pages), but still I think > the code is unnecessarily fragile if it relies on such errors: > > > ERROR: index row requires 8224 bytes, maximum size is 8191 > > ERROR: index row size 8048 exceeds btree version 4 maximum 5432 for index > "b_pkey" > DETAIL: Index row references tuple (0,3) in relation "b". > HINT: Values larger than 1/3 of a buffer page cannot be indexed. > > > Note that at least in ExtractReplicaIdentity() we do expect that an indexed > column value can be TOASTed. > > /* >* If the tuple, which by here only contains indexed columns, still has >* toasted columns, force them to be inlined. This is somewhat unlikely >* since there's limits on the size of indexed columns, so we don't >* duplicate toast_flatten_tuple()s functionality in the above loop over >* the indexed columns, even if it would be more efficient. >*/ > if (HeapTupleHasExternal(key_tuple)) > { > HeapTuple oldtup = key_tuple; > > key_tuple = toast_flatten_tuple(oldtup, desc); > heap_freetuple(oldtup); > } > > Do I miss anything? Well, I see now that the point might be that, in heap_update(), "id_has_external" would be true the indexed value could be TOASTed, so that the (flattened) old tuple would be WAL logged: old_key_tuple = ExtractReplicaIdentity(relation, , bms_overlap(modified_attrs, id_attrs) || id_has_external, _key_copied); Nevertheless, a comment in pgoutput_row_filter(), saying that TOASTed values are not expected if old_slot is NULL, might be useful. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Logical replication row filtering and TOAST
I spent some time thinking about a special case of evaluation of the row filter and wrote a comment that might be useful (see the attachment). However now I think that it's not perfect if the code really relies on the fact that value of an indexed column cannot be TOASTed due to size restrictions. I could hit two different error messages when trying activate TOAST on an index column (in this case PG was build with 16kB pages), but still I think the code is unnecessarily fragile if it relies on such errors: ERROR: index row requires 8224 bytes, maximum size is 8191 ERROR: index row size 8048 exceeds btree version 4 maximum 5432 for index "b_pkey" DETAIL: Index row references tuple (0,3) in relation "b". HINT: Values larger than 1/3 of a buffer page cannot be indexed. Note that at least in ExtractReplicaIdentity() we do expect that an indexed column value can be TOASTed. /* * If the tuple, which by here only contains indexed columns, still has * toasted columns, force them to be inlined. This is somewhat unlikely * since there's limits on the size of indexed columns, so we don't * duplicate toast_flatten_tuple()s functionality in the above loop over * the indexed columns, even if it would be more efficient. */ if (HeapTupleHasExternal(key_tuple)) { HeapTuple oldtup = key_tuple; key_tuple = toast_flatten_tuple(oldtup, desc); heap_freetuple(oldtup); } Do I miss anything? -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c index 20d0b1e1253..2be5aaa18c4 100644 --- a/src/backend/replication/pgoutput/pgoutput.c +++ b/src/backend/replication/pgoutput/pgoutput.c @@ -1246,6 +1246,15 @@ pgoutput_row_filter(Relation relation, TupleTableSlot *old_slot, */ if (!new_slot || !old_slot) { + /* + * For UPDATE, we should only get here if the replica identity is != + * FULL and thus all the identity columns are index columns, i.e. + * never TOASTed. Therefore, we don't need to care of the unchanged + * toasted replica identity columns like we do below. + */ + Assert(relation->rd_rel->relreplident != REPLICA_IDENTITY_FULL || + map_changetype_pubaction[*action] != PUBACTION_UPDATE); + ecxt->ecxt_scantuple = new_slot ? new_slot : old_slot; result = pgoutput_row_filter_exec_expr(filter_exprstate, ecxt);
Re: Temporary file access API
Stephen Frost wrote: > * Antonin Houska (a...@cybertec.at) wrote: > > Here I'm starting a new thread to discuss a topic that's related to the > > Transparent Data Encryption (TDE), but could be useful even without that. > > The > > problem has been addressed somehow in the Cybertec TDE fork, and I can post > > the code here if it helps. However, after reading [1] (and the posts > > upthread), I've got another idea, so let's try to discuss it first. > > Yeah, I tend to agree that it makes sense to discuss the general idea of > doing buffered I/O for the temporary files that are being created today > with things like fwrite and have that be independent of encryption. As > mentioned on that thread, doing our own buffering and having file > accesses done the same way throughout the system is just generally a > good thing and focusing on that will certainly help with any TDE effort > that may or may not happen later. Thanks for your comments, the initial version is attached here. The buffer size is not configurable yet. Besides the new API itself, the patch series shows how it can be used to store logical replication data changes as well as statistics. Further comments are welcome, thanks in advance. -- Antonin Houska Web: https://www.cybertec-postgresql.com temp_file_api_v1.tgz Description: application/gzip
Re: storing an explicit nonce
Stephen Frost wrote: > Perhaps this is all too meta and we need to work through some specific > ideas around just what this would look like. In particular, thinking > about what this API would look like and how it would be used by > reorderbuffer.c, which builds up changes in memory and then does a bare > write() call, seems like a main use-case to consider. The gist there > being "can we come up with an API to do all these things that doesn't > require entirely rewriting ReorderBufferSerializeChange()?" > > Seems like it'd be easier to achieve that by having something that looks > very close to how write() looks, but just happens to have the option to > run the data through a stream cipher and maybe does better error > handling for us. Making that layer also do block-based access to the > files underneath seems like a much larger effort that, sure, may make > some things better too but if we could do that with the same API then it > could also be done later if someone's interested in that. My initial proposal is in this new thread: https://www.postgresql.org/message-id/4987.1644323098%40antos -- Antonin Houska Web: https://www.cybertec-postgresql.com
Temporary file access API
Here I'm starting a new thread to discuss a topic that's related to the Transparent Data Encryption (TDE), but could be useful even without that. The problem has been addressed somehow in the Cybertec TDE fork, and I can post the code here if it helps. However, after reading [1] (and the posts upthread), I've got another idea, so let's try to discuss it first. It makes sense to me if we first implement the buffering (i.e. writing/reading certain amount of data at a time) and make the related functions aware of encryption later: as long as we use a block cipher, we also need to read/write (suitably sized) chunks rather than individual bytes (or arbitrary amounts of data). (In theory, someone might need encryption but reject buffering, but I'm not sure if this is a realistic use case.) For the buffering, I imagine a "file stream" object that user creates on the top of a file descriptor, such as FileStream *FileStreamCreate(File file, int buffer_size) or FileStream *FileStreamCreateFD(int fd, int buffer_size) and uses functions like int FileStreamWrite(FileStream *stream, char *buffer, int amount) and int FileStreamRead(FileStream *stream, char *buffer, int amount) to write and read data respectively. Besides functions to close the streams explicitly (e.g. FileStreamClose() / FileStreamFDClose()), we'd need to ensure automatic closing where that happens to the file. For example, if OpenTemporaryFile() was used to obtain the file descriptor, the user expects that the file will be closed and deleted on transaction boundary, so the corresponding stream should be freed automatically as well. To avoid code duplication, buffile.c should use these streams internally as well, as it also performs buffering. (Here we'd also need functions to change reading/writing position.) Once we implement the encryption, we might need add an argument to the FileStreamCreate...() functions that helps to generate an unique IV, but the ...Read() / ...Write() functions would stay intact. And possibly one more argument to specify the kind of cipher, in case we support more than one. I think that's enough to start the discussion. Thanks for feedback in advance. [1] https://www.postgresql.org/message-id/CA%2BTgmoYGjN_f%3DFCErX49bzjhNG%2BGoctY%2Ba%2BXhNRWCVvDY8U74w%40mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Doc: CREATE_REPLICATION_SLOT command requires the plugin name
Amit Kapila wrote: > On Tue, Feb 1, 2022 at 5:43 PM Antonin Houska wrote: > > > > Amit Kapila wrote: > > > > > On Tue, Feb 1, 2022 at 3:44 PM Antonin Houska wrote: > > > > > > > > I got a syntax error when using the command according to the existing > > > > documentation. The output_plugin parameter needs to be passed too. > > > > > > > > > > Why do we need it for physical slots? > > > > Sure we don't, the missing curly brackets seem to be the problem. I used the > > other form of the command for reference, which therefore might need a minor > > fix too. > > > > Instead of adding additional '{}', can't we simply use: > { PHYSICAL [ RESERVE_WAL ] | > LOGICAL class="parameter">output_plugin } [ > EXPORT_SNAPSHOT | > NOEXPORT_SNAPSHOT | USE_SNAPSHOT > | TWO_PHASE ] Do you mean changing CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL [ RESERVE_WAL ] | LOGICAL output_plugin [ EXPORT_SNAPSHOT | NOEXPORT_SNAPSHOT | USE_SNAPSHOT | TWO_PHASE ] } to CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL [ RESERVE_WAL ] | LOGICAL output_plugin } [ EXPORT_SNAPSHOT | NOEXPORT_SNAPSHOT | USE_SNAPSHOT | TWO_PHASE ] ? I'm not sure, IMHO that would still allow for commands like CREATE_REPLICATION_SLOT slot_name PHYSICAL output_plugin -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Doc: CREATE_REPLICATION_SLOT command requires the plugin name
Amit Kapila wrote: > On Tue, Feb 1, 2022 at 3:44 PM Antonin Houska wrote: > > > > I got a syntax error when using the command according to the existing > > documentation. The output_plugin parameter needs to be passed too. > > > > Why do we need it for physical slots? Sure we don't, the missing curly brackets seem to be the problem. I used the other form of the command for reference, which therefore might need a minor fix too. -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index 24e93f9b28..49fae6c669 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -1942,7 +1942,7 @@ The commands accepted in replication mode are: - CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL | LOGICAL } [ ( option [, ...] ) ] + CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL | { LOGICAL output_plugin } } [ ( option [, ...] ) ] CREATE_REPLICATION_SLOT @@ -2085,7 +2085,8 @@ The commands accepted in replication mode are: -CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL [ RESERVE_WAL ] | LOGICAL output_plugin [ EXPORT_SNAPSHOT | NOEXPORT_SNAPSHOT | USE_SNAPSHOT | TWO_PHASE ] } +CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL [ RESERVE_WAL ] | { LOGICAL output_plugin +} [ EXPORT_SNAPSHOT | NOEXPORT_SNAPSHOT | USE_SNAPSHOT | TWO_PHASE ] }
Doc: CREATE_REPLICATION_SLOT command requires the plugin name
I got a syntax error when using the command according to the existing documentation. The output_plugin parameter needs to be passed too. -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index 24e93f9b28..6065c18fdf 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -1942,7 +1942,7 @@ The commands accepted in replication mode are: - CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL | LOGICAL } [ ( option [, ...] ) ] + CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL | LOGICAL output_plugin } [ ( option [, ...] ) ] CREATE_REPLICATION_SLOT
Re: XTS cipher mode for cluster file encryption
Bruce Momjian wrote: > On Mon, Nov 29, 2021 at 08:37:31AM +0100, Antonin Houska wrote: > > The changes to buffile.c are not trivial, but we haven't really changed the > > API, as long as you mean BufFileCreateTemp(), BufFileWrite(), BufFileRead(). > > > > What our patch affects on the caller side is that BufFileOpenTransient(), > > BufFileCloseTransient(), BufFileWriteTransient() and BufFileReadTransient() > > replace OpenTransientFile(), CloseTransientFile(), write()/fwrite() and > > read()/fread() respectively in reorderbuffer.c and in pgstat.c. These > > changes > > become a little bit less invasive in TDE 1.1 than they were in 1.0, see [1], > > see the diffs attached. > > With pg_upgrade modified to preserve the relfilenode, tablespace oid, and > database oid, we are now closer to implementing cluster file encryption > using XTS. I think we have a few steps left: > > 1. modify temporary file I/O to use a more centralized API > 2. modify the existing cluster file encryption patch to use XTS with a > IV that uses more than the LSN > 3. add XTS regression test code like CTR > 4. create WAL encryption code using CTR > > If we can do #1 in PG 15 I think I can have #2 ready for PG 16 in July. > The feature wiki page is: > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption > > Do people want to advance this feature forward? I confirm that we (Cybertec) do and that we're ready to spend more time on the community implementation. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: [PATCH] buffile: ensure start offset is aligned with BLCKSZ
Sasasu wrote: > Hi hackers, > > there are a very long discuss about TDE, and we agreed on that if the > temporary file I/O can be aligned to some fixed size, it will be easier > to use some kind of encryption algorithm. > > discuss: > https://www.postgresql.org/message-id/20211025155814.GD20998%40tamriel.snowman.net > > This patch adjust file->curOffset and file->pos before the real IO to > ensure the start offset is aligned. Does this test really pass regression tests? In BufFileRead(), I would understand if you did + file->pos = offsetInBlock; + file->curOffset -= offsetInBlock; rather than + file->pos += offsetInBlock; + file->curOffset -= offsetInBlock; Anyway, BufFileDumpBuffer() does not seem to enforce curOffset to end up at block boundary, not to mention BufFileSeek(). When I was implementing this for our fork [1], I concluded that the encryption code path is too specific, so I left the existing code for the unecrypted data and added separate functions for the encrypted data. One specific thing is that if you encrypt and write n bytes, but only need part of it later, you need to read and decrypt exactly those n bytes anyway, otherwise the decryption won't work. So I decided not only to keep curOffset at BLCKSZ boundary, but also to read / write BLCKSZ bytes at a time. This also makes sense if the scope of the initialization vector (IV) is BLCKSZ bytes. Another problem is that you might want to store the IV somewhere in between the data. In short, the encryption makes the buffered IO rather different and the specific code should be kept aside, although the same API is used to invoke it. -- Antonin Houska Web: https://www.cybertec-postgresql.com [1] https://github.com/cybertec-postgresql/postgres/tree/PG_14_TDE_1_1
Re: XTS cipher mode for cluster file encryption
Stephen Frost wrote: > Greetings, > > * Bruce Momjian (br...@momjian.us) wrote: > > On Tue, Oct 19, 2021 at 02:54:56PM -0400, Stephen Frost wrote: > > > * Sasasu (i...@sasa.su) wrote: > > > > A unified block-based I/O API sounds great. Has anyone tried to do this > > > > before? It would be nice if the front-end tools could also use these > > > > API. > > > > > > The TDE patch from Cybertec did go down this route, but the API ended up > > > being rather different which menat a lot of changes in other parts of > > > the system. If we can get a block-based temporary file method that > > > maintains more-or-less the same API, that'd be great, but I'm not sure > > > that we can really do so and I am not entirely convinced that we should > > > make the TDE effort depend on an otherwise quite independent effort of > > > making all temp files usage be block based. > > > > Uh, I thought people felt the Cybertec patch was too large and that a > > unified API for temporary file I/O-encryption was a requirement. Would > > a CTR-steaming-encryption API for temporary tables be easier to > > implement? > > Having a unified API for temporary file I/O (that could then be extended > to provide encryption) would definitely help with reducing the size of a > TDE patch. The approach used in the Cybertec patch was to make > temporary file access block based, but the way that was implemented was > with an API different from pread/pwrite and that meant changing pretty > much all of the call sites for temporary file access, which naturally > resulted in changes in a lot of otherwise unrelated code. The changes to buffile.c are not trivial, but we haven't really changed the API, as long as you mean BufFileCreateTemp(), BufFileWrite(), BufFileRead(). What our patch affects on the caller side is that BufFileOpenTransient(), BufFileCloseTransient(), BufFileWriteTransient() and BufFileReadTransient() replace OpenTransientFile(), CloseTransientFile(), write()/fwrite() and read()/fread() respectively in reorderbuffer.c and in pgstat.c. These changes become a little bit less invasive in TDE 1.1 than they were in 1.0, see [1], see the diffs attached. (I expect that [2] will get committed someday so that the TDE feature won't affect pgstat.c in the future at all.) -- Antonin Houska Web: https://www.cybertec-postgresql.com [1] https://github.com/cybertec-postgresql/postgres/tree/PG_14_TDE_1_1 [2] https://commitfest.postgresql.org/34/1708/ diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index d474ea1d0a..9cb0708e41 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -99,6 +99,7 @@ #include "replication/slot.h" #include "replication/snapbuild.h" /* just for SnapBuildSnapDecRefcount */ #include "storage/bufmgr.h" +#include "storage/buffile.h" #include "storage/fd.h" #include "storage/sinval.h" #include "utils/builtins.h" @@ -131,21 +132,13 @@ typedef struct ReorderBufferTupleCidEnt CommandId combocid; /* just for debugging */ } ReorderBufferTupleCidEnt; -/* Virtual file descriptor with file offset tracking */ -typedef struct TXNEntryFile -{ - File vfd; /* -1 when the file is closed */ - off_t curOffset; /* offset for next write or read. Reset to 0 - * when vfd is opened. */ -} TXNEntryFile; - /* k-way in-order change iteration support structures */ typedef struct ReorderBufferIterTXNEntry { XLogRecPtr lsn; ReorderBufferChange *change; ReorderBufferTXN *txn; - TXNEntryFile file; + TransientBufFile *file; XLogSegNo segno; } ReorderBufferIterTXNEntry; @@ -243,14 +236,22 @@ static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMe * Disk serialization support functions * --- */ +static void ReorderBufferTweakBase(ReorderBufferTXN *txn, + char tweak_base[TWEAK_BASE_SIZE]); static void ReorderBufferCheckMemoryLimit(ReorderBuffer *rb); static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn); static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn, - int fd, ReorderBufferChange *change); + TransientBufFile *file, ReorderBufferChange *change); +static void ReorderBufferWriteData(TransientBufFile *file, void *ptr, size_t size, + ReorderBufferTXN *txn); static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn, - TXNEntryFile *file, XLogSegNo *segno); + TransientBufFile **file, XLogSegNo *segno); static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, - char *change); + T
Re: Reuse of State value in Aggregates
gg pw wrote: > Do Aggregate functions/queries reuse the state value they generate for > subsequent queries? If not then an explanation would be greatly appreciated. The subsequent queries are not supposed to feed the same input into the aggregates, so their aggregate states will most likely be different. Maybe you need to post an example where you think such reusing would be useful. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: [PATCH] Full support for index LP_DEAD hint bits on standby
Michail Nikolaev wrote: > > I understand that the RR snapshot is used to check the MVCC behaviour, > > however > > this comment seems to indicate that the RR snapshot should also prevent the > > standb from setting the hint bits. > > # Make sure previous queries not set the hints on standby because > > # of RR snapshot > > I can imagine that on the primary, but I don't think that the backend that > > checks visibility on standby does checks other snapshots/backends. And it > > didn't work when I ran the test manually, although I could have missed > > something. > > Yes, it checks - you could see ComputeXidHorizons for details. It is > the main part of the correctness of the whole feature. I added some > details about it to the test. Ah, ok. I thought that only KnownAssignedXids is used on standby, but that would ignore the RR snapshot. It wasn't clear to me when the xmin of the hot-standby backends is set, now I think it's done by GetSnapshotData(). > > * I can see no test for the INDEX_LP_DEAD_OK_MIN_LSN value of the > > IndexLpDeadAllowedResult enumeration. Shouldn't there be only two values, > > e.g. INDEX_LP_DEAD_OK and INDEX_LP_DEAD_MAYBE_OK ? Or a boolean variable > > (in > > index_fetch_heap()) of the appropriate name, e.g. kill_maybe_allowed, and > > rename the function is_index_lp_dead_allowed() to > > is_index_lp_dead_maybe_allowed()? > > Yes, this way it is looks better. Done. Also, I have added some checks > for “maybe” LSN-related logic to the test. Attached is a proposal for a minor addition that would make sense to me, add it if you think it's appropriate. I think I've said enough, changing the status to "ready for committer" :-) -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/test/recovery/t/027_standby_index_lp_dead.pl b/src/test/recovery/t/027_standby_index_lp_dead.pl index 5a23eea052..e65000b6bd 100644 --- a/src/test/recovery/t/027_standby_index_lp_dead.pl +++ b/src/test/recovery/t/027_standby_index_lp_dead.pl @@ -7,7 +7,7 @@ use PostgreSQL::Test::Utils; use Test::More; use Config; -plan tests => 29; +plan tests => 30; # Initialize primary node my $node_primary = PostgreSQL::Test::Cluster->new('primary'); @@ -246,6 +246,7 @@ is(btp_safe_on_stanby($node_new_standby), qq(0), 'hint from FPI'); # Make sure bits are set only if minRecoveryPoint > than index page LSN try_to_set_hint_bits($node_new_standby); +is(hints_num($node_new_standby), qq(11), 'no new index hint bits are set on new standby'); is(btp_safe_on_stanby($node_new_standby), qq(0), 'hint not marked as standby-safe'); # Issue checkpoint on primary to update minRecoveryPoint on standby
Re: [PATCH] Full support for index LP_DEAD hint bits on standby
Michail Nikolaev wrote: > > * Is the purpose of the repeatable read (RR) snapshot to test that > > heap_hot_search_buffer() does not set deadness->all_dead if some > > transaction > > can still see a tuple of the chain? > > The main purpose is to test xactStartedInRecovery logic after the promotion. > For example - > > if (scan->xactStartedInRecovery && !RecoveryInProgress())` I understand that the RR snapshot is used to check the MVCC behaviour, however this comment seems to indicate that the RR snapshot should also prevent the standb from setting the hint bits. # Make sure previous queries not set the hints on standby because # of RR snapshot I can imagine that on the primary, but I don't think that the backend that checks visibility on standby does checks other snapshots/backends. And it didn't work when I ran the test manually, although I could have missed something. A few more notes regarding the tests: * 026_standby_index_lp_dead.pl should probably be renamed to 027_standby_index_lp_dead.pl (026_* was created in the master branch recently) * The test fails, although I do have convigrured the build with --enable-tap-tests. BEGIN failed--compilation aborted at t/026_standby_index_lp_dead.pl line 5. t/026_standby_index_lp_dead.pl .. Dubious, test returned 2 (wstat 512, 0x200) I suspect the testing infrastructure changed recently. * The messages like this is(hints_num($node_standby_1), qq(10), 'hints are set on standby1 because FPI but marked as non-safe'); say that the hints are "marked as non-safe", but the hints_num() function does not seem to check that. * wording: is(hints_num($node_standby_2), qq(10), 'index hint bits already set on second standby 2'); -> is(hints_num($node_standby_2), qq(10), 'index hint bits already set on standby 2'); And a few more notes on the code: * There's an extra colon in mask_lp_dead(): bufmask.c:148:38: warning: for loop has empty body [-Wempty-body] offnum = OffsetNumberNext(offnum)); ^ bufmask.c:148:38: note: put the semicolon on a separate line to silence this warning * the header comment of heap_hot_search_buffer() still says "*all_dead" whereas I'd expect "->all_dead". The same for "*page_lsn". * I can see no test for the INDEX_LP_DEAD_OK_MIN_LSN value of the IndexLpDeadAllowedResult enumeration. Shouldn't there be only two values, e.g. INDEX_LP_DEAD_OK and INDEX_LP_DEAD_MAYBE_OK ? Or a boolean variable (in index_fetch_heap()) of the appropriate name, e.g. kill_maybe_allowed, and rename the function is_index_lp_dead_allowed() to is_index_lp_dead_maybe_allowed()? -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: storing an explicit nonce
Sasasu wrote: > On 2021/10/6 23:01, Robert Haas wrote: > > This seems wrong to me. CTR requires that you not reuse the IV. If you > > re-encrypt the page with a different IV, torn pages are a problem. If > > you re-encrypt it with the same IV, then it's not secure any more. > for CBC if the IV is predictable will case "dictionary attack". The following sounds like IV *uniqueness* is needed to defend against "known plaintext attack" ... > and for CBC and GCM reuse IV will case "known plaintext attack". ... but here you seem to say that *randomness* is also necessary: > XTS works like CBC but adds a tweak step. the tweak step does not add > randomness. It means XTS still has "known plaintext attack", (I suppose you mean "XTS with incorrect (e.g. non-random) IV", rather than XTS as such.) > due to the same reason from CBC. According to the Appendix C of https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf CBC requires *unpredictability* of the IV, but that does not necessarily mean randomness: the unpredictable IV can be obtained by applying the forward cipher function to an unique value. Can you please try to explain once again what you consider a requirement (uniqueness, randomness, etc.) on the IV for the XTS mode? Thanks. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: storing an explicit nonce
inds me of Joe Conway's response to me email earlier: https://www.postgresql.org/message-id/50335f56-041b-1a1f-59ea-b5f7bf917352%40joeconway.com In the document he recommended https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf specifically, in the Appendix C I read: """ For the CBC and CFB modes, the IVs must be unpredictable. In particular, for any given plaintext, it must not be possible to predict the IV that will be associated to the plaintext in advance of the generation of the IV. There are two recommended methods for generating unpredictable IVs. The first method is to apply the forward cipher function, under the same key that is used for the encryption of the plaintext, to a nonce. The nonce must be a data block that is unique to each execution of the encryption operation. For example, the nonce may be a counter, as described in Appendix B, or a message number. The second method is to generate a random data block using a FIPS- approved random number generator. """ This is about modes that include CBC, while the documend you refer to seems to deal with some other modes. So if we want to be confident that we use the XTS mode correctly, more research is probably needed. > Now, as I mentioned before, that particular case isn't something that > XTS is particularly good at and that's generally accepted, yet lots of > folks use XTS anyway because the concern isn't "someone has root access > on the box and is watching all block writes" but rather "laptop was > stolen" where the attacker doesn't get to see multiple writes where the > same key+tweak has been used, and the latter is really the attack vector > we're looking to address with XTS too. I've heard a few times that database running in a cloud is also a valid use case for the TDE. In that case I think it should be expected that "someone has root access on the box and is watching all block writes". -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: storing an explicit nonce
Bruce Momjian wrote: > On Tue, Oct 5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote: > > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote: > > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian wrote: > > > We are still working on our TDE patch. Right now the focus is on > > > refactoring > > > temporary file access to make the TDE patch itself smaller. Reconsidering > > > encryption mode choices given concerns expressed is next. Currently a > > > viable > > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an > > > issue with predictable IV and isn't totally broken in case of IV reuse. > > > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous > > 16-byte blocks affect later blocks, meaning that hint bit changes would > > also affect later blocks. I think this means we would need to write WAL > > full page images for hint bit changes to avoid torn pages. Right now > > hint bit (single bit) changes can be lost without causing torn pages. > > This was another of the advantages of using a stream cipher like CTR. > > The above text isn't very clear. What I am saying is that currently > torn pages can be tolerated by hint bit writes because only a single > byte is changing. If we use a block cipher like AES-XTS, later 16-byte > encrypted blocks would be changed by hint bit changes, meaning torn > pages could not be tolerated. This means we would have to use full page > writes for hint bit changes, perhaps making this feature have > unacceptable performance overhead. IIRC, in the XTS scheme, a change of a single byte in the 16-byte block causes the whole encrypted block to be different after the next encryption, however the following blocks are not affected. CBC (cipher-block chaining) is the mode where the change in one block does affect the encryption of the following block. I'm not sure if this fact is important from the hint bit perspective though. It would be an important difference if there was a guarantee that the 16-byte blocks are consitent even on torn page - does e.g. proper alignment of pages guarantee that? Nevertheless, the absence of the chaining may be a reason to prefer CBC to XTS anyway. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: storing an explicit nonce
Robert Haas wrote: > On Tue, Oct 5, 2021 at 1:24 PM Robert Haas wrote: > > On Mon, Oct 4, 2021 at 10:00 PM Stephen Frost wrote: > > > I do want to point out, as I think I did when we discussed this but want > > > to be sure it's also captured here- I don't think that temporary file > > > access should be forced to be block-oriented when it's naturally (in > > > very many cases) sequential. To that point, I'm thinking that we need a > > > temp file access API through which various systems work that's > > > sequential and therefore relatively similar to the existing glibc, et > > > al, APIs, but by going through our own internal API (which more > > > consistently works with the glibc APIs and provides better error > > > reporting in the event of issues, etc) we can then extend it to work as > > > an encrypted stream instead. > > > > Regarding this, would it use block-oriented access on the backend? > > > > I agree that we need a better API layer through which all filesystem > > access is routed. One of the notable weaknesses of the Cybertec patch > > is that it has too large a code footprint, > > (sent too soon) > > ...precisely because PostgreSQL doesn't have such a layer. I'm just trying to make our changes to buffile.c less invasive. Or do you mean that this module should be reworked regardless the encryption? > But I think ultimately we do want to encrypt and decrypt in blocks, so > if we create such a layer, it should expose byte-oriented APIs but > combine the actual I/Os somehow. That's also good for cutting down the > number of system calls, which is a benefit unto itself. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: POC: Cleaning up orphaned files using undo logs
Thomas Munro wrote: > On Wed, Sep 29, 2021 at 8:18 AM Antonin Houska wrote: > > I'm just trying to use the existing infrastructure: the effect of DROP TABLE > > also appear to be performed by the checkpointer. However I don't know why > > the > > unlinks need to be performed by the checkpointer. > > For DROP TABLE, we leave an empty file (I've been calling it a > "tombstone file") so that GetNewRelFileNode() won't let you reuse the > same relfilenode in the same checkpoint cycle. One reason is that > wal_level=minimal has a data-eating crash recovery failure mode if you > reuse a relfilenode in a checkpoint cycle. Interesting. Is the problem that REDO of the DROP TABLE command deletes the relfilenode which already contains the new data, but the new data cannot be recovered because (due to wal_level=minimal) it's not present in WAL? In this case I suppose that the checkpoint just ensures that the DROP TABLE won't be replayed during the next crash recovery. BTW, does that comment fix attached make sense to you? The corresponding code in InitSync() is /* * Create pending-operations hashtable if we need it. Currently, we need * it if we are standalone (not under a postmaster) or if we are a * checkpointer auxiliary process. */ if (!IsUnderPostmaster || AmCheckpointerProcess()) I suspect this is also related to the commit 7ff23c6d27. Thanks for your answer, I was considering to add you to CC :-) -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c index 1c78581354..ae6c5ff8e4 100644 --- a/src/backend/storage/sync/sync.c +++ b/src/backend/storage/sync/sync.c @@ -563,7 +563,7 @@ RegisterSyncRequest(const FileTag *ftag, SyncRequestType type, if (pendingOps != NULL) { - /* standalone backend or startup process: fsync state is local */ + /* standalone backend or checkpointer process: fsync state is local */ RememberSyncRequest(ftag, type); return true; }
Re: POC: Cleaning up orphaned files using undo logs
Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Tue, Sep 21, 2021 at 10:07:55AM +0200, Dmitry Dolgov wrote: > > On Tue, 21 Sep 2021 09:00 Antonin Houska, wrote: > > > > > Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > > > > > Yep, makes sense, thanks. I have few more questions: > > > > > > > > * The use case with orphaned files is working somewhat differently after > > > > the rebase on the latest master, do you observe it as well? The > > > > difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up > > > > an orphaned relation file immediately (only later on checkpoint) > > > > because of empty pendingUnlinks. I haven't investigated more yet, but > > > > seems like after this commit: > > > > > > > > commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 > > > > Author: Thomas Munro > > > > Date: Mon Aug 2 17:32:20 2021 +1200 > > > > > > > > Run checkpointer and bgwriter in crash recovery. > > > > > > > > Start up the checkpointer and bgwriter during crash recovery > > > (except in > > > > --single mode), as we do for replication. This wasn't done back > > > in > > > > commit cdd46c76 out of caution. Now it seems like a better idea > > > to make > > > > the environment as similar as possible in both cases. There may > > > also be > > > > some performance advantages. > > > > > > > > something has to be updated (pendingOps are empty right now, so no > > > > unlink request is remembered). > > > > > > I haven't been debugging that part recently, but yes, this commit is > > > relevant, > > > thanks for pointing that out! Attached is a patch that should fix it. I'll > > > include it in the next version of the patch series, unless you tell me > > > that > > > something is still wrong. > > > > > > > Sure, but I can take a look only in a couple of days. > > Thanks for the patch. > > Hm, maybe there is some misunderstanding. My question above was about > the changed behaviour, when orphaned files (e.g. created relation files > after the backend was killed) are removed only by checkpointer when it > kicks in. As far as I understand, the original intention was to do this > job right away, that's why SyncPre/PostCheckpoint was invoked. But the > recent changes around checkpointer make the current implementation > insufficient. > The patch you've proposed removes invokation of SyncPre/PostCheckpoint, > do I see it correctly? In this sense it doesn't change anything, except > removing non-functioning code of course. Yes, it sounds like a misundeerstanding. I thought you complain about code which is no longer needed. The original intention was to make sure that the files are ever unlinked. IIRC before the commit 7ff23c6d27 the calls SyncPre/PostCheckpoint were necessary because the checkpointer wasn't runnig that early during the startup. Without these calls the startup process would exit without doing anything. Sorry, I see now that the comment incorrectly says "... it seems simpler ...", but in fact it was necessary. > But the question, probably > reformulated from the more design point of view, stays the same — when > and by which process such orphaned files have to be removed? I've > assumed by removing right away the previous version was trying to avoid > any kind of thunder effects of removing too many at once, but maybe I'm > mistaken here. I'm just trying to use the existing infrastructure: the effect of DROP TABLE also appear to be performed by the checkpointer. However I don't know why the unlinks need to be performed by the checkpointer. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: POC: Cleaning up orphaned files using undo logs
Amit Kapila wrote: > On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska wrote: > > > > Amit Kapila wrote: > > > > > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska wrote: > > > > > > The problem of the temporary undo log is that it's loaded into local > > > > buffers > > > > and that backend can exit w/o flushing local buffers to disk, and thus > > > > we are > > > > not guaranteed to find enough information when trying to discard the > > > > undo log > > > > the backend wrote. I'm thinking about the following solutions: > > > > > > > > 1. Let the backend manage temporary undo log on its own (even the slot > > > >metadata would stay outside the shared memory, and in particular the > > > >insertion pointer could start from 1 for each session) and remove the > > > >segment files at the same moment the temporary relations are removed. > > > > > > > >However, by moving the temporary undo slots away from the shared > > > > memory, > > > >computation of oldestFullXidHavingUndo (see the PROC_HDR structure) > > > > would > > > >be affected. It might seem that a transaction which only writes undo > > > > log > > > >for temporary relations does not need to affect > > > > oldestFullXidHavingUndo, > > > >but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo > > > >prevents transactions to be truncated from the CLOG too early, I > > > > wonder if > > > >the following is possible (This scenario is only applicable to the > > > > zheap > > > >storage engine [1], which is not included in this patch, but should > > > > already > > > >be considered.): > > > > > > > >A transaction creates a temporary table, does some (many) changes > > > > and then > > > >gets rolled back. The undo records are being applied and it takes > > > > some > > > >time. Since XID of the transaction did not affect > > > > oldestFullXidHavingUndo, > > > >the XID can disappear from the CLOG due to truncation. > > > > > > > > > > By above do you mean to say that in zheap code, we don't consider XIDs > > > that operate on temp table/undo for oldestFullXidHavingUndo? > > > > I was referring to the code > > > > /* We can't process temporary undo logs. */ > > if (log->meta.persistence == UNDO_TEMP) > > continue; > > > > in undodiscard.c:UndoDiscard(). > > > > Here, I think it will just skip undo of temporary undo logs and > oldestFullXidHavingUndo should be advanced after skipping it. Right, it'll be adavanced, but the transaction XID (if the transaction wrote only to temporary relations) might still be needed. > > > > > > > However zundo.c in > > > >[1] indicates that the transaction status *is* checked during undo > > > >execution, so we might have a problem. > > > > > > > > > > It would be easier to follow if you can tell which exact code are you > > > referring here? > > > > In meant the call of TransactionIdDidCommit() in > > zundo.c:zheap_exec_pending_rollback(). > > > > IIRC, this should be called for temp tables after they have exited as > this is only to apply the pending undo actions if any, and in case of > temporary undo after session exit, we shouldn't need it. I see (had to play with debugger a bit). Currently this works because the temporary relations are dropped by AbortTransaction() -> smgrDoPendingDeletes(), before the undo execution starts. The situation will change as soon as the file removal will also be handled by the undo subsystem, however I'm still not sure how to hit the TransactionIdDidCommit() call for the XID already truncated from CLOG. I'm starting to admint that there's no issue here: temporary undo is always applied immediately in foreground, and thus the zheap_exec_pending_rollback() function never needs to examine XID which no longer exists in the CLOG. > I am not able to understand what exact problem you are facing for temp > tables after the session exit. Can you please explain it a bit more? The problem is that the temporary undo buffers are loaded into backend-local buffers. Thus there's no guarantee that we'll find a consistent information in the undo file even if the backend exited cleanly (local buffers are not flushed at backend exit and there's n
Re: POC: Cleaning up orphaned files using undo logs
Amit Kapila wrote: > On Fri, Sep 24, 2021 at 4:44 PM Antonin Houska wrote: > > > > Amit Kapila wrote: > > > > > On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska wrote: > > > > > > > > Amit Kapila wrote: > > > > > > > > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthali...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote: > > > > > > > > > > > > * What happened with the idea of abandoning discard worker for the > > > > > > sake > > > > > > of simplicity? From what I see limiting everything to foreground > > > > > > undo > > > > > > could reduce the core of the patch series to the first four > > > > > > patches > > > > > > (forgetting about test and docs, but I guess it would be enough at > > > > > > least for the design review), which is already less overwhelming. > > > > > > What we can miss, at least for the cleanup of the orphaned files, is > > > > the *undo > > > > worker*. In this patch series the cleanup is handled by the startup > > > > process. > > > > > > > > > > Okay, I think various people at different point of times has suggested > > > that idea. I think one thing we might need to consider is what to do > > > in case of a FATAL error? In case of FATAL error, it won't be > > > advisable to execute undo immediately, so would we upgrade the error > > > to PANIC in such cases. I remember vaguely that for clean up of > > > orphaned files that can happen rarely and someone has suggested > > > upgrading the error to PANIC in such a case but I don't remember the > > > exact details. > > > > Do you mean FATAL error during normal operation? > > > > Yes. > > > As far as I understand, even > > zheap does not rely on immediate UNDO execution (otherwise it'd never > > introduce the undo worker), so FATAL only means that the undo needs to be > > applied later so it can be discarded. > > > > Yeah, zheap either applies undo later via background worker or next > time before dml operation if there is a need. > > > As for the orphaned files cleanup feature with no undo worker, we might need > > PANIC to ensure that the undo is applied during restart and that it can be > > discarded, otherwise the unapplied undo log would stay there until the next > > (regular) restart and it would block discarding. However upgrading FATAL to > > PANIC just because the current transaction created a table seems quite > > rude. > > > > True, I guess but we can once see in what all scenarios it can > generate FATAL during that operation. By "that operation" you mean "CREATE TABLE"? It's not about FATAL during CREATE TABLE, rather it's about FATAL anytime during a transaction. Whichever operation caused the FATAL error, we'd need to upgrade it to PANIC as long as the transaction has some undo. Although the postgres core probably does not raise FATAL errors too often (OOM conditions seem to be the typical cause), I'm still not enthusiastic about idea that the undo feature turns such errors into PANIC. I wonder what the reason to avoid undoing transaction on FATAL is. If it's about possibly long duration of the undo execution, deletion of orphaned files (relations or the whole databases) via undo shouldn't make things worse because currently FATAL also triggers this sort of cleanup immediately, it's just implemented in different ways. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: POC: Cleaning up orphaned files using undo logs
Amit Kapila wrote: > On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska wrote: > > > > Amit Kapila wrote: > > > > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthali...@gmail.com> > > > wrote: > > > > > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote: > > > > > > > > * What happened with the idea of abandoning discard worker for the sake > > > > of simplicity? From what I see limiting everything to foreground undo > > > > could reduce the core of the patch series to the first four patches > > > > (forgetting about test and docs, but I guess it would be enough at > > > > least for the design review), which is already less overwhelming. > > What we can miss, at least for the cleanup of the orphaned files, is the > > *undo > > worker*. In this patch series the cleanup is handled by the startup process. > > > > Okay, I think various people at different point of times has suggested > that idea. I think one thing we might need to consider is what to do > in case of a FATAL error? In case of FATAL error, it won't be > advisable to execute undo immediately, so would we upgrade the error > to PANIC in such cases. I remember vaguely that for clean up of > orphaned files that can happen rarely and someone has suggested > upgrading the error to PANIC in such a case but I don't remember the > exact details. Do you mean FATAL error during normal operation? As far as I understand, even zheap does not rely on immediate UNDO execution (otherwise it'd never introduce the undo worker), so FATAL only means that the undo needs to be applied later so it can be discarded. As for the orphaned files cleanup feature with no undo worker, we might need PANIC to ensure that the undo is applied during restart and that it can be discarded, otherwise the unapplied undo log would stay there until the next (regular) restart and it would block discarding. However upgrading FATAL to PANIC just because the current transaction created a table seems quite rude. So the undo worker might be needed even for this patch? Or do you mean FATAL error when executing the UNDO? -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: POC: Cleaning up orphaned files using undo logs
Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote: > > Antonin Houska wrote: > > > > > Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > > > > * By throwing at the patchset `make installcheck` I'm getting from time > > > > to time > > > > and error on the restart: > > > > > > > > TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)", > > > > File: "undorecordset.c", Line: 1098, PID: 6055) > > > > > > > > From what I see XLogReadBufferForRedoExtended finds an invalid buffer > > > > and > > > > returns BLK_NOTFOUND. The commentary says: > > > > > > > > If the block was not found, then it must be discarded later in > > > > the WAL. > > > > > > > > and continues with skip = false, but fails to get a page from an > > > > invalid > > > > buffer few lines later. It seems that the skip flag is supposed to be > > > > used > > > > this situation, should it also guard the BufferGetPage part? > > > > > > I could see this sometime too, but can't reproduce it now. It's also not > > > clear > > > to me how XLogReadBufferForRedoExtended() can return BLK_NOTFOUND, as the > > > whole undo log segment is created at once, even if only part of it is > > > needed - > > > see allocate_empty_undo_segment(). > > > > I could eventually reproduce the problem. The root cause was that WAL > > records > > were created even for temporary / unlogged undo, and thus only empty pages > > could be found during replay. I've fixed that and also setup regular test > > for > > the BLK_NOTFOUND value. That required a few more fixes to UndoReplay(). > > > > Attached is a new version. > > Yep, makes sense, thanks. I have few more questions: > > * The use case with orphaned files is working somewhat differently after > the rebase on the latest master, do you observe it as well? The > difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up > an orphaned relation file immediately (only later on checkpoint) > because of empty pendingUnlinks. I haven't investigated more yet, but > seems like after this commit: > > commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 > Author: Thomas Munro > Date: Mon Aug 2 17:32:20 2021 +1200 > > Run checkpointer and bgwriter in crash recovery. > > Start up the checkpointer and bgwriter during crash recovery (except > in > --single mode), as we do for replication. This wasn't done back in > commit cdd46c76 out of caution. Now it seems like a better idea to > make > the environment as similar as possible in both cases. There may also > be > some performance advantages. > > something has to be updated (pendingOps are empty right now, so no > unlink request is remembered). I haven't been debugging that part recently, but yes, this commit is relevant, thanks for pointing that out! Attached is a patch that should fix it. I'll include it in the next version of the patch series, unless you tell me that something is still wrong. -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/backend/access/undo/undorecordset.c b/src/backend/access/undo/undorecordset.c index 59eba7dfb6..9d05824141 100644 --- a/src/backend/access/undo/undorecordset.c +++ b/src/backend/access/undo/undorecordset.c @@ -2622,14 +2622,6 @@ ApplyPendingUndo(void) } } - /* - * Some undo actions may unlink files. Since the checkpointer is not - * guaranteed to be up, it seems simpler to process the undo request - * ourselves in the way the checkpointer would do. - */ - SyncPreCheckpoint(); - SyncPostCheckpoint(); - /* Cleanup. */ chunktable_destroy(sets); }
Re: [PATCH] Full support for index LP_DEAD hint bits on standby
Michail Nikolaev wrote: > Hello. > > Added a check for standby promotion with the long transaction to the > test (code and docs are unchanged). I'm trying to continue the review, sorry for the delay. Following are a few question about the code: * Does the masking need to happen in the AM code, e.g. _bt_killitems()? I'd expect that the RmgrData.rm_fpi_mask can do all the work. Maybe you're concerned about clearing the "LP-safe-on-standby" bits after promotion, but I wouldn't consider this a problem: once the standby is allowed to set the hint bits (i.e. minRecoveryPoint is high enough, see IsIndexLpDeadAllowed() -> XLogNeedsFlush()), promotion shouldn't break anything because it should not allow minRecoveryPoint to go backwards. * How about modifying rm_mask() instead of introducing rm_fpi_mask()? Perhaps a boolean argument can be added to distinguish the purpose of the masking. * Are you sure it's o.k. to use mask_lp_flags() here? It sets the item flags to LP_UNUSED unconditionally, which IMO should only be done by VACUUM. I think you only need to revert the effect of prior ItemIdMarkDead(), so you only need to change the status LP_DEAD to LP_NORMAL if the tuple still has storage. (And maybe add an assertion to ItemIdMarkDead() confirming that it's only used for LP_NORMAL items?) As far as I understand, the current code only uses mask_lp_flags() during WAL consistency check on copies of pages which don't eventually get written to disk. * IsIndexLpDeadAllowed() ** is bufmgr.c the best location for this function? ** the header comment should explain the minLsn argument. ** comment /* It is always allowed on primary if *all_dead. */ should probably be /* It is always allowed on primary if ->all_dead. */ * comment: XLOG_HEAP2_CLEAN has been renamed to XLOG_HEAP2_PRUNE in PG14. On regression tests: * Is the purpose of the repeatable read (RR) snapshot to test that heap_hot_search_buffer() does not set deadness->all_dead if some transaction can still see a tuple of the chain? If so, I think the RR snapshot does not have to be used in the tests because this patch does not really affect the logic: heap_hot_search_buffer() only sets deadness->all_dead to false, just like it sets *all_dead in the current code. Besides that, IsIndexLpDeadAllowed() too can avoid setting of the LP_DEAD flag on an index tuple (at least until the commit record of the deleting/updating transaction gets flushed to disk), so it can hide the behaviour of heap_hot_search_buffer(). * Unless I miss something, the tests check that the hint bits are not propagated from primary (or they are propagated but marked non-safe), however there's no test to check that standby does set the hint bits itself. * I'm also not sure if promotion needs to be tested. What's specific about the promoted cluster from the point of view of this feature? The only thing I can think of is clearing of the "LP-safe-on-standby" bits, but, as I said above, I'm not sure if the tests ever let standby to set those bits before the promotion. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: POC: Cleaning up orphaned files using undo logs
Amit Kapila wrote: > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska wrote: > > The problem of the temporary undo log is that it's loaded into local buffers > > and that backend can exit w/o flushing local buffers to disk, and thus we > > are > > not guaranteed to find enough information when trying to discard the undo > > log > > the backend wrote. I'm thinking about the following solutions: > > > > 1. Let the backend manage temporary undo log on its own (even the slot > >metadata would stay outside the shared memory, and in particular the > >insertion pointer could start from 1 for each session) and remove the > >segment files at the same moment the temporary relations are removed. > > > >However, by moving the temporary undo slots away from the shared memory, > >computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would > >be affected. It might seem that a transaction which only writes undo log > >for temporary relations does not need to affect oldestFullXidHavingUndo, > >but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo > >prevents transactions to be truncated from the CLOG too early, I wonder > > if > >the following is possible (This scenario is only applicable to the zheap > >storage engine [1], which is not included in this patch, but should > > already > >be considered.): > > > >A transaction creates a temporary table, does some (many) changes and > > then > >gets rolled back. The undo records are being applied and it takes some > >time. Since XID of the transaction did not affect > > oldestFullXidHavingUndo, > >the XID can disappear from the CLOG due to truncation. > > > > By above do you mean to say that in zheap code, we don't consider XIDs > that operate on temp table/undo for oldestFullXidHavingUndo? I was referring to the code /* We can't process temporary undo logs. */ if (log->meta.persistence == UNDO_TEMP) continue; in undodiscard.c:UndoDiscard(). > > > However zundo.c in > >[1] indicates that the transaction status *is* checked during undo > >execution, so we might have a problem. > > > > It would be easier to follow if you can tell which exact code are you > referring here? In meant the call of TransactionIdDidCommit() in zundo.c:zheap_exec_pending_rollback(). -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: POC: Cleaning up orphaned files using undo logs
Amit Kapila wrote: > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote: > > > > * What happened with the idea of abandoning discard worker for the sake > > of simplicity? From what I see limiting everything to foreground undo > > could reduce the core of the patch series to the first four patches > > (forgetting about test and docs, but I guess it would be enough at > > least for the design review), which is already less overwhelming. > > > > I think the discard worker would be required even if we decide to > apply all the undo in the foreground. We need to forget/remove the > undo of committed transactions as well which we can't remove > immediately after the commit. I think I proposed foreground discarding at some point, but you reminded me that the undo may still be needed for some time even after transaction commit. Thus the discard worker is indispensable. What we can miss, at least for the cleanup of the orphaned files, is the *undo worker*. In this patch series the cleanup is handled by the startup process. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: POC: Cleaning up orphaned files using undo logs
Antonin Houska wrote: > tsunakawa.ta...@fujitsu.com wrote: > > > I'm crawling like a snail to read the patch set. Below are my first set of > > review comments, which are all minor. > > Thanks. I've added the patch to the upcoming CF [1], so it possibly gets more review and makes some progress. I've marked myself as the author so it's clear who will try to respond to the reviews. It's clear that other people did much more work on the feature than I did so far - they are welcome to add themselves to the author list. [1] https://commitfest.postgresql.org/33/3228/ -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: decoupling table and index vacuum
Andres Freund wrote: > On 2021-04-21 11:21:31 -0400, Robert Haas wrote: > > This scheme adds a lot of complexity, which is a concern, but it seems > > to me that it might have several benefits. One is concurrency. You > > could have one process gathering dead TIDs and adding them to the > > dead-TID fork while another process is vacuuming previously-gathered > > TIDs from some index. > > I think it might even open the door to using multiple processes > gathering dead TIDs for the same relation. I think the possible concurrency improvements are themselves a valid reason to do the decoupling. Or rather it's hard to imagine how the current implementation of VACUUM can get parallel workers involved in gathering the dead heap TIDs efficiently. Currently, a single backend gathers the heap TIDs, and it can then launch several parallel workers to remove the TIDs from indexes. If parallel workers gathered the heap TIDs, then (w/o the decoupling) the parallel index processing would be a problem because a parallel worker cannot launch other parallel workers. > > In fact, every index could be getting vacuumed at the same time, and > > different indexes could be removing different TID ranges. > > We kind of have this feature right now, due to parallel vacuum... -- Antonin Houska Web: https://www.cybertec-postgresql.com
Possible optimization of heap_lock_tuple()
It seems that a concurrent UPDATE can restart heap_lock_tuple() even if it's not necessary. Is the attached proposal correct and worth applying? -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 6ac07f2fda..c3310ab4f9 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -4553,7 +4553,7 @@ l3: */ if (!HeapTupleHeaderIsOnlyLocked(tuple->t_data) && ((tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED) || - !updated)) + (!updated && follow_updates))) goto l3; /* Things look okay, so we can skip sleeping */
Re: storing an explicit nonce
Bruce Momjian wrote: > On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote: > > Hi, > > > > On 2021-05-25 17:29:03 -0400, Bruce Momjian wrote: > > > So, let me ask --- I thought CTR basically took an encrypted stream of > > > bits and XOR'ed them with the data. If that is true, then why are > > > changing hint bits a problem? We already can see some of the bit stream > > > by knowing some bytes of the page. > > > > A *single* reuse of the nonce in CTR reveals nearly all of the > > plaintext. As you say, the data is XORed with the key stream. Reusing > > the nonce means that you reuse the key stream. Which in turn allows you > > to do: > > (data ^ stream) ^ (data' ^ stream) > > which can be simplified to > > (data ^ data') > > thereby leaking all of data except the difference between data and > > data'. That's why it's so crucial to ensure that stream *always* differs > > between two rounds of encrypting "related" data. > > > > We can't just "hope" that data doesn't change and use CTR. > > My point was about whether we need to change the nonce, and hence > WAL-log full page images if we change hint bits. If we don't and > reencrypt the page with the same nonce, don't we only expose the hint > bits? I was not suggesting we avoid changing the nonce in non-hint-bit > cases. > > I don't understand your computation above. You decrypt the page into > shared buffers, you change a hint bit, and rewrite the page. You are > re-XOR'ing the buffer copy with the same key and nonce. Doesn't that > only change the hint bits in the new write? The way I view things is that the CTR mode encrypts each individual bit, independent from any other bit on the page. For non-hint bits data=data', so (data ^ data') is always zero, regardless the actual values of the data. So I agree with you that by reusing the nonce we only expose the hint bits. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: [PATCH] Full support for index LP_DEAD hint bits on standby
Michail Nikolaev wrote: > After some correspondence with Peter Geoghegan (1) and his ideas, I > have reworked the patch a lot and now it is much more simple with even > better performance (no new WAL or conflict resolution, hot standby > feedback is unrelated). My review that started in [1] continues here. (Please note that code.patch does not apply to the current master branch.) I think I understand your approach now and couldn't find a problem by reading the code. What I consider worth improving is documentation, both code comments and nbtree/README. Especially for the problem discussed in [1] it should be explained what would happen if kill_prior_tuple_min_lsn was not checked. Also, in IsIndexLpDeadAllowed() you say that invalid deadness->latest_removed_xid means the following: /* * Looks like it is tuple cleared by heap_page_prune_execute, * we must be sure if LSN of XLOG_HEAP2_CLEAN (or any subsequent * updates) less than minRecoveryPoint to avoid MVCC failure * after crash recovery. */ However I think there's one more case: if heap_hot_search_buffer() considers all tuples in the chain to be "surely dead", but HeapTupleHeaderAdvanceLatestRemovedXid() skips them all for this reason: /* * Ignore tuples inserted by an aborted transaction or if the tuple was * updated/deleted by the inserting transaction. * * Look for a committed hint bit, or if no xmin bit is set, check clog. */ I think that the dead tuples produced this way should never be visible on the standby (and even if they were, they would change the page LSN so your algorithm would treat them correctly) so I see no correctness problem. But it might be worth explaining better the meaning of invalid "latest_removed_xid" in comments. In the nbtree/README, you say "... if the commit record of latestRemovedXid is more ..." but it's not clear to me what "latestRemovedXid" is. If you mean the scan->kill_prior_tuple_min_lsn field, you probably need more words to explain it. * IsIndexLpDeadAllowed() /* It all always allowed on primary if *all_dead. */ should probably be /* It is always allowed on primary if *all_dead. */ * gistkillitems() As the function is only called if (so->numKilled > 0), I think both "killedsomething" and "dirty" variables should always have the same value, so one variable should be enough. Assert(so->numKilled) would be appropriate in that case. The situation is similar for btree and hash indexes. doc.patch: "+applying the fill page write." [1] https://www.postgresql.org/message-id/61470.1620647290%40antos -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: [PATCH] Full support for index LP_DEAD hint bits on standby
Michail Nikolaev wrote: > > Sorry, I missed the fact that your example can be executed inside BEGIN - > > END > > block, in which case minRecoveryPoint won't advance after each command. > > No, the block is not executed as a single transaction, all commands > are separated transactions (see below) > > > Actually I think that a commit record should be replayed > > more often than XLOG_RUNNING_XACTS, shouldn't it? > > Yes, but replaying commit records DOES NOT affect minRecoveryPoint in > almost all cases. > > UpdateMinRecoveryPoint is called by XLogFlush, but xact_redo_commit > calls XLogFlush only in two cases: > * DropRelationFiles is called (some relation are dropped) > * If ForceSyncCommit was used on primary - few “heavy” commands, like > DropTableSpace, CreateTableSpace, movedb, etc. > > But “regular” commit record is replayed without XLogFlush and, as > result, without UpdateMinRecoveryPoint. ok, I missed this. Thanks for explanation. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: [PATCH] Full support for index LP_DEAD hint bits on standby
Michail Nikolaev wrote: > Hello, Antonin. > > > I'm trying to review the patch, but not sure if I understand this problem, > > please see my comment below. > > Thanks a lot for your attention. It is strongly recommended to look at > version N3 (1) because it is a much more elegant, easy, and reliable > solution :) But the minRecoveryPoint-related issue affects it anyway. Indeed I'm reviewing (1), but I wanted to discuss this particular question in context, so I replied here. > > Why doesn't minRecoveryPoint get updated to 20? IMO that should happen by > > replaying the commit record. And if the standby happens to crash before the > > commit record could be replayed, no query should see the deletion and thus > > no > > hint bit should be set in the index. > > minRecoveryPoint is not affected by replaying the commit record in > most cases. It is updated in a lazy way, something like this: > minRecoveryPoint = max LSN of flushed page. Version 3 of a patch > contains a code_optional.patch to move minRecoveryPoint more > aggressively to get additional performance on standby (based on > Peter’s answer in (2). > So, “minRecoveryPoint will go here” is not because of “STANDBY INSERTS > NEW ROW IN INDEX” it is just a random event. > Michail. Sorry, I missed the fact that your example can be executed inside BEGIN - END block, in which case minRecoveryPoint won't advance after each command. I'll continue my review by replying to (1) > [1]: > https://www.postgresql.org/message-id/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH%3DcMXZSf%2Bnzvg%40mail.gmail.com > [2]: > https://www.postgresql.org/message-id/CAH2-WzkSUcuFukhJdSxHFgtL6zEQgNhgOzNBiTbP_4u%3Dk6igAg%40mail.gmail.com > (“Also, btw, do you know any reason to keep minRecoveryPoint at a low > value?”) I'm not an expert in this area (I'm reviewing this patch also to learn more about recovery and replication), but after a breif research I think that postgres tries not to update the control file too frequently, see comments in UpdateMinRecoveryPoint(). I don't know if what you do in code_optional.patch would be a problem. Actually I think that a commit record should be replayed more often than XLOG_RUNNING_XACTS, shouldn't it? -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: [PATCH] Full support for index LP_DEAD hint bits on standby
I'm trying to review the patch, but not sure if I understand this problem, please see my comment below. Michail Nikolaev wrote: > Oh, I just realized that it seems like I was too naive to allow > standby to set LP_DEAD bits this way. > There is a possible consistency problem in the case of low > minRecoveryPoint value (because hint bits do not move PageLSN > forward). > > Something like this: > > LSN=10 STANDBY INSERTS NEW ROW IN INDEX (index_lsn=10) > <---minRecoveryPoint will go here > LSN=20 STANDBY DELETES ROW FROM HEAP, INDEX UNTACHED (index_lsn=10) Why doesn't minRecoveryPoint get updated to 20? IMO that should happen by replaying the commit record. And if the standby happens to crash before the commit record could be replayed, no query should see the deletion and thus no hint bit should be set in the index. >REPLICA SCANS INDEX AND SET hint bits (index_lsn=10) >INDEX IS FLUSHED (minRecoveryPoint=index_lsn=10) >CRASH > > On crash recovery, a standby will be able to handle queries after > LSN=10. But the index page contains hints bits from the future > (LSN=20). > So, need to think here. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Built-in connection pooler
Konstantin Knizhnik wrote: > People asked me to resubmit built-in connection pooler patch to commitfest. > Rebased version of connection pooler is attached. I've reviewd the patch but haven't read the entire thread thoroughly. I hope that I don't duplicate many comments posted earlier by others. (Please note that the patch does not apply to the current master, I had to reset the head of my repository to the appropriate commit.) Documentation / user interface -- * session_pool_size (config.sgml) I wonder if "The default value is 10, so up to 10 backends will serve each database," should rather be "The default value is 10, so up to 10 backends will serve each database/user combination." However, when I read the code, I think that each proxy counts the size of the pool separately, so the total number of backends used for particular database/user combination seems to be session_pool_size * connection_proxies Since the feature uses shared memory statistics anyway, shouldn't it only count the total number of backends per database/user? It would need some locking, but the actual pools (hash tables) could still be local to the proxy processes. * connection_proxies (I've noticed that Ryan Lambert questioned this variable upthread.) I think this variable makes the configuration less straightforward from the user perspective. Cannot the server launch additional proxies dynamically, as needed, e.g. based on the shared memory statistics that the patch introduces? I see that postmaster would have to send the sockets in a different way. How about adding a "proxy launcher" process that would take care of the scheduling and launching new proxies? * multitenant_proxy I thought the purpose of this setting is to reduce the number of backends needed, but could not find evidence in the code. In particular, client_attach() always retrieves the backend from the appropriate pool, and backend_reschedule() does so as well. Thus the role of both client and backend should always match. What piece of information do I miss? * typo (2 occurrences in config.sgml) "stanalone" -> "standalone" Design / coding --- * proxy.c:backend_start() does not change the value of the "host" parameter to the socket directory, so I assume the proxy connects to the backend via TCP protocol. I think the unix socket should be preferred for this connection if the platform has it, however: * is libpq necessary for the proxy to connect to backend at all? Maybe postgres.c:ReadCommand() can be adjusted so that the backend can communicate with the proxy just via the plain socket. I don't like the idea of server components communicating via libpq (do we need anything else of the libpq connection than the socket?) as such, but especially these includes in proxy.c look weird: #include "../interfaces/libpq/libpq-fe.h" #include "../interfaces/libpq/libpq-int.h" * How does the proxy recognize connections to the walsender? I haven't tested that, but it's obvious that these connections should not be proxied. * ConnectionProxyState is in shared memory, so access to its fields should be synchronized. * StartConnectionProxies() is only called from PostmasterMain(), so I'm not sure the proxies get restarted after crash. Perhaps PostmasterStateMachine() needs to call it too after calling StartupDataBase(). * Why do you need the Channel.magic integer field? Wouldn't a boolean field "active" be sufficient? ** In proxy_loop(), I've noticed tests (chan->magic == ACTIVE_CHANNEL_MAGIC) tests inside the branch else if (chan->magic == ACTIVE_CHANNEL_MAGIC) Since neither channel_write() nor channel_read() seem to change the value, I think those tests are not necessary. * Comment lines are often too long. * pgindent should be applied to the patch at some point. I can spend more time reviewing the patch during the next CF. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: WIP: Aggregation push-down
David Steele wrote: > On 7/3/20 6:07 AM, Laurenz Albe wrote: > > On Thu, 2020-07-02 at 14:39 +0200, Daniel Gustafsson wrote: > >> This version now fails to apply to HEAD, with what looks like like a > >> trivial > >> error in the expected test output. Can you please submit a rebased > >> version so > >> we can see it run in the patch tester CI? I'm marking the entry as > >> Waiting on > >> Author in the meantime. > > > > I have rebased the patch against current HEAD, it passes "make > > installcheck". > > This needs another rebase so marked as Waiting on Author: > http://cfbot.cputube.org/patch_32_1247.log Besides the fact that I don't have time to work on it now, I'm not able to convince myself to put more effort into it. If there's almost no progress for several years, I don't believe another rebase(s) is anything but wasted effort. So I've withdrawn the patch, also to save CFMs from examining the history again and again uselessly. The code is free so anyone can continue if he thinks it makes sense. If it finally finds its way into the PG core and if meaningful part of my code remains, I'd appreciate it if I was mentioned in the release notes. Thanks. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: POC: Cleaning up orphaned files using undo logs
tsunakawa.ta...@fujitsu.com wrote: > I'm crawling like a snail to read the patch set. Below are my first set of > review comments, which are all minor. Thanks. I will check your comments when I'll be preparing the next version of the patch. -- Antonin Houska Web: https://www.cybertec-postgresql.com
pg_config_h.in not up-to-date
When I run "autoreconf" on the master branch, git generates the diff below. Shouldn't it just be applied? I suppose someone changed configure.ac and forgot to update the generated file. -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 55cab4d2bf..806eabac81 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -899,7 +899,7 @@ /* Define to select named POSIX semaphores. */ #undef USE_NAMED_POSIX_SEMAPHORES -/* Define to build with OpenSSL support. (--with-ssl=openssl) */ +/* Define to 1 if you have OpenSSL support. */ #undef USE_OPENSSL /* Define to 1 to build with PAM support. (--with-pam) */
Re: POC: Cleaning up orphaned files using undo logs
Bruce Momjian wrote: > On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote: > > Antonin Houska wrote: > > > Well, on repeated run of the test I could also hit the first one. I could > > > fix > > > it and will post a new version of the patch (along with some other small > > > changes) this week. > > > > Attached is the next version. Changes done: > > Yikes, this patch is 23k lines, and most of it looks like added lines of > code. Is this size expected? Yes, earlier versions of this patch, e.g. [1], were of comparable size. It's not really an "ordinary patch". [1] https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com