from:"Antonin Houska"

Re: UniqueKey v2

2024-05-13 Thread Antonin Houska

Antonin Houska  wrote:

> Andy Fan  wrote:
> > >
> > > * Combining the UKs
> > >
> > >   IMO this is the most problematic part of the patch. You call
> > >   populate_joinrel_uniquekeys() for the same join multiple times,
> > 
> > Why do you think so? The below code is called in "make_join_rel"
> 
> Consider join of tables "a", "b" and "c". My understanding is that
> make_join_rel() is called once with rel1={a} and rel2={b join c}, then with
> rel1={a join b} and rel2={c}, etc. I wanted to say that each call should
> produce the same set of unique keys.
> 
> I need to check this part more in detail.

I think I understand now. By calling populate_joinrel_uniquekeys() for various
orderings, you can find out that various input relation unique keys can
represent the whole join. For example, if the ordering is

A JOIN (B JOIN C)

you can prove that the unique keys of A can be used for the whole join, while
for the ordering

B JOIN (A JOIN C)

you can prove the same for the unique keys of B, and so on.

> > Is your original question is about populate_joinrel_uniquekey_for_rel
> > rather than populate_joinrel_uniquekeys? We have the below codes:
> > 
> > outeruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, 
> > outerrel,
> > 
> >  innerrel, restrictlist);
> > inneruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, 
> > innerrel,
> > 
> >  outerrel, restrictlist);
> > 
> > This is mainly because of the following theory. Quoted from
> > README.uniquekey. Let's called this as "rule 1".
> > 
> > """
> > How to maintain the uniquekey?
> > ---
> > .. From the join relation, it is maintained with two rules:
> > 
> > - the uniquekey in one side is still unique if it can't be duplicated
> >   after the join. for example:
> > 
> >   SELECT t1.pk FROM t1 JOIN t2 ON t1.a = t2.pk;
> >   UniqueKey on t1:  t1.pk
> >   UniqueKey on t1 Join t2:  t1.pk
> > """
> > 
> > AND the blow codes:
> > 
> > 
> > if (outeruk_still_valid || inneruk_still_valid)
> > 
> > /*
> >  * the uniquekey on outers or inners have been added into 
> > joinrel so
> >  * the combined uniuqekey from both sides is not needed.
> >  */
> > return;
> > 
> > 
> > We don't create the component uniquekey if any one side of the boths
> > sides is unique already. For example:
> > 
> > "(t1.id) in joinrel(t1, t2) is unique" OR "(t2.id) in joinrel is
> > unique", there is no need to create component UniqueKey (t1.id, t2.id);  
> 
> ok, I need to check more in detail how this part works.

This optimization makes sense to me.

> > >
> > >   Of course one problem is that the number of combinations can grow
> > >   exponentially as new relations are joined.
> > 
> > Yes, that's why "rule 1" needed and "How to reduce the overhead" in
> > UniqueKey.README is introduced. 

I think there should yet be some guarantee that the number of unique keys does
not grow exponentially. Perhaps a constant that allows a relation (base or
join) to have at most N unique keys. (I imagine N to be rather small, e.g. 3
or 4.) And when picking the "best N keys", one criterion could be the number
of expressions in the key (the shorter key the better).

> > >
> > >   2) Check if the join relation is single-row
> > >
> > >   I in order to get rid of the dependency on 'restrictlist', I think you 
> > > can
> > >   use ECs. Consider a query from your regression tests:
> > >
> > > CREATE TABLE uk_t (id int primary key, a int not null, b int not null, c 
> > > int, d int, e int);
> > >
> > > SELECT distinct t1.d FROM uk_t t1 JOIN uk_t t2 ON t1.e = t2.id and t1.id 
> > > = 1;
> > >
> > >   The idea here seems to be that no more than one row of t1 matches the 
> > > query
> > >   clauses. Therefore, if t2(id) is unique, the clause t1.e=t2.id ensures 
> > > that
> > >   no more than one row of t2 matches the query (because t1 cannot provide 
> > > the
> > >   clause with more than one input value of 'e'). And therefore, the join 
> > > also
> > >   produces at most one row.
> > 
> > You are correct and IMO my current code are able to tell it is a single
> > row as well.
> > 
> > 1. Since t1.id = 1, so t1 is single row, so t1.d is unqiuekey as a
> > consequence.
> > 2. Given t2.id is unique, t1.e = t2.id so t1's unqiuekey can be kept
> > after the join because of rule 1 on joinrel. and t1 is singlerow, so the
> > joinrel is singlerow as well.

ok, I think I understand now.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: UniqueKey v2

2024-05-13 Thread Antonin Houska

t', I
> think you can use ECs.", let's see what we can improve.
> >
> >   My theory is that relation is single-row if it has an UK such that each of
> >   its ECs meets at least one of the following conditions:
> >
> >   a) contains a constant
> 
> True.
> >
> >   b) contains a column of a relation which has already been proven 
> > single-row.
> 
> True, not sure if it is easy to tell.
> 
> >   b) is referenced by an UK of a relation which has already been proven
> >  single-row.
> 
> I can't follow here...

This is similar to EC containing a constant: if an EC is used by a single-row
UK, all its member can only have a single value.

> >
> >   I think that in the example above, an EC {t1.e, t2.id} should exist. So 
> > when
> >   checking whether 't2' is single-row, the condition b) cam be used: the UK 
> > of
> >   't2' should reference the EC {t1.e, t2.id}, which in turn contains the
> >   column t1.e. And 't1' is unique because its EC meets the condition a). (Of
> >   course you need to check em_jdomain before you use particular EM.)
> 
> I think the existing rule 1 for joinrel works well with the singlerow
> case naturally, what can be improved if we add the theory you suggested
> here?

This is still the explanation of the idea how to mark join unique key as a
single-row separately from the other logic. As noted above, I need to learn
more about the unique keys of a join.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Join removal and attr_needed cleanup

2024-04-29 Thread Antonin Houska

The attached patch tries to fix a corner case where attr_needed of an inner
relation of an OJ contains the join relid only because another,
already-removed OJ, needed some of its attributes. The unnecessary presence of
the join relid in attr_needed can prevent the planner from further join
removals.

Do cases like this seem worth the effort and is the logic I use correct?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

>From 279856bf97ce08c0c2e0c736a00831bf6324713b Mon Sep 17 00:00:00 2001
From: Antonin Houska 
Date: Mon, 29 Apr 2024 11:34:30 +0200
Subject: [PATCH] Cleanup attr_needed properly after join removal.

If an outer join (J1) was removed and if the remaining outer relation is also
an outer join (J2), the inner relation of J2 may still have the J2's relid in
the attr_needed set, because its attribute was referenced by J1. However,
after removal of J1, it's possible that no other join (nor the query
targetlist) actually needs any attribute of J2, and so the J2's relid should
not be in attr_needed anymore.

This patch tries to spend more effort on checking the contents of attr_needed
after removal of J1 so that J2 can potentially be removed as well.
---
 src/backend/optimizer/plan/analyzejoins.c | 117 +-
 src/test/regress/expected/join.out|   9 ++
 src/test/regress/sql/join.sql |   5 +
 3 files changed, 130 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 506fccd20c..1b9623efaa 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -46,6 +46,10 @@ bool		enable_self_join_removal;
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
 
+static bool innerrel_attrs_needed_above_outer_join(PlannerInfo *root,
+   RelOptInfo *innerrel,
+   SpecialJoinInfo *sjinfo,
+   Relids joinrelids);
 static void remove_leftjoinrel_from_query(PlannerInfo *root, int relid,
 		  SpecialJoinInfo *sjinfo);
 static void remove_rel_from_restrictinfo(RestrictInfo *rinfo,
@@ -183,6 +187,8 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	List	   *clause_list = NIL;
 	ListCell   *l;
 	int			attroff;
+	Oid			remove_ojrelid;
+	bool		remove_ojrelid_valid;
 
 	/*
 	 * Must be a left join to a single baserel, else we aren't going to be
@@ -218,6 +224,13 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	joinrelids = bms_copy(inputrelids);
 	joinrelids = bms_add_member(joinrelids, sjinfo->ojrelid);
 
+	/*
+	 * By default there's no reason to remove sjinfo->ojrelid from
+	 * attr_needed, see below.
+	 */
+	remove_ojrelid = InvalidOid;
+	remove_ojrelid_valid = false;
+
 	/*
 	 * We can't remove the join if any inner-rel attributes are used above the
 	 * join.  Here, "above" the join includes pushed-down conditions, so we
@@ -233,7 +246,38 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 		 attroff >= 0;
 		 attroff--)
 	{
-		if (!bms_is_subset(innerrel->attr_needed[attroff], inputrelids))
+		Relids	attr_needed = innerrel->attr_needed[attroff];
+
+		/*
+		 * If this join was among outer relations of an already-removed join,
+		 * attr_needed may still contain the relid of the current join because
+		 * join clauses of the removed join referenced it. Re-check if another
+		 * join can still reference this join and if not, remove it from
+		 * attr_needed.
+		 */
+		if (bms_is_member(sjinfo->ojrelid, attr_needed))
+		{
+			/* Do the following check only once per inner relation. */
+			if (!remove_ojrelid_valid)
+			{
+/*
+ * If no clause requires the join relid anymore, remember that
+ * we should remove it from attr_needed.
+ */
+if (!innerrel_attrs_needed_above_outer_join(root, innerrel,
+			sjinfo,
+			joinrelids))
+	remove_ojrelid = sjinfo->ojrelid;
+
+remove_ojrelid_valid = true;
+			}
+
+			if (OidIsValid(remove_ojrelid))
+innerrel->attr_needed[attroff] = bms_del_member(attr_needed,
+remove_ojrelid);
+		}
+
+		if (!bms_is_subset(attr_needed, inputrelids))
 			return false;
 	}
 
@@ -333,6 +377,77 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	return false;
 }
 
+/*
+ * innerrel_attrs_needed_above_outer_join
+ *	  Re-check whether attributes of inner relation of an OJ are still needed
+ *	  above that OJ.
+ */
+static bool
+innerrel_attrs_needed_above_outer_join(PlannerInfo *root, RelOptInfo *innerrel,
+	   SpecialJoinInfo *sjinfo,
+	   Relids joinrelids)
+{
+	bool	found = false;
+	ListCell	*l;
+
+	/* Check the join clauses */
+	foreach(l, innerrel->joininfo)
+	{
+		RestrictInfo *ri = lfirst_node(RestrictInfo, l);
+
+		if (bms_is_member(sjinfo->ojrelid, ri->required_relids))
+			return true;
+	}
+
+	/*
+	 * Also check if any clause generated from EC

Use "unique keys" to enhance outer join removal

2024-04-29 Thread Antonin Houska

While the "unique keys" feature [1] is still under development, I'm thinking
how it could be used to enhance the removal of useless outer joins. Is
something really bad about the 0002 patch attached?

I recognize it may be weird that a join relation possibly produces non-join
paths (e.g. SeqScan), but right now don't have better idea where the new code
should appear. I considered planning the subqueries before the existing call
of remove_useless_joins(), to make the unique keys available earlier. However
it seems that more items can be added to 'baserestrictinfo' of the subquery
relation after that. Thus by planning the subquery too early we could miss
some opportunities to push clauses down to the subquery.

Please note that this patch depends on [1], which enhances
rel_is_distinct_for() and thus makes join_is_removable() a bit smareter. Also
note that 0001 is actually a minor fix to [1].

[1] https://www.postgresql.org/message-id/7971.1713526758%40antos
-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

>From a0be4ee7698ff03d6c22398f20fd2c7efadbff45 Mon Sep 17 00:00:00 2001
From: Antonin Houska 
Date: Mon, 29 Apr 2024 07:53:00 +0200
Subject: [PATCH 1/2] Undo comment changes.

These belong to the following patch.
---
 src/test/regress/expected/join.out | 11 +--
 src/test/regress/sql/join.sql  | 13 ++---
 2 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 571198a86a..fa407d37f5 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5648,12 +5648,11 @@ select d.* from d left join (select distinct * from b) s
  Seq Scan on d
 (1 row)
 
--- when the GROUP BY contains a column that is not in the join condition, join
--- removal takes place too, due to "unique keys" (Note: as of 9.6, we notice
--- that b.id is a primary key and so drop b.c_id from the GROUP BY of the
--- resulting plan; but this happens too late for join removal in the outer
--- plan level.)
-explain (costs off, verbose on)
+-- join removal is not possible when the GROUP BY contains a column that is
+-- not in the join condition.  (Note: as of 9.6, we notice that b.id is a
+-- primary key and so drop b.c_id from the GROUP BY of the resulting plan;
+-- but this happens too late for join removal in the outer plan level.)
+explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
QUERY PLAN   
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index bf8fb27072..c4c6c7b8ba 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -2047,17 +2047,16 @@ explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id and d.b = s.c_id;
 
--- when the GROUP BY contains a column that is not in the join condition, join
--- removal takes place too, due to "unique keys" (Note: as of 9.6, we notice
--- that b.id is a primary key and so drop b.c_id from the GROUP BY of the
--- resulting plan; but this happens too late for join removal in the outer
--- plan level.)
-explain (costs off, verbose on)
+-- join removal is not possible when the GROUP BY contains a column that is
+-- not in the join condition.  (Note: as of 9.6, we notice that b.id is a
+-- primary key and so drop b.c_id from the GROUP BY of the resulting plan;
+-- but this happens too late for join removal in the outer plan level.)
+explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
 
 -- similarly, but keying off a DISTINCT clause
-explain (costs off, verbose on)
+explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
 
-- 
2.44.0

>From cb0166a3e1f1f5ff88634c38c2474de16825625a Mon Sep 17 00:00:00 2001
From: Antonin Houska 
Date: Mon, 29 Apr 2024 08:24:42 +0200
Subject: [PATCH 2/2] Use Unique Keys to remove an useless left join.

To consider an outer join useless, we need to prove that the inner relation is
unique for given join clause(s). We may be unable do that at early stage of
planning, especially if the inner relation is a subquery.

The new "unique keys" feature (still under development) can resolve this
problem by re-trying the removal at later stage, when the unique keys have
(possibly) been initialized.
---
 src/backend/optimizer/path/joinpath.c | 35 
 src/backend/optimizer/path/joinrels.c | 10 +
 src/backend/optimizer/plan/analyzejoins.c |  4 +-
 src/include/optimizer/paths.h |  5 +++
 src/include/optimizer/planmain.h  |  1 +
 src/test/regress/expected/join.out| 50 +++
 src/test/regress/sql/join.sql | 13 +++---
 7 files changed, 73 insertions(+), 45 deletions(-)

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimize

Re: UniqueKey v2

2024-04-19 Thread Antonin Houska

niqueness.

  2) Check if the join relation is single-row

  I in order to get rid of the dependency on 'restrictlist', I think you can
  use ECs. Consider a query from your regression tests:

CREATE TABLE uk_t (id int primary key, a int not null, b int not null, c int, d 
int, e int);

SELECT distinct t1.d FROM uk_t t1 JOIN uk_t t2 ON t1.e = t2.id and t1.id = 1;

  The idea here seems to be that no more than one row of t1 matches the query
  clauses. Therefore, if t2(id) is unique, the clause t1.e=t2.id ensures that
  no more than one row of t2 matches the query (because t1 cannot provide the
  clause with more than one input value of 'e'). And therefore, the join also
  produces at most one row.

  My theory is that relation is single-row if it has an UK such that each of
  its ECs meets at least one of the following conditions:

  a) contains a constant

  b) contains a column of a relation which has already been proven single-row.

  b) is referenced by an UK of a relation which has already been proven
 single-row.

  I think that in the example above, an EC {t1.e, t2.id} should exist. So when
  checking whether 't2' is single-row, the condition b) cam be ised: the UK of
  't2' should reference the EC {t1.e, t2.id}, which in turn contains the
  column t1.e. And 't1' is unique because its EC meets the condition a). (Of
  course you need to check em_jdomain before you use particular EM.)


Are you going to submit the patch to the first CF of PG 18?

Please let me know if I can contribute to the effort by reviewing or writing
some code.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


>From 9fe85dae62ae580f31935ee399e7f46dc99b Mon Sep 17 00:00:00 2001
From: Antonin Houska 
Date: Thu, 18 Apr 2024 14:43:02 +0200
Subject: [PATCH 1/5] Unique keys rebased.

The original patch:
https://www.postgresql.org/message-id/871qczm9oc.fsf%40163.com
---
 src/backend/nodes/list.c|  17 +
 src/backend/optimizer/path/Makefile |   3 +-
 src/backend/optimizer/path/README.uniquekey | 220 +++
 src/backend/optimizer/path/allpaths.c   |   2 +
 src/backend/optimizer/path/equivclass.c |  48 ++
 src/backend/optimizer/path/joinrels.c   |   2 +
 src/backend/optimizer/path/uniquekey.c  | 654 
 src/backend/optimizer/plan/initsplan.c  |  10 +-
 src/backend/optimizer/plan/planner.c|  33 +
 src/backend/optimizer/util/plancat.c|  10 +
 src/backend/optimizer/util/relnode.c|   2 +
 src/include/nodes/pathnodes.h   |  19 +
 src/include/nodes/pg_list.h |   2 +
 src/include/optimizer/paths.h   |  13 +
 src/test/regress/expected/join.out  |  11 +-
 src/test/regress/expected/uniquekey.out | 268 
 src/test/regress/parallel_schedule  |   2 +-
 src/test/regress/sql/uniquekey.sql  | 104 
 18 files changed, 1410 insertions(+), 10 deletions(-)
 create mode 100644 src/backend/optimizer/path/README.uniquekey
 create mode 100644 src/backend/optimizer/path/uniquekey.c
 create mode 100644 src/test/regress/expected/uniquekey.out
 create mode 100644 src/test/regress/sql/uniquekey.sql

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index e2615ab105..20eb1267eb 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -695,6 +695,23 @@ list_member_ptr(const List *list, const void *datum)
 	return false;
 }
 
+/*
+ * Return index of the datum in list if found. otherwise return -1.
+ */
+int
+list_member_ptr_pos(const List *list, const void *datum)
+{
+	ListCell   *lc;
+
+	foreach(lc, list)
+	{
+		if (lfirst(lc) == datum)
+			return foreach_current_index(lc);
+	}
+
+	return -1;
+}
+
 /*
  * Return true iff the integer 'datum' is a member of the list.
  */
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..63cc1505d9 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekey.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
new file mode 100644
index 00..be13b113b9
--- /dev/null
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -0,0 +1,220 @@
+Here is a design document and a part of implementation.
+
+What is UniqueKey?
+-
+
+UniqueKey represents a uniqueness information for a RelOptInfo. for
+example:
+
+SELECT id FROM t;
+
+where the ID is the UniqueKey for the RelOptInfo (t).  In the real world,
+it has the following attributes:
+
+1). It should be EquivalenceClass based. for example:
+
+SELECT a FROM t WHERE a = id;
+
+In this case, the UniqueKey should be 1 EC with two members
+- EC(EM(a), EM(id)).
+
+
+2). Each UniqueKey may be made up with 1+ EquivalenceClass. for example:
+
+CREATE TABLE t(a int not null,  b

Re: why there is not VACUUM FULL CONCURRENTLY?

2024-02-16 Thread Antonin Houska

Alvaro Herrera  wrote:

> This is great to hear.
> 
> On 2024-Jan-31, Antonin Houska wrote:
> 
> > Is your plan to work on it soon or should I try to write a draft patch? (I
> > assume this is for PG >= 18.)
> 
> I don't have plans for it, so if you have resources, please go for it.

ok, I'm thinking how can the feature be integrated into the core.

BTW, I'm failing to understand why cluster_rel() has no argument of the
BufferAccessStrategy type. According to buffer/README, the criterion for using
specific strategy is that page "is unlikely to be needed again
soon". Specifically for cluster_rel(), the page will *definitely* not be used
again (unless the VACCUM FULL/CLUSTER command fails): BufferTag contains the
relatin file number and the old relation file is eventually dropped.

Am I missing anything?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

pg_language(langispl) column apparently not needed

2024-02-02 Thread Antonin Houska

I couldn't find a reference to the 'langispl' attribute, so I removed it (see
the diff attached) and the master branch compiled cleanly. Is there yet a
reason to keep it?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/commands/proclang.c b/src/backend/commands/proclang.c
index c849d65e62..d353b4e3f7 100644
--- a/src/backend/commands/proclang.c
+++ b/src/backend/commands/proclang.c
@@ -112,7 +112,6 @@ CreateProceduralLanguage(CreatePLangStmt *stmt)
 	namestrcpy(, languageName);
 	values[Anum_pg_language_lanname - 1] = NameGetDatum();
 	values[Anum_pg_language_lanowner - 1] = ObjectIdGetDatum(languageOwner);
-	values[Anum_pg_language_lanispl - 1] = BoolGetDatum(true);
 	values[Anum_pg_language_lanpltrusted - 1] = BoolGetDatum(stmt->pltrusted);
 	values[Anum_pg_language_lanplcallfoid - 1] = ObjectIdGetDatum(handlerOid);
 	values[Anum_pg_language_laninline - 1] = ObjectIdGetDatum(inlineOid);
diff --git a/src/include/catalog/pg_language.h b/src/include/catalog/pg_language.h
index 581372addd..7ff3fb0d08 100644
--- a/src/include/catalog/pg_language.h
+++ b/src/include/catalog/pg_language.h
@@ -36,9 +36,6 @@ CATALOG(pg_language,2612,LanguageRelationId)
 	/* Language's owner */
 	Oid			lanowner BKI_DEFAULT(POSTGRES) BKI_LOOKUP(pg_authid);
 
-	/* Is a procedural language */
-	bool		lanispl BKI_DEFAULT(f);
-
 	/* PL is trusted */
 	bool		lanpltrusted BKI_DEFAULT(f);

Re: why there is not VACUUM FULL CONCURRENTLY?

2024-01-31 Thread Antonin Houska

Alvaro Herrera  wrote:

> On 2024-Jan-30, Pavel Stehule wrote:
> 
> > One of my customer today is reducing one table from 140GB to 20GB.  Now he
> > is able to run archiving. He should play with pg_repack, and it is working
> > well today, but I ask myself, what pg_repack does not be hard to do
> > internally because it should be done for REINDEX CONCURRENTLY. This is not
> > a common task, and not will be, but on the other hand, it can be nice to
> > have feature, and maybe not too hard to implement today. But I didn't try it
> 
> FWIW a newer, more modern and more trustworthy alternative to pg_repack
> is pg_squeeze, which I discovered almost by random chance, and soon
> discovered I liked it much more.
> 
> So thinking about your question, I think it might be possible to
> integrate a tool that works like pg_squeeze, such that it runs when
> VACUUM is invoked -- either under some new option, or just replace the
> code under FULL, not sure.  If the Cybertec people allows it, we could
> just grab the pg_squeeze code and add it to the things that VACUUM can
> run.

There are no objections from Cybertec. Nevertheless, I don't expect much code
to be just copy & pasted. If I started to implement the extension today, I'd
do some things in a different way. (Some things might actually be simpler in
the core, i.e. a few small changes in PG core are easier than the related
workarounds in the extension.)

The core idea is that: 1) a "historic snapshot" is used to get the current
contents of the table, 2) logical decoding is used to capture the changes done
while the data is being copied to new storage, 3) the exclusive lock on the
table is only taken for very short time, to swap the storage (relfilenode) of
the table.

I think it should be coded in a way that allows use by VACUUM FULL, CLUSTER,
and possibly some subcommands of ALTER TABLE. For example, some users of
pg_squeeze requested an enhancement that allows the user to change column data
type w/o service disruption (typically when it appears that integer type is
going to overflow and change bigint is needed).

Online (re)partitioning could be another use case, although I admit that
commands that change the system catalog are a bit harder to implement than
VACUUM FULL / CLUSTER.

One thing that pg_squeeze does not handle is visibility: it uses heap_insert()
to insert the tuples into the new storage, so the problems described in [1]
can appear. The in-core implementation should rather do something like tuple
rewriting (rewriteheap.c).

Is your plan to work on it soon or should I try to write a draft patch? (I
assume this is for PG >= 18.)

[1] https://www.postgresql.org/docs/current/mvcc-caveats.html

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

cost_incremental_sort() and limit_tuples

2023-12-03 Thread Antonin Houska

I think that cost_incremental_sort() does not account for the limit_tuples
argument properly. Attached is my proposal to fix the problem.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d6ceafd51c..829af80ea7 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2036,6 +2036,36 @@ cost_incremental_sort(Path *path,
 		   NULL, NULL);
 
 	group_tuples = input_tuples / input_groups;
+
+	/*
+	 * Do we have a useful LIMIT?
+	 *
+	 * Unlike the full sort, which must read all the input tuples regardless
+	 * the limit, the incremental sort only needs to read the groups
+	 * containing at least limit_tuples tuples in total. In other words, the
+	 * number of input tuples is almost identical to the number of output
+	 * tuples. Therefore we apply the limit to the *input* set.
+	 */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+	{
+		double	input_tuples_orig = input_tuples;
+
+		/*
+		 * We may need fewer groups, but there must be enough to accommodate
+		 * limit_tuples.
+		 */
+		input_groups = limit_tuples / group_tuples;
+		input_groups = ceil(input_groups);
+
+		/* Fewer input groups implies fewer input tuples. */
+		input_tuples = input_groups * group_tuples;
+		/* XXX Should we instead apply ceil() to group_tuples above? */
+		input_tuples = ceil(input_tuples);
+
+		/* Also adjust input_run_cost. */
+		input_run_cost /= input_tuples_orig / input_tuples;
+	}
+
 	group_input_run_cost = input_run_cost / input_groups;
 
 	/*
@@ -2044,7 +2074,7 @@ cost_incremental_sort(Path *path,
 	 */
 	cost_tuplesort(_startup_cost, _run_cost,
    group_tuples, width, comparison_cost, sort_mem,
-   limit_tuples);
+   -1);
 
 	/*
 	 * Startup cost of incremental sort is the startup cost of its first group

Stop the search once replication origin is found

2023-11-20 Thread Antonin Houska

Although it's not performance-critical, I think it just makes sense to break
the loop in replorigin_session_setup() as soon as we've found the origin.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b0255ffd25..460e3dcc38 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -1144,6 +1144,7 @@ replorigin_session_setup(RepOriginId node, int acquired_by)
 
 		/* ok, found slot */
 		session_replication_state = curstate;
+		break;
 	}

Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)

2023-08-17 Thread Antonin Houska

Thomas Munro  wrote:

> On Wed, Aug 16, 2023 at 11:18 PM Antonin Houska  wrote:
> > I try to understand this patch (commit 5ffb7c7750) because I use condition
> > variable in an extension. One particular problem occured to me, please
> > consider:
> >
> > ConditionVariableSleep() gets interrupted, so AbortTransaction() calls
> > ConditionVariableCancelSleep(), but the signal was sent in between. 
> > Shouldn't
> > at least AbortTransaction() and AbortSubTransaction() check the return value
> > of ConditionVariableCancelSleep(), and re-send the signal if needed?
> 
> I wondered about that in the context of our only in-tree user of
> ConditionVariableSignal(), in parallel btree index creation, but since
> it's using the parallel executor infrastructure, any error would be
> propagated everywhere so all waits would be aborted.

I see, ConditionVariableSignal() is currently used only to signal other
workers running in the same transactions. The other parts use
ConditionVariableBroadcast(), so no consumer should miss its signal.

> > Note that I'm just thinking about such a problem, did not try to reproduce 
> > it.
> 
> Hmm.  I looked for users of ConditionVariableSignal() in the usual web
> tools and didn't find anything, so I guess your extension is not
> released yet or not open source.  I'm curious: what does it actually
> do if there is an error in a CV-wakeup-consuming backend?  I guess it
> might be some kind of work-queue processing system... it seems
> inevitable that if backends are failing with errors, and you don't
> respond by retrying/respawning, you'll lose or significantly delay
> jobs/events/something anyway (imagine only slightly different timing:
> you consume the signal and start working on a job and then ereport,
> which amounts to the same thing in the end now that your transaction
> is rolled back?), and when you retry you'll see whatever condition was
> waited for anyway.  But that's just me imagining what some
> hypothetical strawman system might look like...  what does it really
> do?

If you're interested, the extension is pg_squeeze [1]. I think the use case is
rather special. All the work is done by a background worker, but an user
function can be called to submit a "task" for the worker and wait for its
completion. So the function sleeps on a CV and the worker uses the CV to wake
it up. If this function ends due to ERROR, the user is supposed to find a log
message in the worker output sooner or later. It may sound weird, but that
function exists primarily for regression tests, so ERROR is a problem anyway.

> (FWIW when I worked on a couple of different work queue-like systems
> and tried to use ConditionVariableSignal() I eventually concluded that
> it was the wrong tool for the job because its wakeup order is
> undefined.  It's actually FIFO, but I wanted LIFO so that workers have
> a chance to become idle and reduce the pool size, but I started to
> think that once you want that level of control you really want to
> build a bespoke wait list system, so I never got around to proposing
> that we consider changing that.)
> 
> Our condition variables are weird.  They're not associated with a
> lock, so we made start-of-wait non-atomic: prepare first, then return
> control and let the caller check its condition, then sleep.  Typical
> user space condition variable APIs force you to acquire some kind of
> lock that protects the condition first, then check the condition, then
> atomically release-associated-lock-and-start-sleeping, so there is no
> data race but also no time where control is returned to the caller but
> the thread is on the wait list consuming signals.  That choice has
> some pros (you can choose whatever type of lock you want to protect
> your condition, or none at all if you can get away with memory
> barriers and magic) and cons..  However, as I think Andres was getting
> at, having a non-atomic start-of-wait doesn't seem to require us to
> have a non-atomic end-of-wait and associated extra contention.  So
> maybe we should figure out how to fix that in 17.

Thanks for sharing your point of view. I'm fine with this low-level approach:
it's well documented and there are examples in the PG code showing how it
should be used :-)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

https://github.com/cybertec-postgresql/pg_squeeze/

Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)

2023-08-16 Thread Antonin Houska

Thomas Munro  wrote:

> On Tue, Aug 15, 2023 at 2:23 AM Tomas Vondra
>  wrote:
> > I'm not familiar with the condition variable code enough to have an
> > opinion, but the patch seems to resolve the issue for me - I can no
> > longer reproduce the high CPU usage.
> 
> Thanks, pushed.

I try to understand this patch (commit 5ffb7c7750) because I use condition
variable in an extension. One particular problem occured to me, please
consider:

ConditionVariableSleep() gets interrupted, so AbortTransaction() calls
ConditionVariableCancelSleep(), but the signal was sent in between. Shouldn't
at least AbortTransaction() and AbortSubTransaction() check the return value
of ConditionVariableCancelSleep(), and re-send the signal if needed?

Note that I'm just thinking about such a problem, did not try to reproduce it.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Shouldn't cost_append() also scale the partial path's cost?

2023-06-14 Thread Antonin Houska

Like in cost_seqscan(), I'd expect the subpath cost to be divided among
parallel workers. The patch below shows what I mean. Am I right?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ef475d95a1..5427822e0e 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2313,19 +2313,29 @@ cost_append(AppendPath *apath)
 			 * Apply parallel divisor to subpaths.  Scale the number of rows
 			 * for each partial subpath based on the ratio of the parallel
 			 * divisor originally used for the subpath to the one we adopted.
-			 * Also add the cost of partial paths to the total cost, but
-			 * ignore non-partial paths for now.
+			 * Also add the scaled cost of partial paths to the total cost,
+			 * but ignore non-partial paths for now.
 			 */
 			if (i < apath->first_partial_path)
 apath->path.rows += subpath->rows / parallel_divisor;
 			else
 			{
 double		subpath_parallel_divisor;
+double		scale_factor;
+Cost		run_cost;
 
 subpath_parallel_divisor = get_parallel_divisor(subpath);
-apath->path.rows += subpath->rows * (subpath_parallel_divisor /
-	 parallel_divisor);
-apath->path.total_cost += subpath->total_cost;
+scale_factor = subpath_parallel_divisor / parallel_divisor;
+apath->path.rows += subpath->rows * scale_factor;
+/*
+ * XXX run_cost includes both CPU cost, which is divided among
+ * workers, and disk cost, which is not. Unfortunately we
+ * don't have enough information to separate the two, so scale
+ * the whole run_cost.
+ */
+run_cost = subpath->total_cost - subpath->startup_cost;
+apath->path.total_cost += subpath->startup_cost +
+	run_cost * scale_factor;;
 			}
 
 			apath->path.rows = clamp_row_est(apath->path.rows);

Re: Privileges on PUBLICATION

2023-03-15 Thread Antonin Houska

Gregory Stark (as CFM)  wrote:

> FYI this looks like it needs a rebase due to a conflict in copy.c and
> an offset in pgoutput.c.
> 
> Is there anything specific that still needs review or do you think
> you've handled all Peter's concerns? In particular, is there "a
> comprehensive description of what it is trying to do"? :)

I tried to improve the documentation and commit messages in v05. v06 (just
rebased) is attached.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

>From d4490664ec80f52d23c4345eec5771764bcdbb17 Mon Sep 17 00:00:00 2001
From: Antonin Houska 
Date: Wed, 15 Mar 2023 04:21:01 +0100
Subject: [PATCH 1/2] Move some code into functions.

This is only a preparation for a patch that introduces USAGE privilege on
publications. It should make the following diff a little bit easier to read.
---
 src/backend/catalog/pg_publication.c| 236 +---
 src/backend/commands/copy.c |  81 +--
 src/backend/commands/copyto.c   |  89 
 src/backend/replication/pgoutput/pgoutput.c | 139 +---
 src/include/catalog/pg_publication.h|   6 +
 src/include/commands/copy.h |   2 +
 6 files changed, 308 insertions(+), 245 deletions(-)

diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index a98fcad421..7f6024b7a5 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -1025,6 +1025,208 @@ GetPublicationByName(const char *pubname, bool missing_ok)
 	return OidIsValid(oid) ? GetPublication(oid) : NULL;
 }
 
+/*
+ * Get the mapping for given publication and relation.
+ */
+void
+GetPublicationRelationMapping(Oid pubid, Oid relid,
+			  Datum *attrs, bool *attrs_isnull,
+			  Datum *qual, bool *qual_isnull)
+{
+	Publication *publication;
+	HeapTuple	pubtuple = NULL;
+	Oid			schemaid = get_rel_namespace(relid);
+
+	publication = GetPublication(pubid);
+
+	/*
+	 * We don't consider row filters or column lists for FOR ALL TABLES or
+	 * FOR TABLES IN SCHEMA publications.
+	 */
+	if (!publication->alltables &&
+		!SearchSysCacheExists2(PUBLICATIONNAMESPACEMAP,
+			   ObjectIdGetDatum(schemaid),
+			   ObjectIdGetDatum(publication->oid)))
+		pubtuple = SearchSysCacheCopy2(PUBLICATIONRELMAP,
+	   ObjectIdGetDatum(relid),
+	   ObjectIdGetDatum(publication->oid));
+
+	if (HeapTupleIsValid(pubtuple))
+	{
+		/* Lookup the column list attribute. */
+		*attrs = SysCacheGetAttr(PUBLICATIONRELMAP, pubtuple,
+ Anum_pg_publication_rel_prattrs,
+ attrs_isnull);
+
+		/* Null indicates no filter. */
+		*qual = SysCacheGetAttr(PUBLICATIONRELMAP, pubtuple,
+Anum_pg_publication_rel_prqual,
+qual_isnull);
+	}
+	else
+	{
+		*attrs_isnull = true;
+		*qual_isnull = true;
+	}
+}
+/*
+ * Pick those publications from a list which should actually be used to
+ * publish given relation and return them.
+ *
+ * If publish_as_relid_p is passed, the relation whose tuple descriptor should
+ * be used to publish the data is stored in *publish_as_relid_p.
+ *
+ * If pubactions is passed, update the structure according to the matching
+ * publications.
+ */
+List *
+GetEffectiveRelationPublications(Oid relid, List *publications,
+ Oid *publish_as_relid_p,
+ PublicationActions *pubactions)
+{
+	Oid			schemaId = get_rel_namespace(relid);
+	List	   *pubids = GetRelationPublications(relid);
+	/*
+	 * We don't acquire a lock on the namespace system table as we build the
+	 * cache entry using a historic snapshot and all the later changes are
+	 * absorbed while decoding WAL.
+	 */
+	List	   *schemaPubids = GetSchemaPublications(schemaId);
+	ListCell   *lc;
+	Oid			publish_as_relid = relid;
+	int			publish_ancestor_level = 0;
+	bool		am_partition = get_rel_relispartition(relid);
+	char		relkind = get_rel_relkind(relid);
+	List	   *rel_publications = NIL;
+
+	foreach(lc, publications)
+	{
+		Publication *pub = lfirst(lc);
+		bool		publish = false;
+
+		/*
+		 * Under what relid should we publish changes in this publication?
+		 * We'll use the top-most relid across all publications. Also track
+		 * the ancestor level for this publication.
+		 */
+		Oid	pub_relid = relid;
+		int	ancestor_level = 0;
+
+		/*
+		 * If this is a FOR ALL TABLES publication, pick the partition root
+		 * and set the ancestor level accordingly.
+		 */
+		if (pub->alltables)
+		{
+			publish = true;
+			if (pub->pubviaroot && am_partition)
+			{
+List	   *ancestors = get_partition_ancestors(relid);
+
+pub_relid = llast_oid(ancestors);
+ancestor_level = list_length(ancestors);
+			}
+		}
+
+		if (!publish)
+		{
+			bool		ancestor_published = false;
+
+			/*
+			 * For a partition, check if any of the ancestors are published.
+			 * If so, note down the topmost ancestor that is published via
+			 * this publication, which will be used as th

Re: Parallelize correlated subqueries that execute within each worker

2023-03-08 Thread Antonin Houska

James Coleman  wrote:

> On Mon, Feb 6, 2023 at 11:39 AM Antonin Houska  wrote:

> Attached is v9.

ok, I've changed the status to RfC

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Parallelize correlated subqueries that execute within each worker

2023-02-06 Thread Antonin Houska

 exactly NULL if empty */
if (bms_is_empty(outer_relids))
outer_relids = NULL;


Another question is whether in this call

simple_gather_path = (Path *)
create_gather_path(root, rel, cheapest_partial_path, 
rel->reltarget,
   required_outer, rowsp);

required_outer should be passed to create_gather_path(). Shouldn't it rather
be PATH_REQ_OUTER(cheapest_partial_path) that you test just above? Again,
build_index_paths() initializes outer_relids this way

outer_relids = bms_copy(rel->lateral_relids);

but then it may add some more relations:

/* OK to include this clause */
index_clauses = lappend(index_clauses, iclause);
outer_relids = bms_add_members(outer_relids,
   
rinfo->clause_relids);

So I think that PATH_REQ_OUTER(cheapest_partial_path) in
generate_gather_paths() can eventually contain more relations than
required_outer, and therefore it's safer to check the first.


Similar comments might apply to generate_useful_gather_paths(). Here I also
suggest to move this test

/* We can't pass params to workers. */
if (!bms_is_subset(PATH_REQ_OUTER(subpath), rel->relids))
continue;

to the top of the loop because it's relatively cheap.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: RLS makes COPY TO process child tables

2023-02-01 Thread Antonin Houska

Yugo NAGATA  wrote:

> On Wed, 01 Feb 2023 12:45:57 +0100
> Antonin Houska  wrote:
> 
> > While working on [1] I noticed that if RLS gets enabled, the COPY TO command
> > includes the contents of child table into the result, although the
> > documentation says it should not:
> > 
> > "COPY TO can be used only with plain tables, not views, and does not
> > copy rows from child tables or child partitions. For example, COPY
> > table TO copies the same rows as SELECT * FROM ONLY table. The syntax
> > COPY (SELECT * FROM table) TO ... can be used to dump all of the rows
> > in an inheritance hierarchy, partitioned table, or view."
> > 
> > A test case is attached (rls.sql) as well as fix proposal
> > (copy_rls_no_inh.diff).
> 
> I think this is a bug because the current behaviour is different from
> the documentation. 
> 
> When RLS is enabled on a table in `COPY ... TO ...`, the query is converted
> to `COPY (SELECT * FROM ...) TO ...` to allow the rewriter to add in RLS
> clauses. This causes to dump the rows of child tables.
> 
> The patch fixes this by setting "inh" of the table in the converted query
> to false. This seems reasonable and actually fixes the problem.
> 
> However, I think we would want a comment on the added line.

A short comment added, see the new patch version.

> Also, the attached test should be placed in the regression test.

Hm, I'm not sure it's necessary. It would effectively test whether the 'inh'
field works, but if it didn't, many other tests would fail. I discovered the
bug by reading the code, so I wanted to demonstrate (also to myself) that it
causes incorrect behavior from user perspective. That was the purpose of the
test.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

>From 3a4acb29901a7b53eb32767e30bdfce74b80df8b Mon Sep 17 00:00:00 2001
From: Antonin Houska 
Date: Thu, 2 Feb 2023 07:31:32 +0100
Subject: [PATCH] Make sure that COPY TO does not process child tables.

It's expected (and documented) that the child tables are not copied, however
the query generated for RLS didn't meet this expectation so far.
---
 src/backend/commands/copy.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index e34f583ea7..539fb682c2 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -250,6 +250,9 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
 pstrdup(RelationGetRelationName(rel)),
 -1);
 
+			/* COPY TO should not process child tables. */
+			from->inh = false;
+
 			/* Build query */
 			select = makeNode(SelectStmt);
 			select->targetList = targetList;
-- 
2.31.1

RLS makes COPY TO process child tables

2023-02-01 Thread Antonin Houska

While working on [1] I noticed that if RLS gets enabled, the COPY TO command
includes the contents of child table into the result, although the
documentation says it should not:

"COPY TO can be used only with plain tables, not views, and does not
copy rows from child tables or child partitions. For example, COPY
table TO copies the same rows as SELECT * FROM ONLY table. The syntax
COPY (SELECT * FROM table) TO ... can be used to dump all of the rows
in an inheritance hierarchy, partitioned table, or view."

A test case is attached (rls.sql) as well as fix proposal
(copy_rls_no_inh.diff).

[1] https://commitfest.postgresql.org/41/3641/

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

create table a(i int);
insert into a values (1);

create table a1() inherits(a);
insert into a1 values (1);

-- Only the parent table is copied, as the documentation claims.
copy a to stdout;

alter table a enable row level security;
create role u;
create policy pol_perm on a as permissive for select to u using (true);
grant select on table a to u;
set role u;

-- Both "a" and "a1" appears in the output.
copy a to stdout;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index e34f583ea7..3b8c25dadd 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -249,6 +249,7 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
 			from = makeRangeVar(get_namespace_name(RelationGetNamespace(rel)),
 pstrdup(RelationGetRelationName(rel)),
 -1);
+			from->inh = false;
 
 			/* Build query */
 			select = makeNode(SelectStmt);

Re: Cross-partition UPDATE and foreign table partitions

2023-01-17 Thread Antonin Houska

Antonin Houska  wrote:

> I was wondering why ExecCrossPartitionUpdateForeignKey() has an unused
> argument "oldslot" and wanted to suggest its removal. However, before I did,
> it occurred to me that callers may want to pass the whole slot when the
> partition is a foreign table, i.e. when the "tupleid" argument cannot be
> used. (In that case the problem would be that the function implementation is
> incomplete.)
> 
> However, when checking how cross-partition UPDATE works internally for foreign
> tables, I saw surprising behavior. The attached script creates partitioned
> table "a" with foreign table partitions "a1" and "a2". If you then run the
> following commands
> 
> INSERT INTO a VALUES (1), (10);
> UPDATE a SET i=11 WHERE i=1;
> TABLE a1;
> 
> you'll see that the tuples are correctly routed into the partitions, but the
> UPDATE is simply executed on the "a1" partition. Instead, I'd expect it to
> delete the tuple from "a1" and insert it into "a2". That looks like a bug.

Well, as it usually happens, I found a related information as soon as I had
sent a report. The documentation of CREATE FOREIGN TABLE says:

"However it is not currently possible to move a row from a foreign-table
partition to another partition. An UPDATE that would require doing that will
fail due to the partitioning constraint, assuming that that is properly
enforced by the remote server."

So the remaining question is whether the "oldslot" argument of
ExecCrossPartitionUpdateForeignKey() will be used in the future or should be
removed. Note that the ExecUpdateAct() passes its "slot" variable for it,
which seems to contain the *new* version of the tuple rather than the
old. Some cleanup may be needed.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Cross-partition UPDATE and foreign table partitions

2023-01-17 Thread Antonin Houska

I was wondering why ExecCrossPartitionUpdateForeignKey() has an unused
argument "oldslot" and wanted to suggest its removal. However, before I did,
it occurred to me that callers may want to pass the whole slot when the
partition is a foreign table, i.e. when the "tupleid" argument cannot be
used. (In that case the problem would be that the function implementation is
incomplete.)

However, when checking how cross-partition UPDATE works internally for foreign
tables, I saw surprising behavior. The attached script creates partitioned
table "a" with foreign table partitions "a1" and "a2". If you then run the
following commands

INSERT INTO a VALUES (1), (10);
UPDATE a SET i=11 WHERE i=1;
TABLE a1;

you'll see that the tuples are correctly routed into the partitions, but the
UPDATE is simply executed on the "a1" partition. Instead, I'd expect it to
delete the tuple from "a1" and insert it into "a2". That looks like a bug.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

CREATE EXTENSION IF NOT EXISTS postgres_fdw WITH SCHEMA public;

CREATE SERVER s1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (
dbname 'postgres',
host 'localhost',
port '5432'
);

CREATE USER MAPPING FOR CURRENT_ROLE SERVER s1;

CREATE TABLE public.a (
i integer NOT NULL
)
PARTITION BY RANGE (i);


CREATE TABLE public.a1 (
i integer NOT NULL
);

CREATE FOREIGN TABLE public.a1_loc (
i integer NOT NULL
)
SERVER s1
OPTIONS (
table_name 'a1'
);

CREATE TABLE public.a2 (
i integer NOT NULL
);

CREATE FOREIGN TABLE public.a2_loc (
i integer NOT NULL
)
SERVER s1
OPTIONS (
table_name 'a2'
);

ALTER TABLE ONLY public.a ATTACH PARTITION public.a1_loc FOR VALUES FROM (0) TO 
(10);
ALTER TABLE ONLY public.a ATTACH PARTITION public.a2_loc FOR VALUES FROM (10) 
TO (20);

ALTER TABLE ONLY public.a1
ADD CONSTRAINT a1_pkey PRIMARY KEY (i);

ALTER TABLE ONLY public.a2
ADD CONSTRAINT a2_pkey PRIMARY KEY (i);

Re: grouping pushdown

2023-01-04 Thread Antonin Houska

David Rowley  wrote:

> On Wed, 4 Jan 2023 at 23:21, Spring Zhong  wrote:
> > The plan is apparently inefficient, since the hash aggregate goes after the 
> > Cartesian product. We could expect the query's performance get much 
> > improved if the HashAggregate node can be pushed down to the SCAN node.
> 
> > Is someone has suggestions on this?
> 
> I think this is being worked on. See [1].

Well, the current version of that patch requires the query to contain at least
one aggregate. It shouldn't be a big deal to modify it. However note that this
feature pushes the aggregate/grouping only to one side of the join ("fake"
aggregate count(*) added to the query):

SET enable_agg_pushdown TO on;

EXPLAIN select i1,i2, count(*) from t1, t2 group by i1,i2;
   QUERY PLAN   

 Finalize GroupAggregate  (cost=440.02..440.04 rows=1 width=16)
   Group Key: t1.i1, t2.i2
   ->  Sort  (cost=440.02..440.02 rows=1 width=16)
 Sort Key: t1.i1, t2.i2
 ->  Nested Loop  (cost=195.00..440.01 rows=1 width=16)
   ->  Partial HashAggregate  (cost=195.00..195.01 rows=1 width=12)
 Group Key: t1.i1
 ->  Seq Scan on t1  (cost=0.00..145.00 rows=1 width=4)
   ->  Seq Scan on t2  (cost=0.00..145.00 rows=1 width=4)

If both sides should be grouped, finalization of the partial aggregates would
be more difficult, and I'm not sure it'd be worth the effort.

> [1] https://commitfest.postgresql.org/41/3764/

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Privileges on PUBLICATION

2022-12-16 Thread Antonin Houska

Antonin Houska  wrote:

> Antonin Houska  wrote:
> 
> > Peter Eisentraut  wrote:
> > 
> > > On 04.11.22 08:28, Antonin Houska wrote:
> > > > I thought about the whole concept a bit more and I doubt if the 
> > > > PUBLICATION
> > > > privilege is the best approach. In particular, the user specified in 
> > > > CREATE
> > > > SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have 
> > > > SELECT
> > > > privilege on the tables replicated. So if the DBA excludes some columns 
> > > > from
> > > > the publication's column list and sets the (publication) privileges in 
> > > > such a
> > > > way that the user cannot get the column values via other publications, 
> > > > the
> > > > user still can connect to the database directly and get values of the 
> > > > excluded
> > > > columns.
> > > 
> > > Why are the SELECT privileges needed?  Maybe that's something to think 
> > > about
> > > and maybe change.
> > 
> > I haven't noticed an explanation in comments nor did I search in the mailing
> > list archives, but the question makes sense: the REPLICATION attribute of a
> > role is sufficient for streaming replication, so why should the logical
> > replication require additional privileges?
> > 
> > Technically the SELECT privilege is needed because the sync worker does
> > actually execute SELECT query on the published tables. However, I realize 
> > now
> > that it's not checked by the output plugin. Thus if SELECT is revoked from 
> > the
> > "subscription user" after the table has been synchronized, the replication
> > continues to work. So the necessity for the SELECT privilege might be an
> > omission rather than a design choice. (Even the documentation says that the
> > SELECT privilege is needed only for the initial synchronization [1], however
> > it does not tell why.)
> > 
> > > > As an alternative to the publication privileges, I think that the CREATE
> > > > SUBSCRIPTION command could grant ACL_SELECT automatically to the 
> > > > subscription
> > > > user on the individual columns contained in the publication column 
> > > > list, and
> > > > DROP SUBSCRIPTION would revoke that privilege.
> > > 
> > > I think that approach is weird and unusual.  Privileges and object 
> > > creation
> > > should be separate operations.
> > 
> > ok. Another approach would be to skip the check for the SELECT privilege (as
> > well as the check for the USAGE privilege on the corresponding schema) if
> > given column is being accessed via a publication which has it on its column
> > list and if the subscription user has the USAGE privilege on that 
> > publication.
> > 
> > So far I wasn't sure if we can do that because, if pg_upgrade grants the 
> > USAGE
> > privilege on all publications to the "public" role, the DBAs who relied on 
> > the
> > SELECT privileges might not notice that any role having the REPLICATION
> > attribute can access all the published tables after the upgrade. (pg_upgrade
> > can hardly do anything else because it has no information on the 
> > "subscription
> > users", so it cannot convert the SELECT privilege on tables to the USAGE
> > privileges on publications.)
> > 
> > But now that I see that the logical replication doesn't check the SELECT
> > privilege properly anyway, I think we can get rid of it.
> 
> The attached version tries to do that - as you can see in 0001, the SELECT
> privilege is not required for the walsender process.
> 
> I also added PUBLICATION_NAMES option to the COPY TO command so that the
> publisher knows which publications are subject to the ACL check. Only data of
> those publications are returned to the subscriber. (In the previous patch
> version the ACL checks were performed on the subscriber side, but I that's not
> ideal in terms of security.)
> 
> I also added the regression tests for publications, enhanced psql (the \dRp+
> command) so that it displays the publication ACL and added a few missing
> pieces of documentation.
> 

This is v4. The patch had to be rebased due to the commit 369f09e420.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



publication_privileges_v04.tgz
Description: application/gzip

Re: refactor ExecGrant_*() functions

2022-12-13 Thread Antonin Houska

Peter Eisentraut  wrote:

> On 12.12.22 10:44, Antonin Houska wrote:
> > Peter Eisentraut  wrote:
> > 
> >> On 06.12.22 09:41, Antonin Houska wrote:
> >>> Attached are my proposals for improvements. One is to avoid memory leak, 
> >>> the
> >>> other tries to improve readability a little bit.
> >>
> >> I added the readability improvement to my v2 patch.  The pfree() calls 
> >> aren't
> >> necessary AFAICT.
> 
> It's something to consider, but since this is a refactoring patch and the old
> code didn't do it either, I think it's out of scope.

Well, the reason I brought this topic up is that the old code didn't even
palloc() those arrays. (Because the were located in the stack.)

> > I see that memory contexts exist and that the amount of memory freed is not
> > huge, but my style is to free the memory explicitly if it's allocated in a
> > loop.
> > v2 looks good to me.
> 
> Committed, thanks.

ok, I'll post rebased "USAGE privilege on PUBLICATION" patch [1] soon.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

[1] https://commitfest.postgresql.org/41/3641/

Re: refactor ExecGrant_*() functions

2022-12-12 Thread Antonin Houska

Peter Eisentraut  wrote:

> On 06.12.22 09:41, Antonin Houska wrote:
> > Attached are my proposals for improvements. One is to avoid memory leak, the
> > other tries to improve readability a little bit.
> 
> I added the readability improvement to my v2 patch.  The pfree() calls aren't
> necessary AFAICT.

I see that memory contexts exist and that the amount of memory freed is not
huge, but my style is to free the memory explicitly if it's allocated in a
loop.

v2 looks good to me.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

sendFileWithContent() does not advance the source pointer

2022-12-08 Thread Antonin Houska

When checking something else in the base backup code, I've noticed that
sendFileWithContent() does not advance the 'content' pointer. The sink buffer
is large enough (32kB) so that the first iteration usually processes the whole
file (only special files are processed by this function), and thus that the
problem is hidden.

However it's possible to hit the issue: if there are too many tablespaces,
pg_basebackup generates corrupted tablespace_map. Instead of writing all the
tablespace paths it writes only some and then starts to write the contents
from the beginning again.

The attached script generates scripts to create many tablespaces as well as
the underlying directories. Fix is attached here as well.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

#!/bin/bash

TBSPDIR=/mnt/ramdisk/tbspcs
TBSPCOUNT=2048
SCRIPT_SH=create.sh
SCRIPT_SQL=create.sql

echo "#!/bin/bash" > ${SCRIPT_SH}
echo "mkdir -p $TBSPDIR" >> ${SCRIPT_SH}

echo "" > ${SCRIPT_SQL}

i=0
while [ $i -lt $TBSPCOUNT ];
do
printf "mkdir %s/%.4d\n" $TBSPDIR $i >> ${SCRIPT_SH}
printf "CREATE TABLESPACE tbsp_%.4d LOCATION '%s/%.4d';\n" $i $TBSPDIR $i >> ${SCRIPT_SQL}
i=$((i+1))
done

chmod u+x ${SCRIPT_SH}
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 74fb529380..7aa5f6e44d 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1073,6 +1073,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
 		memcpy(sink->bbs_buffer, content, nbytes);
 		bbsink_archive_contents(sink, nbytes);
 		bytes_done += nbytes;
+		content += nbytes;
 	}
 
 	_tarWritePadding(sink, len);

Re: refactor ExecGrant_*() functions

2022-12-06 Thread Antonin Houska

Peter Eisentraut  wrote:

> Continuing the ideas in [0], this patch refactors the ExecGrant_Foo()
> functions and replaces many of them by a common function that is driven by the
> ObjectProperty table.
> 
> It would be nice to do more here, for example ExecGrant_Language(), which has
> additional non-generalizable checks, but I think this is already a good start.
> For example, the work being discussed on privileges on publications [1] would
> be able to take good advantage of this.

Right, I mostly copy and pasted the code when writing
ExecGrant_Publication(). I agree that your refactoring is very useful.

Attached are my proposals for improvements. One is to avoid memory leak, the
other tries to improve readability a little bit.

> [0]:
> https://www.postgresql.org/message-id/95c30f96-4060-2f48-98b5-a4392d3b6...@enterprisedb.com
> [1]: https://www.postgresql.org/message-id/flat/20330.1652105397@antos

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index d3121a469f..ac4490c0b8 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -2228,6 +2234,9 @@ ExecGrant_generic(InternalGrant *istmt, Oid classid, AclMode default_privs)
 
 		newtuple = heap_modify_tuple(tuple, RelationGetDescr(relation), values,
 	 nulls, replaces);
+		pfree(values);
+		pfree(nulls);
+		pfree(replaces);
 
 		CatalogTupleUpdate(relation, >t_self, newtuple);
 
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index d3121a469f..ac4490c0b8 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -2144,6 +2144,7 @@ ExecGrant_generic(InternalGrant *istmt, Oid classid, AclMode default_privs)
 	{
 		Oid			objectid = lfirst_oid(cell);
 		Datum		aclDatum;
+		Datum		nameDatum;
 		bool		isNull;
 		AclMode		avail_goptions;
 		AclMode		this_privileges;
@@ -2197,6 +2198,11 @@ ExecGrant_generic(InternalGrant *istmt, Oid classid, AclMode default_privs)
 			old_acl, ownerId,
 			, _goptions);
 
+		nameDatum = SysCacheGetAttr(cacheid, tuple,
+	get_object_attnum_name(classid),
+	);
+		Assert(!isNull);
+
 		/*
 		 * Restrict the privileges to what we can actually grant, and emit the
 		 * standards-mandated warning and error messages.
@@ -2205,7 +2211,7 @@ ExecGrant_generic(InternalGrant *istmt, Oid classid, AclMode default_privs)
 			restrict_and_check_grant(istmt->is_grant, avail_goptions,
 	 istmt->all_privs, istmt->privileges,
 	 objectid, grantorId, get_object_type(classid, objectid),
-	 NameStr(*DatumGetName(SysCacheGetAttr(cacheid, tuple, get_object_attnum_name(classid), ))),
+	 NameStr(*DatumGetName(nameDatum)),
 	 0, NULL);
 
 		/*

Re: Privileges on PUBLICATION

2022-11-29 Thread Antonin Houska

Antonin Houska  wrote:

> Peter Eisentraut  wrote:
> 
> > On 04.11.22 08:28, Antonin Houska wrote:
> > > I thought about the whole concept a bit more and I doubt if the 
> > > PUBLICATION
> > > privilege is the best approach. In particular, the user specified in 
> > > CREATE
> > > SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have 
> > > SELECT
> > > privilege on the tables replicated. So if the DBA excludes some columns 
> > > from
> > > the publication's column list and sets the (publication) privileges in 
> > > such a
> > > way that the user cannot get the column values via other publications, the
> > > user still can connect to the database directly and get values of the 
> > > excluded
> > > columns.
> > 
> > Why are the SELECT privileges needed?  Maybe that's something to think about
> > and maybe change.
> 
> I haven't noticed an explanation in comments nor did I search in the mailing
> list archives, but the question makes sense: the REPLICATION attribute of a
> role is sufficient for streaming replication, so why should the logical
> replication require additional privileges?
> 
> Technically the SELECT privilege is needed because the sync worker does
> actually execute SELECT query on the published tables. However, I realize now
> that it's not checked by the output plugin. Thus if SELECT is revoked from the
> "subscription user" after the table has been synchronized, the replication
> continues to work. So the necessity for the SELECT privilege might be an
> omission rather than a design choice. (Even the documentation says that the
> SELECT privilege is needed only for the initial synchronization [1], however
> it does not tell why.)
> 
> > > As an alternative to the publication privileges, I think that the CREATE
> > > SUBSCRIPTION command could grant ACL_SELECT automatically to the 
> > > subscription
> > > user on the individual columns contained in the publication column list, 
> > > and
> > > DROP SUBSCRIPTION would revoke that privilege.
> > 
> > I think that approach is weird and unusual.  Privileges and object creation
> > should be separate operations.
> 
> ok. Another approach would be to skip the check for the SELECT privilege (as
> well as the check for the USAGE privilege on the corresponding schema) if
> given column is being accessed via a publication which has it on its column
> list and if the subscription user has the USAGE privilege on that publication.
> 
> So far I wasn't sure if we can do that because, if pg_upgrade grants the USAGE
> privilege on all publications to the "public" role, the DBAs who relied on the
> SELECT privileges might not notice that any role having the REPLICATION
> attribute can access all the published tables after the upgrade. (pg_upgrade
> can hardly do anything else because it has no information on the "subscription
> users", so it cannot convert the SELECT privilege on tables to the USAGE
> privileges on publications.)
> 
> But now that I see that the logical replication doesn't check the SELECT
> privilege properly anyway, I think we can get rid of it.

The attached version tries to do that - as you can see in 0001, the SELECT
privilege is not required for the walsender process.

I also added PUBLICATION_NAMES option to the COPY TO command so that the
publisher knows which publications are subject to the ACL check. Only data of
those publications are returned to the subscriber. (In the previous patch
version the ACL checks were performed on the subscriber side, but I that's not
ideal in terms of security.)

I also added the regression tests for publications, enhanced psql (the \dRp+
command) so that it displays the publication ACL and added a few missing
pieces of documentation.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



publication_privileges_v03.tgz
Description: application/gzip

Re: Privileges on PUBLICATION

2022-11-14 Thread Antonin Houska

Peter Eisentraut  wrote:

> On 04.11.22 08:28, Antonin Houska wrote:
> > I thought about the whole concept a bit more and I doubt if the PUBLICATION
> > privilege is the best approach. In particular, the user specified in CREATE
> > SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have 
> > SELECT
> > privilege on the tables replicated. So if the DBA excludes some columns from
> > the publication's column list and sets the (publication) privileges in such 
> > a
> > way that the user cannot get the column values via other publications, the
> > user still can connect to the database directly and get values of the 
> > excluded
> > columns.
> 
> Why are the SELECT privileges needed?  Maybe that's something to think about
> and maybe change.

I haven't noticed an explanation in comments nor did I search in the mailing
list archives, but the question makes sense: the REPLICATION attribute of a
role is sufficient for streaming replication, so why should the logical
replication require additional privileges?

Technically the SELECT privilege is needed because the sync worker does
actually execute SELECT query on the published tables. However, I realize now
that it's not checked by the output plugin. Thus if SELECT is revoked from the
"subscription user" after the table has been synchronized, the replication
continues to work. So the necessity for the SELECT privilege might be an
omission rather than a design choice. (Even the documentation says that the
SELECT privilege is needed only for the initial synchronization [1], however
it does not tell why.)

> > As an alternative to the publication privileges, I think that the CREATE
> > SUBSCRIPTION command could grant ACL_SELECT automatically to the 
> > subscription
> > user on the individual columns contained in the publication column list, and
> > DROP SUBSCRIPTION would revoke that privilege.
> 
> I think that approach is weird and unusual.  Privileges and object creation
> should be separate operations.

ok. Another approach would be to skip the check for the SELECT privilege (as
well as the check for the USAGE privilege on the corresponding schema) if
given column is being accessed via a publication which has it on its column
list and if the subscription user has the USAGE privilege on that publication.

So far I wasn't sure if we can do that because, if pg_upgrade grants the USAGE
privilege on all publications to the "public" role, the DBAs who relied on the
SELECT privileges might not notice that any role having the REPLICATION
attribute can access all the published tables after the upgrade. (pg_upgrade
can hardly do anything else because it has no information on the "subscription
users", so it cannot convert the SELECT privilege on tables to the USAGE
privileges on publications.)

But now that I see that the logical replication doesn't check the SELECT
privilege properly anyway, I think we can get rid of it.

[1] https://www.postgresql.org/docs/current/logical-replication-security.html

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: refactor ownercheck and aclcheck functions

2022-11-07 Thread Antonin Houska

Peter Eisentraut  wrote:

> These patches take the dozens of mostly-duplicate pg_foo_ownercheck() and
> pg_foo_aclcheck() functions and replace (most of) them by common functions
> that are driven by the ObjectProperty table.  All the required information is
> already in that table.
> 
> This is similar to the consolidation of the drop-by-OID functions that we did
> a while ago (b1d32d3e3230f00b5baba08f75b4f665c7d6dac6).

I've reviewed this patch, as it's related to my patch [1] (In particular, it
reduces the size of my patch a little bit). I like the idea to reduce the
amount of (almost) copy & pasted code. I haven't found any problem in your
patch that would be worth mentioning, except that the 0001 part does not apply
to the current master branch.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Privileges on PUBLICATION

2022-11-04 Thread Antonin Houska

Mark Dilger  wrote:

> > On Nov 4, 2022, at 12:28 AM, Antonin Houska  wrote:
> > 
> > I thought about the whole concept a bit more and I doubt if the PUBLICATION
> > privilege is the best approach. In particular, the user specified in CREATE
> > SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have 
> > SELECT
> > privilege on the tables replicated. So if the DBA excludes some columns from
> > the publication's column list and sets the (publication) privileges in such 
> > a
> > way that the user cannot get the column values via other publications, the
> > user still can connect to the database directly and get values of the 
> > excluded
> > columns.
> > 
> > As an alternative to the publication privileges, I think that the CREATE
> > SUBSCRIPTION command could grant ACL_SELECT automatically to the 
> > subscription
> > user on the individual columns contained in the publication column list, and
> > DROP SUBSCRIPTION would revoke that privilege.
> > 
> > Of course a question is what to do if the replication user already has that
> > privilege on some columns: either the CREATE SUBSCRIPTION command should 
> > raise
> > ERROR, or we should introduce a new privilege (e.g. ACL_SELECT_PUB) for this
> > purpose, which would effectivelly be ACL_SELECT, but hidden from the users 
> > of
> > GRANT / REVOKE.
> > 
> > If this approach was taken, the USAGE privilege on schema would need to be
> > granted / revoked in a similar way.
> 
> When you talk about a user needing to have privileges, it sounds like you
> mean privileges on the publishing database.  But then you talk about CREATE
> SUBSCRIPTION granting privileges, which would necessarily be on the
> subscriber database.  Can you clarify what you have in mind?

Right, the privileges need to be added on the publishing side, but the user
that needs those privileges is specified on the subscription side. I didn't
think much in detail how it would work. The "subscription user" certainly
cannot connect to the publisher database and add grant the privileges to
itself. Perhaps some of the workers on the publisher side could do it on
startup.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Privileges on PUBLICATION

2022-11-04 Thread Antonin Houska

Amit Kapila  wrote:

> On Thu, Nov 3, 2022 at 11:12 AM Antonin Houska  wrote:
> >
> > Peter Eisentraut  wrote:
> >
> > > The CF entry is about privileges on publications.  Please rebase that 
> > > patch
> > > and repost it so that the CF app and the CF bot are up to date.
> >
> > The rebased patch (with regression tests added) is attached here.
> >
> > There's still one design issue that I haven't mentioned yet: if the USAGE
> > privilege on a publication is revoked after the synchronization phase
> > completed, the missing privilege on a publication causes ERROR in the output
> > plugin. If the privilege is then granted, the error does not disappear 
> > because
> > the same (historical) snapshot we use to decode the failed data change again
> > is also used to check the privileges in the catalog, so the output plugin 
> > does
> > not see that the privilege has already been granted.
> >
> 
> We have a similar problem even when publication is dropped/created.
> The replication won't be able to proceed.
> 
> > The only solution seems to be to drop the publication from the subscription
> > and add it again, or to drop and re-create the whole subscription. I haven't
> > added a note about this problem to the documentation yet, in case someone 
> > has
> > better idea how to approach the problem.
> >
> 
> I think one possibility is that the user advances the slot used in
> replication by using pg_replication_slot_advance() at or after the
> location where the privilege is granted. Some other ideas have been
> discussed in the thread [1], in particular, see email [2] and
> discussion after that but we didn't reach any conclusion.

Thanks for feedback. Regarding the publications, I'm not sure what the user
should expect if he revokes the permissions in the middle of processing. In
such a situation, he should be aware that some privileged data could already
have been replicated. My preference in such a situation would be to truncate
the table(s) on the subscriber side and to restart the replication, rather
than fix the replication and keep the subscriber's access to the leaked data.

> [1] - 
> https://www.postgresql.org/message-id/CAHut%2BPvMbCsL8PAz1Qc6LNoL0Ag0y3YJtPVJ8V0xVXJOPb%2B0xw%40mail.gmail.com
> [2] - 
> https://www.postgresql.org/message-id/CAA4eK1JTwOAniPua04o2EcOXfzRa8ANax%3D3bpx4H-8dH7M2p%3DA%40mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Privileges on PUBLICATION

2022-11-04 Thread Antonin Houska

Peter Eisentraut  wrote:

> On 03.11.22 01:43, Antonin Houska wrote:
> > Peter Eisentraut  wrote:
> > 
> >> The CF entry is about privileges on publications.  Please rebase that patch
> >> and repost it so that the CF app and the CF bot are up to date.
> > The rebased patch (with regression tests added) is attached here.
> 
> Some preliminary discussion:
> 
> What is the upgrade strategy?  I suppose the options are either that
> publications have a default acl that makes them publicly accessible, 
> thus preserving the existing behavior by default, or pg_dump would need to
> create additional GRANT statements when upgrading from pre-PG16.  I don't see
> anything like either of these mentioned in the patch.  What is your plan?

So far I considered the first

> You might be interested in this patch, which relates to yours:
> https://commitfest.postgresql.org/40/3955/

ok, I'll check.

> Looking at your patch, I would also like to find a way to refactor away the
> ExecGrant_Publication() function.  I'll think about that.
> 
> I think you should add some tests under src/test/regress/ for the new GRANT
> and REVOKE statements, just to have some basic coverage that it works.
> sql/publication.sql would be appropriate, I think.

I thought about the whole concept a bit more and I doubt if the PUBLICATION
privilege is the best approach. In particular, the user specified in CREATE
SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have SELECT
privilege on the tables replicated. So if the DBA excludes some columns from
the publication's column list and sets the (publication) privileges in such a
way that the user cannot get the column values via other publications, the
user still can connect to the database directly and get values of the excluded
columns.

As an alternative to the publication privileges, I think that the CREATE
SUBSCRIPTION command could grant ACL_SELECT automatically to the subscription
user on the individual columns contained in the publication column list, and
DROP SUBSCRIPTION would revoke that privilege.

Of course a question is what to do if the replication user already has that
privilege on some columns: either the CREATE SUBSCRIPTION command should raise
ERROR, or we should introduce a new privilege (e.g. ACL_SELECT_PUB) for this
purpose, which would effectivelly be ACL_SELECT, but hidden from the users of
GRANT / REVOKE.

If this approach was taken, the USAGE privilege on schema would need to be
granted / revoked in a similar way.

What do you think about that?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Privileges on PUBLICATION

2022-11-02 Thread Antonin Houska

Peter Eisentraut  wrote:

> The CF entry is about privileges on publications.  Please rebase that patch
> and repost it so that the CF app and the CF bot are up to date.

The rebased patch (with regression tests added) is attached here.

There's still one design issue that I haven't mentioned yet: if the USAGE
privilege on a publication is revoked after the synchronization phase
completed, the missing privilege on a publication causes ERROR in the output
plugin. If the privilege is then granted, the error does not disappear because
the same (historical) snapshot we use to decode the failed data change again
is also used to check the privileges in the catalog, so the output plugin does
not see that the privilege has already been granted.

The only solution seems to be to drop the publication from the subscription
and add it again, or to drop and re-create the whole subscription. I haven't
added a note about this problem to the documentation yet, in case someone has
better idea how to approach the problem.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 03c0193709..d510220a07 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1935,6 +1935,13 @@ REVOKE ALL ON accounts FROM PUBLIC;
statements that have previously performed this lookup, so this is not
a completely secure way to prevent object access.
   
+  
+   For publications, allows logical replication via particular
+   publication. The user specified in
+   the CREATE
+   SUBSCRIPTION command must have this privilege on all
+   publications listed in that command.
+  
   
For sequences, allows use of the
currval and nextval functions.
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index f8756389a3..4286d709d9 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1712,7 +1712,9 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
   
In order to be able to copy the initial table data, the role used for the
replication connection must have the SELECT privilege on
-   a published table (or be a superuser).
+   a published table (or be a superuser). In addition, the role must have
+   the USAGE privilege on all the publications referenced
+   by particular subscription.
   
 
   
@@ -1728,14 +1730,12 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
   
 
   
-   There are currently no privileges on publications.  Any subscription (that
-   is able to connect) can access any publication.  Thus, if you intend to
-   hide some information from particular subscribers, such as by using row
-   filters or column lists, or by not adding the whole table to the
-   publication, be aware that other publications in the same database could
-   expose the same information.  Publication privileges might be added to
-   PostgreSQL in the future to allow for
-   finer-grained access control.
+   If you intend to hide some information from particular subscribers, such as
+   by using row filters or column lists, or by not adding the whole table to
+   the publication, be aware that other publications in the same database
+   could expose the same
+   information. Privileges on publication can
+   be used to implement finer-grained access control.
   
 
   
diff --git a/doc/src/sgml/ref/grant.sgml b/doc/src/sgml/ref/grant.sgml
index dea19cd348..e62b7f643c 100644
--- a/doc/src/sgml/ref/grant.sgml
+++ b/doc/src/sgml/ref/grant.sgml
@@ -82,6 +82,11 @@ GRANT { { SET | ALTER SYSTEM } [, ... ] | ALL [ PRIVILEGES ] }
 TO role_specification [, ...] [ WITH GRANT OPTION ]
 [ GRANTED BY role_specification ]
 
+GRANT { USAGE | ALL [ PRIVILEGES ] }
+ON PUBLICATION pub_name [, ...]
+TO role_specification [, ...] [ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
+
 GRANT { { CREATE | USAGE } [, ...] | ALL [ PRIVILEGES ] }
 ON SCHEMA schema_name [, ...]
 TO role_specification [, ...] [ WITH GRANT OPTION ]
@@ -488,7 +493,7 @@ GRANT admins TO joe;

 

-Privileges on databases, tablespaces, schemas, languages, and
+Privileges on databases, tablespaces, schemas, languages, publications and
 configuration parameters are
 PostgreSQL extensions.

diff --git a/doc/src/sgml/ref/revoke.sgml b/doc/src/sgml/ref/revoke.sgml
index 4fd4bfb3d7..7e8d018743 100644
--- a/doc/src/sgml/ref/revoke.sgml
+++ b/doc/src/sgml/ref/revoke.sgml
@@ -104,6 +104,13 @@ REVOKE [ GRANT OPTION FOR ]
 [ GRANTED BY role_specification ]
 [ CASCADE | RESTRICT ]
 
+REVOKE [ GRANT OPTION FOR ]
+{ USAGE | ALL [ PRIVILEGES ] }
+ON PUBLICATION pub_name [, ...]
+FROM role_specification [, ...]
+[ GRANTED BY role_specification ]
+[ CASCADE | RESTRICT ]
+
 REVOKE [ GRANT OPTION FOR ]

Re: Temporary file access API

2022-09-23 Thread Antonin Houska

Hi,

John Morris  wrote:

> I’m a latecomer to the discussion, but as a word of introduction, I’m working 
> with Stephen, and I have started looking over the temp file proposal with the 
> idea of helping it move along.
>  
> I’ll start by summarizing the temp file proposal and its goals.
>  
> From a high level, the proposed code:
> 
> * Creates an fread/fwrite replacement (BufFileStream) for buffering data to a 
> single file.
> 
> * Updates BufFile, which reads/writes a set of files, to use BufFileStream 
> internally.
> 
> * Does not impact the normal (page cached) database I/O.
> 
> * Updates all the other places where fread/fwrite and read/write are used.

Not really all, just those where the change seemed reasonable (i.e. where it
does not make the code more complex)

> * Creates and removes transient files.

The "stream API" is rather an additional layer on top of files that user needs
to create / remove at lower level.

> I see the following goals:
> 
> * Unify all the “other” file accesses into a single, consistent API.
> 
> * Integrate with VFDs.
> 
> * Integrate transient files with transactions and tablespaces.

If you mean automatic closing/deletion of files on transaction end, this is
also the lower level thing that I didn't try to change.

> * Create a consolidated framework where features like encryption and 
> compression can be more easily added.
> 
> * Maintain good streaming performance.
> 
> Is this a fair description of the proposal? 

Basically that's it.

> For myself, I’d like to map out how features like compression and encryption 
> would fit into the framework, more as a sanity check than anything else, and 
> I’d like to look closer at some of the implementation details. But at the 
> moment, I want to make sure I have the
> proper high-level view of the temp file proposal.

I think the high level design (i.e. how the API should be used) still needs
discussion. In particular, I don't know whether it should aim at the
encryption adoption or not. If it does, then it makes sense to base it on
buffile.c, because encryption essentially takes place in memory. But if
buffering itself (w/o encryption) is not really useful at other places (see
Robert's comments in [1]), then we can design something simpler, w/o touching
buffile.c (which, in turn, won't be usable for encryption, compression or so).

So I think that code simplification and easy adoption of in-memory data
changes (such as encryption or compression) are two rather distinct goals.
admit that I'm running out of ideas how to develop a framework that'd be
useful for both.

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZWP8UtkNVLd75Qqoh9VGOVy_0xBK%2BSHZAdNvnfaikKsQ%40mail.gmail.com

> From: Robert Haas 
> Date: Wednesday, September 21, 2022 at 11:54 AM
> To: Antonin Houska 
> Cc: PostgreSQL Hackers , Peter Eisentraut 
> , Stephen Frost 
> Subject: Re: Temporary file access API
> 
> On Mon, Aug 8, 2022 at 2:26 PM Antonin Houska  wrote:
> > > I don't think that (3) is a separate advantage from (1) and (2), so I
> > > don't have anything separate to say about it.
> >
> > I thought that the uncontrollable buffer size is one of the things you
> > complaint about in
> >
> > https://www.postgresql.org/message-id/CA+TgmoYGjN_f=fcerx49bzjhng+gocty+a+xhnrwcvvdy8u...@mail.gmail.com
> 
> Well, I think that email is mostly complaining about there being no
> buffering at all in a situation where it would be advantageous to do
> some buffering. But that particular code I believe is gone now because
> of the shared-memory stats collector, and when looking through the
> patch set, I didn't see that any of the cases that it touched were
> similar to that one.
> 
> -- 
> Robert Haas
> EDB: http://www.enterprisedb.com
> 

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Return value of PathNameOpenFile()

2022-09-06 Thread Antonin Houska

I've noticed that some callers of PathNameOpenFile()
(e.g. bbsink_server_begin_archive()) consider the call failed even if the
function returned zero, while other ones do check whether the file descriptor
is strictly negative. Since the file descriptor is actually returned by the
open() system call, I assume that zero is a valid result, isn't it?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Temporary file access API

2022-08-08 Thread Antonin Houska

Robert Haas  wrote:

> On Fri, Jul 29, 2022 at 11:36 AM Antonin Houska  wrote:
> > Attached is a new version. It allows the user to set "elevel" (i.e. ERROR is
> > not necessarily thrown on I/O failure, if the user prefers to check the 
> > number
> > of bytes read/written) and to specify the buffer size. Also, 0015 adds a
> > function to copy data from one file descriptor to another.
> 
> I basically agree with Peter Eisentraut's earlier comment: "I don't
> understand what this patch set is supposed to do." The intended
> advantages, as documented in buffile.c, are:
> 
> + * 1. Less code is needed to access the file.
> + *
> + * 2.. Buffering reduces the number of I/O system calls.
> + *
> + * 3. The buffer size can be controlled by the user. (The larger the buffer,
> + * the fewer I/O system calls are needed, but the more data needs to be
> + * written to the buffer before the user recognizes that the file access
> + * failed.)
> + *
> + * 4. It should make features like Transparent Data Encryption less invasive.
> 
> It does look to me like there is a modest reduction in the total
> amount of code from these patches, so I do think they might be
> achieving (1).
> 
> I'm not so sure about (2), though. A number of these places are cases
> where we only do a single write to the file - like in patches 0010,
> 0011, 0012, and 0014, for example. And even in 0013, where we do
> multiple I/Os, it makes no sense to introduce a second buffering
> layer. We're reading from a file into a buffer several kilobytes in
> size and then shipping the data out. The places where you want
> buffering are places where we might be doing system calls for a few
> bytes of I/O at a time, not places where we're already doing
> relatively large I/Os. An extra layer of buffering probably makes
> things slower, not faster.

The buffering seemed to me like a good opportunity to plug-in the encryption,
because the data needs to be encrypted in memory before it's written to
disk. Of course it'd be better to use the existing buffering for that than to
add another layer.

> I don't think that (3) is a separate advantage from (1) and (2), so I
> don't have anything separate to say about it.

I thought that the uncontrollable buffer size is one of the things you
complaint about in

https://www.postgresql.org/message-id/CA+TgmoYGjN_f=fcerx49bzjhng+gocty+a+xhnrwcvvdy8u...@mail.gmail.com

> I find it really unclear whether, or how, it achieves (4). In general,
> I don't think that the large number of raw calls to read() and write()
> spread across the backend is a particularly good thing. TDE is an
> example of a feature that might affect every byte we read or write,
> but you can also imagine instrumentation and tracing facilities that
> might want to be able to do something every time we do an I/O without
> having to modify a zillion separate call sites. Such features can
> certainly benefit from consolidation, but I suspect it isn't helpful
> to treat every I/O in exactly the same way. For instance, let's say we
> settle on a design where everything in the cluster is encrypted but,
> just to make things simple for the sake of discussion, let's say there
> are no stored nonces anywhere. Can we just route every I/O in the
> backend through a wrapper layer that does encryption and decryption?

> Probably not. I imagine postgresql.conf and pg_hba.conf aren't going
> to be encrypted, for example, because they're intended to be edited by
> users. And Bruce has previously proposed designs where the WAL would
> be encrypted with a different key than the data files. Another idea,
> which I kind of like, is to encrypt WAL on a record level rather than
> a block level. Either way, you can't just encrypt every single byte
> the same way. There are probably other relevant distinctions: some
> data is temporary and need not survive a server restart or be shared
> with other backends (except maybe in the case of parallel query) and
> some data is permanent. Some data is already block-structured using
> blocks of size BLCKSZ; some data is block-structured using some other
> block size; and some data is not block-structured. Some data is in
> standard-format pages that are part of the buffer pool and some data
> isn't. I feel like some of those distinctions probably matter to TDE,
> and maybe to other things that we might want to do.
> 
> For instance, to go back to my earlier comments, if the data is
> already block-structured, we do not need to insert a second layer of
> buffering; if it isn't, maybe we should, not just for TDE but for
> other reasons as well. If the data is temporary, TDE might want to
> encrypt it using a temporary key which is generated on the fly and
> only stored in server memory,

Re: Temporary file access API

2022-07-29 Thread Antonin Houska

Antonin Houska  wrote:

> Peter Eisentraut  wrote:
> 
> > On 04.07.22 18:35, Antonin Houska wrote:
> > >> Attached is a new version of the patch, to evaluate what the API use in 
> > >> the
> > >> backend could look like. I haven't touched places where the file is 
> > >> accessed
> > >> in a non-trivial way, e.g. lseek() / fseek() or pg_pwrite() / pg_pread() 
> > >> is
> > >> called.
> > > Rebased patch set is attached here, which applies to the current master.
> > > (A few more opportunities for the new API implemented here.)
> > 
> > I don't understand what this patch set is supposed to do.  AFAICT, the 
> > thread
> > originally forked off from a TDE discussion and, considering the thread
> > subject, was possibly once an attempt to refactor temporary file access to
> > make integrating encryption easier?  The discussion then talked about things
> > like saving on system calls and excising all OS-level file access API use,
> > which I found confusing, and the thread then just became a general 
> > TDE-related
> > mini-discussion.
> 
> Yes, it's an attempt to make the encryption less invasive, but there are a few
> other objectives, at least: 1) to make the read / write operations "less
> low-level" (there are common things like error handling which are often just
> copy & pasted from other places), 2) to have buffered I/O with configurable
> buffer size (the current patch version still has fixed buffer size though)
> 
> It's true that the discussion ends up to be encryption-specific, however the
> scope of the patch is broader. The first meassage of the thread references a
> related discussion
> 
> https://www.postgresql.org/message-id/CA+TgmoYGjN_f=fcerx49bzjhng+gocty+a+xhnrwcvvdy8u...@mail.gmail.com
> 
> which is more important for this patch than the suggestions about encryption.
> 
> > The patches at hand extend some internal file access APIs and then sprinkle
> > them around various places in the backend code, but there is no explanation
> > why or how this is better. I don't see any real structural savings one might
> > expect from a refactoring patch.  No information has been presented how this
> > might help encryption.
> 
> At this stage I expected feedback from the developers who have already
> contributed to the discussion, because I'm not sure myself if this version
> fits their ideas - that's why I didn't elaborate the documentation yet. I'll
> try to summarize my understanding in the next version, but I'd appreciate if I
> got some feedback for the current version first.
> 
> > I also suspect that changing around the use of various file access APIs 
> > needs
> > to be carefully evaluated in each individual case.  Various code has subtle
> > expectations about atomic write behavior, syncing, flushing, error recovery,
> > etc.  I don't know if this has been considered here.
> 
> I considered that, but could have messed up at some places. Right now I'm
> aware of one problem: pgstat.c does not expect the file access API to raise
> ERROR - this needs to be fixed.

Attached is a new version. It allows the user to set "elevel" (i.e. ERROR is
not necessarily thrown on I/O failure, if the user prefers to check the number
of bytes read/written) and to specify the buffer size. Also, 0015 adds a
function to copy data from one file descriptor to another.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



temp_file_api_v4.tgz
Description: application/gzip

Re: Proposal: allow database-specific role memberships

2022-07-19 Thread Antonin Houska

Kenaniah Cerny  wrote:

> Rebased yet again...
> 
> On Mon, Jul 4, 2022 at 1:17 PM Kenaniah Cerny  wrote:

>  The "unsafe_tests" directory is where the pre-existing role tests were
>  located. According to the readme of the "unsafe_tests" directory, the tests
>  contained within are not run during "make installcheck" because they could
>  have side-effects that seem undesirable for a production installation. This
>  seemed like a reasonable location as the new tests that this patch
>  introduces also modifies the "state" of the database cluster by adding,
>  modifying, and removing roles & databases (including template1).

ok, I missed the purpose of "unsafe_tests" so far, thanks.

>  Regarding roles_is_member_of(), the nuance is that role "A" in your example
>  would only be considered a member of role "B" (and by extension role "C")
>  when connected to the database in which "A" was granted database-specific
>  membership to "B".

>  Conversely, when connected to any other database, "A" would not be 
> considered to be a member of "B".  
> 
>  This patch is designed to solve the scenarios in which one may want to
>  grant constrained access to a broader set of privileges. For example,
>  membership in "pg_read_all_data" effectively grants SELECT and USAGE rights
>  on everything (implicitly cluster-wide in today's implementation). By
>  granting a role membership to "pg_read_all_data" within the context of a
>  specific database, the grantee's read-everything privilege is effectively
>  constrained to just that specific database (as membership within
>  "pg_read_all_data" would not otherwise be held).

ok, I tried to view the problem rather from general perspective. However, the
permissions like "pg_read_all_data" are unusual in that they are rather strong
and at the same time they are usually located at the top of the groups
hierarchy. I've got no better idea how to solve the problem.

A few more comments on the patch:

* It's not clear from the explanation of the GRANT ... IN DATABASE ... / GRANT
  ... IN CURRENT DATABASE ... that, even if "membership in ... will be
  effective only when the recipient is connected to the database ...", the
  ADMIN option might not be "fully effective". I refer to the part of the
  regression tests starting with

-- Ensure database-specific admin option can only grant within that database

  For example, "role_read_34" does have the ADMIN option for the
  "pg_read_all_data" role and for the "db_4" database:

GRANT pg_read_all_data TO role_read_34 IN DATABASE db_4 WITH ADMIN OPTION;

  (in other words, "role_read_34" does have the database-specific membership
  in "pg_read_all_data"), but it cannot use the option (in other words, cannot
  use some ability resulting from that membership) unless the session to that
  database is active:

\connect db_3
SET SESSION AUTHORIZATION role_read_34;
...
GRANT pg_read_all_data TO role_granted IN CURRENT DATABASE; -- success
GRANT pg_read_all_data TO role_granted IN DATABASE db_3; -- notice
NOTICE:  role "role_granted" is already a member of role "pg_read_all_data" 
in database "db_3"
GRANT pg_read_all_data TO role_granted IN DATABASE db_4; -- error
ERROR:  must have admin option on role "pg_read_all_data"

Specifically on the regression tests:

* The function check_memberships() has no parameters - is there a reason 
not to use a view?

* I'm not sure if the pg_auth_members catalog can contain InvalidOid in
  other columns than dbid. Thus I think that the query in
  check_memberships() only needs an outer JOIN for the pg_database table,
  while the other joins can be inner.

* In this part

SET SESSION AUTHORIZATION role_read_12_noinherit;
SELECT * FROM data; -- error
SET ROLE role_read_12; -- error
SELECT * FROM data; -- error

I think you don't need to query the table again if the SET ROLE statement
failed and the same query had been executed before the SET ROLE.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Problem about postponing gathering partial paths for topmost scan/join rel

2022-07-18 Thread Antonin Houska

Richard Guo  wrote:

> On Fri, Jul 15, 2022 at 5:00 PM Richard Guo  wrote:
> 
>  On Fri, Jul 15, 2022 at 4:03 PM Richard Guo  wrote:
> 
>  On Thu, Jul 14, 2022 at 10:02 PM Antonin Houska  wrote:
> 
>  I'd prefer a test that demonstrates that the Gather node at the top of the
>  "subproblem plan" is useful purely from the *cost* perspective, rather than
>  due to executor limitation.
> 
>  This patch provides an additional path (Gather atop of subproblem) which
>  was not available before. But your concern makes sense that we need to
>  show this new path is valuable from competing on cost with other paths.
> 
>  How about we change to Nested Loop at the topmost? Something like:
> 
>  Maybe a better example is that we use a small table 'c' to avoid the
>  Gather node above scanning 'c', so that the path of parallel nestloop is
>  possible to be generated.
> 
> Update the patch with the new test case.

ok, this makes sense to me. Just one minor suggestion: the command

alter table d_star reset (parallel_workers);

is not necessary because it's immediately followed by

rollback;

I'm going to set the CF entry to "ready for committer'".

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Problem about postponing gathering partial paths for topmost scan/join rel

2022-07-14 Thread Antonin Houska

Richard Guo  wrote:

> On Wed, Jul 28, 2021 at 3:42 PM Richard Guo  wrote:
> 
>  To fix this problem, I'm thinking we can leverage 'root->all_baserels'
>  to tell if we are at the topmost scan/join rel, something like:
> 
>  --- a/src/backend/optimizer/path/allpaths.c
>  +++ b/src/backend/optimizer/path/allpaths.c
>  @@ -3041,7 +3041,7 @@ standard_join_search(PlannerInfo *root, int 
> levels_needed, List *initial_rels)
>   * partial paths.  We'll do the same for the topmost 
> scan/join rel
>   * once we know the final targetlist (see 
> grouping_planner).
>   */
>  -   if (lev < levels_needed)
>  +   if (!bms_equal(rel->relids, root->all_baserels))
>  generate_useful_gather_paths(root, rel, 
> false);
> 
>  Any thoughts?
> 
> Attach a patch to include the fix described upthread. Would appreciate
> any comments on this topic.


I think I understand the idea but I'm not sure about the regression test. I
suspect that in the plan

EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1 a JOIN tenk1 b ON a.two =3D b.two
  FULL JOIN tenk1 c ON b.two =3D c.two;
 QUERY PLAN 


 Aggregate
   ->  Hash Full Join
 Hash Cond: (b.two =3D c.two)
 ->  Gather
   Workers Planned: 4
   ->  Parallel Hash Join
 Hash Cond: (a.two =3D b.two)
 ->  Parallel Seq Scan on tenk1 a
 ->  Parallel Hash
   ->  Parallel Seq Scan on tenk1 b
 ->  Hash
   ->  Gather
 Workers Planned: 4
 ->  Parallel Seq Scan on tenk1 c


the Gather node is located below the "Hash Full Join" node only because that
kind of join currently cannot be executed by parallel workers. If the parallel
"Hash Full Join" gets implemented (I've noticed but not checked in detail
[1]), it might break this test.

I'd prefer a test that demonstrates that the Gather node at the top of the
"subproblem plan" is useful purely from the *cost* perspective, rather than
due to executor limitation.


[1] https://commitfest.postgresql.org/38/2903/

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: [PROPOSAL] Detecting plan changes with plan_id in pg_stat_activity

2022-07-14 Thread Antonin Houska

Shouldn't the patch status be set to "Waiting on Author"?

(I was curious if this is a patch that I can review.)

Julien Rouhaud  wrote:

> On Wed, Jun 22, 2022 at 11:05:54PM +, Imseih (AWS), Sami wrote:
> > >Can you describe how it's kept in sync, and how it makes sure that the 
> > > property
> > >is maintained over restart / gc?  I don't see any change in the code 
> > > for the
> > >2nd part so I don't see how it could work (and after a quick test it 
> > > indeed
> > >doesn't).
> >
> > There is a routine called qtext_hash_sync which removed all entries from
> > the qtext_hash and reloads it will all the query ids from pgss_hash.
> > [...]
> > All the points when it's called, an exclusive lock is held.this allows or 
> > syncing all
> > The present queryid's in pgss_hash with qtext_hash.
> 
> So your approach is to let the current gc / file loading behavior happen as
> before and construct your mapping hash using the resulting query text / offset
> info.  That can't work.
> 
> > >2nd part so I don't see how it could work (and after a quick test it 
> > > indeed
> > >doesn't).
> >
> > Can you tell me what test you used to determine it is not in sync?
> 
> What test did you use to determine it is in sync?  Have you checked how the 
> gc/
> file loading actually work?
> 
> In my case I just checked the size of the query text file after running some
> script that makes sure that there are the same few queries ran by multiple
> different roles, then:
> 
> Size of $PGDATA/pg_stat_tmp/pgss_query_texts.stat: 559B
> pg_ctl restart
> Size of $PGDATA/pg_stat_tmp/pgss_query_texts.stat: 2383B
> 
> > >Can you share more details on the benchmarks you did?  Did you also run
> > >benchmark on workloads that induce entry eviction, with and without 
> > > need for
> > >gc?  Eviction in pgss is already very expensive, and making things 
> > > worse just
> > >to save a bit of disk space doesn't seem like a good compromise.
> >
> > Sorry this was poorly explained by me. I went back and did some benchmarks. 
> > Attached is
> > The script and results. But here is a summary:
> > On a EC2 r5.2xlarge. The benchmark I performed is:
> > 1. create 10k tables
> > 2. create 5 users
> > 3. run a pgbench script that performs per transaction a select on
> > A randomly chosen table for each of the 5 users.
> > 4. 2 variants of the test executed . 1 variant is with the default 
> > pg_stat_statements.max = 5000
> > and one test with a larger pg_stat_statements.max = 1.
> 
> But you wrote:
> 
> > ##
> > ## pg_stat_statements.max = 15000
> > ##
> 
> So which one is it?
> 
> >
> > So 10-15% is not accurate. I originally tested on a less powered machine. 
> > For this
> > Benchmark I see a 6% increase in TPS (732k vs  683k) when we have a larger 
> > sized
> > pg_stat_statements.max is used and less gc/deallocations.
> > Both tests show a drop in gc/deallocations and a net increase
> > In tps. Less GC makes sense since the external file has less duplicate SQLs.
> 
> On the other hand you're rebuilding the new query_offset hashtable every time
> there's an entry eviction, which seems quite expensive.
> 
> Also, as I mentioned you didn't change any of the heuristic for
> need_gc_qtexts().  So if the query texts are indeed deduplicated, doesn't it
> mean that  gc  will artificially
> be called less often?  The wanted target of "50% bloat" will become "50%
> bloat assuming no deduplication is done" and the average query text file size
> will stay the same whether the query texts are deduplicated or not.
> 
> I'm wondering the improvements you see due to the patch or simply due to
> artificially calling gc less often?  What are the results if instead of using
> vanilla pg_stat_statements you patch it to perform roughly the same number of
> gc as your version does?
> 
> Also your benchmark workload is very friendly with your feature, what are the
> results with other workloads?  Having the results with query texts that aren't
> artificially long would be interesting for instance, after fixing the problems
> mentioned previously.
> 
> Also, you said that if you run that benchmark with a single user you don't see
> any regression.   I don't see how rebuilding an extra hashtable in
> entry_dealloc(), so when holding an exclusive lwlock, while not saving any
> other work elsewhere could be free?
> 
> Looking at the script, you have:
> echo "log_min_messages=debug1" >> $PGDATA/postgresql.conf; \
> 
> Is that really necessary?  Couldn't you upgrade the gc message to a higher
> level for your benchmark need, or expose some new counter in
> pg_stat_statements_info maybe?  Have you done the benchmark using a debug 
> build
> or normal build?
> 
> 

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Temporary file access API

2022-07-05 Thread Antonin Houska

Peter Eisentraut  wrote:

> On 04.07.22 18:35, Antonin Houska wrote:
> >> Attached is a new version of the patch, to evaluate what the API use in the
> >> backend could look like. I haven't touched places where the file is 
> >> accessed
> >> in a non-trivial way, e.g. lseek() / fseek() or pg_pwrite() / pg_pread() is
> >> called.
> > Rebased patch set is attached here, which applies to the current master.
> > (A few more opportunities for the new API implemented here.)
> 
> I don't understand what this patch set is supposed to do.  AFAICT, the thread
> originally forked off from a TDE discussion and, considering the thread
> subject, was possibly once an attempt to refactor temporary file access to
> make integrating encryption easier?  The discussion then talked about things
> like saving on system calls and excising all OS-level file access API use,
> which I found confusing, and the thread then just became a general TDE-related
> mini-discussion.

Yes, it's an attempt to make the encryption less invasive, but there are a few
other objectives, at least: 1) to make the read / write operations "less
low-level" (there are common things like error handling which are often just
copy & pasted from other places), 2) to have buffered I/O with configurable
buffer size (the current patch version still has fixed buffer size though)

It's true that the discussion ends up to be encryption-specific, however the
scope of the patch is broader. The first meassage of the thread references a
related discussion

https://www.postgresql.org/message-id/CA+TgmoYGjN_f=fcerx49bzjhng+gocty+a+xhnrwcvvdy8u...@mail.gmail.com

which is more important for this patch than the suggestions about encryption.

> The patches at hand extend some internal file access APIs and then sprinkle
> them around various places in the backend code, but there is no explanation
> why or how this is better. I don't see any real structural savings one might
> expect from a refactoring patch.  No information has been presented how this
> might help encryption.

At this stage I expected feedback from the developers who have already
contributed to the discussion, because I'm not sure myself if this version
fits their ideas - that's why I didn't elaborate the documentation yet. I'll
try to summarize my understanding in the next version, but I'd appreciate if I
got some feedback for the current version first.

> I also suspect that changing around the use of various file access APIs needs
> to be carefully evaluated in each individual case.  Various code has subtle
> expectations about atomic write behavior, syncing, flushing, error recovery,
> etc.  I don't know if this has been considered here.

I considered that, but could have messed up at some places. Right now I'm
aware of one problem: pgstat.c does not expect the file access API to raise
ERROR - this needs to be fixed.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Temporary file access API

2022-07-04 Thread Antonin Houska

Antonin Houska  wrote:

> Robert Haas  wrote:
> 
> > On Tue, Apr 12, 2022 at 5:30 AM Antonin Houska  wrote:
> > > Robert Haas  wrote:
> > > > On Mon, Apr 11, 2022 at 4:05 AM Antonin Houska  wrote:
> > > > > There are't really that many kinds of files to encrypt:
> > > > >
> > > > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#List_of_the_files_that_contain_user_data
> > > > >
> > > > > (And pg_stat/* files should be removed from the list.)
> > > >
> > > > This kind of gets into some theoretical questions. Like, do we think
> > > > that it's an information leak if people can look at how many
> > > > transactions are committing and aborting in pg_xact_status? In theory
> > > > it could be, but I know it's been argued that that's too much of a
> > > > side channel. I'm not sure I believe that, but it's arguable.
> > >
> > > I was referring to the fact that the statistics are no longer stored in 
> > > files:
> > >
> > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd
> > 
> > Oh, yeah, I agree with that.
> 
> I see now that the statistics are yet saved to a file on server shutdown. I've
> updated the wiki page.
> 
> Attached is a new version of the patch, to evaluate what the API use in the
> backend could look like. I haven't touched places where the file is accessed
> in a non-trivial way, e.g. lseek() / fseek() or pg_pwrite() / pg_pread() is
> called.

Rebased patch set is attached here, which applies to the current master.

(A few more opportunities for the new API implemented here.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



temp_file_api_v3.tgz
Description: application/gzip

Re: Proposal: allow database-specific role memberships

2022-06-29 Thread Antonin Houska

Kenaniah Cerny  wrote:

> Attached is a newly-rebased patch -- would love to get a review from someone 
> whenever possible.

I've picked this patch for a review. The patch currently does not apply to the
master branch, so I could only read the diff. Following are my comments:

* I think that roles_is_member_of() deserves a comment explaining why the code
  that you moved into append_role_memberships() needs to be called twice,
  i.e. once for global memberships and once for the database-specific ones.

  I think the reason is that if, for example, role "A" is a database-specific
  member of role "B" and "B" is a "global" member of role "C", then "A" should
  not be considered a member of "C", unless "A" is granted "C" explicitly. Is
  this behavior intended?

  Note that in this example, the "C" members are a superset of "B" members,
  and thus "C" should have weaker permissions on database objects than
  "B". What's then the reason to not consider "A" a member of "C"? If "C"
  gives its members some permissions of "B" (e.g. "pg_write_all_data"), then I
  think the roles hierarchy is poorly designed.

  A counter-example might help me to understand.

* Why do you think that "unsafe_tests" is the appropriate name for the
  directory that contains regression tests?

I can spend more time on the review if the patch gets rebased.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Privileges on PUBLICATION

2022-06-20 Thread Antonin Houska

Euler Taveira  wrote:

> --eeab359ad6094efd84562cddd7fb9e89
> Content-Type: text/plain
> 
> On Wed, May 18, 2022, at 6:44 AM, Antonin Houska wrote:
> > ok, please see the next version.
> The new paragraph looks good to me. I'm not sure if the CREATE PUBLICATION is
> the right place to provide such information. As I suggested in a previous 
> email
> [1], you could add it to "Logical Replication > Security".

ok, I missed that. The next version moves the text there.

> [1] https://postgr.es/m/d96103fe-99e2-4119-bd76-952d326b7...@www.fastmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 145ea71d61b..2fcaa9d261a 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1171,6 +1171,17 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
schema automatically, the user must be a superuser.
   
 
+  
+   Note that there are currently no privileges on publication, and that any
+   subscriber can access any publication. Thus if you're trying to hide some
+   information from particular subscribers (by using the
+   WHERE clause or the column list, or by not adding the
+   whole table to the publication), please be aware that other publications
+   can expose the same information. Publication privileges might be added
+   to PostgreSQL in the future to allow for
+   fine-grained access control.
+  
+
   
To create a subscription, the user must be a superuser.

Re: Privileges on PUBLICATION

2022-05-19 Thread Antonin Houska

Euler Taveira  wrote:

> On Wed, May 18, 2022, at 6:16 AM, Antonin Houska wrote:
> 
>  The patch is attached to this message.
> 
> Great. Add it to the next CF. I'll review it when I have some spare time.

https://commitfest.postgresql.org/38/3641/

Thanks!

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Privileges on PUBLICATION

2022-05-18 Thread Antonin Houska

Euler Taveira  wrote:

> On Fri, May 13, 2022, at 3:36 AM, Antonin Houska wrote:
> 
>  Attached is my proposal. It tries to be more specific and does not mention 
> the
>  absence of the privileges explicitly.
> 
> You explained the current issue but say nothing about the limitation. This
> information will trigger a question possibly in one of the MLs. IMO if you say
> something like the sentence above at the end, it will make it clear why that
> setup expose all data (there is no access control to publications) and
> explicitly say there is a TODO here.
> 
> Additional privileges might be added to control access to table data in a
> future version of PostgreSQL.

I thought it sound too negative if absence of some feature was mentioned
explicitly. However it makes sense to be clear from technical point of view.

> I also wouldn't use the warning tag because it fits in the same category as 
> the
> other restrictions listed in the page.

ok, please see the next version.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/doc/src/sgml/ref/create_publication.sgml b/doc/src/sgml/ref/create_publication.sgml
index 1a828e8d2ff..259fe20a148 100644
--- a/doc/src/sgml/ref/create_publication.sgml
+++ b/doc/src/sgml/ref/create_publication.sgml
@@ -112,6 +112,17 @@ CREATE PUBLICATION name
   Specifying a table that is part of a schema specified by
   FOR ALL TABLES IN SCHEMA is not supported.
  
+
+ 
+  Note that there are currently no privileges on publication, and that any
+  subscriber can access any publication. Thus if you're trying to hide
+  some information from particular subscribers (by using the
+  WHERE clause or the column list, or by not adding the
+  whole table to the publication), please be aware that other publications
+  can expose the same information. Publication privileges might be added
+  to PostgreSQL in the future to allow for
+  fine-grained access control.
+

Re: Privileges on PUBLICATION

2022-05-18 Thread Antonin Houska

Antonin Houska  wrote:

> Euler Taveira  wrote:
> 
> > On Tue, May 10, 2022, at 5:37 AM, Antonin Houska wrote:
> > 
> >  My understanding is that the rows/columns filtering is a way for the
> >  *publisher* to control which data is available to particular replica. From
> >  this point of view, the publication privileges would just make the control
> >  complete.
> > 
> > I agree. IMO it is a new feature. We already require high privilege for 
> > logical
> > replication. Hence, we expect the replication user to have access to all 
> > data.
> > Unfortunately, nobody mentioned about this requirement during the row 
> > filter /
> > column list development; someone could have written a patch for GRANT ... ON
> > PUBLICATION.
> 
> I can try that for PG 16, unless someone is already working on it.

The patch is attached to this message.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

>From 4004260cf1f3443455b7db89a70a50e38d1663b0 Mon Sep 17 00:00:00 2001
From: Antonin Houska 
Date: Wed, 18 May 2022 10:58:42 +0200
Subject: [PATCH] Draft implementation of USAGE privilege on PUBLICATION.

This privilege seems to be useful for databases from which multiple
subscribers replicate the data. It should not be assumed that any publication
can be exposed to any subscription.
---
 doc/src/sgml/ddl.sgml   |   7 +
 doc/src/sgml/logical-replication.sgml   |   4 +-
 doc/src/sgml/ref/grant.sgml |   7 +-
 doc/src/sgml/ref/revoke.sgml|   7 +
 src/backend/catalog/aclchk.c| 206 
 src/backend/commands/copyto.c   | 114 +++
 src/backend/commands/publicationcmds.c  |   2 +
 src/backend/parser/gram.y   |   8 +
 src/backend/replication/pgoutput/pgoutput.c |  13 +-
 src/backend/utils/adt/acl.c |   7 +
 src/bin/pg_dump/dumputils.c |   2 +
 src/bin/pg_dump/pg_dump.c   |  28 ++-
 src/bin/pg_dump/pg_dump.h   |   1 +
 src/bin/psql/tab-complete.c |   3 +
 src/include/catalog/pg_publication.h|   6 +
 src/include/nodes/parsenodes.h  |   4 +-
 src/include/utils/acl.h |   4 +
 17 files changed, 413 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index f2ac1ba0034..dc499c50756 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1935,6 +1935,13 @@ REVOKE ALL ON accounts FROM PUBLIC;
statements that have previously performed this lookup, so this is not
a completely secure way to prevent object access.
   
+  
+   For publications, allows logical replication via particular
+   publication. The user specified in
+   the CREATE
+   SUBSCRIPTION command must have this privilege on all
+   publications listed in that command.
+  
   
For sequences, allows use of the
currval and nextval functions.
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 145ea71d61b..63194b34541 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1156,7 +1156,9 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
   
In order to be able to copy the initial table data, the role used for the
replication connection must have the SELECT privilege on
-   a published table (or be a superuser).
+   a published table (or be a superuser). In addition, the role must have
+   the USAGE privilege on all the publications referenced
+   by particular subscription.
   
 
   
diff --git a/doc/src/sgml/ref/grant.sgml b/doc/src/sgml/ref/grant.sgml
index f744b05b55d..95b9c46137f 100644
--- a/doc/src/sgml/ref/grant.sgml
+++ b/doc/src/sgml/ref/grant.sgml
@@ -82,6 +82,11 @@ GRANT { { SET | ALTER SYSTEM } [, ... ] | ALL [ PRIVILEGES ] }
 TO role_specification [, ...] [ WITH GRANT OPTION ]
 [ GRANTED BY role_specification ]
 
+GRANT { USAGE | ALL [ PRIVILEGES ] }
+ON PUBLICATION pub_name [, ...]
+TO role_specification [, ...] [ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
+
 GRANT { { CREATE | USAGE } [, ...] | ALL [ PRIVILEGES ] }
 ON SCHEMA schema_name [, ...]
 TO role_specification [, ...] [ WITH GRANT OPTION ]
@@ -460,7 +465,7 @@ GRANT admins TO joe;

 

-Privileges on databases, tablespaces, schemas, languages, and
+Privileges on databases, tablespaces, schemas, languages, publications and
 configuration parameters are
 PostgreSQL extensions.

diff --git a/doc/src/sgml/ref/revoke.sgml b/doc/src/sgml/ref/revoke.sgml
index 62f19710369..c21ea4c9c69 100644
--- a/doc/src/sgml/ref/revoke.sgml
+++ b/doc/src/sgml/ref/revoke.sgml
@@ -104,6 +104,13 @@ REVOKE [ GRANT OPTION FOR ]
 [ GR

Re: Privileges on PUBLICATION

2022-05-13 Thread Antonin Houska

Peter Eisentraut  wrote:

> On 10.05.22 10:37, Antonin Houska wrote:
> > My understanding is that the rows/columns filtering is a way for the
> > *publisher*  to control which data is available to particular replica. From
> > this point of view, the publication privileges would just make the control
> > complete.
> 
> I think privileges on publications would eventually be useful.  But are you
> arguing that we need them for PG15 to make the new features usable safely?

I didn't think that far, but user should be aware of the problem. My proposal
of documentation is in 
https://www.postgresql.org/message-id/5859.1652423797%40antos

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Privileges on PUBLICATION

2022-05-13 Thread Antonin Houska

Euler Taveira  wrote:

> On Tue, May 10, 2022, at 5:37 AM, Antonin Houska wrote:
> 
>  My understanding is that the rows/columns filtering is a way for the
>  *publisher* to control which data is available to particular replica. From
>  this point of view, the publication privileges would just make the control
>  complete.
> 
> I agree. IMO it is a new feature. We already require high privilege for 
> logical
> replication. Hence, we expect the replication user to have access to all data.
> Unfortunately, nobody mentioned about this requirement during the row filter /
> column list development; someone could have written a patch for GRANT ... ON
> PUBLICATION.

I can try that for PG 16, unless someone is already working on it.

> I understand your concern. Like I said in my last sentence in the previous
> email: it is a fine-grained access control on the publisher. Keep in mind that
> it will *only* work for non-superusers (REPLICATION attribute). It is not
> exposing something that we didn't expose before. In this particular case, 
> there
> is no mechanism to prevent the subscriber to obtain data provided by the
> various row filters if they know the publication names. We could probably add 
> a
> sentence to "Logical Replication > Security" section:
> 
> There is no privileges for publications. If you have multiple publications in 
> a
> database, a subscription can use all publications available.

Attached is my proposal. It tries to be more specific and does not mention the
absence of the privileges explicitly.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/doc/src/sgml/ref/create_publication.sgml b/doc/src/sgml/ref/create_publication.sgml
index 1a828e8d2ff..b74ba625649 100644
--- a/doc/src/sgml/ref/create_publication.sgml
+++ b/doc/src/sgml/ref/create_publication.sgml
@@ -94,6 +94,16 @@ CREATE PUBLICATION name
   list is specified, it must include the replica identity columns.
  
 
+ 
+  
+   If you are using the WHERE clause or the column list
+   to omit some table data from the replication for security reasons,
+   please make sure that the same data is not exposed via other
+   publications which contain the same table and have different (or
+   none) WHERE clause or column list.
+ 
+ 
+
  
   Only persistent base tables and partitioned tables can be part of a
   publication.  Temporary tables, unlogged tables, foreign tables,

Re: Privileges on PUBLICATION

2022-05-10 Thread Antonin Houska

Amit Kapila  wrote:

> On Tue, May 10, 2022 at 12:16 AM Euler Taveira  wrote:
> >
> > On Mon, May 9, 2022, at 11:09 AM, Antonin Houska wrote:
> >
> > Now that the user can specify rows and columns to be omitted from the 
> > logical
> > replication [1], I suppose hiding rows and columns from the subscriber is an
> > important use case. However, since the subscription connection user (i.e. 
> > the
> > user specified in the CREATE SUBSCRIPTION ... CONNECTION ... command) needs
> > SELECT permission on the replicated table (on the publication side), he can
> > just use another publication (which has different filters or no filters at
> > all) to get the supposedly-hidden data replicated.
> >
> > The required privileges were not relaxed on publisher after the row filter 
> > and
> > column list features. It is not just to "create another publication". Create
> > publications require CREATE privilege on databases (that is *not* granted to
> > PUBLIC).If you have an untrusted user that could bypass your rules about 
> > hidden
> > data, it is better to review your user privileges.
> >
> 
> Also, to create a subscription (which combines multiple publications
> to bypass rules), a user must be a superuser. So, isn't that a
> sufficient guarantee that users shouldn't be able to bypass such
> rules?

My understanding is that the rows/columns filtering is a way for the
*publisher* to control which data is available to particular replica. From
this point of view, the publication privileges would just make the control
complete.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Privileges on PUBLICATION

2022-05-10 Thread Antonin Houska

Euler Taveira  wrote:

> On Mon, May 9, 2022, at 11:09 AM, Antonin Houska wrote:
> 
>  Now that the user can specify rows and columns to be omitted from the logical
>  replication [1], I suppose hiding rows and columns from the subscriber is an
>  important use case. However, since the subscription connection user (i.e. the
>  user specified in the CREATE SUBSCRIPTION ... CONNECTION ... command) needs
>  SELECT permission on the replicated table (on the publication side), he can
>  just use another publication (which has different filters or no filters at
>  all) to get the supposedly-hidden data replicated.
> 
> The required privileges were not relaxed on publisher after the row filter 
> and 
> column list features. It is not just to "create another publication". Create
> publications require CREATE privilege on databases (that is *not* granted to
> PUBLIC).If you have an untrusted user that could bypass your rules about 
> hidden
> data, it is better to review your user privileges.
> 
> postgres=# CREATE ROLE foo REPLICATION LOGIN;
> CREATE ROLE
> postgres=# \c - foo
> You are now connected to database "postgres" as user "foo".
> postgres=> CREATE PUBLICATION pub1;
> ERROR:  permission denied for database postgres
> 
> The documentation [1] says
> 
> "The role used for the replication connection must have the REPLICATION
> attribute (or be a superuser)."
> 
> You can use role foo for the replication connection but role foo couldn't be a
> superuser. In this case, even if role foo open a connection to database
> postgres, a publication cannot be created due to lack of privileges.
> 
>  Don't we need privileges on publication (e.g GRANT USAGE ON PUBLICATION ...)
>  now?
> 
> Maybe. We rely on CREATE privilege on databases right now. If you say that
> GRANT USAGE ON PUBLICATION is just a command that will have the same effect as
> REPLICATION property [1] has right now, I would say it won't. Are you aiming a
> fine-grained access control on publisher?

The configuration I'm thinking of is multiple replicas reading data from the
same master.

For example, consider "foo" and "bar" roles, used by "subscr_foo" and
"subscr_bar" subscriptions respectively. (Therefore, both roles need the
REPLICATION option.) The subscriptions "subscr_foo" and "subscr_bar" are
located in "db_foo" and "db_bar" databases respectively.

On the master side, there are two publications: "pub_foo" and "pub_bar", to be
used by "subscr_foo" and "subscr_bar" subscriptions respectively. The
publications replicate the same table, but each with a different row filter.

The problem is that the admin of "db_foo" can add the "pub_bar" publication to
the "subscr_foo" subscription, and thus get the data that his "pub_foo" would
filter out. Likewise, the admin of "db_bar" can "steal" the data from
"pub_foo" by adding that publication to "subscr_bar".

In this case, the existing publications are misused, so the CREATE PUBLICATION
privileges do not help. Since the REPLICATION option of a role is
cluster-wide, but I need specific roles to be restricted to specific
publications, it can actually be called fine-grained access control as you
say.

> [1] https://www.postgresql.org/docs/devel/logical-replication-security.html

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Privileges on PUBLICATION

2022-05-09 Thread Antonin Houska

Now that the user can specify rows and columns to be omitted from the logical
replication [1], I suppose hiding rows and columns from the subscriber is an
important use case. However, since the subscription connection user (i.e. the
user specified in the CREATE SUBSCRIPTION ... CONNECTION ... command) needs
SELECT permission on the replicated table (on the publication side), he can
just use another publication (which has different filters or no filters at
all) to get the supposedly-hidden data replicated.

Don't we need privileges on publication (e.g GRANT USAGE ON PUBLICATION ...)
now?

[1] https://www.postgresql.org/docs/devel/sql-createpublication.html

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Temporary file access API

2022-04-20 Thread Antonin Houska

Robert Haas  wrote:

> On Tue, Apr 12, 2022 at 5:30 AM Antonin Houska  wrote:
> > Robert Haas  wrote:
> > > On Mon, Apr 11, 2022 at 4:05 AM Antonin Houska  wrote:
> > > > There are't really that many kinds of files to encrypt:
> > > >
> > > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#List_of_the_files_that_contain_user_data
> > > >
> > > > (And pg_stat/* files should be removed from the list.)
> > >
> > > This kind of gets into some theoretical questions. Like, do we think
> > > that it's an information leak if people can look at how many
> > > transactions are committing and aborting in pg_xact_status? In theory
> > > it could be, but I know it's been argued that that's too much of a
> > > side channel. I'm not sure I believe that, but it's arguable.
> >
> > I was referring to the fact that the statistics are no longer stored in 
> > files:
> >
> > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd
> 
> Oh, yeah, I agree with that.

I see now that the statistics are yet saved to a file on server shutdown. I've
updated the wiki page.

Attached is a new version of the patch, to evaluate what the API use in the
backend could look like. I haven't touched places where the file is accessed
in a non-trivial way, e.g. lseek() / fseek() or pg_pwrite() / pg_pread() is
called.

Another use case might be copying one file to another via a buffer. Something
like

    BufFileCopy(int dstfd, int srcfd, int bufsize)

The obvious call site would be in copydir.c:copy_file(), but I think there are
a few more in the server code.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

temp_file_api_v2.tgz
Description: application/gzip

Re: Temporary file access API

2022-04-13 Thread Antonin Houska

Matthias van de Meent  wrote:

> On Mon, 11 Apr 2022 at 10:05, Antonin Houska  wrote:
> >
> > There are't really that many kinds of files to encrypt:
> >
> > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#List_of_the_files_that_contain_user_data
> >
> > (And pg_stat/* files should be removed from the list.)
> 
> I was looking at that list of files that contain user data, and
> noticed that all relation forks except the main fork were marked as
> 'does not contain user data'. To me this seems not necessarily true:
> AMs do have access to forks for user data storage as well (without any
> real issues or breaking the abstraction), and the init-fork is
> expected to store user data (specifically in the form of unlogged
> sequences). Shouldn't those forks thus also be encrypted-by-default,
> or should we provide some other method to ensure that non-main forks
> with user data are encrypted?

Thanks. I've updated the wiki page (also included Robert's comments).

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Temporary file access API

2022-04-12 Thread Antonin Houska

Robert Haas  wrote:

> On Mon, Apr 11, 2022 at 4:05 AM Antonin Houska  wrote:
> > There are't really that many kinds of files to encrypt:
> >
> > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#List_of_the_files_that_contain_user_data
> >
> > (And pg_stat/* files should be removed from the list.)
> 
> This kind of gets into some theoretical questions. Like, do we think
> that it's an information leak if people can look at how many
> transactions are committing and aborting in pg_xact_status? In theory
> it could be, but I know it's been argued that that's too much of a
> side channel. I'm not sure I believe that, but it's arguable.

I was referring to the fact that the statistics are no longer stored in files:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd

> Similarly, the argument that global/pg_internal.init doesn't contain
> user data relies on the theory that the only table data that will make
> its way into the file is for system catalogs. I guess that's not user
> data *exactly* but ... are we sure that's how we want to roll here?

Yes, this is worth attention.

> I really don't know how you can argue that pg_dynshmem/mmap.NNN
> doesn't contain user data - surely it can.

Good point. Since postgres does not control writing into this file, it's a
special case though. (Maybe TDE will have to reject to start if
dynamic_shared_memory_type is set to mmap and the instance is encrypted.)

Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: 回复：POC: Cleaning up orphaned files using undo logs

2022-04-11 Thread Antonin Houska

孔凡深(云梳)  wrote:

> Hi, Antonin. I am more interested in zheap. Recently reviewing the patch you 
> submitted.
> When I use pg_undodump-tool to dump the undo page chunk, I found that some 
> chunk header is abnormal. 
> After reading the relevant codes in 
> 0006-The-initial-implementation-of-the-pg_undodump-tool.patch, 
> I feel that there is a bug  in the function parse_undo_page.

Thanks, I'll take a look if the project happens to continue. Currently it
seems that another approach is more likely to be taken:

https://www.postgresql.org/message-id/CA%2BTgmoa_VNzG4ZouZyQQ9h%3DoRiy%3DZQV5%2BxHQXxMWmep4Ygg8Dg%40mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Temporary file access API

2022-04-11 Thread Antonin Houska

Robert Haas  wrote:

> On Fri, Apr 8, 2022 at 4:54 AM Antonin Houska  wrote:
> > Do you think that the use of a system call is a problem itself (e.g. because
> > the code looks less simple if read/write is used somewhere and fread/fwrite
> > elsewhere; of course of read/write is mandatory in special cases like WAL,
> > heap pages, etc.)  or is the problem that the system calls are used too
> > frequently? I suppose only the latter.
> 
> I'm not really super-interested in reducing the number of system
> calls. It's not a dumb thing in which to be interested and I know that
> for example Thomas Munro is very interested in it and has done a bunch
> of work in that direction just to improve performance. But for me the
> attraction of this is mostly whether it gets us closer to TDE.
> 
> And that's why I'm asking these questions about adopting it in
> different places. I kind of thought that your previous patches needed
> to encrypt, I don't know, 10 or 20 different kinds of files. So I was
> surprised to see this patch touching the handling of only 2 kinds of
> files. If we consolidate the handling of let's say 15 of 20 cases into
> a single mechanism, we've really moved the needle in the right
> direction -- but consolidating the handling of 2 of 20 cases, or
> whatever the real numbers are, isn't very exciting.

There are't really that many kinds of files to encrypt:

https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#List_of_the_files_that_contain_user_data

(And pg_stat/* files should be removed from the list.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Temporary file access API

2022-04-08 Thread Antonin Houska

Robert Haas  wrote:

> On Tue, Mar 8, 2022 at 6:12 AM Antonin Houska  wrote:
> > Thanks for your comments, the initial version is attached here.
> 
> I've been meaning to look at this thread for some time but have not
> found enough time to do that until just now. And now I have
> questions...
> 
> 1. Supposing we accepted this, how widely do you think that we could
> adopt it? I see that this patch set adapts a few places to use it and
> that's nice, but I have a feeling there's a lot more places that are
> making use of system calls directly, or through some other
> abstraction, than just this. I'm not sure that we actually want to use
> something like this everywhere, but what would be the rule for
> deciding where to use it and where not to use
> it? If the plan for this facility is just to adapt these two
> particular places to use it, that doesn't feel like enough to be
> worthwhile.

Admittedly I viewed the problem from the perspective of the TDE, so I haven't
spent much time looking for other opportunities. Now, with the stats collector
using shared memory, even one of the use cases implemented here no longer
exists. I need to do more research.

Do you think that the use of a system call is a problem itself (e.g. because
the code looks less simple if read/write is used somewhere and fread/fwrite
elsewhere; of course of read/write is mandatory in special cases like WAL,
heap pages, etc.)  or is the problem that the system calls are used too
frequently? I suppose only the latter.

Anyway, I'm not sure there are *many* places where system calls are used too
frequently. Instead, the coding uses to be such that the information is first
assembled in memory and then written to file at once. So the value of the
(buffered) stream is that it makes the code simpler (eliminates the need to
prepare the data in memory). That's what I tried to do for reorderbuffer.c and
pgstat.c in my patch.

Related question is whether we should try to replace some uses of the libc
stream (FILE *) at some places. You seem to suggest that in [1]. One example
is snapmgr.c:ExportSnapshot(), if we also implement output formatting. Of
course there are places where (FILE *) cannot be replaced because, besides
regular file, the code needs to work with stdin/stdout in general. (Parsing of
configuration files falls into this category, but that doesn't matter because
bison-generated parser seems to implement buffering anyway.)

> 2. What about frontend code? Most frontend code won't examine data
> files directly, but at least pg_controldata, pg_checksums, and
> pg_resetwal are exceptions.

If the frequency of using system calls is the problem, then I wouldn't change
these because ControlFileData structure needs to be initialized in memory
anyway and then written at once. And pg_checksums reads whole blocks
anyway. I'll take a closer look.

> 3. How would we extend this to support encryption? Add an extra
> argument to BufFileStreamInit(V)FD passing down the encryption
> details?

Yes.

> There are some smaller things about the patch with which I'm not 100%
> comfortable, but I'd like to start by understanding the big picture.

Thanks!

[1] 
https://www.postgresql.org/message-id/CA+TgmoYGjN_f=fcerx49bzjhng+gocty+a+xhnrwcvvdy8u...@mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Logical replication row filtering and TOAST

2022-04-05 Thread Antonin Houska

Amit Kapila  wrote:

> > Antonin Houska  wrote:
> >
> > Nevertheless, a comment in pgoutput_row_filter(), saying that TOASTed values
> > are not expected if old_slot is NULL, might be useful.
> >
> 
> How about something like the attached?

Yes, that'd be sufficient. Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Logical replication row filtering and TOAST

2022-04-05 Thread Antonin Houska

Antonin Houska  wrote:

> I spent some time thinking about a special case of evaluation of the row
> filter and wrote a comment that might be useful (see the attachment). However
> now I think that it's not perfect if the code really relies on the fact that
> value of an indexed column cannot be TOASTed due to size restrictions.
> 
> I could hit two different error messages when trying activate TOAST on an
> index column (in this case PG was build with 16kB pages), but still I think
> the code is unnecessarily fragile if it relies on such errors:
> 
> 
> ERROR:  index row requires 8224 bytes, maximum size is 8191
> 
> ERROR:  index row size 8048 exceeds btree version 4 maximum 5432 for index 
> "b_pkey"
> DETAIL:  Index row references tuple (0,3) in relation "b".
> HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
> 
> 
> Note that at least in ExtractReplicaIdentity() we do expect that an indexed
> column value can be TOASTed.
> 
>   /*
>* If the tuple, which by here only contains indexed columns, still has
>* toasted columns, force them to be inlined. This is somewhat unlikely
>* since there's limits on the size of indexed columns, so we don't
>* duplicate toast_flatten_tuple()s functionality in the above loop over
>* the indexed columns, even if it would be more efficient.
>*/
>   if (HeapTupleHasExternal(key_tuple))
>   {
>   HeapTuple   oldtup = key_tuple;
> 
>   key_tuple = toast_flatten_tuple(oldtup, desc);
>   heap_freetuple(oldtup);
>   }
> 
> Do I miss anything?

Well, I see now that the point might be that, in heap_update(),
"id_has_external" would be true the indexed value could be TOASTed, so that
the (flattened) old tuple would be WAL logged:

old_key_tuple = ExtractReplicaIdentity(relation, ,

   bms_overlap(modified_attrs, id_attrs) ||

   id_has_external,

   _key_copied);

Nevertheless, a comment in pgoutput_row_filter(), saying that TOASTed values
are not expected if old_slot is NULL, might be useful.


-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Logical replication row filtering and TOAST

2022-04-05 Thread Antonin Houska

I spent some time thinking about a special case of evaluation of the row
filter and wrote a comment that might be useful (see the attachment). However
now I think that it's not perfect if the code really relies on the fact that
value of an indexed column cannot be TOASTed due to size restrictions.

I could hit two different error messages when trying activate TOAST on an
index column (in this case PG was build with 16kB pages), but still I think
the code is unnecessarily fragile if it relies on such errors:


ERROR:  index row requires 8224 bytes, maximum size is 8191

ERROR:  index row size 8048 exceeds btree version 4 maximum 5432 for index 
"b_pkey"
DETAIL:  Index row references tuple (0,3) in relation "b".
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.


Note that at least in ExtractReplicaIdentity() we do expect that an indexed
column value can be TOASTed.

/*
 * If the tuple, which by here only contains indexed columns, still has
 * toasted columns, force them to be inlined. This is somewhat unlikely
 * since there's limits on the size of indexed columns, so we don't
 * duplicate toast_flatten_tuple()s functionality in the above loop over
 * the indexed columns, even if it would be more efficient.
 */
if (HeapTupleHasExternal(key_tuple))
{
HeapTuple   oldtup = key_tuple;

key_tuple = toast_flatten_tuple(oldtup, desc);
heap_freetuple(oldtup);
}

Do I miss anything?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 20d0b1e1253..2be5aaa18c4 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -1246,6 +1246,15 @@ pgoutput_row_filter(Relation relation, TupleTableSlot *old_slot,
 	 */
 	if (!new_slot || !old_slot)
 	{
+		/*
+		 * For UPDATE, we should only get here if the replica identity is !=
+		 * FULL and thus all the identity columns are index columns, i.e.
+		 * never TOASTed.  Therefore, we don't need to care of the unchanged
+		 * toasted replica identity columns like we do below.
+		 */
+		Assert(relation->rd_rel->relreplident != REPLICA_IDENTITY_FULL ||
+			   map_changetype_pubaction[*action] != PUBACTION_UPDATE);
+
 		ecxt->ecxt_scantuple = new_slot ? new_slot : old_slot;
 		result = pgoutput_row_filter_exec_expr(filter_exprstate, ecxt);

Re: Temporary file access API

2022-03-08 Thread Antonin Houska

Stephen Frost  wrote:

> * Antonin Houska (a...@cybertec.at) wrote:
> > Here I'm starting a new thread to discuss a topic that's related to the
> > Transparent Data Encryption (TDE), but could be useful even without that.  
> > The
> > problem has been addressed somehow in the Cybertec TDE fork, and I can post
> > the code here if it helps. However, after reading [1] (and the posts
> > upthread), I've got another idea, so let's try to discuss it first.
> 
> Yeah, I tend to agree that it makes sense to discuss the general idea of
> doing buffered I/O for the temporary files that are being created today
> with things like fwrite and have that be independent of encryption.  As
> mentioned on that thread, doing our own buffering and having file
> accesses done the same way throughout the system is just generally a
> good thing and focusing on that will certainly help with any TDE effort
> that may or may not happen later.

Thanks for your comments, the initial version is attached here.

The buffer size is not configurable yet. Besides the new API itself, the patch
series shows how it can be used to store logical replication data changes as
well as statistics. Further comments are welcome, thanks in advance.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



temp_file_api_v1.tgz
Description: application/gzip

Re: storing an explicit nonce

2022-02-08 Thread Antonin Houska

Stephen Frost  wrote:

> Perhaps this is all too meta and we need to work through some specific
> ideas around just what this would look like.  In particular, thinking
> about what this API would look like and how it would be used by
> reorderbuffer.c, which builds up changes in memory and then does a bare
> write() call, seems like a main use-case to consider.  The gist there
> being "can we come up with an API to do all these things that doesn't
> require entirely rewriting ReorderBufferSerializeChange()?"
> 
> Seems like it'd be easier to achieve that by having something that looks
> very close to how write() looks, but just happens to have the option to
> run the data through a stream cipher and maybe does better error
> handling for us.  Making that layer also do block-based access to the
> files underneath seems like a much larger effort that, sure, may make
> some things better too but if we could do that with the same API then it
> could also be done later if someone's interested in that.

My initial proposal is in this new thread:

https://www.postgresql.org/message-id/4987.1644323098%40antos

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Temporary file access API

2022-02-08 Thread Antonin Houska

Here I'm starting a new thread to discuss a topic that's related to the
Transparent Data Encryption (TDE), but could be useful even without that.  The
problem has been addressed somehow in the Cybertec TDE fork, and I can post
the code here if it helps. However, after reading [1] (and the posts
upthread), I've got another idea, so let's try to discuss it first.

It makes sense to me if we first implement the buffering (i.e. writing/reading
certain amount of data at a time) and make the related functions aware of
encryption later: as long as we use a block cipher, we also need to read/write
(suitably sized) chunks rather than individual bytes (or arbitrary amounts of
data). (In theory, someone might need encryption but reject buffering, but I'm
not sure if this is a realistic use case.)

For the buffering, I imagine a "file stream" object that user creates on the
top of a file descriptor, such as

FileStream  *FileStreamCreate(File file, int buffer_size)

or

FileStream  *FileStreamCreateFD(int fd, int buffer_size)

and uses functions like

int FileStreamWrite(FileStream *stream, char *buffer, int amount)

and

int FileStreamRead(FileStream *stream, char *buffer, int amount)

to write and read data respectively.

Besides functions to close the streams explicitly (e.g. FileStreamClose() /
FileStreamFDClose()), we'd need to ensure automatic closing where that happens
to the file. For example, if OpenTemporaryFile() was used to obtain the file
descriptor, the user expects that the file will be closed and deleted on
transaction boundary, so the corresponding stream should be freed
automatically as well.

To avoid code duplication, buffile.c should use these streams internally as
well, as it also performs buffering. (Here we'd also need functions to change
reading/writing position.)

Once we implement the encryption, we might need add an argument to the
FileStreamCreate...() functions that helps to generate an unique IV, but the
...Read() / ...Write() functions would stay intact. And possibly one more
argument to specify the kind of cipher, in case we support more than one.

I think that's enough to start the discussion. Thanks for feedback in advance.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoYGjN_f%3DFCErX49bzjhNG%2BGoctY%2Ba%2BXhNRWCVvDY8U74w%40mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Doc: CREATE_REPLICATION_SLOT command requires the plugin name

2022-02-01 Thread Antonin Houska

Amit Kapila  wrote:

> On Tue, Feb 1, 2022 at 5:43 PM Antonin Houska  wrote:
> >
> > Amit Kapila  wrote:
> >
> > > On Tue, Feb 1, 2022 at 3:44 PM Antonin Houska  wrote:
> > > >
> > > > I got a syntax error when using the command according to the existing
> > > > documentation. The output_plugin parameter needs to be passed too.
> > > >
> > >
> > > Why do we need it for physical slots?
> >
> > Sure we don't, the missing curly brackets seem to be the problem. I used the
> > other form of the command for reference, which therefore might need a minor
> > fix too.
> >
> 
> Instead of adding additional '{}', can't we simply use:
> { PHYSICAL [ RESERVE_WAL ] |
> LOGICAL  class="parameter">output_plugin } [
> EXPORT_SNAPSHOT |
> NOEXPORT_SNAPSHOT | USE_SNAPSHOT
> | TWO_PHASE ]

Do you mean changing

CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL [ RESERVE_WAL ] | 
LOGICAL output_plugin [ EXPORT_SNAPSHOT | NOEXPORT_SNAPSHOT | USE_SNAPSHOT | 
TWO_PHASE ] }

to

CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL [ RESERVE_WAL ] | 
LOGICAL output_plugin } [ EXPORT_SNAPSHOT | NOEXPORT_SNAPSHOT | USE_SNAPSHOT | 
TWO_PHASE ]

?

I'm not sure, IMHO that would still allow for commands like

CREATE_REPLICATION_SLOT slot_name PHYSICAL output_plugin


-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Doc: CREATE_REPLICATION_SLOT command requires the plugin name

2022-02-01 Thread Antonin Houska

Amit Kapila  wrote:

> On Tue, Feb 1, 2022 at 3:44 PM Antonin Houska  wrote:
> >
> > I got a syntax error when using the command according to the existing
> > documentation. The output_plugin parameter needs to be passed too.
> >
> 
> Why do we need it for physical slots?

Sure we don't, the missing curly brackets seem to be the problem. I used the
other form of the command for reference, which therefore might need a minor
fix too.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 24e93f9b28..49fae6c669 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1942,7 +1942,7 @@ The commands accepted in replication mode are:
   
 
   
-   CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL | LOGICAL } [ ( option [, ...] ) ]
+   CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL | { LOGICAL output_plugin } } [ ( option [, ...] ) ]
  CREATE_REPLICATION_SLOT
 
 
@@ -2085,7 +2085,8 @@ The commands accepted in replication mode are:
   
 
   
-CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL [ RESERVE_WAL ] | LOGICAL output_plugin [ EXPORT_SNAPSHOT | NOEXPORT_SNAPSHOT | USE_SNAPSHOT | TWO_PHASE ] }
+CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL [ RESERVE_WAL ] | { LOGICAL output_plugin
+} [ EXPORT_SNAPSHOT | NOEXPORT_SNAPSHOT | USE_SNAPSHOT | TWO_PHASE ] }

Doc: CREATE_REPLICATION_SLOT command requires the plugin name

2022-02-01 Thread Antonin Houska

I got a syntax error when using the command according to the existing
documentation. The output_plugin parameter needs to be passed too.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 24e93f9b28..6065c18fdf 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1942,7 +1942,7 @@ The commands accepted in replication mode are:
   
 
   
-   CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL | LOGICAL } [ ( option [, ...] ) ]
+   CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL | LOGICAL output_plugin } [ ( option [, ...] ) ]
  CREATE_REPLICATION_SLOT

Re: XTS cipher mode for cluster file encryption

2022-01-31 Thread Antonin Houska

Bruce Momjian  wrote:

> On Mon, Nov 29, 2021 at 08:37:31AM +0100, Antonin Houska wrote:
> > The changes to buffile.c are not trivial, but we haven't really changed the
> > API, as long as you mean BufFileCreateTemp(), BufFileWrite(), BufFileRead().
> > 
> > What our patch affects on the caller side is that BufFileOpenTransient(),
> > BufFileCloseTransient(), BufFileWriteTransient() and BufFileReadTransient()
> > replace OpenTransientFile(), CloseTransientFile(), write()/fwrite() and
> > read()/fread() respectively in reorderbuffer.c and in pgstat.c. These 
> > changes
> > become a little bit less invasive in TDE 1.1 than they were in 1.0, see [1],
> > see the diffs attached.
> 
> With pg_upgrade modified to preserve the relfilenode, tablespace oid, and
> database oid, we are now closer to implementing cluster file encryption
> using XTS.  I think we have a few steps left:
> 
> 1.  modify temporary file I/O to use a more centralized API
> 2.  modify the existing cluster file encryption patch to use XTS with a
> IV that uses more than the LSN
> 3.  add XTS regression test code like CTR
> 4.  create WAL encryption code using CTR
> 
> If we can do #1 in PG 15 I think I can have #2 ready for PG 16 in July.
> The feature wiki page is:
> 
>   https://wiki.postgresql.org/wiki/Transparent_Data_Encryption
> 
> Do people want to advance this feature forward?

I confirm that we (Cybertec) do and that we're ready to spend more time on the
community implementation.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: [PATCH] buffile: ensure start offset is aligned with BLCKSZ

2021-11-29 Thread Antonin Houska

Sasasu  wrote:

> Hi hackers,
> 
> there are a very long discuss about TDE, and we agreed on that if the
> temporary file I/O can be aligned to some fixed size, it will be easier
> to use some kind of encryption algorithm.
> 
> discuss:
> https://www.postgresql.org/message-id/20211025155814.GD20998%40tamriel.snowman.net
> 
> This patch adjust file->curOffset and file->pos before the real IO to
> ensure the start offset is aligned.

Does this test really pass regression tests? In BufFileRead(), I would
understand if you did

+   file->pos = offsetInBlock;
+   file->curOffset -= offsetInBlock;

rather than

+   file->pos += offsetInBlock;
+   file->curOffset -= offsetInBlock;

Anyway, BufFileDumpBuffer() does not seem to enforce curOffset to end up at
block boundary, not to mention BufFileSeek().

When I was implementing this for our fork [1], I concluded that the encryption
code path is too specific, so I left the existing code for the unecrypted data
and added separate functions for the encrypted data.

One specific thing is that if you encrypt and write n bytes, but only need
part of it later, you need to read and decrypt exactly those n bytes anyway,
otherwise the decryption won't work. So I decided not only to keep curOffset
at BLCKSZ boundary, but also to read / write BLCKSZ bytes at a time. This also
makes sense if the scope of the initialization vector (IV) is BLCKSZ bytes.

Another problem is that you might want to store the IV somewhere in between
the data. In short, the encryption makes the buffered IO rather different and
the specific code should be kept aside, although the same API is used to
invoke it.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

[1] https://github.com/cybertec-postgresql/postgres/tree/PG_14_TDE_1_1

Re: XTS cipher mode for cluster file encryption

2021-11-28 Thread Antonin Houska

Stephen Frost  wrote:

> Greetings,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Tue, Oct 19, 2021 at 02:54:56PM -0400, Stephen Frost wrote:
> > > * Sasasu (i...@sasa.su) wrote:
> > > > A unified block-based I/O API sounds great. Has anyone tried to do this
> > > > before? It would be nice if the front-end tools could also use these 
> > > > API.
> > > 
> > > The TDE patch from Cybertec did go down this route, but the API ended up
> > > being rather different which menat a lot of changes in other parts of
> > > the system.  If we can get a block-based temporary file method that
> > > maintains more-or-less the same API, that'd be great, but I'm not sure
> > > that we can really do so and I am not entirely convinced that we should
> > > make the TDE effort depend on an otherwise quite independent effort of
> > > making all temp files usage be block based.
> > 
> > Uh, I thought people felt the Cybertec patch was too large and that a
> > unified API for temporary file I/O-encryption was a requirement.  Would
> > a CTR-steaming-encryption API for temporary tables be easier to
> > implement?
> 
> Having a unified API for temporary file I/O (that could then be extended
> to provide encryption) would definitely help with reducing the size of a
> TDE patch.  The approach used in the Cybertec patch was to make
> temporary file access block based, but the way that was implemented was
> with an API different from pread/pwrite and that meant changing pretty
> much all of the call sites for temporary file access, which naturally
> resulted in changes in a lot of otherwise unrelated code.

The changes to buffile.c are not trivial, but we haven't really changed the
API, as long as you mean BufFileCreateTemp(), BufFileWrite(), BufFileRead().

What our patch affects on the caller side is that BufFileOpenTransient(),
BufFileCloseTransient(), BufFileWriteTransient() and BufFileReadTransient()
replace OpenTransientFile(), CloseTransientFile(), write()/fwrite() and
read()/fread() respectively in reorderbuffer.c and in pgstat.c. These changes
become a little bit less invasive in TDE 1.1 than they were in 1.0, see [1],
see the diffs attached.

(I expect that [2] will get committed someday so that the TDE feature won't
affect pgstat.c in the future at all.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


[1] https://github.com/cybertec-postgresql/postgres/tree/PG_14_TDE_1_1

[2] https://commitfest.postgresql.org/34/1708/

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d474ea1d0a..9cb0708e41 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -99,6 +99,7 @@
 #include "replication/slot.h"
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "storage/fd.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
@@ -131,21 +132,13 @@ typedef struct ReorderBufferTupleCidEnt
 	CommandId	combocid;		/* just for debugging */
 } ReorderBufferTupleCidEnt;
 
-/* Virtual file descriptor with file offset tracking */
-typedef struct TXNEntryFile
-{
-	File		vfd;			/* -1 when the file is closed */
-	off_t		curOffset;		/* offset for next write or read. Reset to 0
- * when vfd is opened. */
-} TXNEntryFile;
-
 /* k-way in-order change iteration support structures */
 typedef struct ReorderBufferIterTXNEntry
 {
 	XLogRecPtr	lsn;
 	ReorderBufferChange *change;
 	ReorderBufferTXN *txn;
-	TXNEntryFile file;
+	TransientBufFile *file;
 	XLogSegNo	segno;
 } ReorderBufferIterTXNEntry;
 
@@ -243,14 +236,22 @@ static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMe
  * Disk serialization support functions
  * ---
  */
+static void ReorderBufferTweakBase(ReorderBufferTXN *txn,
+   char tweak_base[TWEAK_BASE_SIZE]);
 static void ReorderBufferCheckMemoryLimit(ReorderBuffer *rb);
 static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
-		 int fd, ReorderBufferChange *change);
+			 TransientBufFile *file, ReorderBufferChange *change);
+static void ReorderBufferWriteData(TransientBufFile *file, void *ptr, size_t size,
+	   ReorderBufferTXN *txn);
 static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
-		TXNEntryFile *file, XLogSegNo *segno);
+			TransientBufFile **file, XLogSegNo *segno);
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
-	   char *change);
+		   T

Re: Reuse of State value in Aggregates

2021-11-09 Thread Antonin Houska

gg pw  wrote:

> Do Aggregate functions/queries reuse the state value they generate for 
> subsequent queries? If not then an explanation would be greatly appreciated. 

The subsequent queries are not supposed to feed the same input into the
aggregates, so their aggregate states will most likely be different. Maybe you
need to post an example where you think such reusing would be useful.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-11-09 Thread Antonin Houska

Michail Nikolaev  wrote:

> > I understand that the RR snapshot is used to check the MVCC behaviour, 
> > however
> > this comment seems to indicate that the RR snapshot should also prevent the
> > standb from setting the hint bits.
> > # Make sure previous queries not set the hints on standby because
> > # of RR snapshot
> > I can imagine that on the primary, but I don't think that the backend that
> > checks visibility on standby does checks other snapshots/backends. And it
> > didn't work when I ran the test manually, although I could have missed
> > something.
> 
> Yes, it checks - you could see ComputeXidHorizons for details. It is
> the main part of the correctness of the whole feature. I added some
> details about it to the test.

Ah, ok. I thought that only KnownAssignedXids is used on standby, but that
would ignore the RR snapshot. It wasn't clear to me when the xmin of the
hot-standby backends is set, now I think it's done by GetSnapshotData().

> > * I can see no test for the INDEX_LP_DEAD_OK_MIN_LSN value of the
> >   IndexLpDeadAllowedResult enumeration. Shouldn't there be only two values,
> >   e.g. INDEX_LP_DEAD_OK and INDEX_LP_DEAD_MAYBE_OK ? Or a boolean variable 
> > (in
> >   index_fetch_heap()) of the appropriate name, e.g. kill_maybe_allowed, and
> >   rename the function is_index_lp_dead_allowed() to
> >   is_index_lp_dead_maybe_allowed()?
> 
> Yes, this way it is looks better. Done. Also, I have added some checks
> for “maybe” LSN-related logic to the test.

Attached is a proposal for a minor addition that would make sense to me, add
it if you think it's appropriate.

I think I've said enough, changing the status to "ready for committer" :-)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/test/recovery/t/027_standby_index_lp_dead.pl b/src/test/recovery/t/027_standby_index_lp_dead.pl
index 5a23eea052..e65000b6bd 100644
--- a/src/test/recovery/t/027_standby_index_lp_dead.pl
+++ b/src/test/recovery/t/027_standby_index_lp_dead.pl
@@ -7,7 +7,7 @@ use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
 
-plan tests => 29;
+plan tests => 30;
 
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
@@ -246,6 +246,7 @@ is(btp_safe_on_stanby($node_new_standby), qq(0), 'hint from FPI');
 
 # Make sure bits are set only if minRecoveryPoint > than index page LSN
 try_to_set_hint_bits($node_new_standby);
+is(hints_num($node_new_standby), qq(11), 'no new index hint bits are set on new standby');
 is(btp_safe_on_stanby($node_new_standby), qq(0), 'hint not marked as standby-safe');
 
 # Issue checkpoint on primary to update minRecoveryPoint on standby

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-11-03 Thread Antonin Houska

Michail Nikolaev  wrote:

> > * Is the purpose of the repeatable read (RR) snapshot to test that
> >  heap_hot_search_buffer() does not set deadness->all_dead if some 
> > transaction
> >  can still see a tuple of the chain?
> 
> The main purpose is to test xactStartedInRecovery logic after the promotion.
> For example -
> > if (scan->xactStartedInRecovery && !RecoveryInProgress())`

I understand that the RR snapshot is used to check the MVCC behaviour, however
this comment seems to indicate that the RR snapshot should also prevent the
standb from setting the hint bits.

# Make sure previous queries not set the hints on standby because
# of RR snapshot

I can imagine that on the primary, but I don't think that the backend that
checks visibility on standby does checks other snapshots/backends. And it
didn't work when I ran the test manually, although I could have missed
something.

A few more notes regarding the tests:

* 026_standby_index_lp_dead.pl should probably be renamed to
  027_standby_index_lp_dead.pl (026_* was created in the master branch
  recently)

* The test fails, although I do have convigrured the build with
  --enable-tap-tests.

BEGIN failed--compilation aborted at t/026_standby_index_lp_dead.pl line 5.
t/026_standby_index_lp_dead.pl .. Dubious, test returned 2 (wstat 512, 0x200)

I suspect the testing infrastructure changed recently.

* The messages like this

is(hints_num($node_standby_1), qq(10),
'hints are set on standby1 because FPI but marked as non-safe');

  say that the hints are "marked as non-safe", but the hints_num() function
  does not seem to check that.

* wording:

is(hints_num($node_standby_2), qq(10), 'index hint bits already set on second 
standby 2');
->
is(hints_num($node_standby_2), qq(10), 'index hint bits already set on standby 
2');

And a few more notes on the code:

* There's an extra colon in mask_lp_dead():

bufmask.c:148:38: warning: for loop has empty body [-Wempty-body]
 offnum = OffsetNumberNext(offnum));
   ^
bufmask.c:148:38: note: put the semicolon on a separate line to silence this 
warning

* the header comment of heap_hot_search_buffer() still says "*all_dead"
  whereas I'd expect "->all_dead".

  The same for "*page_lsn".

* I can see no test for the INDEX_LP_DEAD_OK_MIN_LSN value of the
  IndexLpDeadAllowedResult enumeration. Shouldn't there be only two values,
  e.g. INDEX_LP_DEAD_OK and INDEX_LP_DEAD_MAYBE_OK ? Or a boolean variable (in
  index_fetch_heap()) of the appropriate name, e.g. kill_maybe_allowed, and
  rename the function is_index_lp_dead_allowed() to
  is_index_lp_dead_maybe_allowed()?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: storing an explicit nonce

2021-11-01 Thread Antonin Houska

Sasasu  wrote:

> On 2021/10/6 23:01, Robert Haas wrote:
> > This seems wrong to me. CTR requires that you not reuse the IV. If you
> > re-encrypt the page with a different IV, torn pages are a problem. If
> > you re-encrypt it with the same IV, then it's not secure any more.

> for CBC if the IV is predictable will case "dictionary attack".

The following sounds like IV *uniqueness* is needed to defend against "known
plaintext attack" ...

> and for CBC and GCM reuse IV will case "known plaintext attack".

... but here you seem to say that *randomness* is also necessary:

> XTS works like CBC but adds a tweak step.  the tweak step does not add
> randomness. It means XTS still has "known plaintext attack",

(I suppose you mean "XTS with incorrect (e.g. non-random) IV", rather than XTS
as such.)

> due to the same reason from CBC.

According to the Appendix C of

https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf

CBC requires *unpredictability* of the IV, but that does not necessarily mean
randomness: the unpredictable IV can be obtained by applying the forward
cipher function to an unique value.

Can you please try to explain once again what you consider a requirement
(uniqueness, randomness, etc.) on the IV for the XTS mode? Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: storing an explicit nonce

2021-10-08 Thread Antonin Houska

inds me of Joe Conway's response to me email earlier:

https://www.postgresql.org/message-id/50335f56-041b-1a1f-59ea-b5f7bf917352%40joeconway.com

In the document he recommended

https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf

specifically, in the Appendix C I read:

"""
For the CBC and CFB modes, the IVs must be unpredictable.  In particular, for
any given plaintext, it must not be possible to predict the IV that will be
associated to the plaintext in advance of the generation of the IV.

There are two recommended methods for generating unpredictable IVs. The first
method is to apply the forward cipher function, under the same key that is
used for the encryption of the plaintext, to a nonce.  The nonce must be a
data block that is unique to each execution of the encryption operation. For
example, the nonce may be a counter, as described in Appendix B, or a message
number. The second method is to generate a random data block using a FIPS-
approved random number generator.
"""

This is about modes that include CBC, while the documend you refer to seems to
deal with some other modes. So if we want to be confident that we use the XTS
mode correctly, more research is probably needed.

> Now, as I mentioned before, that particular case isn't something that
> XTS is particularly good at and that's generally accepted, yet lots of
> folks use XTS anyway because the concern isn't "someone has root access
> on the box and is watching all block writes" but rather "laptop was
> stolen" where the attacker doesn't get to see multiple writes where the
> same key+tweak has been used, and the latter is really the attack vector
> we're looking to address with XTS too.

I've heard a few times that database running in a cloud is also a valid use
case for the TDE. In that case I think it should be expected that "someone has
root access on the box and is watching all block writes".

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: storing an explicit nonce

2021-10-07 Thread Antonin Houska

Bruce Momjian  wrote:

> On Tue, Oct  5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote:
> > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian  wrote:
> > > We are still working on our TDE patch. Right now the focus is on 
> > > refactoring
> > > temporary file access to make the TDE patch itself smaller. Reconsidering
> > > encryption mode choices given concerns expressed is next. Currently a 
> > > viable
> > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
> > > issue with predictable IV and isn't totally broken in case of IV reuse.
> > 
> > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> > 16-byte blocks affect later blocks, meaning that hint bit changes would
> > also affect later blocks.  I think this means we would need to write WAL
> > full page images for hint bit changes to avoid torn pages.  Right now
> > hint bit (single bit) changes can be lost without causing torn pages. 
> > This was another of the advantages of using a stream cipher like CTR.
> 
> The above text isn't very clear.  What I am saying is that currently
> torn pages can be tolerated by hint bit writes because only a single
> byte is changing.  If we use a block cipher like AES-XTS, later 16-byte
> encrypted blocks would be changed by hint bit changes, meaning torn
> pages could not be tolerated.  This means we would have to use full page
> writes for hint bit changes, perhaps making this feature have
> unacceptable performance overhead.

IIRC, in the XTS scheme, a change of a single byte in the 16-byte block causes
the whole encrypted block to be different after the next encryption, however
the following blocks are not affected. CBC (cipher-block chaining) is the mode
where the change in one block does affect the encryption of the following
block.

I'm not sure if this fact is important from the hint bit perspective
though. It would be an important difference if there was a guarantee that the
16-byte blocks are consitent even on torn page - does e.g. proper alignment of
pages guarantee that? Nevertheless, the absence of the chaining may be a
reason to prefer CBC to XTS anyway.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: storing an explicit nonce

2021-10-05 Thread Antonin Houska

Robert Haas  wrote:

> On Tue, Oct 5, 2021 at 1:24 PM Robert Haas  wrote:
> > On Mon, Oct 4, 2021 at 10:00 PM Stephen Frost  wrote:
> > > I do want to point out, as I think I did when we discussed this but want
> > > to be sure it's also captured here- I don't think that temporary file
> > > access should be forced to be block-oriented when it's naturally (in
> > > very many cases) sequential.  To that point, I'm thinking that we need a
> > > temp file access API through which various systems work that's
> > > sequential and therefore relatively similar to the existing glibc, et
> > > al, APIs, but by going through our own internal API (which more
> > > consistently works with the glibc APIs and provides better error
> > > reporting in the event of issues, etc) we can then extend it to work as
> > > an encrypted stream instead.
> >
> > Regarding this, would it use block-oriented access on the backend?
> >
> > I agree that we need a better API layer through which all filesystem
> > access is routed. One of the notable weaknesses of the Cybertec patch
> > is that it has too large a code footprint,
> 
> (sent too soon)
> 
> ...precisely because PostgreSQL doesn't have such a layer.

I'm just trying to make our changes to buffile.c less invasive. Or do you mean
that this module should be reworked regardless the encryption?

> But I think ultimately we do want to encrypt and decrypt in blocks, so
> if we create such a layer, it should expose byte-oriented APIs but
> combine the actual I/Os somehow. That's also good for cutting down the
> number of system calls, which is a benefit unto itself.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-09-29 Thread Antonin Houska

Thomas Munro  wrote:

> On Wed, Sep 29, 2021 at 8:18 AM Antonin Houska  wrote:
> > I'm just trying to use the existing infrastructure: the effect of DROP TABLE
> > also appear to be performed by the checkpointer. However I don't know why 
> > the
> > unlinks need to be performed by the checkpointer.
> 
> For DROP TABLE, we leave an empty file (I've been calling it a
> "tombstone file") so that GetNewRelFileNode() won't let you reuse the
> same relfilenode in the same checkpoint cycle.  One reason is that
> wal_level=minimal has a data-eating crash recovery failure mode if you
> reuse a relfilenode in a checkpoint cycle.

Interesting. Is the problem that REDO of the DROP TABLE command deletes the
relfilenode which already contains the new data, but the new data cannot be
recovered because (due to wal_level=minimal) it's not present in WAL? In this
case I suppose that the checkpoint just ensures that the DROP TABLE won't be
replayed during the next crash recovery.

BTW, does that comment fix attached make sense to you? The corresponding code
in InitSync() is

/*
 * Create pending-operations hashtable if we need it.  Currently, we 
need
 * it if we are standalone (not under a postmaster) or if we are a
 * checkpointer auxiliary process.
 */
if (!IsUnderPostmaster || AmCheckpointerProcess())

I suspect this is also related to the commit 7ff23c6d27.

Thanks for your answer, I was considering to add you to CC :-)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 1c78581354..ae6c5ff8e4 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -563,7 +563,7 @@ RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,

 	if (pendingOps != NULL)
 	{
-		/* standalone backend or startup process: fsync state is local */
+		/* standalone backend or checkpointer process: fsync state is local */
 		RememberSyncRequest(ftag, type);
 		return true;
 	}

Re: POC: Cleaning up orphaned files using undo logs

2021-09-28 Thread Antonin Houska

Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Tue, Sep 21, 2021 at 10:07:55AM +0200, Dmitry Dolgov wrote:
> > On Tue, 21 Sep 2021 09:00 Antonin Houska,  wrote:
> >
> > > Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > >
> > > > Yep, makes sense, thanks. I have few more questions:
> > > >
> > > > * The use case with orphaned files is working somewhat differently after
> > > >   the rebase on the latest master, do you observe it as well? The
> > > >   difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up
> > > >   an orphaned relation file immediately (only later on checkpoint)
> > > >   because of empty pendingUnlinks. I haven't investigated more yet, but
> > > >   seems like after this commit:
> > > >
> > > > commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
> > > > Author: Thomas Munro 
> > > > Date:   Mon Aug 2 17:32:20 2021 +1200
> > > >
> > > > Run checkpointer and bgwriter in crash recovery.
> > > >
> > > > Start up the checkpointer and bgwriter during crash recovery
> > > (except in
> > > > --single mode), as we do for replication.  This wasn't done back
> > > in
> > > > commit cdd46c76 out of caution.  Now it seems like a better idea
> > > to make
> > > > the environment as similar as possible in both cases.  There may
> > > also be
> > > > some performance advantages.
> > > >
> > > >   something has to be updated (pendingOps are empty right now, so no
> > > >   unlink request is remembered).
> > >
> > > I haven't been debugging that part recently, but yes, this commit is
> > > relevant,
> > > thanks for pointing that out! Attached is a patch that should fix it. I'll
> > > include it in the next version of the patch series, unless you tell me 
> > > that
> > > something is still wrong.
> > >
> >
> > Sure, but  I can take a look only in a couple of days.
> 
> Thanks for the patch.
> 
> Hm, maybe there is some misunderstanding. My question above was about
> the changed behaviour, when orphaned files (e.g. created relation files
> after the backend was killed) are removed only by checkpointer when it
> kicks in. As far as I understand, the original intention was to do this
> job right away, that's why SyncPre/PostCheckpoint was invoked. But the
> recent changes around checkpointer make the current implementation
> insufficient.

> The patch you've proposed removes invokation of SyncPre/PostCheckpoint,
> do I see it correctly? In this sense it doesn't change anything, except
> removing non-functioning code of course.

Yes, it sounds like a misundeerstanding. I thought you complain about code
which is no longer needed.

The original intention was to make sure that the files are ever unlinked. IIRC
before the commit 7ff23c6d27 the calls SyncPre/PostCheckpoint were necessary
because the checkpointer wasn't runnig that early during the startup. Without
these calls the startup process would exit without doing anything. Sorry, I
see now that the comment incorrectly says "... it seems simpler ...", but in
fact it was necessary.

> But the question, probably
> reformulated from the more design point of view, stays the same — when
> and by which process such orphaned files have to be removed? I've
> assumed by removing right away the previous version was trying to avoid
> any kind of thunder effects of removing too many at once, but maybe I'm
> mistaken here.

I'm just trying to use the existing infrastructure: the effect of DROP TABLE
also appear to be performed by the checkpointer. However I don't know why the
unlinks need to be performed by the checkpointer.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-09-28 Thread Antonin Houska

Amit Kapila  wrote:

> On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska  wrote:
> >
> > Amit Kapila  wrote:
> >
> > > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska  wrote:
> >
> > > > The problem of the temporary undo log is that it's loaded into local 
> > > > buffers
> > > > and that backend can exit w/o flushing local buffers to disk, and thus 
> > > > we are
> > > > not guaranteed to find enough information when trying to discard the 
> > > > undo log
> > > > the backend wrote. I'm thinking about the following solutions:
> > > >
> > > > 1. Let the backend manage temporary undo log on its own (even the slot
> > > >metadata would stay outside the shared memory, and in particular the
> > > >insertion pointer could start from 1 for each session) and remove the
> > > >segment files at the same moment the temporary relations are removed.
> > > >
> > > >However, by moving the temporary undo slots away from the shared 
> > > > memory,
> > > >computation of oldestFullXidHavingUndo (see the PROC_HDR structure) 
> > > > would
> > > >be affected. It might seem that a transaction which only writes undo 
> > > > log
> > > >for temporary relations does not need to affect 
> > > > oldestFullXidHavingUndo,
> > > >but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
> > > >prevents transactions to be truncated from the CLOG too early, I 
> > > > wonder if
> > > >the following is possible (This scenario is only applicable to the 
> > > > zheap
> > > >storage engine [1], which is not included in this patch, but should 
> > > > already
> > > >be considered.):
> > > >
> > > >A transaction creates a temporary table, does some (many) changes 
> > > > and then
> > > >gets rolled back. The undo records are being applied and it takes 
> > > > some
> > > >time. Since XID of the transaction did not affect 
> > > > oldestFullXidHavingUndo,
> > > >the XID can disappear from the CLOG due to truncation.
> > > >
> > >
> > > By above do you mean to say that in zheap code, we don't consider XIDs
> > > that operate on temp table/undo for oldestFullXidHavingUndo?
> >
> > I was referring to the code
> >
> > /* We can't process temporary undo logs. */
> > if (log->meta.persistence == UNDO_TEMP)
> > continue;
> >
> > in undodiscard.c:UndoDiscard().
> >
> 
> Here, I think it will just skip undo of temporary undo logs and
> oldestFullXidHavingUndo should be advanced after skipping it.

Right, it'll be adavanced, but the transaction XID (if the transaction wrote
only to temporary relations) might still be needed.

> > >
> > > > However zundo.c in
> > > >[1] indicates that the transaction status *is* checked during undo
> > > >execution, so we might have a problem.
> > > >
> > >
> > > It would be easier to follow if you can tell which exact code are you
> > > referring here?
> >
> > In meant the call of TransactionIdDidCommit() in
> > zundo.c:zheap_exec_pending_rollback().
> >
> 
> IIRC, this should be called for temp tables after they have exited as
> this is only to apply the pending undo actions if any, and in case of
> temporary undo after session exit, we shouldn't need it.

I see (had to play with debugger a bit). Currently this works because the
temporary relations are dropped by AbortTransaction() ->
smgrDoPendingDeletes(), before the undo execution starts. The situation will
change as soon as the file removal will also be handled by the undo subsystem,
however I'm still not sure how to hit the TransactionIdDidCommit() call for
the XID already truncated from CLOG.

I'm starting to admint that there's no issue here: temporary undo is always
applied immediately in foreground, and thus the zheap_exec_pending_rollback()
function never needs to examine XID which no longer exists in the CLOG.

> I am not able to understand what exact problem you are facing for temp
> tables after the session exit. Can you please explain it a bit more?

The problem is that the temporary undo buffers are loaded into backend-local
buffers. Thus there's no guarantee that we'll find a consistent information in
the undo file even if the backend exited cleanly (local buffers are not
flushed at backend exit and there's n

Re: POC: Cleaning up orphaned files using undo logs

2021-09-27 Thread Antonin Houska

Amit Kapila  wrote:

> On Fri, Sep 24, 2021 at 4:44 PM Antonin Houska  wrote:
> >
> > Amit Kapila  wrote:
> >
> > > On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska  wrote:
> > > >
> > > > Amit Kapila  wrote:
> > > >
> > > > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthali...@gmail.com> 
> > > > > wrote:
> > > > > >
> > > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > > > > >
> > > > > > * What happened with the idea of abandoning discard worker for the 
> > > > > > sake
> > > > > >   of simplicity? From what I see limiting everything to foreground 
> > > > > > undo
> > > > > >   could reduce the core of the patch series to the first four 
> > > > > > patches
> > > > > >   (forgetting about test and docs, but I guess it would be enough at
> > > > > >   least for the design review), which is already less overwhelming.
> >
> > > > What we can miss, at least for the cleanup of the orphaned files, is 
> > > > the *undo
> > > > worker*. In this patch series the cleanup is handled by the startup 
> > > > process.
> > > >
> > >
> > > Okay, I think various people at different point of times has suggested
> > > that idea. I think one thing we might need to consider is what to do
> > > in case of a FATAL error? In case of FATAL error, it won't be
> > > advisable to execute undo immediately, so would we upgrade the error
> > > to PANIC in such cases. I remember vaguely that for clean up of
> > > orphaned files that can happen rarely and someone has suggested
> > > upgrading the error to PANIC in such a case but I don't remember the
> > > exact details.
> >
> > Do you mean FATAL error during normal operation?
> >
> 
> Yes.
> 
> > As far as I understand, even
> > zheap does not rely on immediate UNDO execution (otherwise it'd never
> > introduce the undo worker), so FATAL only means that the undo needs to be
> > applied later so it can be discarded.
> >
> 
> Yeah, zheap either applies undo later via background worker or next
> time before dml operation if there is a need.
> 
> > As for the orphaned files cleanup feature with no undo worker, we might need
> > PANIC to ensure that the undo is applied during restart and that it can be
> > discarded, otherwise the unapplied undo log would stay there until the next
> > (regular) restart and it would block discarding. However upgrading FATAL to
> > PANIC just because the current transaction created a table seems quite
> > rude.
> >
> 
> True, I guess but we can once see in what all scenarios it can
> generate FATAL during that operation.

By "that operation" you mean "CREATE TABLE"?

It's not about FATAL during CREATE TABLE, rather it's about FATAL anytime
during a transaction. Whichever operation caused the FATAL error, we'd need to
upgrade it to PANIC as long as the transaction has some undo.

Although the postgres core probably does not raise FATAL errors too often (OOM
conditions seem to be the typical cause), I'm still not enthusiastic about
idea that the undo feature turns such errors into PANIC.

I wonder what the reason to avoid undoing transaction on FATAL is. If it's
about possibly long duration of the undo execution, deletion of orphaned files
(relations or the whole databases) via undo shouldn't make things worse
because currently FATAL also triggers this sort of cleanup immediately, it's
just implemented in different ways.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-09-24 Thread Antonin Houska

Amit Kapila  wrote:

> On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska  wrote:
> >
> > Amit Kapila  wrote:
> >
> > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthali...@gmail.com> 
> > > wrote:
> > > >
> > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > > >
> > > > * What happened with the idea of abandoning discard worker for the sake
> > > >   of simplicity? From what I see limiting everything to foreground undo
> > > >   could reduce the core of the patch series to the first four patches
> > > >   (forgetting about test and docs, but I guess it would be enough at
> > > >   least for the design review), which is already less overwhelming.

> > What we can miss, at least for the cleanup of the orphaned files, is the 
> > *undo
> > worker*. In this patch series the cleanup is handled by the startup process.
> >
> 
> Okay, I think various people at different point of times has suggested
> that idea. I think one thing we might need to consider is what to do
> in case of a FATAL error? In case of FATAL error, it won't be
> advisable to execute undo immediately, so would we upgrade the error
> to PANIC in such cases. I remember vaguely that for clean up of
> orphaned files that can happen rarely and someone has suggested
> upgrading the error to PANIC in such a case but I don't remember the
> exact details.

Do you mean FATAL error during normal operation? As far as I understand, even
zheap does not rely on immediate UNDO execution (otherwise it'd never
introduce the undo worker), so FATAL only means that the undo needs to be
applied later so it can be discarded.

As for the orphaned files cleanup feature with no undo worker, we might need
PANIC to ensure that the undo is applied during restart and that it can be
discarded, otherwise the unapplied undo log would stay there until the next
(regular) restart and it would block discarding. However upgrading FATAL to
PANIC just because the current transaction created a table seems quite
rude. So the undo worker might be needed even for this patch?

Or do you mean FATAL error when executing the UNDO?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-09-21 Thread Antonin Houska

Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > Antonin Houska  wrote:
> >
> > > Dmitry Dolgov <9erthali...@gmail.com> wrote:
> >
> > > > * By throwing at the patchset `make installcheck` I'm getting from time 
> > > > to time
> > > >   and error on the restart:
> > > >
> > > > TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)",
> > > > File: "undorecordset.c", Line: 1098, PID: 6055)
> > > >
> > > >   From what I see XLogReadBufferForRedoExtended finds an invalid buffer 
> > > > and
> > > >   returns BLK_NOTFOUND. The commentary says:
> > > >
> > > >  If the block was not found, then it must be discarded later in
> > > >  the WAL.
> > > >
> > > >   and continues with skip = false, but fails to get a page from an 
> > > > invalid
> > > >   buffer few lines later. It seems that the skip flag is supposed to be 
> > > > used
> > > >   this situation, should it also guard the BufferGetPage part?
> > >
> > > I could see this sometime too, but can't reproduce it now. It's also not 
> > > clear
> > > to me how XLogReadBufferForRedoExtended() can return BLK_NOTFOUND, as the
> > > whole undo log segment is created at once, even if only part of it is 
> > > needed -
> > > see allocate_empty_undo_segment().
> >
> > I could eventually reproduce the problem. The root cause was that WAL 
> > records
> > were created even for temporary / unlogged undo, and thus only empty pages
> > could be found during replay. I've fixed that and also setup regular test 
> > for
> > the BLK_NOTFOUND value. That required a few more fixes to UndoReplay().
> >
> > Attached is a new version.
> 
> Yep, makes sense, thanks. I have few more questions:
> 
> * The use case with orphaned files is working somewhat differently after
>   the rebase on the latest master, do you observe it as well? The
>   difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up
>   an orphaned relation file immediately (only later on checkpoint)
>   because of empty pendingUnlinks. I haven't investigated more yet, but
>   seems like after this commit:
> 
> commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
> Author: Thomas Munro 
> Date:   Mon Aug 2 17:32:20 2021 +1200
> 
> Run checkpointer and bgwriter in crash recovery.
> 
> Start up the checkpointer and bgwriter during crash recovery (except 
> in
> --single mode), as we do for replication.  This wasn't done back in
> commit cdd46c76 out of caution.  Now it seems like a better idea to 
> make
> the environment as similar as possible in both cases.  There may also 
> be
> some performance advantages.
> 
>   something has to be updated (pendingOps are empty right now, so no
>   unlink request is remembered).

I haven't been debugging that part recently, but yes, this commit is relevant,
thanks for pointing that out! Attached is a patch that should fix it. I'll
include it in the next version of the patch series, unless you tell me that
something is still wrong.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/access/undo/undorecordset.c b/src/backend/access/undo/undorecordset.c
index 59eba7dfb6..9d05824141 100644
--- a/src/backend/access/undo/undorecordset.c
+++ b/src/backend/access/undo/undorecordset.c
@@ -2622,14 +2622,6 @@ ApplyPendingUndo(void)
 		}
 	}
 
-	/*
-	 * Some undo actions may unlink files. Since the checkpointer is not
-	 * guaranteed to be up, it seems simpler to process the undo request
-	 * ourselves in the way the checkpointer would do.
-	 */
-	SyncPreCheckpoint();
-	SyncPostCheckpoint();
-
 	/* Cleanup. */
 	chunktable_destroy(sets);
 }

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-09-20 Thread Antonin Houska

Michail Nikolaev  wrote:

> Hello.
> 
> Added a check for standby promotion with the long transaction to the
> test (code and docs are unchanged).

I'm trying to continue the review, sorry for the delay. Following are a few
question about the code:

* Does the masking need to happen in the AM code, e.g. _bt_killitems()? I'd
  expect that the RmgrData.rm_fpi_mask can do all the work.

  Maybe you're concerned about clearing the "LP-safe-on-standby" bits after
  promotion, but I wouldn't consider this a problem: once the standby is
  allowed to set the hint bits (i.e. minRecoveryPoint is high enough, see
  IsIndexLpDeadAllowed() -> XLogNeedsFlush()), promotion shouldn't break
  anything because it should not allow minRecoveryPoint to go backwards.

* How about modifying rm_mask() instead of introducing rm_fpi_mask()?  Perhaps
  a boolean argument can be added to distinguish the purpose of the masking.

* Are you sure it's o.k. to use mask_lp_flags() here? It sets the item flags
  to LP_UNUSED unconditionally, which IMO should only be done by VACUUM. I
  think you only need to revert the effect of prior ItemIdMarkDead(), so you
  only need to change the status LP_DEAD to LP_NORMAL if the tuple still has
  storage. (And maybe add an assertion to ItemIdMarkDead() confirming that
  it's only used for LP_NORMAL items?)

  As far as I understand, the current code only uses mask_lp_flags() during
  WAL consistency check on copies of pages which don't eventually get written
  to disk.

* IsIndexLpDeadAllowed()

  ** is bufmgr.c the best location for this function?

  ** the header comment should explain the minLsn argument.

  ** comment

/* It is always allowed on primary if *all_dead. */

should probably be

/* It is always allowed on primary if ->all_dead. */

* comment: XLOG_HEAP2_CLEAN has been renamed to XLOG_HEAP2_PRUNE in PG14.

On regression tests:

* Is the purpose of the repeatable read (RR) snapshot to test that
  heap_hot_search_buffer() does not set deadness->all_dead if some transaction
  can still see a tuple of the chain? If so, I think the RR snapshot does not
  have to be used in the tests because this patch does not really affect the
  logic: heap_hot_search_buffer() only sets deadness->all_dead to false, just
  like it sets *all_dead in the current code. Besides that,
  IsIndexLpDeadAllowed() too can avoid setting of the LP_DEAD flag on an index
  tuple (at least until the commit record of the deleting/updating transaction
  gets flushed to disk), so it can hide the behaviour of
  heap_hot_search_buffer().

* Unless I miss something, the tests check that the hint bits are not
  propagated from primary (or they are propagated but marked non-safe),
  however there's no test to check that standby does set the hint bits itself.

* I'm also not sure if promotion needs to be tested. What's specific about the
  promoted cluster from the point of view of this feature? The only thing I
  can think of is clearing of the "LP-safe-on-standby" bits, but, as I said
  above, I'm not sure if the tests ever let standby to set those bits before
  the promotion.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-09-19 Thread Antonin Houska

Amit Kapila  wrote:

> On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska  wrote:

> > The problem of the temporary undo log is that it's loaded into local buffers
> > and that backend can exit w/o flushing local buffers to disk, and thus we 
> > are
> > not guaranteed to find enough information when trying to discard the undo 
> > log
> > the backend wrote. I'm thinking about the following solutions:
> >
> > 1. Let the backend manage temporary undo log on its own (even the slot
> >metadata would stay outside the shared memory, and in particular the
> >insertion pointer could start from 1 for each session) and remove the
> >segment files at the same moment the temporary relations are removed.
> >
> >However, by moving the temporary undo slots away from the shared memory,
> >computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would
> >be affected. It might seem that a transaction which only writes undo log
> >for temporary relations does not need to affect oldestFullXidHavingUndo,
> >but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
> >prevents transactions to be truncated from the CLOG too early, I wonder 
> > if
> >the following is possible (This scenario is only applicable to the zheap
> >storage engine [1], which is not included in this patch, but should 
> > already
> >be considered.):
> >
> >A transaction creates a temporary table, does some (many) changes and 
> > then
> >gets rolled back. The undo records are being applied and it takes some
> >time. Since XID of the transaction did not affect 
> > oldestFullXidHavingUndo,
> >the XID can disappear from the CLOG due to truncation.
> >
> 
> By above do you mean to say that in zheap code, we don't consider XIDs
> that operate on temp table/undo for oldestFullXidHavingUndo?

I was referring to the code

/* We can't process temporary undo logs. */
if (log->meta.persistence == UNDO_TEMP)
continue;

in undodiscard.c:UndoDiscard().

> 
> > However zundo.c in
> >[1] indicates that the transaction status *is* checked during undo
> >execution, so we might have a problem.
> >
> 
> It would be easier to follow if you can tell which exact code are you
> referring here?

In meant the call of TransactionIdDidCommit() in
zundo.c:zheap_exec_pending_rollback().

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-09-19 Thread Antonin Houska

Amit Kapila  wrote:

> On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> >
> > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> >
> > * What happened with the idea of abandoning discard worker for the sake
> >   of simplicity? From what I see limiting everything to foreground undo
> >   could reduce the core of the patch series to the first four patches
> >   (forgetting about test and docs, but I guess it would be enough at
> >   least for the design review), which is already less overwhelming.
> >
> 
> I think the discard worker would be required even if we decide to
> apply all the undo in the foreground. We need to forget/remove the
> undo of committed transactions as well which we can't remove
> immediately after the commit.

I think I proposed foreground discarding at some point, but you reminded me
that the undo may still be needed for some time even after transaction
commit. Thus the discard worker is indispensable.

What we can miss, at least for the cleanup of the orphaned files, is the *undo
worker*. In this patch series the cleanup is handled by the startup process.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-06-30 Thread Antonin Houska

Antonin Houska  wrote:

> tsunakawa.ta...@fujitsu.com  wrote:
> 
> > I'm crawling like a snail to read the patch set.  Below are my first set of 
> > review comments, which are all minor.
> 
> Thanks.

I've added the patch to the upcoming CF [1], so it possibly gets more review
and makes some progress. I've marked myself as the author so it's clear who
will try to respond to the reviews. It's clear that other people did much more
work on the feature than I did so far - they are welcome to add themselves to
the author list.

[1] https://commitfest.postgresql.org/33/3228/

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: decoupling table and index vacuum

2021-06-23 Thread Antonin Houska

Andres Freund  wrote:

> On 2021-04-21 11:21:31 -0400, Robert Haas wrote:
> > This scheme adds a lot of complexity, which is a concern, but it seems
> > to me that it might have several benefits. One is concurrency. You
> > could have one process gathering dead TIDs and adding them to the
> > dead-TID fork while another process is vacuuming previously-gathered
> > TIDs from some index.
> 
> I think it might even open the door to using multiple processes
> gathering dead TIDs for the same relation.

I think the possible concurrency improvements are themselves a valid reason to
do the decoupling. Or rather it's hard to imagine how the current
implementation of VACUUM can get parallel workers involved in gathering the
dead heap TIDs efficiently. Currently, a single backend gathers the heap TIDs,
and it can then launch several parallel workers to remove the TIDs from
indexes. If parallel workers gathered the heap TIDs, then (w/o the decoupling)
the parallel index processing would be a problem because a parallel worker
cannot launch other parallel workers.

> > In fact, every index could be getting vacuumed at the same time, and
> > different indexes could be removing different TID ranges.
> 
> We kind of have this feature right now, due to parallel vacuum...

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Possible optimization of heap_lock_tuple()

2021-05-26 Thread Antonin Houska

It seems that a concurrent UPDATE can restart heap_lock_tuple() even if it's
not necessary. Is the attached proposal correct and worth applying?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6ac07f2fda..c3310ab4f9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4553,7 +4553,7 @@ l3:
  */
 if (!HeapTupleHeaderIsOnlyLocked(tuple->t_data) &&
 	((tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED) ||
-	 !updated))
+	 (!updated && follow_updates)))
 	goto l3;
 
 /* Things look okay, so we can skip sleeping */

Re: storing an explicit nonce

2021-05-25 Thread Antonin Houska

Bruce Momjian  wrote:

> On Tue, May 25, 2021 at 04:48:21PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2021-05-25 17:29:03 -0400, Bruce Momjian wrote:
> > > So, let me ask --- I thought CTR basically took an encrypted stream of
> > > bits and XOR'ed them with the data.  If that is true, then why are
> > > changing hint bits a problem?  We already can see some of the bit stream
> > > by knowing some bytes of the page.
> > 
> > A *single* reuse of the nonce in CTR reveals nearly all of the
> > plaintext. As you say, the data is XORed with the key stream. Reusing
> > the nonce means that you reuse the key stream. Which in turn allows you
> > to do:
> >   (data ^ stream) ^ (data' ^ stream)
> > which can be simplified to
> >   (data ^ data')
> > thereby leaking all of data except the difference between data and
> > data'. That's why it's so crucial to ensure that stream *always* differs
> > between two rounds of encrypting "related" data.
> > 
> > We can't just "hope" that data doesn't change and use CTR.
> 
> My point was about whether we need to change the nonce, and hence
> WAL-log full page images if we change hint bits.  If we don't and
> reencrypt the page with the same nonce, don't we only expose the hint
> bits?  I was not suggesting we avoid changing the nonce in non-hint-bit
> cases.
> 
> I don't understand your computation above.  You decrypt the page into
> shared buffers, you change a hint bit, and rewrite the page.  You are
> re-XOR'ing the buffer copy with the same key and nonce.  Doesn't that
> only change the hint bits in the new write?

The way I view things is that the CTR mode encrypts each individual bit,
independent from any other bit on the page. For non-hint bits data=data', so
(data ^ data') is always zero, regardless the actual values of the data. So I
agree with you that by reusing the nonce we only expose the hint bits.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-05-10 Thread Antonin Houska

Michail Nikolaev  wrote:

> After some correspondence with Peter Geoghegan (1) and his ideas, I
> have reworked the patch a lot and now it is much more simple with even
> better performance (no new WAL or conflict resolution, hot standby
> feedback is unrelated).

My review that started in [1] continues here.

(Please note that code.patch does not apply to the current master branch.)

I think I understand your approach now and couldn't find a problem by reading
the code. What I consider worth improving is documentation, both code comments
and nbtree/README. Especially for the problem discussed in [1] it should be
explained what would happen if kill_prior_tuple_min_lsn was not checked.

Also, in IsIndexLpDeadAllowed() you say that invalid
deadness->latest_removed_xid means the following:

/*
 * Looks like it is tuple cleared by heap_page_prune_execute,
 * we must be sure if LSN of XLOG_HEAP2_CLEAN (or any subsequent
 * updates) less than minRecoveryPoint to avoid MVCC failure
 * after crash recovery.
 */

However I think there's one more case: if heap_hot_search_buffer() considers
all tuples in the chain to be "surely dead", but
HeapTupleHeaderAdvanceLatestRemovedXid() skips them all for this reason:

/*
 * Ignore tuples inserted by an aborted transaction or if the tuple was
 * updated/deleted by the inserting transaction.
 *
 * Look for a committed hint bit, or if no xmin bit is set, check clog.
 */

I think that the dead tuples produced this way should never be visible on the
standby (and even if they were, they would change the page LSN so your
algorithm would treat them correctly) so I see no correctness problem. But it
might be worth explaining better the meaning of invalid "latest_removed_xid"
in comments.

In the nbtree/README, you say

"... if the commit record of latestRemovedXid is more ..."

but it's not clear to me what "latestRemovedXid" is. If you mean the
scan->kill_prior_tuple_min_lsn field, you probably need more words to explain
it.

* IsIndexLpDeadAllowed()

/* It all always allowed on primary if *all_dead. */

should probably be

/* It is always allowed on primary if *all_dead. */

* gistkillitems()

As the function is only called if (so->numKilled > 0), I think both
"killedsomething" and "dirty" variables should always have the same value, so
one variable should be enough. Assert(so->numKilled) would be appropriate in
that case.

The situation is similar for btree and hash indexes.

doc.patch:

"+applying the fill page write."

[1] https://www.postgresql.org/message-id/61470.1620647290%40antos

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-05-10 Thread Antonin Houska

Michail Nikolaev  wrote:

> > Sorry, I missed the fact that your example can be executed inside BEGIN - 
> > END
> > block, in which case minRecoveryPoint won't advance after each command.
> 
> No, the block is not executed as a single transaction, all commands
> are separated transactions (see below)
> 
> > Actually I think that a commit record should be replayed
> > more often than XLOG_RUNNING_XACTS, shouldn't it?
> 
> Yes, but replaying commit records DOES NOT affect minRecoveryPoint in
> almost all cases.
> 
> UpdateMinRecoveryPoint is called by XLogFlush,  but xact_redo_commit
> calls XLogFlush only in two cases:
> * DropRelationFiles is called (some relation are dropped)
> * If ForceSyncCommit was used on primary - few “heavy” commands, like
> DropTableSpace, CreateTableSpace, movedb, etc.
> 
> But “regular” commit record is replayed without XLogFlush and, as
> result, without UpdateMinRecoveryPoint.

ok, I missed this. Thanks for explanation.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-05-10 Thread Antonin Houska

Michail Nikolaev  wrote:

> Hello, Antonin.
> 
> > I'm trying to review the patch, but not sure if I understand this problem,
> > please see my comment below.
> 
> Thanks a lot for your attention. It is strongly recommended to look at
> version N3 (1) because it is a much more elegant, easy, and reliable
> solution :) But the minRecoveryPoint-related issue affects it anyway.

Indeed I'm reviewing (1), but I wanted to discuss this particular question in
context, so I replied here.

> > Why doesn't minRecoveryPoint get updated to 20? IMO that should happen by
> > replaying the commit record. And if the standby happens to crash before the
> > commit record could be replayed, no query should see the deletion and thus 
> > no
> > hint bit should be set in the index.
> 
> minRecoveryPoint is not affected by replaying the commit record in
> most cases. It is updated in a lazy way, something like this:
> minRecoveryPoint = max LSN of flushed page. Version 3 of a patch
> contains a code_optional.patch to move minRecoveryPoint more
> aggressively to get additional performance on standby (based on
> Peter’s answer in (2).

> So, “minRecoveryPoint will go here” is not because of “STANDBY INSERTS
> NEW ROW IN INDEX” it is just a random event.
> Michail.

Sorry, I missed the fact that your example can be executed inside BEGIN - END
block, in which case minRecoveryPoint won't advance after each command.

I'll continue my review by replying to (1)

> [1]: 
> https://www.postgresql.org/message-id/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH%3DcMXZSf%2Bnzvg%40mail.gmail.com
> [2]: 
> https://www.postgresql.org/message-id/CAH2-WzkSUcuFukhJdSxHFgtL6zEQgNhgOzNBiTbP_4u%3Dk6igAg%40mail.gmail.com

> (“Also, btw, do you know any reason to keep minRecoveryPoint at a low
> value?”)

I'm not an expert in this area (I'm reviewing this patch also to learn more
about recovery and replication), but after a breif research I think that
postgres tries not to update the control file too frequently, see comments in
UpdateMinRecoveryPoint(). I don't know if what you do in code_optional.patch
would be a problem. Actually I think that a commit record should be replayed
more often than XLOG_RUNNING_XACTS, shouldn't it?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-05-06 Thread Antonin Houska

I'm trying to review the patch, but not sure if I understand this problem,
please see my comment below.

Michail Nikolaev  wrote:

> Oh, I just realized that it seems like I was too naive to allow
> standby to set LP_DEAD bits this way.
> There is a possible consistency problem in the case of low
> minRecoveryPoint value (because hint bits do not move PageLSN
> forward).
> 
> Something like this:
> 
> LSN=10  STANDBY INSERTS NEW ROW IN INDEX (index_lsn=10)
> <---minRecoveryPoint will go here
> LSN=20  STANDBY DELETES ROW FROM HEAP, INDEX UNTACHED (index_lsn=10)

Why doesn't minRecoveryPoint get updated to 20? IMO that should happen by
replaying the commit record. And if the standby happens to crash before the
commit record could be replayed, no query should see the deletion and thus no
hint bit should be set in the index.

>REPLICA SCANS INDEX AND SET hint bits (index_lsn=10)
>INDEX IS FLUSHED (minRecoveryPoint=index_lsn=10)
>CRASH
> 
> On crash recovery, a standby will be able to handle queries after
> LSN=10. But the index page contains hints bits from the future
> (LSN=20).
> So, need to think here.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Built-in connection pooler

2021-04-28 Thread Antonin Houska

Konstantin Knizhnik  wrote:

> People asked me to resubmit built-in connection pooler patch to commitfest.
> Rebased version of connection pooler is attached.

I've reviewd the patch but haven't read the entire thread thoroughly. I hope
that I don't duplicate many comments posted earlier by others.

(Please note that the patch does not apply to the current master, I had to
reset the head of my repository to the appropriate commit.)

Documentation / user interface
--

* session_pool_size (config.sgml)

I wonder if

"The default value is 10, so up to 10 backends will serve each database,"

should rather be

"The default value is 10, so up to 10 backends will serve each 
database/user combination."

However, when I read the code, I think that each proxy counts the size of the
pool separately, so the total number of backends used for particular
database/user combination seems to be

session_pool_size * connection_proxies

Since the feature uses shared memory statistics anyway, shouldn't it only
count the total number of backends per database/user? It would need some
locking, but the actual pools (hash tables) could still be local to the proxy
processes.

* connection_proxies

(I've noticed that Ryan Lambert questioned this variable upthread.)

I think this variable makes the configuration less straightforward from the
user perspective. Cannot the server launch additional proxies dynamically, as
needed, e.g. based on the shared memory statistics that the patch introduces?
I see that postmaster would have to send the sockets in a different way. How
about adding a "proxy launcher" process that would take care of the scheduling
and launching new proxies?

* multitenant_proxy

I thought the purpose of this setting is to reduce the number of backends
needed, but could not find evidence in the code. In particular,
client_attach() always retrieves the backend from the appropriate pool, and
backend_reschedule() does so as well. Thus the role of both client and backend
should always match. What piece of information do I miss?

* typo (2 occurrences in config.sgml)

  "stanalone" -> "standalone"

Design / coding
---

* proxy.c:backend_start() does not change the value of the "host" parameter to
  the socket directory, so I assume the proxy connects to the backend via TCP
  protocol. I think the unix socket should be preferred for this connection if
  the platform has it, however:

* is libpq necessary for the proxy to connect to backend at all? Maybe
  postgres.c:ReadCommand() can be adjusted so that the backend can communicate
  with the proxy just via the plain socket.

  I don't like the idea of server components communicating via libpq (do we
  need anything else of the libpq connection than the socket?) as such, but
  especially these includes in proxy.c look weird:

  #include "../interfaces/libpq/libpq-fe.h"
  #include "../interfaces/libpq/libpq-int.h"

* How does the proxy recognize connections to the walsender? I haven't tested
  that, but it's obvious that these connections should not be proxied.

* ConnectionProxyState is in shared memory, so access to its fields should be
  synchronized.

* StartConnectionProxies() is only called from PostmasterMain(), so I'm not
  sure the proxies get restarted after crash. Perhaps PostmasterStateMachine()
  needs to call it too after calling StartupDataBase().

* Why do you need the Channel.magic integer field? Wouldn't a boolean field
  "active" be sufficient?

  ** In proxy_loop(), I've noticed tests (chan->magic == ACTIVE_CHANNEL_MAGIC)
 tests inside the branch

 else if (chan->magic == ACTIVE_CHANNEL_MAGIC)

 Since neither channel_write() nor channel_read() seem to change the
 value, I think those tests are not necessary.

* Comment lines are often too long.

* pgindent should be applied to the patch at some point.

I can spend more time reviewing the patch during the next CF.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: WIP: Aggregation push-down

2021-03-11 Thread Antonin Houska

David Steele  wrote:

> On 7/3/20 6:07 AM, Laurenz Albe wrote:
> > On Thu, 2020-07-02 at 14:39 +0200, Daniel Gustafsson wrote:
> >> This version now fails to apply to HEAD, with what looks like like a 
> >> trivial
> >> error in the expected test output.  Can you please submit a rebased 
> >> version so
> >> we can see it run in the patch tester CI?  I'm marking the entry as 
> >> Waiting on
> >> Author in the meantime.
> >
> > I have rebased the patch against current HEAD, it passes "make 
> > installcheck".
> 
> This needs another rebase so marked as Waiting on Author:
> http://cfbot.cputube.org/patch_32_1247.log

Besides the fact that I don't have time to work on it now, I'm not able to
convince myself to put more effort into it. If there's almost no progress for
several years, I don't believe another rebase(s) is anything but wasted
effort.

So I've withdrawn the patch, also to save CFMs from examining the history
again and again uselessly. The code is free so anyone can continue if he
thinks it makes sense. If it finally finds its way into the PG core and if
meaningful part of my code remains, I'd appreciate it if I was mentioned in
the release notes. Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

2021-03-10 Thread Antonin Houska

tsunakawa.ta...@fujitsu.com  wrote:

> I'm crawling like a snail to read the patch set.  Below are my first set of 
> review comments, which are all minor.

Thanks. I will check your comments when I'll be preparing the next version of
the patch.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

pg_config_h.in not up-to-date

2021-02-18 Thread Antonin Houska

When I run "autoreconf" on the master branch, git generates the diff
below. Shouldn't it just be applied? I suppose someone changed configure.ac
and forgot to update the generated file.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 55cab4d2bf..806eabac81 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -899,7 +899,7 @@
 /* Define to select named POSIX semaphores. */
 #undef USE_NAMED_POSIX_SEMAPHORES
 
-/* Define to build with OpenSSL support. (--with-ssl=openssl) */
+/* Define to 1 if you have OpenSSL support. */
 #undef USE_OPENSSL
 
 /* Define to 1 to build with PAM support. (--with-pam) */

Re: POC: Cleaning up orphaned files using undo logs

2021-02-03 Thread Antonin Houska

Bruce Momjian  wrote:

> On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote:
> > Antonin Houska  wrote:
> > > Well, on repeated run of the test I could also hit the first one. I could 
> > > fix
> > > it and will post a new version of the patch (along with some other small
> > > changes) this week.
> > 
> > Attached is the next version. Changes done:
> 
> Yikes, this patch is 23k lines, and most of it looks like added lines of
> code.  Is this size expected?

Yes, earlier versions of this patch, e.g. [1], were of comparable size. It's
not really an "ordinary patch".

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

1 2 3 4 >

1 - 100 of 323 matches

Mail list logo