backtrace_functions emits trace for any elog

2023-09-04 Thread Ilya Gladyshev
Hi,
I used backtrace_functions to debug one of my ideas and found its behavior 
counter-intuitive and contradictory to it own docs. I think the GUC is supposed 
to be used to dump backtrace only on elog(ERROR) (should it also be done for 
higher levels? not sure about this), but, in fact, it does that for any 
log-level. I have attached a patch that checks log-level before attaching 
backtrace.

Regards,
Ilya


0001-check-elevel-before-attaching-backtrace.patch
Description: Binary data


Re: Progress report of CREATE INDEX for nested partitioned tables

2023-03-16 Thread Ilya Gladyshev



> 16 марта 2023 г., в 04:07, Justin Pryzby  написал(а):
> 
> On Tue, Mar 14, 2023 at 06:58:14PM +0400, Ilya Gladyshev wrote:
>>> The only change from the current patch is (3).  (1) still calls
>>> count_leaf_partitions(), but only once.  I'd prefer that to rearranging
>>> the progress reporting to set the TOTAL in ProcessUtilitySlow().
>> 
>> As for reusing TOTAL calculated outside of DefineIndex, as I can see, 
>> ProcessUtilitySlow is not the only call site for DefineIndex (although, I 
>> don’t know whether all of them need progress tracking), for instance, there 
>> is ALTER TABLE that calls DefineIndex to create index for constraints. So I 
>> feel like rearranging progress reporting will result in unnecessary code 
>> duplication in those call sites, so passing in an optional parameter seems 
>> to be easier here, if we are going to optimize it, after all. Especially if 
>> back-patching is a non-issue.
> 
> Yeah.  See attached.  I don't like duplicating the loop.  Is this really
> the right direction to go ?
> 
> I haven't verified if the child tables are locked in all the paths which
> would call count_leaf_partitions().  But why is it important to lock
> them for this?  If they weren't locked before, that'd be a pre-existing
> problem...
> <0001-fix-CREATE-INDEX-progress-report-with-nested-partiti.patch>

I’m not sure what the general policy on locking is, but I have checked ALTER 
TABLE ADD INDEX, and the all the partitions seem to be locked on the first 
entry to DefineIndex there. All other call sites pass in the parentIndexId, 
which means the progress tracking machinery will not be initialized, so I 
think, we don’t need to do locking in count_leaf_partitions(). 

The approach in the patch looks good to me. Some nitpicks on the patch: 
1. There’s an unnecessary second call to get_rel_relkind in ProcessUtilitySlow, 
we can just use what’s in the variable relkind.
2. We can also combine else and if to have one less nested level like that:
+   else if (!RELKIND_HAS_PARTITIONS(child_relkind))

3. There was a part of the comment saying "If the index was built by calling 
DefineIndex() recursively, the called function is responsible for updating the 
progress report for built indexes.", I think it is still useful to have it 
there.






Re: Progress report of CREATE INDEX for nested partitioned tables

2023-03-14 Thread Ilya Gladyshev



> 14 марта 2023 г., в 18:34, Justin Pryzby  написал(а):
> 
> On Mon, Mar 13, 2023 at 10:42:59AM -0400, Tom Lane wrote:
>> Justin Pryzby  writes:
>>> On Sun, Mar 12, 2023 at 06:25:13PM -0400, Tom Lane wrote:
 I agree that adding such a field to IndexStmt would be a very bad idea.
 However, adding another parameter to DefineIndex doesn't seem like a
 problem.
>> 
>>> It's a problem since this is a bug and it's desirable to backpatch a
>>> fix, right ?
>> 
>> I do not think this is important enough to justify a back-patch.
> 
> That's fine with me, but it comes as a surprise, and it might invalidate
> earlier discussions, which were working under the constraint of
> maintaining a compatible ABI.
> 
>>> Incrementing by 0 sounds terrible, since someone who has intermediate
>>> partitioned tables is likely to always see 0% done.
>> 
>> How so?  The counter will increase after there's some actual work done,
>> ie building an index.  If there's no indexes to build then it hardly
>> matters, because the command will complete in very little time.
> 
> I misunderstood your idea as suggesting to skip progress reporting for
> *every* intermediate partitioned index, and its leaves.  But I guess
> what you meant is to skip progress reporting when ATTACHing an
> intermediate partitioned index. That seems okay, since 1) intermediate
> partitioned tables are probably rare, and 2) ATTACH is fast, so the
> effect is indistinguisable from querying the progress report a few
> moments later.

+1 that counting attached partitioned indexes as 0 is fine. 

> The idea would be for:
> 1) TOTAL to show the number of direct and indirect leaf partitions;
> 2) update progress while building direct or indirect indexes;
> 3) ATTACHing intermediate partitioned tables to increment by 0;
> 4) ATTACHing a direct child should continue to increment by 1,
> since that common case already works as expected and shouldn't be
> changed.
> 
> The only change from the current patch is (3).  (1) still calls
> count_leaf_partitions(), but only once.  I'd prefer that to rearranging
> the progress reporting to set the TOTAL in ProcessUtilitySlow().
> 
> -- 
> Justin

As for reusing TOTAL calculated outside of DefineIndex, as I can see, 
ProcessUtilitySlow is not the only call site for DefineIndex (although, I don’t 
know whether all of them need progress tracking), for instance, there is ALTER 
TABLE that calls DefineIndex to create index for constraints. So I feel like 
rearranging progress reporting will result in unnecessary code duplication in 
those call sites, so passing in an optional parameter seems to be easier here, 
if we are going to optimize it, after all. Especially if back-patching is a 
non-issue.





Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Ilya Gladyshev


> 1 февр. 2023 г., в 20:27, Matthias van de Meent 
>  написал(а):
> 
> On Wed, 1 Feb 2023 at 16:53, Justin Pryzby  <mailto:pry...@telsasoft.com>> wrote:
>> 
>> On Wed, Feb 01, 2023 at 04:21:35PM +0100, Matthias van de Meent wrote:
>>> On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev  
>>> wrote:
>>>>> 1 февр. 2023 г., в 16:01, Alvaro Herrera  
>>>>> написал(а):
>>>>> Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
>>>>> lookup for every single element therein ... this sounds slow.
>>>>> 
>>>>> In one of the callsites, we already have the partition descriptor
>>>>> available.  We could just scan partdesc->is_leaf[] and add one for each
>>>>> 'true' value we see there.
>>>> 
>>>> The problem is that partdesc contains only direct children of the table 
>>>> and we need all the children down the inheritance tree to count the total 
>>>> number of leaf partitions in the first callsite.
>>>> 
>>>>> In the other callsite, we had the table open just a few lines before the
>>>>> place you call count_leaf_partitions.  Maybe we can rejigger things by
>>>>> examining its state before closing it: if relkind is not partitioned we
>>>>> know leaf_partitions=0, and only if partitioned we count leaf partitions.
>>>>> I think that would save some work.  I also wonder if it's worth writing
>>>>> a bespoke function for counting leaf partitions rather than relying on
>>>>> find_all_inheritors.
>>>> 
>>>> Sure, added this condition to avoid the extra work here.
>>>> 
>>> 
>>>>   When creating an index on a partitioned table, this column is set to
>>>> -   the total number of partitions on which the index is to be created.
>>>> +   the total number of leaf partitions on which the index is to be 
>>>> created or attached.
>>> 
>>> I think we should also add a note about the (now) non-constant nature
>>> of the value, something along the lines of "This value is updated as
>>> we're processing and discovering partitioned tables in the partition
>>> hierarchy".
>> 
>> But the TOTAL is constant, right ?  Updating the total when being called
>> recursively is the problem these patches fix.
> 
> If that's the case, then I'm not seeing the 'fix' part of the patch. I
> thought this patch was fixing the provably incorrect TOTAL value where
> DONE > TOTAL due to the recursive operation overwriting the DONE/TOTAL
> values instead of updating them.
> 
> In HEAD we set TOTAL to whatever number partitioned table we're
> currently processing has - regardless of whether we're the top level
> statement.
> With the patch we instead add the number of child relations to that
> count, for which REL_HAS_STORAGE(child) -- or at least, in the v3
> posted by Ilya. Approximately immediately after updating that count we
> recurse to the child relations, and that only returns once it is done
> creating the indexes, so both TOTAL and DONE go up as we process more
> partitions in the hierarchy.

The TOTAL in the patch is set only when processing the top-level parent and it 
is not updated when we recurse, so yes, it is constant. From v3:

@@ -1219,8 +1243,14 @@ DefineIndex(Oid relationId,
RelationparentIndex;
TupleDesc   parentDesc;
 
-   
pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
-   
 nparts);
+   if (!OidIsValid(parentIndexId))
+   {
+   int total_parts;
+
+   total_parts = count_leaf_partitions(relationId);
+   
pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
+   
 total_parts);
+   }


It is set to the total number of children on all levels of the hierarchy, not 
just the current one, so the total value doesn’t need to be updated later, 
because it is set to the correct value from the very beginning. 

It is the DONE counter that is updated, and when we attach an index of a 
partition that is itself a partitioned table (like a2 in your example, if it 
already had an index created), it will be updated by the number of children of 
the partition.

@@ -1431,9 +1463,25 @@ DefineIndex(Oid rela

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Ilya Gladyshev

> 1 февр. 2023 г., в 16:01, Alvaro Herrera  написал(а):
> 
> Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
> lookup for every single element therein ... this sounds slow. 
> 
> In one of the callsites, we already have the partition descriptor
> available.  We could just scan partdesc->is_leaf[] and add one for each
> 'true' value we see there.

The problem is that partdesc contains only direct children of the table and we 
need all the children down the inheritance tree to count the total number of 
leaf partitions in the first callsite.

> In the other callsite, we had the table open just a few lines before the
> place you call count_leaf_partitions.  Maybe we can rejigger things by
> examining its state before closing it: if relkind is not partitioned we
> know leaf_partitions=0, and only if partitioned we count leaf partitions.
> I think that would save some work.  I also wonder if it's worth writing
> a bespoke function for counting leaf partitions rather than relying on
> find_all_inheritors.

Sure, added this condition to avoid the extra work here.


v3-0001-create-index-progress-increment.patch
Description: Binary data





Re: Progress report of CREATE INDEX for nested partitioned tables

2023-01-31 Thread Ilya Gladyshev


> 1 февр. 2023 г., в 08:29, Justin Pryzby  написал(а):
> 
> On Tue, Jan 31, 2023 at 07:32:20PM +0400, Ilya Gladyshev wrote:
>>> 17 янв. 2023 г., в 23:44, Tomas Vondra  
>>> написал(а):
>>> Do we actually need the new parts_done field? I mean, we already do
>>> track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the
>>> st_progress_param array. Can't we simply read it from there? Then we
>>> would not have ABI issues with the new field added to IndexStmt.
>> 
>> I think it’s a good approach and it could be useful outside of scope of this 
>> patch too. So I have attached a patch, that introduces 
>> pgstat_progress_incr_param function for this purpose. There’s one thing I am 
>> not sure about, IIUC, we can assume that the only process that can write 
>> into MyBEEntry of the current backend is the current backend itself, 
>> therefore looping to get consistent reads from this array is not required. 
>> Please correct me, if I am wrong here.
> 
> Thanks for the updated patch.
> 
> I think you're right - pgstat_begin_read_activity() is used for cases
> involving other backends.  It ought to be safe for a process to read its
> own status bits, since we know they're not also being written.
> 
> You changed DefineIndex() to update progress for the leaf indexes' when
> called recursively.  The caller updates the progress for "attached"
> indexes, but not created ones.  That allows providing fine-granularity
> progress updates when using intermediate partitions, right ?  (Rather
> than updating the progress by more than one at a time in the case of
> intermediate partitioning).
> 
> If my understanding is right, that's subtle, and adds a bit of
> complexity to the current code, so could use careful commentary.  I
> suggest:
> 
> * If the index was attached, update progress for all its direct and
> * indirect leaf indexes all at once.  If the index was built by calling
> * DefineIndex() recursively, the called function is responsible for
> * updating the progress report for built indexes.
> 
> ...
> 
> * If this is the top-level index, we're done. When called recursively
> * for child tables, the done partition counter is incremented now,
> * rather than in the caller.

Yes, you are correct about the intended behavior, I added your comments to the 
patch.

> I guess you know that there were compiler warnings (related to your
> question).
> https://cirrus-ci.com/task/6571212386598912
> 
> pgstat_progress_incr_param() could call pgstat_progress_update_param()
> rather than using its own Assert() and WRITE_ACTIVITY calls.  I'm not
> sure which I prefer, though.
> 
> Also, there are whitespace/tab/style issues in
> pgstat_progress_incr_param().
> 
> -- 
> Justin

Thank you for the review, I fixed the aforementioned issues in the v2.


v2-0001-create-index-progress-increment.patch
Description: Binary data



Re: Progress report of CREATE INDEX for nested partitioned tables

2023-01-31 Thread Ilya Gladyshev

> 17 янв. 2023 г., в 23:44, Tomas Vondra  
> написал(а):
> Do we actually need the new parts_done field? I mean, we already do
> track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the
> st_progress_param array. Can't we simply read it from there? Then we
> would not have ABI issues with the new field added to IndexStmt.

I think it’s a good approach and it could be useful outside of scope of this 
patch too. So I have attached a patch, that introduces 
pgstat_progress_incr_param function for this purpose. There’s one thing I am 
not sure about, IIUC, we can assume that the only process that can write into 
MyBEEntry of the current backend is the current backend itself, therefore 
looping to get consistent reads from this array is not required. Please correct 
me, if I am wrong here.



0001-create-index-progress-increment.patch
Description: Binary data


Re: Progress report of CREATE INDEX for nested partitioned tables

2023-01-09 Thread Ilya Gladyshev
On Sun, 2023-01-08 at 10:48 -0600, Justin Pryzby wrote:
> On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote:
> > We have the common problem of too many patches.
> > 
> > https://www.postgresql.org/message-id/a15f904a70924ffa4ca25c3c744cff31e0e6e143.camel%40gmail.com
> > This changes the progress reporting to show indirect children as
> > "total", and adds a global variable to track recursion into
> > DefineIndex(), allowing it to be incremented without the value
> > being
> > lost to the caller.
> > 
> > https://www.postgresql.org/message-id/20221211063334.GB27893%40telsasoft.com
> > This also counts indirect children, but only increments the
> > progress
> > reporting in the parent.  This has the disadvantage that when
> > intermediate partitions are in use, the done_partitions counter
> > will
> > "jump" from (say) 20 to 30 without ever hitting 21-29.
> > 
> > https://www.postgresql.org/message-id/20221213044331.GJ27893%40telsasoft.com
> > This has two alternate patches:
> > - One patch changes to only update progress reporting of *direct*
> >   children.  This is minimal, but discourages any future plan to
> > track
> >   progress involving intermediate partitions with finer
> > granularity.
> > - A alternative patch adds IndexStmt.nparts_done, and allows
> > reporting
> >   fine-grained progress involving intermediate partitions.
> > 
> > https://www.postgresql.org/message-id/flat/039564d234fc3d014c555a7ee98be69a9e724836.ca...@gmail.com
> > This also reports progress of intermediate children.  The first
> > patch
> > does it by adding an argument to DefineIndex() (which isn't okay to
> > backpatch).  And an alternate patch does it by adding to IndexStmt.
> > 
> > @committers: Is it okay to add nparts_done to IndexStmt ?
> 
> Any hint about this ?
> 
> This should be resolved before the "CIC on partitioned tables" patch,
> which I think is otherwise done.

I suggest that we move on with the IndexStmt patch and see what the
committers have to say about it. I have brushed the patch up a bit,
fixing TODOs and adding docs as per our discussion above.

From 490d8afa7cb952e5b3947d81d85bf9690499e8a3 Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Tue, 13 Dec 2022 22:04:45 +0400
Subject: [PATCH v2] parts_done via IndexStmt

---
 doc/src/sgml/monitoring.sgml   |  4 +--
 src/backend/bootstrap/bootparse.y  |  2 ++
 src/backend/commands/indexcmds.c   | 57 +++---
 src/backend/parser/parse_utilcmd.c |  2 ++
 src/include/nodes/parsenodes.h |  1 +
 5 files changed, 59 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index cf220c3bcb..2c32a92364 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6565,7 +6565,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
   
   
When creating an index on a partitioned table, this column is set to
-   the total number of partitions on which the index is to be created.
+   the total number of leaf partitions on which the index is to be created or attached.
This field is 0 during a REINDEX.
   
  
@@ -6576,7 +6576,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
   
   
When creating an index on a partitioned table, this column is set to
-   the number of partitions on which the index has been created.
+   the number of leaf partitions on which the index has been created or attached.
This field is 0 during a REINDEX.
   
  
diff --git a/src/backend/bootstrap/bootparse.y b/src/backend/bootstrap/bootparse.y
index 86804bb598..dad9c38d90 100644
--- a/src/backend/bootstrap/bootparse.y
+++ b/src/backend/bootstrap/bootparse.y
@@ -296,6 +296,7 @@ Boot_DeclareIndexStmt:
 	stmt->concurrent = false;
 	stmt->if_not_exists = false;
 	stmt->reset_default_tblspc = false;
+	stmt->parts_done = -1;
 
 	/* locks and races need not concern us in bootstrap mode */
 	relationId = RangeVarGetRelid(stmt->relation, NoLock,
@@ -348,6 +349,7 @@ Boot_DeclareUniqueIndexStmt:
 	stmt->concurrent = false;
 	stmt->if_not_exists = false;
 	stmt->reset_default_tblspc = false;
+	stmt->parts_done = -1;
 
 	/* locks and races need not concern us in bootstrap mode */
 	relationId = RangeVarGetRelid(stmt->relation, NoLock,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 8afc006f89..be4d1c884e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -130,6 +130,29 @@ typedef struct ReindexErrorInfo
 	char		relkind;
 } ReindexErrorInfo;
 
+
+/*
+ * Count the number of direct and indirect leaf partitions, excl

Re: Progress report of CREATE INDEX for nested partitioned tables

2022-12-13 Thread Ilya Gladyshev
On Mon, 2022-12-12 at 22:43 -0600, Justin Pryzby wrote:
> On Mon, Dec 12, 2022 at 11:39:23PM +0400, Ilya Gladyshev wrote:
> > 
> > > Could you check what I've written as a counter-proposal ?
> > 
> > I think that this might be a good solution to start with, it gives
> > us the opportunity to improve the granularity later without any
> > surprising changes for the end user. We could use this patch for
> > previous versions and make more granular output in the latest. What
> > do you think?
> 
> Somehow, it hadn't occured to me that my patch "lost granularity" by
> incrementing the progress bar by more than one...  Shoot.
> 
> > I actually think that the progress view would be better off without
> > the total number of partitions, 
> 
> Just curious - why ?

We don't really know how many indexes we are going to create, unless we
have some kind of preliminary "planning" stage where we acumulate all
the relations that will need to have indexes created (rather than
attached). And if someone wants the total, it can be calculated
manually without this view, it's less user-friendly, but if we can't do
it well, I would leave it up to the user.

> 
> > With this in mind, I think your proposal to exclude catalog-only
> > indexes sounds reasonable to me, but I feel like the docs are off
> > in this case, because the attached indexes are not created, but we
> > pretend like they are in this metric, so we should fix one or the
> > other.
> 
> I agree that the docs should indicate whether we're counting "all
> partitions", "direct partitions", and whether or not that includes
> partitioned partitions, or just leaf partitions.

Agree. I think that docs should also be explicit about the attached
indexes, if we decide to count them in as "created".

> I have another proposal: since the original patch 3.5 years ago
> didn't
> consider or account for sub-partitions, let's not start counting them
> now.  It was never defined whether they were included or not (and I
> guess that they're not common) so we can take this opportunity to
> clarify the definition.

I have had this thought initially, but then I thought that it's not
what I would want, if I was to track progress of multi-level
partitioned tables (but yeah, I guess it's pretty uncommon). In this
respect, I like your initial counter-proposal more, because it leaves
us room to improve this in the future. Otherwise, if we commit to
reporting only top-level partitions now, I'm not sure we will have the
opportunity to change this.


> Alternately, if it's okay to add nparts_done to the IndexStmt, then
> that's easy.

Yeah, or we could add another argument to DefineIndex. I don't know if
it's ok, or which option is better here in terms of compatibility and
interface-wise, so I have tried both of them, see the attached patches.



From 4f55526f8a534d6b2f27b6b17cadaee0370e4cbc Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Tue, 13 Dec 2022 19:42:53 +0400
Subject: [PATCH] parts_done via DefineIndex args

---
 src/backend/commands/indexcmds.c | 52 +++-
 src/backend/commands/tablecmds.c |  7 +++--
 src/backend/tcop/utility.c   |  3 +-
 src/include/commands/defrem.h|  3 +-
 4 files changed, 53 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b5b860c3ab..9fa9109b9e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -129,6 +129,29 @@ typedef struct ReindexErrorInfo
charrelkind;
 } ReindexErrorInfo;
 
+
+/*
+ * Count the number of direct and indirect leaf partitions, excluding foreign
+ * tables.
+ */
+static int
+num_leaf_partitions(Oid relid)
+{
+   int nleaves = 0;
+   List*childs = find_all_inheritors(relid, NoLock, NULL);
+   ListCell *lc;
+
+   foreach(lc, childs)
+   {
+   Oid partrelid = lfirst_oid(lc);
+   if (RELKIND_HAS_STORAGE(get_rel_relkind(partrelid)))
+   nleaves++;
+   }
+
+   list_free(childs);
+   return nleaves;
+}
+
 /*
  * CheckIndexCompatible
  * Determine whether an existing index definition is compatible 
with a
@@ -530,7 +553,8 @@ DefineIndex(Oid relationId,
bool check_rights,
bool check_not_in_use,
bool skip_build,
-   bool quiet)
+   bool quiet,
+   int *parts_done)
 {
boolconcurrent;
char   *indexRelationName;
@@ -568,6 +592,7 @@ DefineIndex(Oid relationId,
Oid root_save_userid;
int root_save_sec_context;
int ro

Re: Progress report of CREATE INDEX for nested partitioned tables

2022-12-12 Thread Ilya Gladyshev

> Could you check what I've written as a counter-proposal ?

I think that this might be a good solution to start with, it gives us the 
opportunity to improve the granularity later without any surprising changes for 
the end user. We could use this patch for previous versions and make more 
granular output in the latest. What do you think?

> As long as we're changing partitions_done to include nested
> sub-partitions, it seems to me like we should exclude intermediate
> "catalog-only" partitioned indexes, and count only physical leaf
> partitions.  Should it alo exclude any children with matching indexes,
> which will also be catalog-only changes?  Probably not.
> 
> The docs say:
> |When creating an index on a partitioned table, this column is set to the
> |total number of partitions on which the index is to be created. This
> |field is 0 during a REINDEX.

I agree with you on catalog-only partitioned indexes, but I think that in the 
perfect world we should exclude all the relations where the index isn’t 
actually created, so that means excluding attached indexes as well. However, 
IMO doing it this way will require too much of a code rewrite for quite a minor 
feature (but we could do it, ofc). I actually think that the progress view 
would be better off without the total number of partitions, but I’m not sure we 
have this option now. With this in mind, I think your proposal to exclude 
catalog-only indexes sounds reasonable to me, but I feel like the docs are off 
in this case, because the attached indexes are not created, but we pretend like 
they are in this metric, so we should fix one or the other.

> 
>> I changed current behaviour to report the total number of partitions in
>> the inheritance tree and fixed recursion in the attached patch. I used
>> a static variable to keep the counter to avoid ABI breakage of
>> DefineIndex, so that we could backpatch this to previous versions.
> 
> I wrote a bunch of assertions for this, which seems to have uncovered an
> similar issue with COPY progress reporting, dating to 8a4f618e7.  I'm
> not sure the assertions are okay.  I imagine they may break other
> extensions, as with file_fdw.
> 
> -- 
> Justin
> <0001-fix-progress-reporting-of-nested-partitioned-indexes.patch>



Progress report of CREATE INDEX for nested partitioned tables

2022-12-10 Thread Ilya Gladyshev
Hi,

I have noticed that progress reporting for CREATE INDEX of partitioned
tables seems to be working poorly for nested partitioned tables. In
particular, it overwrites total and done partitions count when it
recurses down to child partitioned tables and it only reports top-level
partitions. So it's not hard to see something like this during CREATE
INDEX now:

postgres=# select partitions_total, partitions_done from
pg_stat_progress_create_index ;
 partitions_total | partitions_done 
--+-
1 |   2
(1 row)


I changed current behaviour to report the total number of partitions in
the inheritance tree and fixed recursion in the attached patch. I used
a static variable to keep the counter to avoid ABI breakage of
DefineIndex, so that we could backpatch this to previous versions.

Thanks,
Ilya Gladyshev
From 342caa3d4ce273b14a6998c20c32b862d2d4c890 Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Fri, 9 Dec 2022 23:17:29 +0400
Subject: [PATCH] report top parent progress for CREATE INDEX

---
 src/backend/commands/indexcmds.c | 36 ++--
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b5b860c3ab..80557dad8d 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -129,6 +129,12 @@ typedef struct ReindexErrorInfo
 	char		relkind;
 } ReindexErrorInfo;
 
+/*
+ * Used to report the number of partitions,
+ * that were processed during index creation.
+ */
+static int create_index_parts_done;
+
 /*
  * CheckIndexCompatible
  *		Determine whether an existing index definition is compatible with a
@@ -1218,8 +1224,18 @@ DefineIndex(Oid relationId,
 			Relation	parentIndex;
 			TupleDesc	parentDesc;
 
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
-		 nparts);
+			if (!OidIsValid(parentIndexId))
+			{
+List *inhs;
+
+/* Reset counter of done partitions when processing root index */
+create_index_parts_done = 0;
+inhs = find_all_inheritors(relationId, NoLock, NULL);
+/* inheritance tree size without root itself */
+pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
+			 list_length(inhs) - 1);
+list_free(inhs);
+			}
 
 			/* Make a local copy of partdesc->oids[], just for safety */
 			memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
@@ -1431,8 +1447,6 @@ DefineIndex(Oid relationId,
 		   child_save_sec_context);
 }
 
-pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_DONE,
-			 i + 1);
 free_attrmap(attmap);
 			}
 
@@ -1465,13 +1479,17 @@ DefineIndex(Oid relationId,
 
 		/*
 		 * Indexes on partitioned tables are not themselves built, so we're
-		 * done here.
+		 * done here. If it is a child partitioned table, increment done parts counter.
 		 */
 		AtEOXact_GUC(false, root_save_nestlevel);
 		SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 		table_close(rel, NoLock);
 		if (!OidIsValid(parentIndexId))
 			pgstat_progress_end_command();
+		else
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_DONE,
+		 ++create_index_parts_done);
+
 		return address;
 	}
 
@@ -1483,9 +1501,15 @@ DefineIndex(Oid relationId,
 		/* Close the heap and we're done, in the non-concurrent case */
 		table_close(rel, NoLock);
 
-		/* If this is the top-level index, we're done. */
+		/*
+		 * If this is the top-level index, we're done, otherwise, increment
+		 * done partition counter.
+		 */
 		if (!OidIsValid(parentIndexId))
 			pgstat_progress_end_command();
+		else
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_DONE,
+		 ++create_index_parts_done);
 
 		return address;
 	}
-- 
2.30.2



Re: CREATE INDEX CONCURRENTLY on partitioned index

2022-12-09 Thread Ilya Gladyshev
On Sun, 2022-12-04 at 13:09 -0600, Justin Pryzby wrote:
> 
> This beavior is fixed.  I re-factored and re-implented to use
> DefineIndex() for building indexes concurrently rather than
> reindexing.
> That makes the patch smaller, actually, and has the added benefit of
> splitting off the "Concurrently" part of DefineIndex() into a
> separate
> function.

Nice, I think it turned out pretty concise. I played around with the
patch quite a bit, didn't find any major problems, the only minor thing
that I can note is that we should skip the top parent index itself in
the loop not to increment the pg_stat counter, something like this:

diff --git a/src/backend/commands/indexcmds.c
b/src/backend/commands/indexcmds.c
index cfab45b999..9049540b5b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1515,6 +1515,9 @@ DefineIndex(Oid relationId,
Oid indrelid =
lfirst_oid(lc);
Oid tabrelid =
IndexGetRelation(indrelid, false);
 
+   if (indrelid == indexRelationId)
+   continue;
+
if
(RELKIND_HAS_STORAGE(get_rel_relkind(indrelid)) &&
!get_index_isvalid(indrelid))
{
> 
> BTW, it causes the patch to fail to apply in cfbot when you send an
> additional (002) supplementary patch without including the original
> (001) patch.  You can name it *.txt to avoid the issue.
>  
> https://wiki.postgresql.org/wiki/Cfbot#Which_attachments_are_considered_to_be_patches.3F
> 
> Thanks for looking.
> 
My bad, didn't know about this, thanks for the link.

On a side note, I noticed that reindex behaviour is strange on
partitioned tables, it doesn't mark partitioned tables as valid after
reindexing children, as I could understand from the code and mailing
lists, this is the intended behaviour, but I can't quite understand the
rationale for it, do you know why it is done this way?





Re: CREATE INDEX CONCURRENTLY on partitioned index

2022-12-03 Thread Ilya Gladyshev
Hi,

Thank you Justin and Alexander for working on this, I have reviewed and
tested the latest patch, it works well, the problems mentioned
previously are all fixed. I like the idea of sharing code of reindex
and index, but I have noticed some peculiarities as a user. 

The reporting is somewhat confusing as it switches to reporting for
reindex concurrently while building child indexes, this should be fixed
with the simple patch I have attached. Another thing that I have
noticed is that REINDEX, which is used under the hood, creates new
indexes with suffix _ccnew, and if the index building fails, the
indexes that could not be build will have the name with _ccnew suffix.
This can actually be seen in your test:

ERROR:  could not create unique index "idxpart2_a_idx2_ccnew"

I find it quite confusing and I don't think that this the expected
behavior (if it is, I think it should be documented, like it is for
REINDEX). As an example of problems that it might entail, DROP INDEX
will not drop all the invalid indexes in the inheritance tree, because
it will leave _ccnew indexes in place, which is ok for reindex
concurrently, but that's not how C-I-C works now. I think that fixing
this problem requires some heavy code rewrite and I'm not quite sure
how to go about it, if you have any ideas, I will be happy to try them
out.


Thanks,
Ilya
From 8eb9fd7ce7d34c5c323c47b60a7f883f360ef090 Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Sat, 3 Dec 2022 18:20:03 +0400
Subject: [PATCH] turn off reindex reporting for create

---
 src/backend/commands/indexcmds.c | 36 +++-
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a2775931e2..b3c713037f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -3804,14 +3804,16 @@ ReindexRelationConcurrently(Oid relationOid, ReindexParams *params)
 
 		/* Don't overwrite CREATE INDEX command */
 		if (!(params->options & REINDEXOPT_REPORT_CREATE_PART))
+		{
 			pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX,
 	  idx->tableId);
 
-		progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
-		progress_vals[1] = 0;	/* initializing */
-		progress_vals[2] = idx->indexId;
-		progress_vals[3] = idx->amId;
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+			progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
+			progress_vals[1] = 0;	/* initializing */
+			progress_vals[2] = idx->indexId;
+			progress_vals[3] = idx->amId;
+			pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+		}
 
 		/* Choose a temporary relation name for the new index */
 		concurrentName = ChooseRelationName(get_rel_name(idx->indexId),
@@ -3967,13 +3969,15 @@ ReindexRelationConcurrently(Oid relationOid, ReindexParams *params)
 		 * table involved.  Don't overwrite CREATE INDEX command.
 		 */
 		if (!(params->options & REINDEXOPT_REPORT_CREATE_PART))
+		{
 			pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, newidx->tableId);
 
-		progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
-		progress_vals[1] = PROGRESS_CREATEIDX_PHASE_BUILD;
-		progress_vals[2] = newidx->indexId;
-		progress_vals[3] = newidx->amId;
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+			progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
+			progress_vals[1] = PROGRESS_CREATEIDX_PHASE_BUILD;
+			progress_vals[2] = newidx->indexId;
+			progress_vals[3] = newidx->amId;
+			pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+		}
 
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
@@ -4033,14 +4037,16 @@ ReindexRelationConcurrently(Oid relationOid, ReindexParams *params)
 		 * table involved. Don't overwrite CREATE INDEX command.
 		 */
 		if (!(params->options & REINDEXOPT_REPORT_CREATE_PART))
+		{
 			pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX,
 	  newidx->tableId);
 
-		progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
-		progress_vals[1] = PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN;
-		progress_vals[2] = newidx->indexId;
-		progress_vals[3] = newidx->amId;
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+			progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
+			progress_vals[1] = PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN;
+			progress_vals[2] = newidx->indexId;
+			progress_vals[3] = newidx->amId;
+			pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+		}
 
 		validate_index(newidx->tableId, newidx->indexId, snapshot);
 
-- 
2.30.2



Re: postgres_fdw binary protocol support

2022-11-24 Thread Ilya Gladyshev



> 22 нояб. 2022 г., в 17:10, Ashutosh Bapat  
> написал(а):
> 
> Hi Illya,
> 
> On Mon, Nov 21, 2022 at 8:50 PM Ilya Gladyshev
>  wrote:
>> 
>> Hi everyone,
>> 
>> I have made a patch that introduces support for libpq binary protocol
>> in postgres_fdw. The idea is simple, when a user knows that the foreign
>> server is binary compatible with the local and his workload could
>> somehow benefit from using binary protocol, it can be switched on for a
>> particular server or even a particular table.
>> 
> 
> Why do we need this feature? If it's for performance then do we have
> performance numbers?
Yes, it is for performance, but I am yet to do the benchmarks. My initial idea 
was that binary protocol must be more efficient than text, because as I 
understand that’s the whole point of it. However, the minor tests that I have 
done do not prove this and I couldn’t find any benchmarks for it online, so I 
will do further tests to find a use case for it.
> About the patch itself, I see a lot of if (binary) {} else {} block
> which are repeated. It will be good if we can add functions/macros to
> avoid duplication.
Yea, that’s true, I have some ideas about improving it



postgres_fdw binary protocol support

2022-11-21 Thread Ilya Gladyshev
Hi everyone,

I have made a patch that introduces support for libpq binary protocol
in postgres_fdw. The idea is simple, when a user knows that the foreign
server is binary compatible with the local and his workload could
somehow benefit from using binary protocol, it can be switched on for a
particular server or even a particular table. 

The patch adds a new foreign server and table option 'binary_format'
(by default off) and implements serialization/deserialization of query
results and parameters for binary protocol. I have tested the patch by
switching foreign servers in postgres_fdw.sql tests to binary_mode, the
only diff was in the text of the error for parsing an invalid integer
value, so it worked as expected for the test. There are a few minor
issues I don't like in the code and I am yet to write the tests and
docs for it. It would be great to get some feedback and understand,
whether this is a welcome feature, before proceeding with all of the
abovementioned.

Thanks,
Ilya Gladyshev
From 2cb72df03ed94d55cf51531a2d21a7d3369ae27b Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Sat, 19 Nov 2022 17:47:49 +0400
Subject: [PATCH] postgres_fdw libpq binary proto support

---
 contrib/postgres_fdw/option.c   |   6 +-
 contrib/postgres_fdw/postgres_fdw.c | 389 
 contrib/postgres_fdw/postgres_fdw.h |   1 +
 3 files changed, 338 insertions(+), 58 deletions(-)

diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index fa80ee2a55..f96cb79b42 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -125,7 +125,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 			strcmp(def->defname, "truncatable") == 0 ||
 			strcmp(def->defname, "async_capable") == 0 ||
 			strcmp(def->defname, "parallel_commit") == 0 ||
-			strcmp(def->defname, "keep_connections") == 0)
+			strcmp(def->defname, "keep_connections") == 0 ||
+			strcmp(def->defname, "binary_format") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -253,6 +254,9 @@ InitPgFdwOptions(void)
 		/* async_capable is available on both server and table */
 		{"async_capable", ForeignServerRelationId, false},
 		{"async_capable", ForeignTableRelationId, false},
+		/* async_capable is available on both server and table */
+		{"binary_format", ForeignServerRelationId, false},
+		{"binary_format", ForeignTableRelationId, false},
 		{"parallel_commit", ForeignServerRelationId, false},
 		{"keep_connections", ForeignServerRelationId, false},
 		{"password_required", UserMappingRelationId, false},
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 8d7500abfb..9344b6f5fc 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -76,6 +76,8 @@ enum FdwScanPrivateIndex
 	FdwScanPrivateRetrievedAttrs,
 	/* Integer representing the desired fetch_size */
 	FdwScanPrivateFetchSize,
+	/* Boolean flag showing whether to use binary or text libpq protocol */
+	FdwScanPrivateBinaryFormat,
 
 	/*
 	 * String describing join i.e. names of relations being joined and types
@@ -128,7 +130,8 @@ enum FdwDirectModifyPrivateIndex
 	/* Integer list of attribute numbers retrieved by RETURNING */
 	FdwDirectModifyPrivateRetrievedAttrs,
 	/* set-processed flag (as a Boolean node) */
-	FdwDirectModifyPrivateSetProcessed
+	FdwDirectModifyPrivateSetProcessed,
+	FdwDirectModifyPrivateBinaryFormat
 };
 
 /*
@@ -154,6 +157,7 @@ typedef struct PgFdwScanState
 	FmgrInfo   *param_flinfo;	/* output conversion functions for them */
 	List	   *param_exprs;	/* executable expressions for param values */
 	const char **param_values;	/* textual values of query parameters */
+	int *param_lengths;
 
 	/* for storing result tuples */
 	HeapTuple  *tuples;			/* array of currently-retrieved tuples */
@@ -172,6 +176,7 @@ typedef struct PgFdwScanState
 	MemoryContext temp_cxt;		/* context for per-tuple temporary data */
 
 	int			fetch_size;		/* number of tuples per fetch */
+	bool binary_format; /* whether to use libpq binary or text format */
 } PgFdwScanState;
 
 /*
@@ -195,6 +200,7 @@ typedef struct PgFdwModifyState
 	int			batch_size;		/* value of FDW option "batch_size" */
 	bool		has_returning;	/* is there a RETURNING clause? */
 	List	   *retrieved_attrs;	/* attr numbers retrieved by RETURNING */
+	bool binary_format;
 
 	/* info about parameters for prepared statement */
 	AttrNumber	ctidAttno;		/* attnum of input resjunk ctid column */
@@ -225,7 +231,7 @@ typedef struct PgFdwDirectModifyState
 	bool		has_returning;	/* is there a RETURNING clause? */
 	List	   *retrieved_attrs;	/* attr numbers retrieved by RETURNING */
 	bool		set_processed;	/* do we set the command es_processed? */
-
+	bool binary_format;
 	/* for remote query execution */
 	PGconn	   *co

Re: Segfault on logical replication to partitioned table with foreign children

2022-11-03 Thread Ilya Gladyshev
On Wed, 2022-11-02 at 12:37 -0400, Tom Lane wrote:
> Since we're getting pretty close to the next set of back-branch
> releases,
> I went ahead and pushed a minimal fix along the lines suggested by
> Shi Yu.
> (I realized that there's a second ExecFindPartition call in worker.c
> that
> also needs a check.)  We still can at leisure think about refactoring
> CheckSubscriptionRelkind to avoid unnecessary lookups.  I think that
> is something we should do only in HEAD; it'll just be a marginal
> savings,
> not worth the risks of API changes in stable branches.  The other
> loose
> end is whether to worry about a regression test case.  I'm inclined
> not
> to bother.  The only thing that isn't getting exercised is the actual
> ereport, which probably isn't in great need of routine testing.
> 
> regards, tom lane

I agree that early checks limit some of the functionality that was
available before, so I guess the only way to preserve it is to do only
the last-minute checks after routing, fair enough. As for the
refactoring of the premature lookup, I have done some work on that in
the previous patch, I think we can just use it. Attached a separate
patch with the refactoring.
From 004c63a8eba777be739f062cdc9b7ddcf2eac253 Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Thu, 3 Nov 2022 11:39:24 +0400
Subject: [PATCH] Delay namespace and relname lookups until error

---
 src/backend/commands/subscriptioncmds.c| 12 
 src/backend/executor/execReplication.c | 13 ++---
 src/backend/replication/logical/relation.c |  2 +-
 src/backend/replication/logical/worker.c   |  8 ++--
 src/include/executor/executor.h|  2 +-
 5 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f0cec2ad5e..0c3ad698eb 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -700,12 +700,14 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 			{
 RangeVar   *rv = (RangeVar *) lfirst(lc);
 Oid			relid;
+Relation sub_rel;
 
 relid = RangeVarGetRelid(rv, AccessShareLock, false);
+sub_rel = RelationIdGetRelation(relid);
 
 /* Check for supported relkind. */
-CheckSubscriptionRelkind(get_rel_relkind(relid),
-		 rv->schemaname, rv->relname);
+CheckSubscriptionRelkind(sub_rel, rv->schemaname, rv->relname);
+RelationClose(sub_rel);
 
 AddSubscriptionRelState(subid, relid, table_state,
 		InvalidXLogRecPtr);
@@ -861,12 +863,14 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data,
 		{
 			RangeVar   *rv = (RangeVar *) lfirst(lc);
 			Oid			relid;
+			Relation sub_rel;
 
 			relid = RangeVarGetRelid(rv, AccessShareLock, false);
+			sub_rel = RelationIdGetRelation(relid);
 
 			/* Check for supported relkind. */
-			CheckSubscriptionRelkind(get_rel_relkind(relid),
-	 rv->schemaname, rv->relname);
+			CheckSubscriptionRelkind(sub_rel, rv->schemaname, rv->relname);
+			RelationClose(sub_rel);
 
 			pubrel_local_oids[off++] = relid;
 
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 6014f2e248..a0a2d326db 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -649,16 +649,23 @@ CheckCmdReplicaIdentity(Relation rel, CmdType cmd)
 /*
  * Check if we support writing into specific relkind.
  *
- * The nspname and relname are only needed for error reporting.
+ * If nspname and relname are not NULL, they are used for error reporting, otherwise these values are fetched from relcache.
  */
 void
-CheckSubscriptionRelkind(char relkind, const char *nspname,
-		 const char *relname)
+CheckSubscriptionRelkind(Relation rel, const char *nspname, const char *relname)
 {
+	char		relkind = rel->rd_rel->relkind;
+
 	if (relkind != RELKIND_RELATION && relkind != RELKIND_PARTITIONED_TABLE)
+	{
+		if (relname == NULL)
+			relname = RelationGetRelationName(rel);
+		if (nspname == NULL)
+			nspname = get_namespace_name(RelationGetNamespace(rel));
 		ereport(ERROR,
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
  errmsg("cannot use relation \"%s.%s\" as logical replication target",
 		nspname, relname),
  errdetail_relkind_not_supported(relkind)));
+	}
 }
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index e989047681..e98031e272 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -393,7 +393,7 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 		entry->localreloid = relid;
 
 		/* Check for supported relkind. */
-		CheckSubscriptionRelkind(entry->localrel->rd_rel->relkind,
+		CheckSubscriptionRelkind(entry->localrel,

Re: Segfault on logical replication to partitioned table with foreign children

2022-10-31 Thread Ilya Gladyshev
On Mon, 2022-10-31 at 03:20 +, shiy.f...@fujitsu.com wrote:
> On Sun, Oct 30, 2022 9:39 PM Tom Lane  wrote:
> > 
> > What I'm wondering about is whether we can refactor this code
> > to avoid so many usually-useless catalog lookups.  Pulling the
> > namespace name, in particular, is expensive and we generally
> > are not going to need the result.  In the child-rel case it'd
> > be much better to pass the opened relation and let the error-check
> > subroutine work from that.  Maybe we should just do it like that
> > at the start of the recursion, too?  Or pass the relid and let
> > the subroutine look up the names only in the error case.
> > 
> > A completely different line of thought is that this doesn't seem
> > like a terribly bulletproof fix, since children could get added to
> > a partitioned table after we look.  Maybe it'd be better to check
> > the relkind at the last moment before we do something that depends
> > on a child table being a relation.
> > 
> 
> I agree. So maybe we can add this check in
> apply_handle_tuple_routing().
> 
> diff --git a/src/backend/replication/logical/worker.c
> b/src/backend/replication/logical/worker.c
> index 5250ae7f54..e941b68e4b 100644
> --- a/src/backend/replication/logical/worker.c
> +++ b/src/backend/replication/logical/worker.c
> @@ -2176,6 +2176,10 @@ apply_handle_tuple_routing(ApplyExecutionData
> *edata,
>     Assert(partrelinfo != NULL);
>     partrel = partrelinfo->ri_RelationDesc;
> 
> +   /* Check for supported relkind. */
> +   CheckSubscriptionRelkind(partrel->rd_rel->relkind,
> +    relmapentry-
> >remoterel.nspname, relmapentry->remoterel.relname);
> +
>     /*
>  * To perform any of the operations below, the tuple must
> match the
>  * partition's rowtype. Convert if needed or just copy, using
> a dedicated
> 
> 
> Regards,
> Shi yu

I have verified that the current patch handles the attaching of new
partitions to the target partitioned table by throwing an error on
attempt to insert into a foreign table inside the logical replication
worker. I have refactored the code to minimize cache lookups, but I am
yet to write the tests for this. See the attached patch for the
refactored version.


From 56086185324e12e9db6add40f84bc7e60867f1d6 Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Tue, 1 Nov 2022 01:42:44 +0400
Subject: [PATCH v2] check relkind of subscription target recursively

---
 src/backend/commands/subscriptioncmds.c| 18 
 src/backend/executor/execReplication.c | 51 ++
 src/backend/replication/logical/relation.c |  6 +--
 src/include/executor/executor.h|  4 +-
 4 files changed, 57 insertions(+), 22 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f0cec2ad5e..2ae89f9e03 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -700,12 +700,13 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 			{
 RangeVar   *rv = (RangeVar *) lfirst(lc);
 Oid			relid;
+Relation rel;
 
 relid = RangeVarGetRelid(rv, AccessShareLock, false);
-
-/* Check for supported relkind. */
-CheckSubscriptionRelkind(get_rel_relkind(relid),
-		 rv->schemaname, rv->relname);
+rel = RelationIdGetRelation(relid);
+/* Check for supported relkind recursively. */
+CheckSubscriptionRelation(rel, rv->schemaname, rv->relname);
+RelationClose(rel);
 
 AddSubscriptionRelState(subid, relid, table_state,
 		InvalidXLogRecPtr);
@@ -861,12 +862,13 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data,
 		{
 			RangeVar   *rv = (RangeVar *) lfirst(lc);
 			Oid			relid;
+			Relation rel;
 
 			relid = RangeVarGetRelid(rv, AccessShareLock, false);
-
-			/* Check for supported relkind. */
-			CheckSubscriptionRelkind(get_rel_relkind(relid),
-	 rv->schemaname, rv->relname);
+			rel = RelationIdGetRelation(relid);
+			/* Check for supported relkind recursively. */
+			CheckSubscriptionRelation(rel, rv->schemaname, rv->relname);
+			RelationClose(rel);
 
 			pubrel_local_oids[off++] = relid;
 
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 6014f2e248..9da0b6b3a1 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -16,9 +16,11 @@
 
 #include "access/genam.h"
 #include "access/relscan.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "catalog/pg_inherits.h"
 #include "commands/trigger.h

Re: Segfault on logical replication to partitioned table with foreign children

2022-10-31 Thread Ilya Gladyshev
On Sun, 2022-10-30 at 16:52 +0100, Alvaro Herrera wrote:
> On 2022-Oct-28, ilya.v.gladys...@gmail.com wrote:
> 
> > This will cause a segfault or raise an assert, because inserting
> > into
> > foreign tables via logical replication is not possible. The
> > solution I
> > propose is to add recursive checks of relkind for children of a
> > target,
> > if the target is a partitioned table.
> 
> However, I imagine that the only reason we don't support this is that
> the code hasn't been written yet. I think it would be better to write
> that code, so that we don't have to raise any error at all (unless
> the
> foreign table is something that doesn't support DML, in which case we
> would have to raise an error).  Of course, we still have to fix it in
> backbranches, but we can just do it as a targeted check at the moment
> of
> insert/update, not at the moment of subscription create/alter.
> 

Sure, this patch is just a quick fix. A proper implementation of
logical replication into foreign tables would be a much more difficult
undertaking. I think this patch is simple enough, the checks in the
patch are performed both on subscription DDL and when the relation is
opened for logical replication, so it gives both early feedback and
last-minute checks as well. All the code infrastructure for these kinds
of checks is already in place, so I think it's a good idea to use it.

P.S. sorry, duplicating the message, forgot to cc the mailing list the
first time





Re: Partial aggregates pushdown

2021-11-01 Thread Ilya Gladyshev



On 01.11.2021 13:30, Alexander Pyhalov wrote:

Peter Eisentraut писал 2021-11-01 12:47:

On 21.10.21 12:55, Alexander Pyhalov wrote:
Now aggregates with internal states can be pushed down, if they are 
marked as pushdown safe (this flag is set to true for min/max/sum),
have internal states and associated converters. Converters are 
called locally, they transform aggregate result to serialized 
internal representation.
As converters don't have access to internal aggregate state, partial 
aggregates like avg() are still not pushable.


It seems to me that the system should be able to determine from the
existing aggregate catalog entry whether an aggregate can be pushed
down.  For example, it could check aggtranstype != internal and
similar.  A separate boolean flag should not be necessary.


Hi.
I think we can't infer this property from existing flags. For example, 
if I have avg() with bigint[] argtranstype, it doesn't mean we can 
push down it. We couldn't also decide if partial aggregete is safe to 
push down based on aggfinalfn presence (for example, it is defined for 
sum(numeric), but we can push it down.


I think one potential way to do it would be to allow pushing down 
aggregates that EITHER have state of the same type as their return type, 
OR have a conversion function that converts their return value to the 
type of their state.






Re: Partial aggregates pushdown

2021-11-01 Thread Ilya Gladyshev

Hi,

On 21.10.2021 13:55, Alexander Pyhalov wrote:

Hi. Updated patch.
Now aggregates with internal states can be pushed down, if they are 
marked as pushdown safe (this flag is set to true for min/max/sum),

have internal states and associated converters.


I don't quite understand why this is restricted only to aggregates that 
have 'internal' state, I feel like that should be possible for any 
aggregate that has a function to convert its final result back to 
aggregate state to be pushed down. While I couldn't come up with a 
useful example for this, except maybe for an aggregate whose aggfinalfn 
is used purely for cosmetic purposes (e.g. format the result into a 
string), I still feel that it is an unnecessary restriction.


A few minor review notes to the patch:


+static List *build_conv_list(RelOptInfo *foreignrel);

this should probably be up top among other declarations.


@@ -1433,6 +1453,48 @@ postgresGetForeignPlan(PlannerInfo *root,
                         outer_plan);
 }

+/*
+ * Generate attinmeta if there are some converters:
+ * they are expecxted to return BYTEA, but real input type is likely 
different.

+ */


typo in word "expec*x*ted".


@@ -139,10 +147,13 @@ typedef struct PgFdwScanState
                              * for a foreign join scan. */
 TupleDesc    tupdesc;        /* tuple descriptor of scan */
 AttInMetadata *attinmeta;    /* attribute datatype conversion 
metadata */
+    AttInMetadata *rcvd_attinmeta;    /* metadata for received tuples, 
NULL if

+                                     * there's no converters */


Looks like rcvd_attinmeta is redundant and you could use attinmeta for 
conversion metadata.


The last thing - the patch needs to be rebased, it doesn't apply cleanly 
on top of current master.


Thanks,

Ilya Gladyshev




Re: Update of partition key on foreign server

2021-08-26 Thread Ilya Gladyshev
2 авг. 2021 г., в 15:29, Илья Гладышев <i.gladys...@postgrespro.ru> написал(а):
  


  
  

Hi,I am currently looking into a partition constraint violation
that occurs on update of a partition key on a foreign server. To reproduce it you can run:On server 1 using port 5432:
  
  create extension postgres_fdw;
  create table players (id integer, name text) partition by
  list(name);
  create table players_0 partition of players for values in
  ('name1');
  create server neighbor foreign data wrapper postgres_fdw options
  (host 'localhost', port '5433', dbname 'postgres');
  create foreign table players_1 partition of players for values in
  ('name2') server neighbor ;
  create user mapping for current_user server neighbor;On server 2 using port 5433:
  create extension postgres_fdw;
  create server neighbor foreign data wrapper postgres_fdw options
  (host 'localhost', port '5432', dbname 'postgres');
  create table players (id integer, name text) partition by
  list(name);
  create table players_1 partition of players for values in
  ('name2');
  create user mapping for current_user server neighbor;
  create foreign table players_0 partition of players for values in
  ('name1') server neighbor;
  insert into players values (1, 'name1');
  update players set name='name2' where name='name1';
  ERROR:  new row for relation "players_0" violates partition
constraint
DETAIL:  Failing row contains (1, name2).
CONTEXT:  remote SQL command: UPDATE public.players_0 SET name =
'name2'::text WHERE ((name = 'name1'::text))From what I have read on the mailing list, I understand that
this is a known problem, but I haven't found any thread
discussing it in particular. Is this something that needs
fixing? If it is, I want to try to do it, but I’m wondering if
there are any known caveats and looking for any tips on how to
implement it. 
  My understanding is that this should be implemented in a
similar way to how the row is routed from a local partition in
ExecCrossPartitionUpdate, so the update should be replaced with
a foreign delete + local/foreign insert. In addition, a direct
update should be forbidden when the query modifies the partition
key. I’m probably missing a lot of details (feel free to point out), but is the general idea correct? I will be grateful for any feedback.Thanks,Ilya Gladyshev

  

I have developed a simple patch to fix this, while I’m not fully satisfied with it, it gets the job done. From what I have read in the docs, partition constraints are currently not checked for foreign tables, because they must be enforced on the remote server anyway (because there might be many other ways to access the data source besides FDW), and thus there’s no point in doing that locally. However, in the case of foreign partition update, checking the constraints locally can be used to allow for routing from remote sources, so I don’t feel like this point is valid in this case. In message [1] there’s also mentioned possibility of existence of triggers on the foreign server, and transforming the update to delete+insert will cause these triggers to be omitted. While it is true, I feel like partition pruning already has a similar effect, as it allows to skip scanning foreign partitions. The only way to both fix the update and have the triggers work, that I came up with, is to use parent partitioned table as a target for the foreign update (FDW request would then be "UPDATE public.players …"), while this is possible, it requires the foreign server to have the same partitioned table, which seems quite restrictive to me. Please let me know if I’m missing something or there is a better way to do it.Thanks,Ilya Gladyshev[1] https://www.postgresql.org/message-id/5A93F487.4080602%40lab.ntt.co.jp

0001-Partition-key-update-in-foreign-table.patch
Description: Binary data


Re: Per query FDW network stat collection

2021-08-24 Thread Ilya Gladyshev



On 24.08.2021 12:19, Julien Rouhaud wrote:

However I'm not sure that having a new "network" option is the best
way for that.  It seems confusing as IIUC it won't be catching all
network activity (like fe/be activity, or network disk...) but only
FDW activity.  I think it would be better to have those information
retrieved when using the verbose option rather than a new one.
Similarly, I'm afraid that INSTRUMENT_NETWORK could be misleading,
although I don't have any better proposal right now.


I am also doubtful about this naming. Initially, I wanted to add fe/be 
activity as one of the metrics, but then decided to restrict myself to 
FDW for now. However, I decided to leave "network" as it is, because to 
me it makes sense to have all the network-related metrics under a single 
explain option (and a single instrumentation flag perhaps), in case more 
are added later. The struct fields used for collection internally tell 
explicitly that they are meant to be used only for FDW stats and the 
explain output also mentions that the displayed stats are for FDW 
network activity.






Per query FDW network stat collection

2021-08-24 Thread Ilya Gladyshev

Hello,

I have implemented per query network stat collection for FDW. It is done 
in a similar way to how buffer and WAL stats are collected and it can be 
seen with a new NETWORK option for explain command:


explain (analyze, network) insert into itrtest values (2, 'blah');

  QUERY PLAN
---
 Insert on itrtest  (cost=0.00..0.01 rows=0 width=0) (actual 
time=0.544..0.544 rows=0 loops=1)

   Network: FDW bytes sent=197 received=72, wait_time=0.689
   ->  Result  (cost=0.00..0.01 rows=1 width=36) (actual 
time=0.003..0.003 rows=1 loops=1)

 Planning Time: 0.025 ms
 Execution Time: 0.701 ms
(5 rows)

I am yet to add corresponding columns to pg_stat_statements, write tests 
and documentation, but before I go ahead with that, I would like to know 
what the community thinks about the patch.


Regards,

Ilya Gladyshev


>From 3ffbe071480672189c2e03d7e54707c77ba58b0b Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Mon, 23 Aug 2021 21:37:31 +0300
Subject: [PATCH] adds per query FDW network usage stats

Adds means for collecting network usage stats and outputting it in
explain with NETWORK option. Implements network stats collection for
postgres_fdw via adding a hook for stat collection to libpq.
---
 contrib/postgres_fdw/connection.c| 57 -
 contrib/postgres_fdw/postgres_fdw.c  | 28 +
 src/backend/access/heap/vacuumlazy.c |  5 ++-
 src/backend/access/nbtree/nbtsort.c  | 14 ---
 src/backend/commands/explain.c   | 62 
 src/backend/executor/execParallel.c  | 28 ++---
 src/backend/executor/instrument.c| 57 +
 src/backend/utils/misc/guc.c | 10 -
 src/include/commands/explain.h   |  1 +
 src/include/executor/execParallel.h  |  1 +
 src/include/executor/instrument.h| 25 ---
 src/interfaces/libpq/exports.txt |  1 +
 src/interfaces/libpq/fe-misc.c   |  2 +
 src/interfaces/libpq/fe-secure.c |  4 ++
 src/interfaces/libpq/libpq-fe.h  |  5 ++-
 15 files changed, 271 insertions(+), 29 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 82aa14a65de..3f479a74ba1 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -571,10 +571,22 @@ void
 do_sql_command(PGconn *conn, const char *sql)
 {
 	PGresult   *res;
+	instr_time start, duration;
+
+	if (track_fdw_wait_timing)
+		INSTR_TIME_SET_CURRENT(start);
 
 	if (!PQsendQuery(conn, sql))
 		pgfdw_report_error(ERROR, NULL, conn, false, sql);
 	res = pgfdw_get_result(conn, sql);
+
+	if (track_fdw_wait_timing)
+	{
+		INSTR_TIME_SET_CURRENT(duration);
+		INSTR_TIME_SUBTRACT(duration, start);
+		INSTR_TIME_ADD(pgNetUsage.fdw_wait_time, duration);
+	}
+
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		pgfdw_report_error(ERROR, res, conn, true, sql);
 	PQclear(res);
@@ -684,10 +696,14 @@ GetPrepStmtNumber(PGconn *conn)
 PGresult *
 pgfdw_exec_query(PGconn *conn, const char *query, PgFdwConnState *state)
 {
+	PGresult *res;
+	instr_time start, duration;
 	/* First, process a pending asynchronous request, if any. */
 	if (state && state->pendingAreq)
 		process_pending_request(state->pendingAreq);
 
+	if (track_fdw_wait_timing)
+		INSTR_TIME_SET_CURRENT(start);
 	/*
 	 * Submit a query.  Since we don't use non-blocking mode, this also can
 	 * block.  But its risk is relatively small, so we ignore that for now.
@@ -696,7 +712,14 @@ pgfdw_exec_query(PGconn *conn, const char *query, PgFdwConnState *state)
 		pgfdw_report_error(ERROR, NULL, conn, false, query);
 
 	/* Wait for the result. */
-	return pgfdw_get_result(conn, query);
+	res = pgfdw_get_result(conn, query);
+	if (track_fdw_wait_timing)
+	{
+		INSTR_TIME_SET_CURRENT(duration);
+		INSTR_TIME_SUBTRACT(duration, start);
+		INSTR_TIME_ADD(pgNetUsage.fdw_wait_time, duration);
+	}
+	return res;
 }
 
 /*
@@ -717,6 +740,10 @@ pgfdw_get_result(PGconn *conn, const char *query)
 	/* In what follows, do not leak any PGresults on an error. */
 	PG_TRY();
 	{
+		instr_time start, duration;
+		if (track_fdw_wait_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
 		for (;;)
 		{
 			PGresult   *res;
@@ -750,6 +777,13 @@ pgfdw_get_result(PGconn *conn, const char *query)
 			PQclear(last_res);
 			last_res = res;
 		}
+
+		if (track_fdw_wait_timing)
+		{
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			INSTR_TIME_ADD(pgNetUsage.fdw_wait_time, duration);
+		}
 	}
 	PG_CATCH();
 	{
@@ -893,7 +927,18 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 	 */
 	if (entry->have_prep_stmt && entry->have_error)
 	{
+		instr_time start, duration;
+		if (track_fdw_wait_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
 		res = PQexec(entry->conn, "DEALLOCATE