Re: Skipping logical replication transactions on subscriber side

2021-07-05 Thread Masahiko Sawada
On Mon, Jul 5, 2021 at 7:33 PM Alexey Lesovsky  wrote:
>
> Hi,
> Have a few notes about pg_stat_logical_replication_error from the DBA point 
> of view (which will use this view in the future).

Thank you for the comments!

> 1. As I understand it, this view might contain many errors related to 
> different subscriptions. It is better to name 
> "pg_stat_logical_replication_errors" using the plural form (like this done 
> for stat views for tables, indexes, functions).

Agreed.

> Also, I'd like to suggest thinking twice about the view name (and function 
> used in view DDL) -  "pg_stat_logical_replication_error" contains very common 
> "logical replication" words, but the view contains errors related to 
> subscriptions only. In the future there could be other kinds of errors 
> related to logical replication, but not related to subscriptions - what will 
> you do?

Is pg_stat_subscription_errors or
pg_stat_logical_replication_apply_errors better?

> 2. Add a field with database name or id - it helps to quickly understand to 
> which database the subscription belongs.

Agreed.

> 3. Add a counter field with total number of errors - it helps to calculate 
> errors rates and aggregations (sum), and don't lose information about errors 
> between view checks.

Do you mean to increment the error count if the error (command, xid,
and relid) is the same as the previous one? or to have the total
number of errors per subscription? And what can we infer from the
error rates and aggregations?

> 4. Add text of last error (if it will not be too expensive).

Agreed.

> 5. Rename the "action" field to "command", as I know this is right from 
> terminology point of view.

Okay.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Pipeline mode and PQpipelineSync()

2021-07-05 Thread Boris Kolpackov
Alvaro Herrera  writes:

> Ah, yes it does.  I can reproduce this now.  I thought PQconsumeInput
> was sufficient, but it's not: you have to have the PQgetResult in there
> too.  Looking ...

Any progress on fixing this?




Re: "debug_invalidate_system_caches_always" is too long

2021-07-05 Thread Bharath Rupireddy
On Tue, Jul 6, 2021 at 12:43 AM Tom Lane  wrote:
>
> Noah Misch  writes:
> > On Sun, Jul 04, 2021 at 04:27:13PM -0400, Tom Lane wrote:
> >> However, I think we should also give serious consideration to
> >> "debug_clobber_cache" or "debug_clobber_cache_always" for continuity
> >> with past practice (though it still feels like "always" is a good
> >> word to lose now).  "debug_clobber_caches" is another reasonable
> >> variant.
>
> > https://en.wikipedia.org/wiki/Clobbering refers to cases where storage had 
> > no
> > changes to its accessibility but now contains different data.  That doesn't
> > match InvalidateSystemCaches() especially well, so I think dropping that 
> > word
> > has been a good step.  Some other shorter terms could be debug_flush_caches,
> > debug_rebuild_caches, or debug_expire_caches.  (debug_caches is tempting, 
> > but
> > that may ensnare folks looking for extra logging rather than a big 
> > slowdown.)
>
> I like "debug_flush_caches" --- it's short and accurate.

Do we always flush the cache entries into the disk? Sometimes we just
invalidate the cache entries in the registered invalidation callbacks,
right? Since we already use the term "clobber" in the user visible
config option --clobber-cache, isn't it consistent to use
debug_clobber_caches?

Regards,
Bharath Rupireddy.




Re: Detecting File Damage & Inconsistencies

2021-07-05 Thread Amit Kapila
On Fri, Jul 2, 2021 at 8:29 PM Simon Riggs  wrote:
>
> On Fri, Jul 2, 2021 at 5:34 AM Craig Ringer
>  wrote:
> >
>
> > If you don't think the sorts of use cases I presented are worth the trouble 
> > that's fair enough. I'm not against adding it on the commit record. It's 
> > just that with logical decoding moving toward a streaming model I suspect 
> > only having it at commit time may cause us some pain later.
>
> I think you have some good ideas about how to handle larger
> transactions with streaming. As a separate patch it might be worth
> keeping track of transaction size and logging something when a
> transaction gets too large.
>

If we want this additional information for streaming mode in logical
replication then can't we piggyback it on the very first record
written for a transaction when this info is required? Currently, we do
something similar for logging top_level_xid for subtransaction in
XLogRecordAssemble().


-- 
With Regards,
Amit Kapila.




Re: Re: parallel distinct union and aggregate support patch

2021-07-05 Thread David Rowley
On Tue, 30 Mar 2021 at 22:33, bu...@sohu.com  wrote:
> I have written a plan with similar functions, It is known that the following 
> two situations do not work well.

I read through this thread and also wondered about a Parallel
Partition type operator.  It also seems to me that if it could be done
this way then you could just plug in existing nodes to get Sorting and
Aggregation rather than having to modify existing nodes to get them to
do what you need.

>From what I've seen looking over the thread, a few people suggested
this and I didn't see anywhere where you responded to them about the
idea.  Just so you're aware, contributing to PostgreSQL is not a case
of throwing code at a wall and seeing which parts stick.  You need to
interact and respond to people reviewing your work. This is especially
true for the people who actually have the authority to merge any of
your work with the main code repo.

It seems to me you might be getting off to a bad start and you might
not be aware of this process. So I hope this email will help put you
on track.

Some of the people that you've not properly responded to include:

Thomas Munro:  PostgreSQL committer. Wrote Parallel Hash Join.
Robert Hass: PostgreSQL committer. Wrote much of the original parallel
query code
Heikki Linnakangas: PostgreSQL committer. Worked on many parts of the
planner and executor. Also works for the company that develops
Greenplum, a massively parallel processing RDBMS, based on Postgres.

You might find other information in [1].

If I wanted to do what you want to do, I think those 3 people might be
some of the last people I'd pick to ignore questions from! :-)

Also, I'd say also copying in Tom Lane randomly when he's not shown
any interest in the patch here is likely not a good way of making
forward progress.  You might find that it might bubble up on his radar
if you start constructively interacting with the people who have
questioned your design.  I'd say that should be your next step.

The probability of anyone merging any of your code without properly
discussing the design with the appropriate people are either very
close to zero or actually zero.

I hope this email helps you get on track.

David

[1] https://www.postgresql.org/community/contributors/




Re: ECPG doesn't compile CREATE AS EXECUTE properly.

2021-07-05 Thread Michael Paquier
On Thu, Jul 01, 2021 at 06:45:25PM +0900, Kyotaro Horiguchi wrote:
> Separating "CREATE TABLE AS EXECUTE" from ExecuteStmt would be cleaner
> but I avoided to change the syntax tree. Instead the attched make
> distinction of $$.type of ExecuteStmt between NULL and "" to use to
> notify the returned name is name of a prepared statement or a full
> statement.

I am not so sure, and using an empty string makes the code a bit
harder to follow.  How would that look with the grammar split you have
in mind?  Maybe that makes the code more consistent with the PREPARE
block a couple of lines above?
--
Michael


signature.asc
Description: PGP signature


Re: simplifying foreign key/RI checks

2021-07-05 Thread Amit Langote
On Tue, Jul 6, 2021 at 1:56 AM vignesh C  wrote:
> The 2nd patch does not apply on Head, please post a rebased version:
> error: patch failed: src/backend/utils/adt/ri_triggers.c:337
> error: src/backend/utils/adt/ri_triggers.c: patch does not apply

Thanks for the heads up.

Rebased patches attached.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v9-0001-Export-get_partition_for_tuple.patch
Description: Binary data


v9-0002-Avoid-using-SPI-for-some-RI-checks.patch
Description: Binary data


Re: Atomic rename feature for Windows.

2021-07-05 Thread Michael Paquier
On Mon, Jul 05, 2021 at 04:53:06PM +0300, Victor Spirin wrote:
> This patch related to this post:
> https://www.postgresql.org/message-id/CAEepm%3D0FV-k%2B%3Dd9z08cW%3DZXoR1%3Dkw9wdpkP6WAuOrKJdz-8ujg%40mail.gmail.com

How does that cope with durable_rename_excl() where rename() is used
on Windows?  The problems that 909b449 has somewhat "fixed" were
annoying for the users as it prevented WAL segment recycling, so we
need to be sure that this does not cause more harm. 

> + /*
> + * CHECKSUM_TYPE_NONE defined in the winioctl.h when _WIN32_WINNT >= 
> _WIN32_WINNT_WIN10
> + */
> +#ifdef CHECKSUM_TYPE_NONE
> +#undef CHECKSUM_TYPE_NONE
> +#endif

Okay.  Should this be renamed separately then to avoid conflicts?

> - * get support for GetLocaleInfoEx() with locales. For everything else
> + * Studio 2015 the minimum requirement is Windows 10 (0x0A00) to get support 
> for SetFileInformationByHandle.
> + * The minimum requirement is Windows Vista (0x0600) get support for 
> GetLocaleInfoEx() with locales.
> + * For everything else
>   * the minimum version is Windows XP (0x0501).
>   */
>  #if defined(_MSC_VER) && _MSC_VER >= 1900
> -#define MIN_WINNT 0x0600
> +#define MIN_WINNT 0x0A00
>  #else
>  #define MIN_WINNT 0x0501
>  #endif

This is a large bump for Studio >= 2015 I am afraid.  That does not
seem acceptable, as it means losing support for GetLocaleInfoEx()
across older versions.

> +#if defined(WIN32) && !defined(__CYGWIN__) && defined(_WIN32_WINNT_WIN10) && 
> _WIN32_WINNT >= _WIN32_WINNT_WIN10
> +
> +#include 
> +
> +/*
> + * win10_rename - uses SetFileInformationByHandle function with 
> FILE_RENAME_FLAG_POSIX_SEMANTICS flag for atomic rename file
> + * working only on Windows 10 or later and  _WIN32_WINNT must be >= 
> _WIN32_WINNT_WIN10
> + */
> +static int win10_rename(wchar_t const* from, wchar_t const* to)

Having win10_rename(), a wrapper for pgrename_win10(), which is itself
an extra wrapper for pgrename(), is confusing.  Could you reduce the
layers of functions here.  At the end we just want an extra runtime
option for pgrename().  Note that pgrename_win10() could be static to
dirmod.c, and it seems to me that you just want a small function to do
the path conversion anyway.  It would be better to avoid using
malloc() in those code paths as well, as the backend could finish by
calling that.  We should be able to remove the malloc()s with local
variables large enough to hold those paths, no?

> + # manifest with ms_compatibility:supportedOS tags for using 
> IsWindows10OrGreater() function
> + print $o "\n1 24 \"src/port/win10.manifest\"\n";
> +
>   close($o);
>   close($i);
>   }
> diff --git a/src/port/win10.manifest b/src/port/win10.manifest
> new file mode 100644

It would be good to not require that.  Those extra files make the
long-term maintenance harder.
--
Michael


signature.asc
Description: PGP signature


RE: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-07-05 Thread houzj.f...@fujitsu.com
On Sunday, July 4, 2021 1:44 PM Dilip Kumar  wrote:
> On Fri, Jul 2, 2021 at 8:16 PM Robert Haas  wrote:
> >
> > On Wed, Jun 30, 2021 at 11:46 PM Greg Nancarrow 
> wrote:
> > > I personally think "(b) provide an option to the user to specify
> > > whether inserts can be parallelized on a relation" is the preferable
> > > option.
> > > There seems to be too many issues with the alternative of trying to
> > > determine the parallel-safety of a partitioned table automatically.
> > > I think (b) is the simplest and most consistent approach, working
> > > the same way for all table types, and without the overhead of (a).
> > > Also, I don't think (b) is difficult for the user. At worst, the
> > > user can use the provided utility-functions at development-time to
> > > verify the intended declared table parallel-safety.
> > > I can't really see some mixture of (a) and (b) being acceptable.
> >
> > Yeah, I'd like to have it be automatic, but I don't have a clear idea
> > how to make that work nicely. It's possible somebody (Tom?) can
> > suggest something that I'm overlooking, though.
> 
> In general, for the non-partitioned table, where we don't have much overhead
> of checking the parallel safety and invalidation is also not a big problem so 
> I am
> tempted to provide an automatic parallel safety check.  This would enable
> parallelism for more cases wherever it is suitable without user intervention.
> OTOH, I understand that providing automatic checking might be very costly if
> the number of partitions is more.  Can't we provide some mid-way where the
> parallelism is enabled by default for the normal table but for the partitioned
> table it is disabled by default and the user has to set it safe for enabling
> parallelism?  I agree that such behavior might sound a bit hackish.

About the invalidation for non-partitioned table, I think it still has a
problem: When a function's parallel safety changed, it's expensive to judge
whether the function is related to index or trigger or some table-related
objects by using pg_depend, because we can only do the judgement in each
backend when accept a invalidation message.  If we don't do that, it means
whenever a function's parallel safety changed, we invalidate every relation's
cached safety which looks not very nice to me.

So, I personally think "(b) provide an option to the user to specify whether
inserts can be parallelized on a relation" is the preferable option.

Best regards,
houzj


Re: Evaluate expression at planning time for two more cases

2021-07-05 Thread David Rowley
On Tue, 9 Mar 2021 at 05:13, Ibrar Ahmed  wrote:
> It was a minor change therefore I rebased the patch, please take a look.

I only had a quick look at the v3 patch.

+ rel = table_open(rte->relid, NoLock);
+ att = TupleDescAttr(rel->rd_att, var->varattno - 1);

+ if (att->attnotnull && !check_null_side(context->root, relid, context))

This is not really an acceptable way to determine the notnull
attribute value. Andy Fan proposes a much better way in [1].
RelOptInfo is meant to cache the required Relation data that we need
during query planning. IIRC, Andy's patch correctly uses this and does
so in an efficient way.

In any case, as you can see there's a bit of other work going on to
smarten up the planner around NULL value detection. The UniqueKeys
patch requires this and various other things have come up that really
should be solved.

Surafel, I'd suggest we return this patch with feedback and maybe you
could instead help reviewing the other patches in regards to the NOT
NULL tracking and maybe come back to this once the dust has settled
and everyone is clear on how we determine if a column is NULL or not.

Let me know your thoughts.

David

[1] 
https://www.postgresql.org/message-id/caku4awpqjaqjwq2x-ar9g3+zhrzu1k8hnp7a+_mluov-n5a...@mail.gmail.com




Re: Evaluate expression at planning time for two more cases

2021-07-05 Thread David Rowley
On Tue, 8 Sept 2020 at 13:46, Tom Lane  wrote:
>
> I've been doing some handwaving about changing the representation
> of Vars, with an eye to making it clear by inspection whether a
> given Var is nullable by some lower outer join [2].  If that work
> ever comes to fruition then the need for "check_null_side" would
> go away.  So maybe we should put this idea on the back burner
> until that happens.

I looked at this patch too. I agree that we should delay adding any
new smarts in regards to NULL or NOT NULL until we have some more
robust infrastructure to make this sort of patch easier and cheaper.

My vote is to just return this patch with feedback.  Maybe Surafel
will be interested in pursuing this later when we have better
infrastructure or perhaps helping review the patch you're talking
about.

David




Re: Can a child process detect postmaster death when in pg_usleep?

2021-07-05 Thread Michael Paquier
On Mon, Jul 05, 2021 at 09:42:29PM +0530, Bharath Rupireddy wrote:
> I agree. I'm attaching the patch that replaces pg_usleep with
> WaitLatch for {pre, post}_auth_delay. I'm also attaching Michael's
> latest patch stop-backup-latch-v2.patch, just for the sake of cfbot.

I don't object to the argument that switching to a latch for this code
path could be good for responsiveness, but switching it is less
attractive than the others as wait events are not available in
pg_stat_activity at authentication startup.  That's the case of normal
backends and WAL senders, not the case of autovacuum workers using
post_auth_delay if I read the code correctly.

Anyway, it is worth noting that the patch as proposed breaks
post_auth_delay.  MyLatch is set when reaching WaitLatch() for
post_auth_delay after loading the options, so the use of WL_LATCH_SET
is not right.  I think that this comes from SwitchToSharedLatch() in
InitProcess().  And it does not seem quite right to me to just blindly
reset the latch before doing the wait in this code path.  Perhaps we
could just use (WL_TIMEOUT | WL_EXIT_ON_PM_DEATH) to do the job.

The one for pg_stop_backup() has been applied, no objections to that.
--
Michael


signature.asc
Description: PGP signature


Re: Minor typo in generate_useful_gather_paths comment

2021-07-05 Thread David Rowley
On Tue, 6 Jul 2021 at 06:55, James Coleman  wrote:
> While re-reading this code I found a small typo and fixed it (making
> the comment more explicit at the same time).

Thanks. Pushed (9ee91cc58).

David




Re: Removed extra memory allocations from create_list_bounds

2021-07-05 Thread David Rowley
On Tue, 6 Jul 2021 at 05:03, Justin Pryzby  wrote:
>
> Also, if you're going to remove the initializations here, maybe you'd also
> change i and j to C99 "for" declarations like "for (int i=0, j=0; ...)"
>
> -   PartitionListValue **all_values = NULL;
> -   ListCell   *cell;
> -   int i = 0;
> -   int ndatums = 0;
> +   PartitionListValue *all_values;
> +   int i;
> +   int j;
> +   int ndatums;
>
> Same in get_non_null_list_datum_count()

I tend to only get motivated to use that for new code that does not
exist in back-branches.  I'll maybe stop doing that when we no longer
have to support the pre-C99 versions of the code.

David




Re: Removed extra memory allocations from create_list_bounds

2021-07-05 Thread David Rowley
On Tue, 6 Jul 2021 at 04:45, Justin Pryzby  wrote:
> If you wanted to further squish the patches together, I don't mind being a
> co-author.

Thanks for looking at the patches.

I fixed the couple of things that you mentioned and pushed all 4
patches as a single commit (53d86957e)

David




Re: Asymmetric partition-wise JOIN

2021-07-05 Thread Zhihong Yu
On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov 
wrote:

> On 18/6/21 15:02, Alexander Pyhalov wrote:
> > Andrey Lepikhov писал 2021-05-27 07:27:
> >> Next version of the patch.
> >> For searching any problems I forced this patch during 'make check'
> >> tests. Some bugs were found and fixed.
> >
> > Hi.
> > I've tested this patch and haven't found issues, but I have some
> comments.
> Thank you for review!
> > Variable names - child_join_rel and child_join_relids seem to be
> > inconsistent with rest of the file
> fixed
>
> > When build_child_join_rel() can return NULL?
> Fixed
> > Missing word:
> > each bitmapset variable will large => each bitmapset variable will be
> large
> Fixed
> > What hook do you refer to?
> Removed> You've changed function to copy appinfo, so now comment is
> incorrect.
> Thanks, fixed> Do I understand correctly that now parent_relids also
> contains relids of
> > relations from 'global' inner relation, which we join to childs?
> Yes
>
> --
> regards,
> Andrey Lepikhov
> Postgres Professional
>
Hi,

relations because it could cause CPU and memory huge consumption
during reparameterization of NestLoop path.

CPU and memory huge consumption -> huge consumption of CPU and memory

+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one
of
+ * these JOINs are impossible to build.

at least one of these JOINs are impossible to build. -> at least one
of these JOINs is impossible to build.

+* Can't imagine situation when join relation already exists.
But in
+* the 'partition_join' regression test it happens.
+* It may be an indicator of possible problems.

Should a log be added in the above case ?

+is_asymmetric_join_capable(PlannerInfo *root,

is_asymmetric_join_capable -> is_asymmetric_join_feasible

Cheers


Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2021-07-05 Thread James Coleman
On Mon, Jul 5, 2021 at 8:08 AM Ronan Dunklau  wrote:
>
> > Ok, I reproduced that case, just not using a group by: by adding the group
> > by a sort node is added in both cases (master and your patch), except that
> > with your patch we sort on both keys and that doesn't really incur a
> > performance penalty.
> >
> > I think the overhead occurs because in the ExecAgg case, we use the
> > tuplesort_*_datum API as an optimization when we have a single column as an
> > input, which the ExecSort code doesn't. Maybe it would be worth it to try to
> > use that API in sort nodes too, when it can be done.
>
> Please find attached a POC patch to do just that.
>
> The switch to the single-datum tuplesort is done when there is only one
> attribute, it is byval (to avoid having to deal with copy of the references
> everywhere) and we are not in bound mode (to also avoid having to move things
> around).
>
> A naive run on make check pass on this, but I may have overlooked things.
>
> Should I add this separately to the commitfest ?

It seems like a pretty obvious win on its own, and, I'd expect, will
need less discussion than David's patch, so my vote is to make it a
separate thread. The patch tester wants the full series attached each
time, and even without that it's difficult to discuss multiple patches
on a single thread.

If you make a separate thread and CF entry, please CC me and add me as
a reviewer on the CF entry.

One thing from a quick read through of the patch: your changes near
the end of ExecSort, in ExecInitSort, and in execnodes.h need
indentation fixes.

Thanks,
James




Re: "debug_invalidate_system_caches_always" is too long

2021-07-05 Thread Tom Lane
Noah Misch  writes:
> On Sun, Jul 04, 2021 at 04:27:13PM -0400, Tom Lane wrote:
>> However, I think we should also give serious consideration to
>> "debug_clobber_cache" or "debug_clobber_cache_always" for continuity
>> with past practice (though it still feels like "always" is a good
>> word to lose now).  "debug_clobber_caches" is another reasonable
>> variant.

> https://en.wikipedia.org/wiki/Clobbering refers to cases where storage had no
> changes to its accessibility but now contains different data.  That doesn't
> match InvalidateSystemCaches() especially well, so I think dropping that word
> has been a good step.  Some other shorter terms could be debug_flush_caches,
> debug_rebuild_caches, or debug_expire_caches.  (debug_caches is tempting, but
> that may ensnare folks looking for extra logging rather than a big slowdown.)

I like "debug_flush_caches" --- it's short and accurate.

regards, tom lane




Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2021-07-05 Thread James Coleman
On Sat, Jun 12, 2021 at 11:07 AM David Rowley  wrote:
>
> A few years ago I wrote a patch to implement the missing aggregate
> combine functions for array_agg and string_agg [1].  In the end, the
> patch was rejected due to some concern [2] that if we allow these
> aggregates to run in parallel then it might mess up the order in which
> values are being aggregated by some unsuspecting user who didn't
> include an ORDER BY clause in the aggregate.   It was mentioned at the
> time that due to nodeAgg.c performing so terribly with ORDER BY
> aggregates that we couldn't possibly ask users who were upset by this
> change to include an ORDER BY in their aggregate functions.
>
> I'd still quite like to finish off the remaining aggregate functions
> that still don't have a combine function, so to get that going again
> I've written some code that gets the query planner onboard with
> picking a plan that allows nodeAgg.c to skip doing any internal sorts
> when the input results are already sorted according to what's required
> by the aggregate function.
>
> Of course, the query could have many aggregates all with differing
> ORDER BY clauses. Since we reuse the same input for each, it might not
> be possible to feed each aggregate with suitably sorted input.   When
> the input is not sorted, nodeAgg.c still must perform the sort as it
> does today.  So unfortunately we can't remove the nodeAgg.c code for
> that.
>
> The best I could come up with is just picking a sort order that suits
> the first ORDER BY aggregate, then spin through the remaining ones to
> see if there's any with a more strict order requirement that we can
> use to suit that one and the first one. The planner does something
> similar for window functions already, although it's not quite as
> comprehensive to look beyond the first window for windows with a more
> strict sort requirement.

I think this is a reasonable choice. I could imagine a more complex
method, say, counting the number of aggregates benefiting from a given
sort, and choosing the one that benefits the most (and this could be
further complicated by ranking based on "cost" -- not costing in the
normal sense since we don't have that at this point), but I think it'd
take a lot of convincing that that was valuable.

> This allows us to give presorted input to both aggregates in the following 
> case:
>
> SELECT agg(a ORDER BY a),agg2(a ORDER BY a,b) ...
>
> but just the first agg in this one:
>
> SELECT agg(a ORDER BY a),agg2(a ORDER BY c) ...
>
> In order to make DISTINCT work, I had to add a new expression
> evaluation operator to allow filtering if the current value is the
> same as the last value.  The current unpatched code implements
> distinct when reading back the sorted tuplestore data.  Since I don't
> have a tuplestore with the pre-sorted version, I needed to cache the
> last Datum, or last tuple somewhere.  I ended up putting that in the
> AggStatePerTransData struct. I'm not quite sure if I like that, but I
> don't really see where else it could go.

That sounds like what nodeIncrementalSort.c's isCurrentGroup() does,
except it's just implemented inline. Not anything you need to change
in this patch, but noting it in case it triggered a thought valuable
for you for me later on.

> When testing the performance of all this I found that when a suitable
> index exists to provide pre-sorted input for the aggregation that the
> performance does improve. Unfortunately, it looks like things get more
> complex when no index exists.  In this case, since we're setting
> pathkeys to tell the planner we need a plan that provides pre-sorted
> input to the aggregates, the planner will add a sort below the
> aggregate node.  I initially didn't see any problem with that as it
> just moves the sort to a Sort node rather than having it done
> implicitly inside nodeAgg.c.  The problem is, it just does not perform
> as well.  I guess this is because when the sort is done inside
> nodeAgg.c that the transition function is called in a tight loop while
> reading records back from the tuplestore.  In the patched version,
> there's an additional node transition in between nodeAgg and nodeSort
> and that causes slower performance.  For now, I'm not quite sure what
> to do about that.  We set the plan pathkeys well before we could
> possibly decide if asking for pre-sorted input for the aggregates
> would be a good idea or not.

This opens up another path for significant plan benefits too: if
there's now an explicit sort node, then it's possible for that node to
be an incremental sort node, which isn't something nodeAgg is capable
of utilizing currently.

> Please find attached my WIP patch.  It's WIP due to what I mentioned
> in the above paragraph and also because I've not bothered to add JIT
> support for the new expression evaluation steps.

I looked this over (though didn't get a chance to play with it).

I'm wondering about the changes to the test output in tuplesort.out.
It looks like where a 

Minor typo in generate_useful_gather_paths comment

2021-07-05 Thread James Coleman
While re-reading this code I found a small typo and fixed it (making
the comment more explicit at the same time).

Thanks,
James


v1-0001-Fix-typo-in-comment.patch
Description: Binary data


Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-07-05 Thread vignesh C
On Mon, Jul 5, 2021 at 10:30 PM Álvaro Herrera  wrote:
>
> On 2021-Jul-05, vignesh C wrote:
>
> > On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera  
> > wrote:
> > >
> > > On 2021-Jun-20, Tom Lane wrote:
> > >
> > > > Actually ... isn't there a second race, in the opposite direction?
> > > > IIUC, the point of this is that once we force some WAL to be sent
> > > > to the frozen sender/receiver, they'll be killed for failure to
> > > > respond.  But the advance_wal call is not the only possible cause
> > > > of that; a background autovacuum for example could emit some WAL.
> > > > So I fear it's possible for the 'to release replication slot'
> > > > message to come out before we capture $logstart.  I think you
> > > > need to capture that value before the kill not after.
> > >
> > > I accounted for all those things and pushed again.
> >
> > I saw that this patch is pushed. If there is no pending work left for
> > this, can we change the commitfest entry to Committed.
>
> There is none that I'm aware of, please mark it committed.  Thanks

Thanks for confirming, I have marked it as committed.

Regards,
Vignesh




Re: Removed extra memory allocations from create_list_bounds

2021-07-05 Thread Justin Pryzby
Also, if you're going to remove the initializations here, maybe you'd also
change i and j to C99 "for" declarations like "for (int i=0, j=0; ...)"

-   PartitionListValue **all_values = NULL;
-   ListCell   *cell;
-   int i = 0;
-   int ndatums = 0;
+   PartitionListValue *all_values;
+   int i;
+   int j;
+   int ndatums;

Same in get_non_null_list_datum_count()

-- 
Justin




Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-07-05 Thread Álvaro Herrera
On 2021-Jul-05, vignesh C wrote:

> On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera  
> wrote:
> >
> > On 2021-Jun-20, Tom Lane wrote:
> >
> > > Actually ... isn't there a second race, in the opposite direction?
> > > IIUC, the point of this is that once we force some WAL to be sent
> > > to the frozen sender/receiver, they'll be killed for failure to
> > > respond.  But the advance_wal call is not the only possible cause
> > > of that; a background autovacuum for example could emit some WAL.
> > > So I fear it's possible for the 'to release replication slot'
> > > message to come out before we capture $logstart.  I think you
> > > need to capture that value before the kill not after.
> >
> > I accounted for all those things and pushed again.
> 
> I saw that this patch is pushed. If there is no pending work left for
> this, can we change the commitfest entry to Committed.

There is none that I'm aware of, please mark it committed.  Thanks

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)




Re: simplifying foreign key/RI checks

2021-07-05 Thread vignesh C
On Sun, Apr 4, 2021 at 1:51 PM Amit Langote  wrote:
>
> On Fri, Apr 2, 2021 at 11:55 PM Zhihong Yu  wrote:
> >
> > Hi,
> >
> > +   skip = !ExecLockTableTuple(erm->relation, , markSlot,
> > +  estate->es_snapshot, 
> > estate->es_output_cid,
> > +  lockmode, erm->waitPolicy, _needed);
> > +   if (skip)
> >
> > It seems the variable skip is only used above. The variable is not needed - 
> > if statement can directly check the return value.
> >
> > + * Locks tuple with given TID with given lockmode following given 
> > wait
> >
> > given appears three times in the above sentence. Maybe the following is bit 
> > easier to read:
> >
> > Locks tuple with the specified TID, lockmode following given wait policy
> >
> > + * Checks whether a tuple containing the same unique key as extracted from 
> > the
> > + * tuple provided in 'slot' exists in 'pk_rel'.
> >
> > I think 'same' is not needed here since the remaining part of the sentence 
> > has adequately identified the key.
> >
> > +   if (leaf_pk_rel == NULL)
> > +   goto done;
> >
> > It would be better to avoid goto by including the cleanup statements in the 
> > if block and return.
> >
> > +   if (index_getnext_slot(scan, ForwardScanDirection, outslot))
> > +   found = true;
> > +
> > +   /* Found tuple, try to lock it in key share mode. */
> > +   if (found)
> >
> > Since found is only assigned in one place, the two if statements can be 
> > combined into one.
>
> Thanks for taking a look.  I agree with most of your suggestions and
> have incorporated them in the v8 just posted.

The 2nd patch does not apply on Head, please post a rebased version:
error: patch failed: src/backend/utils/adt/ri_triggers.c:337
error: src/backend/utils/adt/ri_triggers.c: patch does not apply

Regards,
Vignesh




Re: Pre-allocating WAL files

2021-07-05 Thread vignesh C
On Mon, Jun 7, 2021 at 8:48 PM Bossart, Nathan  wrote:
>
> On 12/25/20, 12:09 PM, "Andres Freund"  wrote:
> > When running write heavy transactional workloads I've many times
> > observed that one needs to run the benchmarks for quite a while till
> > they get to their steady state performance. The most significant reason
> > for that is that initially WAL files will not get recycled, but need to
> > be freshly initialized. That's 16MB of writes that need to synchronously
> > finish before a small write transaction can even start to be written
> > out...
> >
> > I think there's two useful things we could do:
> >
> > 1) Add pg_wal_preallocate(uint64 bytes) that ensures (bytes +
> >segment_size - 1) / segment_size WAL segments exist from the current
> >point in the WAL. Perhaps with the number of bytes defaulting to
> >min_wal_size if not explicitly specified?
> >
> > 2) Have checkpointer (we want walwriter to run with low latency to flush
> >out async commits etc) occasionally check if WAL files need to be
> >pre-allocated.
> >
> >Checkpointer already tracks the amount of WAL that's expected to be
> >generated till the end of the checkpoint, so it seems like it's a
> >pretty good candidate to do so.
> >
> >To keep checkpointer pre-allocating when idle we could signal it
> >whenever a record has crossed a segment boundary.
> >
> >
> > With a plain pgbench run I see a 2.5x reduction in throughput in the
> > periods where we initialize WAL files.
>
> I've been exploring this independently a bit and noticed this message.
> Attached is a proof-of-concept patch for a separate "WAL allocator"
> process that maintains a pool of WAL-segment-sized files that can be
> claimed whenever a new segment file is needed.  An early version of
> this patch attempted to spread the I/O like non-immediate checkpoints
> do, but I couldn't point to any real benefit from doing so, and it
> complicated things quite a bit.
>
> I like the idea of trying to bake this into an existing process such
> as the checkpointer.  I'll admit that creating a new process just for
> WAL pre-allocation feels a bit heavy-handed, but it was a nice way to
> keep this stuff modularized.  I can look into moving this
> functionality into the checkpointer process if this is something that
> folks are interested in.

Thanks for posting the patch, the patch no more applies on Head:
Applying: wal segment pre-allocation
error: patch failed: src/backend/access/transam/xlog.c:3283
error: src/backend/access/transam/xlog.c: patch does not apply

Can you rebase the patch and post, it might help if someone is picking
it up for review.

Regards,
Vignesh




Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-07-05 Thread vignesh C
On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera  wrote:
>
> On 2021-Jun-20, Tom Lane wrote:
>
> > Actually ... isn't there a second race, in the opposite direction?
> > IIUC, the point of this is that once we force some WAL to be sent
> > to the frozen sender/receiver, they'll be killed for failure to
> > respond.  But the advance_wal call is not the only possible cause
> > of that; a background autovacuum for example could emit some WAL.
> > So I fear it's possible for the 'to release replication slot'
> > message to come out before we capture $logstart.  I think you
> > need to capture that value before the kill not after.
>
> I accounted for all those things and pushed again.

I saw that this patch is pushed. If there is no pending work left for
this, can we change the commitfest entry to Committed.

Regards,
Vignesh




Re: Excessive cost of OpClassCache flushes in CLOBBER_CACHE_ALWAYS mode

2021-07-05 Thread Tom Lane
Andrew Dunstan  writes:
> On 7/4/21 3:57 PM, Tom Lane wrote:
>> I'm now a little dubious about my claim that this would have helped find
>> any bugs.  Invalidating a finished OpClassCache entry does not model any
>> real-world scenario, because as noted elsewhere in LookupOpclassInfo,
>> once such a cache entry is populated it is kept for the rest of the
>> session.  Also, the claim in the comment that we need this to test
>> a cache flush during load seems like nonsense: if we have
>> debug_invalidate_system_caches_always turned on, then we'll test
>> the effects of such flushes throughout the initial population of
>> a cache entry.  Doing it over and over again adds nothing.
>> 
>> So I'm now fairly strongly tempted to remove this code outright
>> (effectively reverting 03ffc4d6d).  Another possibility now that
>> we have debug_invalidate_system_caches_always is to increase the
>> threshold at which this happens, making it more like
>> CLOBBER_CACHE_RECURSIVE.

> If we don't think it's adding anything useful just rip it out. We don't
> generally keep code hanging around just on the off chance it might be
> useful some day.

I did a little more research about the origins of 03ffc4d6d.
I found this thread:

https://www.postgresql.org/message-id/flat/2988.1196271930%40sss.pgh.pa.us#a7dd1ce92f5470ba5ad2e1be03d40802

That points back to commit 0b56be834, which fixed the oversight that
OpclassOidIndexId had not been marked as a critical system index,
and claims that making LookupOpclassInfo invalidate entries would
have helped identify that.

To test that, I tried removing OpclassOidIndexId from the list of
critical system indexes.  I found that (a) if I also remove the
invalidation code in LookupOpclassInfo, then things seem to proceed
normally even in CCA mode, but (b) with that code, we get into
infinite recursion, continually trying to rebuild that index's
relcache entry.  Once criticalRelcachesBuilt is set, LookupOpclassInfo
assumes it can use catalog index scans for all opclasses, even those
needed for indexes it's going to rely on, which is why the recursion
happens.  This might arguably be a bug in that CLOBBER_CACHE_ENABLED
code itself: maybe it needs to avoid invalidating the entries for
OID_BTREE_OPS_OID and INT2_BTREE_OPS_OID.  If I make it do so, the
infinite recursion disappears even without OpclassOidIndexId as a
critical index.

So on the one hand maybe that logic is too simplistic, but on the
other hand it might help identify missed critical indexes in future
too.  That puts it in the same category as the sorts of bugs that
"debug_invalidate_system_caches_always > 1" is meant to help with,
so now I'm content to change it that way.

regards, tom lane




Re: Removed extra memory allocations from create_list_bounds

2021-07-05 Thread Justin Pryzby
On Tue, Jul 06, 2021 at 01:48:52AM +1200, David Rowley wrote:
> On Wed, 19 May 2021 at 05:28, Nitin Jadhav  
> wrote:
> > I have rebased all the patches on top of
> > 'v2_0001-removed_extra_mem_alloc_from_create_list_bounds.patch'.
> > Attaching all the patches here.
> 
> I had a look over these and I think what's being done here is fine.

Thanks for loooking.

0001 is missing a newline before create_list_bounds()

0003 is missing pfree(all_bounds), which I had as 0005.
It 1) allocates all_bounds; 2) allocates rbounds; 3) copies all_bounds into
rbounds; 4) allocates boundDatums; 5) copies rbounds into boundDatums; 6) frees
rbounds; 7) returns boundInfo with boundinfo->datums.

> The good news is, we can just give partition_bounds_copy() the same
> treatment. 0004 does that.

+1

> I've attached another set of patches. I squashed all the changes to
> each create_*_bounds function into a patch of their own. Perhaps 0002
> and 0003 which handle range and hash partitioning can be one patch
> since Justin seemed to write that one.  I kept 0001 separate since
> that's Nitin's patch plus Justin's extra parts. It seems easier to
> credit properly having the patches broken out like this.  I think it's
> excessive to break down 0001 into Nitin and Justin's individual parts.

If you wanted to further squish the patches together, I don't mind being a
co-author.

Cheers,
-- 
Justin




Re: Can a child process detect postmaster death when in pg_usleep?

2021-07-05 Thread Bharath Rupireddy
On Mon, Jul 5, 2021 at 9:25 PM Stephen Frost  wrote:
> In general, I agree with Tom's up-thread comment about children hanging
> around after postmaster death making things more difficult for debugging
> and just in general, so I'm in favor of trying to eliminate as many
> cases where that's happening as we reasonably can without impacting
> performance by checking too often.

I agree. I'm attaching the patch that replaces pg_usleep with
WaitLatch for {pre, post}_auth_delay. I'm also attaching Michael's
latest patch stop-backup-latch-v2.patch, just for the sake of cfbot.

Regards,
Bharath Rupireddy.


stop-backup-latch-v2.patch
Description: Binary data


v1-0003-Use-a-WaitLatch-for-Pre-and-Post-Auth-Delay.patch
Description: Binary data


Re: Can a child process detect postmaster death when in pg_usleep?

2021-07-05 Thread Bharath Rupireddy
On Mon, Jul 5, 2021 at 7:33 AM Michael Paquier  wrote:
>
> On Fri, Jul 02, 2021 at 12:03:07PM +0530, Bharath Rupireddy wrote:
> > My bad. I was talking about the cases when do_pg_stop_backup is called
> > while the server is in recovery mode i.e. backup_started_in_recovery =
> > RecoveryInProgress(); evaluates to true. I'm not sure in these cases
> > whether we should replace pg_usleep with WaitLatch. If yes, whether we
> > should use procLatch/MyLatch or recoveryWakeupLatch as they are
> > currently serving different purposes.
>
> It seems to me that you should re-read the description of
> recoveryWakeupLatch at the top of xlog.c and check for which purpose
> it exists, which is, in this case, to wake up the startup process to
> accelerate WAL replay.  So do_pg_stop_backup() has no business with
> it.

Hm. The shared recoveryWakeupLatch is being owned by the startup
process to wait and other backends/processes are using it to wake up
the startup process.

> Switching pg_stop_backup() to use a latch rather than pg_usleep() has
> benefits:
> - It simplifies the wait event handling.
> - The process waiting for the last WAL segment to be archived will be
> more responsive on signals like SIGHUP and on postmaster death.
>
> These don't sound bad to me to apply here, so 0002 could be simplified
> as attached.

The attached stop-backup-latch-v2.patch looks good to me.

Regards,
Bharath Rupireddy.




Re: Can a child process detect postmaster death when in pg_usleep?

2021-07-05 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Fri, Jul 02, 2021 at 12:03:07PM +0530, Bharath Rupireddy wrote:
> > My bad. I was talking about the cases when do_pg_stop_backup is called
> > while the server is in recovery mode i.e. backup_started_in_recovery =
> > RecoveryInProgress(); evaluates to true. I'm not sure in these cases
> > whether we should replace pg_usleep with WaitLatch. If yes, whether we
> > should use procLatch/MyLatch or recoveryWakeupLatch as they are
> > currently serving different purposes.
> 
> It seems to me that you should re-read the description of
> recoveryWakeupLatch at the top of xlog.c and check for which purpose
> it exists, which is, in this case, to wake up the startup process to
> accelerate WAL replay.  So do_pg_stop_backup() has no business with
> it.
> 
> Switching pg_stop_backup() to use a latch rather than pg_usleep() has
> benefits:
> - It simplifies the wait event handling.
> - The process waiting for the last WAL segment to be archived will be
> more responsive on signals like SIGHUP and on postmaster death.

Yes, agreed.

> These don't sound bad to me to apply here, so 0002 could be simplified
> as attached.

Took a quick look and the patch looks good to me.

In general, I agree with Tom's up-thread comment about children hanging
around after postmaster death making things more difficult for debugging
and just in general, so I'm in favor of trying to eliminate as many
cases where that's happening as we reasonably can without impacting
performance by checking too often.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2021-07-05 Thread Ranier Vilela
Em seg., 5 de jul. de 2021 às 12:07, Ronan Dunklau 
escreveu:

> Le lundi 5 juillet 2021, 16:51:59 CEST Ranier Vilela a écrit :
> > >Please find attached a POC patch to do just that.
> > >
> > >The switch to the single-datum tuplesort is done when there is only one
> > >attribute, it is byval (to avoid having to deal with copy of the
> >
> > references
> >
> > >everywhere) and we are not in bound mode (to also avoid having to move
> >
> > things
> >
> > >around).
> >
> > Hi, nice results!
> >
> > I have a few suggestions and questions to your patch:
>
> Thank you for those !
> >
> > 1. Why do you moved the declaration of variable *plannode?
> > I think this is unnecessary, extend the scope.
>
> Sorry, I should have cleaned it up before sending.
>
> >
> > 2. Why do you declare a new variable TupleDesc out_tuple_desc at
> > ExecInitSort?
> > I think this is unnecessary too, maybe I didn't notice something.
>
> Same as the above, thanks for the two.
> >
> > 3. I inverted the order of check at this line, I think "!node-bounded" is
> > more cheap that TupleDescAttr(tupDesc, 0) ->attbyval
>
> I'm not sure it matters since it's done once per sort but Ok
> >
> > 4. Once that you changed the order of execution, this test is unlikely
> that
> > happens, so add unlikely helps the branch.
>
> Ok.
>
> >
> > 5. I think that you add a invariant inside the loop
> > "if(node->is_single_val)"?
> > Would not be better two fors?
>
> Ok for me.
>
> >
> > For you convenience, I attached a v2 version (with styles changes),
> please
> > take a look and can you repeat yours tests?
>
> Tested it quickly, and did not see any change performance wise that cannot
> be
> attributed to noise on my laptop but it's fine.
>
Thanks for testing again.


> Thank you for the fixes !
>
You are welcome.

regards,
Ranier Vilela


Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2021-07-05 Thread Ronan Dunklau
Le lundi 5 juillet 2021, 16:51:59 CEST Ranier Vilela a écrit :
> >Please find attached a POC patch to do just that.
> >
> >The switch to the single-datum tuplesort is done when there is only one
> >attribute, it is byval (to avoid having to deal with copy of the
> 
> references
> 
> >everywhere) and we are not in bound mode (to also avoid having to move
> 
> things
> 
> >around).
> 
> Hi, nice results!
> 
> I have a few suggestions and questions to your patch:

Thank you for those !
> 
> 1. Why do you moved the declaration of variable *plannode?
> I think this is unnecessary, extend the scope.

Sorry, I should have cleaned it up before sending.

> 
> 2. Why do you declare a new variable TupleDesc out_tuple_desc at
> ExecInitSort?
> I think this is unnecessary too, maybe I didn't notice something.

Same as the above, thanks for the two.
> 
> 3. I inverted the order of check at this line, I think "!node-bounded" is
> more cheap that TupleDescAttr(tupDesc, 0) ->attbyval

I'm not sure it matters since it's done once per sort but Ok
> 
> 4. Once that you changed the order of execution, this test is unlikely that
> happens, so add unlikely helps the branch.

Ok.

> 
> 5. I think that you add a invariant inside the loop
> "if(node->is_single_val)"?
> Would not be better two fors?

Ok for me.

> 
> For you convenience, I attached a v2 version (with styles changes), please
> take a look and can you repeat yours tests?

Tested it quickly, and did not see any change performance wise that cannot be 
attributed to noise on my laptop but it's fine.

Thank you for the fixes !

> 
> regards,
> Ranier Vilela


-- 
Ronan Dunklau






Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2021-07-05 Thread Ranier Vilela
>Please find attached a POC patch to do just that.

>The switch to the single-datum tuplesort is done when there is only one
>attribute, it is byval (to avoid having to deal with copy of the
references
>everywhere) and we are not in bound mode (to also avoid having to move
things
>around).
Hi, nice results!

I have a few suggestions and questions to your patch:

1. Why do you moved the declaration of variable *plannode?
I think this is unnecessary, extend the scope.

2. Why do you declare a new variable TupleDesc out_tuple_desc at
ExecInitSort?
I think this is unnecessary too, maybe I didn't notice something.

3. I inverted the order of check at this line, I think "!node-bounded" is
more cheap that TupleDescAttr(tupDesc, 0) ->attbyval

4. Once that you changed the order of execution, this test is unlikely that
happens, so add unlikely helps the branch.

5. I think that you add a invariant inside the loop
"if(node->is_single_val)"?
Would not be better two fors?

For you convenience, I attached a v2 version (with styles changes), please
take a look and can you repeat yours tests?

regards,
Ranier Vilela


v2-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patch
Description: Binary data


Re: Disable WAL logging to speed up data loading

2021-07-05 Thread Stephen Frost
Greetings,

* osumi.takami...@fujitsu.com (osumi.takami...@fujitsu.com) wrote:
> On Monday, July 5, 2021 10:32 AM Michael Paquier wrote:
> > On Sun, Jul 04, 2021 at 11:02:01AM -0400, Stephen Frost wrote:
> > > Rather than RfC, the appropriate status seems like it should be
> > > Rejected, as otherwise it's just encouraging someone to ultimately
> > > waste their time rebasing and updating the patch when it isn't going
> > > to ever actually be committed.
> > 
> > Yes, agreed with that.
> Thanks for paying attention to this thread.
> 
> This cannot be helped but I've made the patch status as you suggested,
> because I judged the community would not accept this idea.

Thanks.  Hopefully this will encourage those interested in minimal WAL
logging for data loading performance to instead work on additional
optimizations for the existing 'minimal' WAL level.

Thanks again,

Stephen


signature.asc
Description: PGP signature


Atomic rename feature for Windows.

2021-07-05 Thread Victor Spirin

Hi

I used the SetFileInformationByHandle function with the 
FILE_RENAME_FLAG_POSIX_SEMANTICS flag for the file rename function..


1) The _WIN32_WINNT variable needs to be increased to 0x0A00 (Windows 
10).  Fixed conflict with #undef CHECKSUM_TYPE_NONE


2) The SetFileInformationByHandle function works correctly only on 
Windows 10 and higher.


The app must have a manifest to check the Windows version using the 
IsWindows10OrGreater() function. I added a manifest to all postgres 
projects and disabled the GenerateManifest option on windows projects.


This patch related to this post: 
https://www.postgresql.org/message-id/CAEepm%3D0FV-k%2B%3Dd9z08cW%3DZXoR1%3Dkw9wdpkP6WAuOrKJdz-8ujg%40mail.gmail.com



--
Victor Spirin
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/include/common/checksum_helper.h 
b/src/include/common/checksum_helper.h
index cac7570ea1..2d533c93a6 100644
--- a/src/include/common/checksum_helper.h
+++ b/src/include/common/checksum_helper.h
@@ -26,6 +26,13 @@
  * MD5 here because any new that does need a cryptographically strong checksum
  * should use something better.
  */
+
+ /*
+ * CHECKSUM_TYPE_NONE defined in the winioctl.h when _WIN32_WINNT >= 
_WIN32_WINNT_WIN10
+ */
+#ifdef CHECKSUM_TYPE_NONE
+#undef CHECKSUM_TYPE_NONE
+#endif
 typedef enum pg_checksum_type
 {
CHECKSUM_TYPE_NONE,
diff --git a/src/include/port/win32.h b/src/include/port/win32.h
index d8ae49e22d..d91555f5c0 100644
--- a/src/include/port/win32.h
+++ b/src/include/port/win32.h
@@ -12,12 +12,13 @@
 /*
  * Make sure _WIN32_WINNT has the minimum required value.
  * Leave a higher value in place. When building with at least Visual
- * Studio 2015 the minimum requirement is Windows Vista (0x0600) to
- * get support for GetLocaleInfoEx() with locales. For everything else
+ * Studio 2015 the minimum requirement is Windows 10 (0x0A00) to get support 
for SetFileInformationByHandle.
+ * The minimum requirement is Windows Vista (0x0600) get support for 
GetLocaleInfoEx() with locales.
+ * For everything else
  * the minimum version is Windows XP (0x0501).
  */
 #if defined(_MSC_VER) && _MSC_VER >= 1900
-#define MIN_WINNT 0x0600
+#define MIN_WINNT 0x0A00
 #else
 #define MIN_WINNT 0x0501
 #endif
diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index 763bc5f915..fbd051699d 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -39,6 +39,110 @@
 #endif
 #endif
 
+
+#if defined(WIN32) && !defined(__CYGWIN__) && defined(_WIN32_WINNT_WIN10) && 
_WIN32_WINNT >= _WIN32_WINNT_WIN10
+
+#include 
+
+/*
+ * win10_rename - uses SetFileInformationByHandle function with 
FILE_RENAME_FLAG_POSIX_SEMANTICS flag for atomic rename file
+ * working only on Windows 10 or later and  _WIN32_WINNT must be >= 
_WIN32_WINNT_WIN10
+ */
+static int win10_rename(wchar_t const* from, wchar_t const* to)
+{
+
+   int err;
+   size_t len;
+   FILE_RENAME_INFO* rename_info;
+   HANDLE f_handle;
+
+   f_handle = CreateFileW(from,
+   GENERIC_READ | GENERIC_WRITE | DELETE,
+   FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE,
+   NULL,
+   OPEN_EXISTING,
+   FILE_ATTRIBUTE_NORMAL,
+   NULL);
+
+
+   if (f_handle == INVALID_HANDLE_VALUE)
+   {
+   err = GetLastError();
+   _dosmaperr(err);
+   return -1;
+   }
+
+
+   len = wcslen(to);
+   rename_info = (FILE_RENAME_INFO*)malloc(sizeof(FILE_RENAME_INFO) + (len 
+ 1) * sizeof(wchar_t));
+   rename_info->ReplaceIfExists = TRUE;
+   rename_info->Flags = FILE_RENAME_FLAG_POSIX_SEMANTICS | 
FILE_RENAME_FLAG_REPLACE_IF_EXISTS;
+
+   rename_info->RootDirectory = NULL;
+   rename_info->FileNameLength = len;
+   memcpy(rename_info->FileName, to, (len + 1) * sizeof(wchar_t));
+   if (!SetFileInformationByHandle(f_handle, FileRenameInfoEx, 
rename_info, sizeof(FILE_RENAME_INFO) + (len + 1) * sizeof(wchar_t)))
+   {
+   err = GetLastError();
+
+   _dosmaperr(err);
+   CloseHandle(f_handle);
+   free(rename_info);
+   return -1;
+   }
+
+   CloseHandle(f_handle);
+
+   return 0;
+}
+
+
+/*
+ * pgrename_win10 - converts arguments from Windows ANSI code page to 
wchar_t and calls win10_rename function
+*/
+int
+pgrename_win10(const char *from, const char *to)
+{
+   wchar_t *from_w, *to_w;
+   size_t size, wsize;
+   int wlen;
+   int ret;
+
+   size = strlen(from) + 1;
+   wsize = size * sizeof(wchar_t);
+
+   from_w = (wchar_t*)malloc(wsize);
+   wlen = MultiByteToWideChar(CP_ACP, 0, (LPCCH)from, (int)size, 
(LPWSTR)from_w, (int)size);
+   if (wlen == 0)
+   {
+   free(from_w);
+   return -1;
+   }
+
+   size = strlen(to) + 1;
+   wsize = size * sizeof(wchar_t);
+
+   to_w = (wchar_t*)malloc(wsize);
+   wlen = 

Re: Removed extra memory allocations from create_list_bounds

2021-07-05 Thread David Rowley
On Wed, 19 May 2021 at 05:28, Nitin Jadhav
 wrote:
> I have rebased all the patches on top of
> 'v2_0001-removed_extra_mem_alloc_from_create_list_bounds.patch'.
> Attaching all the patches here.

I had a look over these and I think what's being done here is fine.

I think this will help speed up building the partition bound.
Unfortunately, it won't help any for speeding up things like finding
the correct partition during SELECT  or DML on partitioned tables.
The reason for this is that RelationBuildPartitionDesc first builds
the partition bound using the functions being modified here, but it
then copies it into the relcache into a memory context for the
partition using partition_bounds_copy().  It looks like
partition_bounds_copy() suffers from the same palloc explosion type
problem as is being fixed in each of the create_*_bounds() functions
here.   The good news is, we can just give partition_bounds_copy() the
same treatment. 0004 does that.

I tried to see if the better cache locality of the
PartitionBoundInfo's datum array would help speed up inserts into a
partitioned table. I figured a fairly small binary search in a LIST
partitioned table of 10 partitions might have all Datum visited all on
the same cache line. However, I was unable to see any performance
gains. I think the other work being done is likely just going to drown
out any gains in cache efficiency in the binary search.  COPY into a
partitioned table might have more chance of becoming a little faster,
but I didn't try.

I've attached another set of patches. I squashed all the changes to
each create_*_bounds function into a patch of their own. Perhaps 0002
and 0003 which handle range and hash partitioning can be one patch
since Justin seemed to write that one.  I kept 0001 separate since
that's Nitin's patch plus Justin's extra parts. It seems easier to
credit properly having the patches broken out like this.  I think it's
excessive to break down 0001 into Nitin and Justin's individual parts.

I did make a few adjustments to the patches renaming a variable or two
and I changed how we assign the boundinfo->datums[i] pointers to take
the address of the Nth element rather than incrementing the variable
pointing to the array each item.  I personally like p = [i];
more than p = array; array++, others may not feel the same.

Nitin and Justin, are you both able to have another look over these
and let me know what you think.  If all is well I'd like to push all 4
patches.

David


v3-0001-Reduce-the-number-of-pallocs-in-create_list_bound.patch
Description: Binary data


v3-0004-Reduce-the-number-of-pallocs-in-partition_bounds_.patch
Description: Binary data


v3-0003-Reduce-the-number-of-palloc-calls-in-create_range.patch
Description: Binary data


v3-0002-Reduce-the-number-of-pallocs-in-create_hash_bound.patch
Description: Binary data


Re: Removing unneeded self joins

2021-07-05 Thread Andrey Lepikhov

On 2/7/21 01:56, Hywel Carver wrote:
On Wed, Jun 30, 2021 at 12:21 PM Andrey Lepikhov 
mailto:a.lepik...@postgrespro.ru>> wrote:

I think, here we could ask more general question: do we want to
remove a
'IS NOT NULL' clause from the clause list if the rest of the list
implicitly implies it?


My suggestion was not to remove it, but to avoid adding it in the first 
place. When your optimisation has found a join on a group of columns 
under a uniqueness constraint, you would do something like this (forgive 
the pseudo-code)


foreach(column, join_clause) {
   if(column.nullable) { // This condition is what I'm suggesting is added
     add_null_test(column, IS_NOT_NULL);
   }
}

But it may be that that's not possible or practical at this point in the 
code.
I think, such option will require to implement a new machinery to prove 
that arbitrary column couldn't produce NULL value.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-07-05 Thread Anastasia Lubennikova
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

I looked through the patch. Looks good to me. 

CFbot tests are passing and, as I got it from the thread, nobody opposes this 
refactoring, so, move it to RFC status.

The new status of this patch is: Ready for Committer


Re: Use extended statistics to estimate (Var op Var) clauses

2021-07-05 Thread Dean Rasheed
On Sun, 13 Jun 2021 at 21:28, Tomas Vondra
 wrote:
>
> Here is a slightly updated version of the patch
>

Hi,

I have looked at this in some more detail, and it all looks pretty
good, other than some mostly cosmetic stuff.

The new code in statext_is_compatible_clause_internal() is a little
hard to follow because some of the comments aren't right (e.g. when
checking clause_expr2, it isn't an (Expr op Const) or (Const op Expr)
as the comment says). Rather than trying to comment on each
conditional branch, it might be simpler to just have a single
catch-all comment at the top, and also remove the "return true" in the
middle, to make it something like:

/*
 * Check Vars appearing on either side by recursing, and make a note of
 * any expressions.
 */
if (IsA(clause_expr, Var))
{
if (!statext_is_compatible_clause_internal(...))
return false;
}
else
*exprs = lappend(*exprs, clause_expr);

if (clause_expr2)
{
if (IsA(clause_expr2, Var))
{
if (!statext_is_compatible_clause_internal(...))
return false;
}
else
*exprs = lappend(*exprs, clause_expr2);
}

return true;

Is the FIXME comment in examine_opclause_args() necessary? The check
for a single relation has already been done in
clause[list]_selectivity_ext(), and I'm not sure what
examine_opclause_args() would do differently.

In mcv_get_match_bitmap(), perhaps do the RESULT_IS_FINAL() checks
first in each loop.

Also in mcv_get_match_bitmap(), the 2 "First check whether the
constant is below the lower boundary ..." comments don't make any
sense to me. Were those perhaps copied and pasted from somewhere else?
They should perhaps say "Otherwise, compare the MCVItem with the
constant" and "Otherwise compare the values from the MCVItem using the
clause operator", or something like that.

But other than such cosmetic things, I think the patch is good, and
gives some nice estimate improvements.

Regards,
Dean




Fix possible variable declaration uninitialized (src/backend/utils/adt/varlena.c)

2021-07-05 Thread Ranier Vilela
Hi,

This is not a live bug.
I think this is worth fixing, just for the sake of style and code
correctness.
As a bonus, we have a reduced scope and standardized return.

regards,
Ranier Vilela


fix_possible_decl_var_uninitialized_varlena.patch
Description: Binary data


Re: Numeric multiplication overflow errors

2021-07-05 Thread Ranier Vilela
Em seg., 5 de jul. de 2021 às 09:02, David Rowley 
escreveu:

> On Mon, 5 Jul 2021 at 23:07, Ranier Vilela  wrote:
> >
> > Em seg., 5 de jul. de 2021 às 06:44, Dean Rasheed <
> dean.a.rash...@gmail.com> escreveu:
> >> Note, however, that it won't make any difference to performance in the
> >> way that you're suggesting -- elog() in Postgres is used for "should
> >> never happen, unless there's a software bug" errors, rather than, say,
> >> "might happen for certain invalid inputs" errors, so init_var() should
> >> always be called in these functions.
> >
> > I agree that in this case, most of the time, elog is not called.
>
> You may have misunderstood what Dean meant.  elog(ERROR) calls are now
> exclusively for "cannot happen" cases.  If someone gets one of these
> then there's a bug to fix or something else serious has gone wrong
> with the hardware.
>
> The case you seem to be talking about would fit better if the code in
> question had been ereport(ERROR).
>
> I don't disagree that the initialisation is better to happen after the
> elog. I'm just mentioning this as I wanted to make sure you knew the
> difference between elog(ERROR) and ereport(ERROR).
>
I understand the difference now, thanks for clarifying.

regards,
Ranier Vilela


Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2021-07-05 Thread Ronan Dunklau
> Ok, I reproduced that case, just not using a group by: by adding the group
> by a sort node is added in both cases (master and your patch), except that
> with your patch we sort on both keys and that doesn't really incur a
> performance penalty.
> 
> I think the overhead occurs because in the ExecAgg case, we use the
> tuplesort_*_datum API as an optimization when we have a single column as an
> input, which the ExecSort code doesn't. Maybe it would be worth it to try to
> use that API in sort nodes too, when it can be done.

Please find attached a POC patch to do just that.

The switch to the single-datum tuplesort is done when there is only one 
attribute, it is byval (to avoid having to deal with copy of the references 
everywhere) and we are not in bound mode (to also avoid having to move things 
around).

A naive run on make check pass on this, but I may have overlooked things.

Should I add this separately to the commitfest ?

For the record, the times I got on my laptop, on master VS david's patch VS 
both. Values are an average of 100 runs, as reported by pgbench --no-vacuum -f 
 -t 100. There is a good amount of noise, but the simple "select one 
ordered column case" seems worth the optimization.

Only shared_buffers and work_mem have been set to 2GB each.

Setup 1: single table, 1 000 000 tuples, no index
CREATE TABLE tbench (
  a int,
  b int
);

INSERT INTO tbench (a, b) SELECT a, b FROM generate_series(1, 100) a, 
generate_series(1, 1) b;


Test 1: Single-column ordered select (order by b since the table is already 
sorted by a)
select b from tbench order by b;

master: 303.661ms
with mine: 148.571ms

Test 2: Ordered sum (using b so that the input is not presorted)
select sum(b order by b) from tbench;

master: 112.379ms
with david's patch: 144.469ms
with david's patch + mine: 97ms

Test 3: Ordered sum + group by
select b, sum(a order by a) from tbench GROUP BY b;

master: 316.117ms
with david's patch: 297.079
with david's patch + mine: 294.601

Setup 2: same as before, but adding an index on (b, a)
CREATE INDEX ON tbench (b, a);

Test 2: Ordered sum:
select sum(a order by a) from tbench;

master: 111.847 ms
with david's patch: 48.088
with david's patch + mine: 47.678 ms

Test 3: Ordered sum + group by:
select a, sum(b order by b) from tbench GROUP BY a;

master: 76.873 ms
with david's patch: 61.105
with david's patch + mine:   62.672 ms


-- 
Ronan Dunklau>From a833baf025f69762fb1f076bf1ef9c986dcbe824 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Mon, 5 Jul 2021 12:45:26 +0200
Subject: [PATCH] Allow Sort nodes to use the fast "single datum" tuplesort.

The sorting code in nodeagg made use of this API, but not the regular
nodesort one. This is a POC seeing how it can be done.
---
 src/backend/executor/nodeSort.c | 62 +
 src/include/nodes/execnodes.h   |  1 +
 2 files changed, 48 insertions(+), 15 deletions(-)

diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..03873b0d58 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -44,6 +44,7 @@ ExecSort(PlanState *pstate)
 	ScanDirection dir;
 	Tuplesortstate *tuplesortstate;
 	TupleTableSlot *slot;
+	Sort	   *plannode = (Sort *) node->ss.ps.plan;
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -64,7 +65,6 @@ ExecSort(PlanState *pstate)
 
 	if (!node->sort_Done)
 	{
-		Sort	   *plannode = (Sort *) node->ss.ps.plan;
 		PlanState  *outerNode;
 		TupleDesc	tupDesc;
 
@@ -85,16 +85,29 @@ ExecSort(PlanState *pstate)
 
 		outerNode = outerPlanState(node);
 		tupDesc = ExecGetResultType(outerNode);
+		if ((tupDesc->natts == 1) &&
+			(TupleDescAttr(tupDesc, 0) ->attbyval) &&
+			!node->bounded)
+		{
 
-		tuplesortstate = tuplesort_begin_heap(tupDesc,
-			  plannode->numCols,
-			  plannode->sortColIdx,
-			  plannode->sortOperators,
-			  plannode->collations,
-			  plannode->nullsFirst,
-			  work_mem,
-			  NULL,
-			  node->randomAccess);
+			node->is_single_val = true;
+			tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+   plannode->sortOperators[0],
+   plannode->collations[0],
+   plannode->nullsFirst[0],
+   work_mem,
+   NULL,
+   node->randomAccess);
+		} else
+			tuplesortstate = tuplesort_begin_heap(tupDesc,
+  plannode->numCols,
+  plannode->sortColIdx,
+  plannode->sortOperators,
+  plannode->collations,
+  plannode->nullsFirst,
+  work_mem,
+  NULL,
+  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
@@ -109,8 +122,15 @@ ExecSort(PlanState *pstate)
 
 			if (TupIsNull(slot))
 break;
-
-			tuplesort_puttupleslot(tuplesortstate, slot);
+			if(node->is_single_val)
+			{
+

Re: Numeric multiplication overflow errors

2021-07-05 Thread David Rowley
On Mon, 5 Jul 2021 at 23:07, Ranier Vilela  wrote:
>
> Em seg., 5 de jul. de 2021 às 06:44, Dean Rasheed  
> escreveu:
>> Note, however, that it won't make any difference to performance in the
>> way that you're suggesting -- elog() in Postgres is used for "should
>> never happen, unless there's a software bug" errors, rather than, say,
>> "might happen for certain invalid inputs" errors, so init_var() should
>> always be called in these functions.
>
> I agree that in this case, most of the time, elog is not called.

You may have misunderstood what Dean meant.  elog(ERROR) calls are now
exclusively for "cannot happen" cases.  If someone gets one of these
then there's a bug to fix or something else serious has gone wrong
with the hardware.

The case you seem to be talking about would fit better if the code in
question had been ereport(ERROR).

I don't disagree that the initialisation is better to happen after the
elog. I'm just mentioning this as I wanted to make sure you knew the
difference between elog(ERROR) and ereport(ERROR).

David




Re: Excessive cost of OpClassCache flushes in CLOBBER_CACHE_ALWAYS mode

2021-07-05 Thread Andrew Dunstan


On 7/4/21 3:57 PM, Tom Lane wrote:
> Over in [1] it is demonstrated that with CLOBBER_CACHE_ALWAYS enabled,
> initdb accounts for a full 50% of the runtime of "make check-world"
> (well, actually of the buildfarm cycle, which is not quite exactly
> that but close).  Since initdb certainly doesn't cost that much
> normally, I wondered why it is so negatively affected by CCA.  Some
> perf measuring led me to LookupOpclassInfo, and specifically this bit:
>
> /*
>  * When testing for cache-flush hazards, we intentionally disable the
>  * operator class cache and force reloading of the info on each call. This
>  * is helpful because we want to test the case where a cache flush occurs
>  * while we are loading the info, and it's very hard to provoke that if
>  * this happens only once per opclass per backend.
>  */
> #ifdef CLOBBER_CACHE_ENABLED
> if (debug_invalidate_system_caches_always > 0)
> opcentry->valid = false;
> #endif
>
> Diking that out halves initdb's CCA runtime.  Turns out it also
> roughly halves the runtime of the core regression tests under CCA,
> so this doesn't explain why initdb seems so disproportionately
> affected by CCA.
>
> However, seeing that this single choice is accounting for half the
> cost of CCA testing, we really have to ask whether it's worth that.
> This code was added by my commit 03ffc4d6d of 2007-11-28, about which
> I wrote:
>
> Improve test coverage of CLOBBER_CACHE_ALWAYS by having it also force
> reloading of operator class information on each use of LookupOpclassInfo.
> Had this been in place a year ago, it would have helped me find a bug
> in the then-new 'operator family' code.  Now that we have a build farm
> member testing CLOBBER_CACHE_ALWAYS on a regular basis, it seems worth
> expending a little bit of effort here.
>
> I'm now a little dubious about my claim that this would have helped find
> any bugs.  Invalidating a finished OpClassCache entry does not model any
> real-world scenario, because as noted elsewhere in LookupOpclassInfo,
> once such a cache entry is populated it is kept for the rest of the
> session.  Also, the claim in the comment that we need this to test
> a cache flush during load seems like nonsense: if we have
> debug_invalidate_system_caches_always turned on, then we'll test
> the effects of such flushes throughout the initial population of
> a cache entry.  Doing it over and over again adds nothing.
>
> So I'm now fairly strongly tempted to remove this code outright
> (effectively reverting 03ffc4d6d).  Another possibility now that
> we have debug_invalidate_system_caches_always is to increase the
> threshold at which this happens, making it more like
> CLOBBER_CACHE_RECURSIVE.
>
> Thoughts?
>
>   


If we don't think it's adding anything useful just rip it out. We don't
generally keep code hanging around just on the off chance it might be
useful some day.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: "debug_invalidate_system_caches_always" is too long

2021-07-05 Thread Andrew Dunstan


On 7/4/21 4:27 PM, Tom Lane wrote:
> As I've been poking around in this area, I find myself growing
> increasingly annoyed at the new GUC name
> "debug_invalidate_system_caches_always".  It is too d*mn long.
> It's a serious pain to type in any context where you don't have
> autocomplete to help you.  I've kept referring to this type of
> testing as CLOBBER_CACHE_ALWAYS testing, even though that name is
> now obsolete, just because it's so much shorter.  I think we need
> to reconsider this name while we still can.
>
> I do agree with the "debug_" prefix given that it's now visible to
> users.  However, it doesn't seem that hard to save some space in
> the rest of the name.  The word "system" is adding nothing of value,
> and the word "always" seems rather confusing --- if it does
> something "always", why is there more than one level?  So a simple
> proposal is to rename it to "debug_invalidate_caches".
>
> However, I think we should also give serious consideration to
> "debug_clobber_cache" or "debug_clobber_cache_always" for continuity
> with past practice (though it still feels like "always" is a good
> word to lose now).  "debug_clobber_caches" is another reasonable
> variant.
>


+1 for debug_invalidate_caches - it seems to have the most content and
least noise. Second choice would be debug_clobber_caches.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Numeric multiplication overflow errors

2021-07-05 Thread Ranier Vilela
Em seg., 5 de jul. de 2021 às 06:44, Dean Rasheed 
escreveu:

> On Fri, 2 Jul 2021 at 19:48, Ranier Vilela  wrote:
> >
> > If you allow me a small suggestion.
> > Move the initializations of the variable tmp_var to after check if the
> function can run.
> > Saves some cycles, when not running.
> >
>
> OK, thanks. I agree, on grounds of neatness and consistency with
> nearby code, so I've done it that way.
>
Thanks.


> Note, however, that it won't make any difference to performance in the
> way that you're suggesting -- elog() in Postgres is used for "should
> never happen, unless there's a software bug" errors, rather than, say,
> "might happen for certain invalid inputs" errors, so init_var() should
> always be called in these functions.
>
I agree that in this case, most of the time, elog is not called.
But by writing this way, you are following the principle of not doing
unnecessary work until it is absolutely necessary.
If you follow this principle, in general, the performance will always be
better.

regards,
Ranier Vilela


Re: [PATCH] Hooks at XactCommand level

2021-07-05 Thread Gilles Darold

Le 01/07/2021 à 18:47, Tom Lane a écrit :

Nicolas CHAHWEKILIAN  writes:

As far as I am concerned, I am totally awaiting for this kind of feature
exposed here, for one single reason at this time : the extension
pg_statement_rollback will be much more valuable with the ability of
processing "rollback to savepoint" without the need for explicit
instruction from client side (and this patch is giving this option).

What exactly do these hooks do that isn't done as well or better
by the RegisterXactCallback and RegisterSubXactCallback mechanisms?
Perhaps we need to define some additional event types for those?
Or pass more data to the callback functions?

I quite dislike inventing a hook that's defined as "run during
start_xact_command", because there is basically nothing that's
not ad-hoc about that function: it's internal to postgres.c
and both its responsibilities and its call sites have changed
over time.  I think anyone hooking into that will be displeased
by the stability of their results.



Sorry Tom, it seems that I have totally misinterpreted your comments, 
google translate was not a great help for my understanding but Julien 
was. Thanks Julien.



I'm joining a new patch v4 that removes the need of any hook and adds a 
new events XACT_EVENT_COMMAND_START and SUBXACT_EVENT_COMMAND_START that 
can be cautch in the xact callbacks when a new command is to be executed.



--
Gilles Darold
http://www.darold.net/

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 441445927e..3b5f6bfc2d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6118,3 +6118,32 @@ MarkSubTransactionAssigned(void)
 
 	CurrentTransactionState->assigned = true;
 }
+
+/*
+ * CallXactStartCommand
+ *
+ * Wrapper over CallXactCallbacks or CallSubXactCallbacks called in postgres.c
+ * at end of start_xact_command(). It allows to user-defined code to be executed
+ * for the start of any command through a xact registered callback. This function
+ * do nothing if a transaction is not started.
+ *
+ * The related events XactEvent/SubXactEvent are XACT_EVENT_COMMAND_START and
+ * SUBXACT_EVENT_COMMAND_START.
+ */
+void
+CallXactStartCommand(void)
+{
+	TransactionState s = CurrentTransactionState;
+
+	if (s->blockState == TBLOCK_DEFAULT || s->blockState == TBLOCK_STARTED)
+		return;
+
+	/*
+	 * Call start-of-xact callbacks with start command event
+	 */
+	if (s->parent && s->subTransactionId)
+		CallSubXactCallbacks(SUBXACT_EVENT_COMMAND_START, s->subTransactionId,
+			 s->parent->subTransactionId);
+	else
+		CallXactCallbacks(XACT_EVENT_COMMAND_START);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8cea10c901..a0f4a17c51 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2708,6 +2708,16 @@ start_xact_command(void)
 		!get_timeout_active(CLIENT_CONNECTION_CHECK_TIMEOUT))
 		enable_timeout_after(CLIENT_CONNECTION_CHECK_TIMEOUT,
 			 client_connection_check_interval);
+
+	/*
+	 * Instruct registered callbacks on xact that a new command is to be
+	 * executed. It allows to execute user-defined code before any new
+	 * statement is executed. This is the responsability of the user code
+	 * to take care of the state of the transaction, it can be in an ABORT
+	 * state. The related xact events are XACT_EVENT_COMMAND_START and
+	 * SUBXACT_EVENT_COMMAND_START.
+	 */
+	CallXactStartCommand();
 }
 
 static void
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 134f6862da..190fc7151b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -119,7 +119,8 @@ typedef enum
 	XACT_EVENT_PREPARE,
 	XACT_EVENT_PRE_COMMIT,
 	XACT_EVENT_PARALLEL_PRE_COMMIT,
-	XACT_EVENT_PRE_PREPARE
+	XACT_EVENT_PRE_PREPARE,
+	XACT_EVENT_COMMAND_START
 } XactEvent;
 
 typedef void (*XactCallback) (XactEvent event, void *arg);
@@ -129,7 +130,8 @@ typedef enum
 	SUBXACT_EVENT_START_SUB,
 	SUBXACT_EVENT_COMMIT_SUB,
 	SUBXACT_EVENT_ABORT_SUB,
-	SUBXACT_EVENT_PRE_COMMIT_SUB
+	SUBXACT_EVENT_PRE_COMMIT_SUB,
+	SUBXACT_EVENT_COMMAND_START
 } SubXactEvent;
 
 typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
@@ -467,4 +469,6 @@ extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
 
+extern void CallXactStartCommand(void);
+
 #endif			/* XACT_H */


Re: Skipping logical replication transactions on subscriber side

2021-07-05 Thread Alexey Lesovsky
Hi,
Have a few notes about pg_stat_logical_replication_error from the DBA point
of view (which will use this view in the future).
1. As I understand it, this view might contain many errors related to
different subscriptions. It is better to name
"pg_stat_logical_replication_errors" using the plural form (like this done
for stat views for tables, indexes, functions). Also, I'd like to suggest
thinking twice about the view name (and function used in view DDL) -
"pg_stat_logical_replication_error" contains very common "logical
replication" words, but the view contains errors related to subscriptions
only. In the future there could be other kinds of errors related to logical
replication, but not related to subscriptions - what will you do?
2. Add a field with database name or id - it helps to quickly understand to
which database the subscription belongs.
3. Add a counter field with total number of errors - it helps to calculate
errors rates and aggregations (sum), and don't lose information about
errors between view checks.
4. Add text of last error (if it will not be too expensive).
5. Rename the "action" field to "command", as I know this is right from
terminology point of view.

Finally, the view might seems like this:

postgres(1:25250)=# select * from pg_stat_logical_replication_errors;
subname | datid | relid | command | xid | total | last_failure |
last_failure_text
--++---+-+-+---+---+---
sub_1 | 12345 | 16384 | INSERT | 736 | 145 | 2021-06-27 12:12:45.142675+09
| something goes wrong...
sub_2 | 12346 | 16458 | UPDATE | 845 | 12 | 2021-06-27 12:16:01.458752+09 |
hmm, something goes wrong

Regards, Alexey

On Mon, Jul 5, 2021 at 2:59 PM Masahiko Sawada 
wrote:

> On Thu, Jun 17, 2021 at 6:20 PM Masahiko Sawada 
> wrote:
> >
> > On Thu, Jun 17, 2021 at 3:24 PM Masahiko Sawada 
> wrote:
> > >
> > > > Now, if this function is used by super
> > > > users then we can probably trust that they provide the XIDs that we
> > > > can trust to be skipped but OTOH making a restriction to allow these
> > > > functions to be used by superusers might restrict the usage of this
> > > > repair tool.
> > >
> > > If we specify the subscription id or name, maybe we can allow also the
> > > owner of subscription to do that operation?
> >
> > Ah, the owner of the subscription must be superuser.
>
> I've attached PoC patches.
>
> 0001 patch introduces the ability to skip transactions on the
> subscriber side. We can specify XID to the subscription by like ALTER
> SUBSCRIPTION test_sub SET SKIP TRANSACTION 100. The implementation
> seems straightforward except for setting origin state. After skipping
> the transaction we have to update the session origin state so that we
> can start streaming the transaction next to the one that we just
> skipped in case of the server crash or restarting the apply worker. We
> set origin state to the commit WAL record. However, since we skip all
> changes we don’t write any WAL even if we call CommitTransaction() at
> the end of the skipped transaction. So the patch sets the origin state
> to the transaction that updates the pg_subscription system catalog to
> reset the skip XID. I think we need a discussion of this part.
>
> With 0002 and 0003 patches, we report the error information in server
> logs and the stats view, respectively. 0002 patch adds errcontext for
> messages that happened during applying the changes:
>
> ERROR:  duplicate key value violates unique constraint "hoge_pkey"
> DETAIL:  Key (c)=(1) already exists.
> CONTEXT:  during apply of "INSERT" for relation "public.hoge" in
> transaction with xid 736 committs 2021-06-27 12:12:30.053887+09
>
> 0003 patch adds pg_stat_logical_replication_error statistics view
> discussed on another thread[1]. The apply worker sends the error
> information to the stats collector if an error happens during applying
> changes. We can check those errors as follow:
>
> postgres(1:25250)=# select * from pg_stat_logical_replication_error;
>  subname  | relid | action | xid | last_failure
> --+---++-+---
>  test_sub | 16384 | INSERT | 736 | 2021-06-27 12:12:45.142675+09
> (1 row)
>
> I added only columns required for the skipping transaction feature to
> the view for now.
>
> Please note that those patches are meant to evaluate the concept we've
> discussed so far. Those don't have the doc update yet.
>
> Regards,
>
> [1]
> https://www.postgresql.org/message-id/DB35438F-9356-4841-89A0-412709EBD3AB%40enterprisedb.com
>
>
> --
> Masahiko Sawada
> EDB:  https://www.enterprisedb.com/
>


-- 
С уважением Алексей В. Лесовский


Re: Skipping logical replication transactions on subscriber side

2021-07-05 Thread Amit Kapila
On Mon, Jul 5, 2021 at 3:16 PM Amit Kapila  wrote:
>
> On Thu, Jul 1, 2021 at 6:31 PM Masahiko Sawada  wrote:
> >
> > On Thu, Jul 1, 2021 at 12:56 PM Amit Kapila  wrote:
>
> Instead of using the syntax "ALTER SUBSCRIPTION name SET SKIP
> TRANSACTION Iconst", isn't it better to use it as a subscription
> option like Mark has done for his patch (disable_on_error)?
>
> I am slightly nervous about this way of allowing the user to skip the
> errors because if it is not used carefully then it can easily lead to
> inconsistent data on the subscriber. I agree that as only superusers
> will be allowed to use this option and we can document clearly the
> side-effects, the risk could be reduced but is that sufficient?
>

I see that users can create a similar effect by using
pg_replication_origin_advance() and it is mentioned in the docs that
careless use of this function can lead to inconsistently replicated
data. So, this new way doesn't seem to be any more dangerous than what
we already have.

-- 
With Regards,
Amit Kapila.




Re: psql - add SHOW_ALL_RESULTS option

2021-07-05 Thread Peter Eisentraut

On 12.06.21 11:41, Fabien COELHO wrote:
The patch includes basic AUTOCOMMIT and ON_ERROR_ROLLBACK tests, which 
did not exist before, at all.


I looked at these tests first.  The tests are good, they increase 
coverage.  But they don't actually test the issue that was broken by the 
previous patch, namely the situation where autocommit is off and the 
user manually messes with the savepoints.  I applied the tests against 
the previous patch and there was no failure.  So the tests are useful, 
but they don't really help this patch.  Would you like to enhance the 
tests a bit to cover this case?  I think we could move forward with 
these tests then.





Re: Numeric multiplication overflow errors

2021-07-05 Thread Dean Rasheed
On Sun, 4 Jul 2021 at 09:43, David Rowley  wrote:
>
> On Sat, 3 Jul 2021 at 11:04, Dean Rasheed  wrote:
> > Thinking about this more, I think it's best not to risk back-patching.
> > It *might* be safe, but it's difficult to really be sure of that. The
> > bug itself is pretty unlikely to ever happen in practice, hence the
> > lack of prior complaints, and in fact I only found it by an
> > examination of the code. So it doesn't seem to be worth the risk.
>
> That seems like good logic to me.  Perhaps we can reconsider that
> decision if users complain about it.

Thanks. Pushed to master only.

I think the other part (avoiding overflows in numeric_mul) is fairly
straightforward and uncontentious, so barring objections, I'll push
and back-patch it in a couple of days or so.

Regards,
Dean
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index bc71326..2a0f68f
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -233,6 +233,7 @@ struct NumericData
  */
 
 #define NUMERIC_DSCALE_MASK			0x3FFF
+#define NUMERIC_DSCALE_MAX			NUMERIC_DSCALE_MASK
 
 #define NUMERIC_SIGN(n) \
 	(NUMERIC_IS_SHORT(n) ? \
@@ -2958,7 +2959,11 @@ numeric_mul_opt_error(Numeric num1, Nume
 	 * Unlike add_var() and sub_var(), mul_var() will round its result. In the
 	 * case of numeric_mul(), which is invoked for the * operator on numerics,
 	 * we request exact representation for the product (rscale = sum(dscale of
-	 * arg1, dscale of arg2)).
+	 * arg1, dscale of arg2)).  If the exact result has more digits after the
+	 * decimal point than can be stored in a numeric, we round it.  Rounding
+	 * after computing the exact result ensures that the final result is
+	 * correctly rounded (rounding in mul_var() using a truncated product
+	 * would not guarantee this).
 	 */
 	init_var_from_num(num1, );
 	init_var_from_num(num2, );
@@ -2966,6 +2971,9 @@ numeric_mul_opt_error(Numeric num1, Nume
 	init_var();
 	mul_var(, , , arg1.dscale + arg2.dscale);
 
+	if (result.dscale > NUMERIC_DSCALE_MAX)
+		round_var(, NUMERIC_DSCALE_MAX);
+
 	res = make_result_opt_error(, have_error);
 
 	free_var();
diff --git a/src/test/regress/expected/numeric.out b/src/test/regress/expected/numeric.out
new file mode 100644
index 4ad4851..385e963
--- a/src/test/regress/expected/numeric.out
+++ b/src/test/regress/expected/numeric.out
@@ -2145,6 +2145,12 @@ select 47699
  47685231
 (1 row)
 
+select trim_scale((0.1 - 2e-16383) * (0.1 - 3e-16383));
+ trim_scale 
+
+   0.01
+(1 row)
+
 --
 -- Test some corner cases for division
 --
diff --git a/src/test/regress/sql/numeric.sql b/src/test/regress/sql/numeric.sql
new file mode 100644
index 3784c52..7e17c28
--- a/src/test/regress/sql/numeric.sql
+++ b/src/test/regress/sql/numeric.sql
@@ -1044,6 +1044,8 @@ select 47709
 
 select 4769 * ;
 
+select trim_scale((0.1 - 2e-16383) * (0.1 - 3e-16383));
+
 --
 -- Test some corner cases for division
 --


Re: Asymmetric partition-wise JOIN

2021-07-05 Thread Andrey Lepikhov

On 18/6/21 15:02, Alexander Pyhalov wrote:

Andrey Lepikhov писал 2021-05-27 07:27:

Next version of the patch.
For searching any problems I forced this patch during 'make check'
tests. Some bugs were found and fixed.


Hi.
I've tested this patch and haven't found issues, but I have some comments.

Thank you for review!
Variable names - child_join_rel and child_join_relids seem to be 
inconsistent with rest of the file

fixed


When build_child_join_rel() can return NULL?

Fixed

Missing word:
each bitmapset variable will large => each bitmapset variable will be large

Fixed

What hook do you refer to?
Removed> You've changed function to copy appinfo, so now comment is 
incorrect.
Thanks, fixed> Do I understand correctly that now parent_relids also 
contains relids of

relations from 'global' inner relation, which we join to childs?

Yes

--
regards,
Andrey Lepikhov
Postgres Professional
From d653b4f65486e2cefafda61811390c4095eae371 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Fri, 2 Apr 2021 11:02:20 +0500
Subject: [PATCH] Asymmetric partitionwise join.

Teach optimizer to consider partitionwise join of non-partitioned
table with each partition of partitioned table.

Disallow asymmetric machinery for joining of two partitioned (or appended)
relations because it could cause CPU and memory huge consumption
during reparameterization of NestLoop path.
---
 src/backend/optimizer/path/joinpath.c|   9 +
 src/backend/optimizer/path/joinrels.c| 172 +++
 src/backend/optimizer/plan/setrefs.c |  17 +-
 src/backend/optimizer/util/appendinfo.c  |  37 ++-
 src/backend/optimizer/util/pathnode.c|   9 +-
 src/backend/optimizer/util/relnode.c |  19 +-
 src/include/optimizer/paths.h|   7 +-
 src/test/regress/expected/partition_join.out | 306 +++
 src/test/regress/sql/partition_join.sql  | 126 
 9 files changed, 672 insertions(+), 30 deletions(-)

diff --git a/src/backend/optimizer/path/joinpath.c 
b/src/backend/optimizer/path/joinpath.c
index b67b517770..7a1a7b2e86 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root,
if (set_join_pathlist_hook)
set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
   jointype, );
+
+   /*
+* 7. If outer relation is delivered from partition-tables, consider
+* distributing inner relation to every partition-leaf prior to
+* append these leafs.
+*/
+   try_asymmetric_partitionwise_join(root, joinrel,
+ 
outerrel, innerrel,
+ 
jointype, );
 }
 
 /*
diff --git a/src/backend/optimizer/path/joinrels.c 
b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..2839b7acc3 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
 
 #include "miscadmin.h"
 #include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
 #include "optimizer/joininfo.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
@@ -1551,6 +1552,177 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo 
*rel1, RelOptInfo *rel2,
}
 }
 
+/*
+ * Build RelOptInfo on JOIN of each partition of the outer relation and the 
inner
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs are impossible to build.
+ */
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+   
 RelOptInfo *joinrel,
+   
 AppendPath *append_path,
+   
 RelOptInfo *inner_rel,
+   
 JoinType jointype,
+   
 JoinPathExtraData *extra)
+{
+   List*result = NIL;
+   ListCell*lc;
+
+   foreach (lc, append_path->subpaths)
+   {
+   Path*child_path = lfirst(lc);
+   RelOptInfo  *child_rel = child_path->parent;
+   Relids  child_joinrelids;
+   Relids  parent_relids;
+   RelOptInfo  *child_joinrel;
+   SpecialJoinInfo *child_sjinfo;
+   List*child_restrictlist;
+
+   child_joinrelids = bms_union(child_rel->relids, 
inner_rel->relids);
+   parent_relids = bms_union(append_path->path.parent->relids,
+  

Re: Doc chapter for Hash Indexes

2021-07-05 Thread Amit Kapila
On Tue, Jun 29, 2021 at 2:21 PM Amit Kapila  wrote:
>
> On Sat, Jun 26, 2021 at 3:43 PM Amit Kapila  wrote:
> >
>
> I am planning to go through the patch once again and would like to
> commit and backpatch till v10 in a day to two unless someone thinks
> otherwise.
>

Pushed.

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2021-07-05 Thread Amit Kapila
On Thu, Jul 1, 2021 at 6:31 PM Masahiko Sawada  wrote:
>
> On Thu, Jul 1, 2021 at 12:56 PM Amit Kapila  wrote:
> >
> >
> > Don't we want to clear stats at drop subscription as well? We do drop
> > database stats in dropdb via pgstat_drop_database, so I think we need
> > to clear subscription stats at the time of drop subscription.
>
> Yes, it needs to be cleared. In the 0003 patch, pgstat_vacuum_stat()
> sends the message to clear the stats. I think it's better to have
> pgstat_vacuum_stat() do that job similar to dropping replication slot
> statistics rather than relying on the single message send at DROP
> SUBSCRIPTION. I've considered doing both: sending the message at DROP
> SUBSCRIPTION and periodical checking by pgstat_vacuum_stat(), but
> dropping subscription not setting a replication slot is able to
> rollback. So we need to send it only at commit time. Given that we
> don’t necessarily need the stats to be updated immediately, I think
> it’s reasonable to go with only a way of pgstat_vacuum_stat().
>

Okay, that makes sense. Can we consider sending the multiple ids in
one message as we do for relations or functions in
pgstat_vacuum_stat()? That will reduce some message traffic. BTW, do
we have some way to avoid wrapping around the OID before we clean up
via pgstat_vacuum_stat()?


> > In the 0003 patch, if I am reading it correctly then the patch is not
> > doing anything for tablesync worker. It is not clear to me at this
> > stage what exactly we want to do about it? Do we want to just ignore
> > errors from tablesync worker and let the system behave as it is
> > without this feature? If we want to do anything then I think the way
> > to skip the initial table sync would be to behave like the user has
> > given 'copy_data' option as false.
>
> It might be better to have also sync workers report errors, even if
> SKIP TRANSACTION feature doesn’t support anything for initial table
> synchronization. From the user perspective, The initial table
> synchronization is also the part of logical replication operations. If
> we report only error information of applying logical changes, it could
> confuse users.
>
> But I’m not sure about the way to skip the initial table
> synchronization. Once we set `copy_data` to false, all table
> synchronizations are disabled. Some of them might have been able to
> synchronize successfully. It might be useful if the user can disable
> the table initialization for the particular tables.
>

True but I guess the user can wait for all the tablesyncs to either
finish or get an error corresponding to the table sync. After that, it
can use 'copy_data' as false. This is not a very good method but I
don't see any other option. I guess whatever is the case logging
errors from tablesyncs is anyway not a bad idea.

Instead of using the syntax "ALTER SUBSCRIPTION name SET SKIP
TRANSACTION Iconst", isn't it better to use it as a subscription
option like Mark has done for his patch (disable_on_error)?

I am slightly nervous about this way of allowing the user to skip the
errors because if it is not used carefully then it can easily lead to
inconsistent data on the subscriber. I agree that as only superusers
will be allowed to use this option and we can document clearly the
side-effects, the risk could be reduced but is that sufficient? It is
not that we don't have any other tool which allows users to make their
data inconsistent (one recent example is functions
(heap_force_kill/heap_force_freeze) in pg_surgery module) if not used
carefully but it might be better to not expose such tools.

OTOH, if we use the error infrastructure of this patch and allow users
to just disable the subscription on error as was proposed by Mark then
that can't lead to any inconsistency.

What do you think?

-- 
With Regards,
Amit Kapila.




Re: Numeric multiplication overflow errors

2021-07-05 Thread Dean Rasheed
On Fri, 2 Jul 2021 at 19:48, Ranier Vilela  wrote:
>
> If you allow me a small suggestion.
> Move the initializations of the variable tmp_var to after check if the 
> function can run.
> Saves some cycles, when not running.
>

OK, thanks. I agree, on grounds of neatness and consistency with
nearby code, so I've done it that way.

Note, however, that it won't make any difference to performance in the
way that you're suggesting -- elog() in Postgres is used for "should
never happen, unless there's a software bug" errors, rather than, say,
"might happen for certain invalid inputs" errors, so init_var() should
always be called in these functions.

Regards,
Dean




Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

2021-07-05 Thread Yugo NAGATA
Hello Ishii-san,

On Thu, 01 Jul 2021 09:03:42 +0900 (JST)
Tatsuo Ishii  wrote:

> > v13 patches gave a compiler warning...
> > 
> > $ make >/dev/null
> > pgbench.c: In function ‘commandError’:
> > pgbench.c:3071:17: warning: unused variable ‘command’ [-Wunused-variable]
> >   const Command *command = sql_script[st->use_file].commands[st->command];
> >  ^~~

Hmm, we'll get the warning when --enable-cassert is not specified.
I'll fix it.

> There is a typo in the doc (more over ->  moreover).
> 
> >of all transaction tries; more over, you cannot use an unlimited 
> > number
> 
> of all transaction tries; moreover, you cannot use an unlimited number
> 

Thanks. I'll fix.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 




Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-05 Thread Andrey Borodin



> 1 июля 2021 г., в 20:59, Mark Dilger  
> написал(а):
> 
> 
> 
>> On Jun 29, 2021, at 6:25 PM, Mark Dilger  
>> wrote:
>> 
>> Please find attached a new set of patches.
> 
> And again, this time attaching a fifth patch which includes the work to allow 
> users who belong to the right security role to SET and ALTER SYSTEM SET 
> variables without being a superuser.

I'm not sure, but maybe we should allow replication role to change 
session_replication_role?

Best regards, Andrey Borodin.



Re: [PATCH] Pull general SASL framework out of SCRAM

2021-07-05 Thread Michael Paquier
On Wed, Jun 30, 2021 at 10:30:12PM +, Jacob Champion wrote:
> Done in v3, with a second patch for the code motion.

I have gone through that, tweaking the documentation you have added as
that's the meat of the patch, reworking a bit the declarations of the
callbacks (no need for several typedef gere) and doing some small
format changes to make the indentation happy.  And that looks pretty
good.  It is a bit sad that the SCRAM part cannot be completely
unplugged from the auth part, because of the call to the free function
and the HBA checks, but adding more wrappers to accomodate with that
is not really worth it.  So I'd like to apply that to clarify this
code layer, without the TODOs.

-   pg_be_scram_get_mechanisms(port, _mechs);
-   /* Put another '\0' to mark that list is finished. */
-   appendStringInfoChar(_mechs, '\0');
I was wondering for a couple of seconds if it would not be better to
let the last '\0' being set within the callback, but what you have
here looks better.

-   if (!pg_fe_scram_channel_bound(conn->sasl_state))
+   if (!conn->sasl || !conn->sasl->channel_bound(conn->sasl_state))
conn->sasl should be set in this code path.  This style is safer.

The top comment of scram_init() still mentioned
pg_be_scram_get_mechanisms(), while it should be
scram_get_mechanisms().

PG_MAX_SASL_MESSAGE_LENGTH can stay within auth-sasl.c.

> I added a first pass at API documentation as well. This exposed some
> additional front-end TODOs that I added inline, but they should
> probably be dealt with independently of the refactor:
> 
> - Zero-length client responses are legal in the SASL framework;
> currently we use zero as a sentinel for "don't send a response".

Check.

> - I don't think it's legal for a client to refuse a challenge from the
> server without aborting the exchange, so we should probably check to
> make sure that client responses are non-NULL in the success case.

Hmm.  Does the RFCs tell us anything about that?
--
Michael
From 3bfae6e6f563acfe63f2ae44feb5799f4e49324b Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 5 Jul 2021 17:16:15 +0900
Subject: [PATCH v4] Generalize SASL exchange code for the backend and the
 frontend

---
 src/include/libpq/auth.h |   2 +
 src/include/libpq/sasl.h | 136 +++
 src/include/libpq/scram.h|  13 +-
 src/backend/libpq/Makefile   |   1 +
 src/backend/libpq/auth-sasl.c| 196 +++
 src/backend/libpq/auth-scram.c   |  51 ---
 src/backend/libpq/auth.c | 167 +--
 src/interfaces/libpq/fe-auth-sasl.h  | 130 ++
 src/interfaces/libpq/fe-auth-scram.c |  40 --
 src/interfaces/libpq/fe-auth.c   |  23 +++-
 src/interfaces/libpq/fe-auth.h   |  11 +-
 src/interfaces/libpq/fe-connect.c|   6 +-
 src/interfaces/libpq/libpq-int.h |   2 +
 src/tools/pgindent/typedefs.list |   2 +
 14 files changed, 558 insertions(+), 222 deletions(-)
 create mode 100644 src/include/libpq/sasl.h
 create mode 100644 src/backend/libpq/auth-sasl.c
 create mode 100644 src/interfaces/libpq/fe-auth-sasl.h

diff --git a/src/include/libpq/auth.h b/src/include/libpq/auth.h
index 3610fae3ff..3d6734f253 100644
--- a/src/include/libpq/auth.h
+++ b/src/include/libpq/auth.h
@@ -21,6 +21,8 @@ extern bool pg_krb_caseins_users;
 extern char *pg_krb_realm;
 
 extern void ClientAuthentication(Port *port);
+extern void sendAuthRequest(Port *port, AuthRequest areq, const char *extradata,
+			int extralen);
 
 /* Hook for plugins to get control in ClientAuthentication() */
 typedef void (*ClientAuthentication_hook_type) (Port *, int);
diff --git a/src/include/libpq/sasl.h b/src/include/libpq/sasl.h
new file mode 100644
index 00..f119a62d68
--- /dev/null
+++ b/src/include/libpq/sasl.h
@@ -0,0 +1,136 @@
+/*-
+ *
+ * sasl.h
+ *	  Defines the SASL mechanism interface for the backend.
+ *
+ * Each SASL mechanism defines a frontend and a backend callback structure.
+ *
+ * See src/interfaces/libpq/fe-auth-sasl.h for the frontend counterpart.
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/libpq/sasl.h
+ *
+ *-
+ */
+
+#ifndef PG_SASL_H
+#define PG_SASL_H
+
+#include "lib/stringinfo.h"
+#include "libpq/libpq-be.h"
+
+/* Status codes for message exchange */
+#define PG_SASL_EXCHANGE_CONTINUE		0
+#define PG_SASL_EXCHANGE_SUCCESS		1
+#define PG_SASL_EXCHANGE_FAILURE		2
+
+/*
+ * Backend SASL mechanism callbacks.
+ *
+ * To implement a backend mechanism, declare a pg_be_sasl_mech struct with
+ * appropriate callback implementations.  Then pass the mechanism to
+ * CheckSASLAuth() during ClientAuthentication(), once the server has decided
+ * which 

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-07-05 Thread David Rowley
On Wed, 30 Jun 2021 at 05:11, David Christensen
 wrote:
> 1) A basic refactor of the existing code to easily handle expanding the
> units we use into a table-based format.  This also includes changing the
> return value of `pg_size_bytes()` from an int64 into a numeric, and
> minor test adjustments to reflect this.

This is not quite what I had imagined when you said about adding a
table to make it easier to add new units in the future. I expected a
single table that handles all units, not just the ones above kB and
not one for each function.

There are actually two pg_size_pretty functions, one for BIGINT and
one for NUMERIC. I see you only changed the NUMERIC version.  I'd
expect them both to have the same treatment and use the same table so
there's consistency between the two functions.

The attached is more like what I had in mind. There's a very small net
reduction in lines of code with this and it also helps keep
pg_size_pretty() and pg_size_pretty_numeric() in sync.

I don't really like the fact that I had to add the doHalfRound field
to get the same rounding behaviour as the original functions.  I'm
wondering if it would just be too clever just to track how many bits
we've shifted right by in pg_size_pretty* and compare that to the
value of multishift for the current unit and do appropriate rounding
to get us to the value of multishift.  In theory, we could just keep
calling the half_rounded macro until we make it to the multishift
value.  My current thoughts are that it's unlikely that anyone would
twiddle with the size_pretty_units array in such a way for the extra
code to be worth it. Maybe someone else feels differently.

David


tidy_up_pg_size_pretty_functions.patch
Description: Binary data


Re: Increase value of OUTER_VAR

2021-07-05 Thread Andrey Lepikhov

On 2/7/21 21:23, Tom Lane wrote:

So I'm inclined to propose pushing this and seeing what happens.


+1
But why the Index type still uses for indexing of range table entries?
For example:
- we give int resultRelation value to create_modifytable_path() as Index 
nominalRelation value.

- exec_rt_fetch(Index) calls list_nth(int).
- generate_subquery_vars() accepts an 'Index varno' value

It looks sloppy. Do you plan to change this in the next commits?

--
regards,
Andrey Lepikhov
Postgres Professional




Re: Diagnostic comment in LogicalIncreaseXminForSlot

2021-07-05 Thread Masahiko Sawada
On Fri, May 21, 2021 at 6:00 PM Ashutosh Bapat
 wrote:
>
>
>
> On Fri, May 21, 2021 at 11:26 AM Amit Kapila  wrote:
>>
>> On Thu, May 20, 2021 at 5:43 PM Ashutosh Bapat
>>  wrote:
>> >
>> > Hi
>> > LogicalIncreaseRestartDecodingForSlot() has a debug log to report a
>> > new restart_lsn. But the corresponding function for catalog_xmin.
>> > Here's a patch to add the same.
>> >
>>
>> I think this can be useful. One minor comment:
>> + elog(DEBUG1, "got new catalog_xmin %u at %X/%X", xmin,
>> + (uint32) (current_lsn >> 32), (uint32) current_lsn);
>>
>> Isn't it better to use LSN_FORMAT_ARGS for current_lsn?
>
>
> Thanks for reminding me about that. :).
>
> Attached revised patch.
>
>>
>> Also, there
>> doesn't seem to be any urgency for adding this, so you can register it
>> for the next CF so that we can add this when the branch opens for
>> PG-15.
>
>
> It's there in CF. I am fine with PG-15. It will be good to patch the 
> back-branches to have this extra diagnostic information available.

The patch looks to me.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




[Patch] change the return value of PQsendFlushRequest

2021-07-05 Thread zhangj...@fujitsu.com
Hi all

The return value of function PQsendFlushRequest is 1 or 0.
---

Sends a request for the server to flush its output buffer.

int PQsendFlushRequest(PGconn *conn);




Returns 1 for success.  Returns 0 on any failure.

-
But in the following code, false is returned.
I think it would be better to change to 0.

int PQsendFlushRequest(PGconn *conn)
{
..
if (conn->asyncStatus != PGASYNC_IDLE &&
conn->pipelineStatus == PQ_PIPELINE_OFF)
{
appendPQExpBufferStr(>errorMessage,
 libpq_gettext("another 
command is already in progress\n"));
return false;  ※
}
..
}

Best Regards!
Zhangjie



PQsendFlushRequest.patch
Description: PQsendFlushRequest.patch


Re: detailed error message of pg_waldump

2021-07-05 Thread Masahiko Sawada
On Wed, Jun 16, 2021 at 5:36 PM Kyotaro Horiguchi
 wrote:
>
> Thanks!
>
> At Wed, 16 Jun 2021 16:52:11 +0900, Masahiko Sawada  
> wrote in
> > On Fri, Jun 4, 2021 at 5:35 PM Kyotaro Horiguchi
> >  wrote:
> > >
> > > In a very common operation of accidentally specifying a recycled
> > > segment, pg_waldump often returns the following obscure message.
> > >
> > > $ pg_waldump 0001002D
> > > pg_waldump: fatal: could not find a valid record after 0/2D00
> > >
> > > The more detailed message is generated internally and we can use it.
> > > That looks like the following.
> > >
> > > $ pg_waldump 0001002D
> > > pg_waldump: fatal: unexpected pageaddr 0/2400 in log segment 
> > > 0001002D, offset 0
> > >
> > > Is it work doing?
> >
> > Perhaps we need both? The current message describes where the error
> > happened and the message internally generated describes the details.
> > It seems to me that both are useful. For example, if we find an error
> > during XLogReadRecord(), we show both as follows:
> >
> >if (errormsg)
> >fatal_error("error in WAL record at %X/%X: %s",
> >LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
> >errormsg);
>
> Yeah, I thought that it might be a bit vervous and lengty but actually
> we have another place where doing that. One more point is whether we
> have a case where first_record is invalid but errormsg is NULL
> there. WALDumpReadPage immediately exits so we should always have a
> message in that case according to the comment in ReadRecord.
>
> > * We only end up here without a message when XLogPageRead()
> > * failed - in that case we already logged something. In
> > * StandbyMode that only happens if we have been triggered, so we
> > * shouldn't loop anymore in that case.
>
> So that can be an assertion.
>
> Now the messages looks like this.
>
> $ pg_waldump /home/horiguti/data/data_work/pg_wal/00020010
> pg_waldump: fatal: could not find a valid record after 0/0: unexpected 
> pageaddr 0/900 in log segment 00020010, offset 0
>

Thank you for updating the patch!

+ *
+ * The returned pointer (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogFindNextRecord or XLogReadRecord.
  */

The comment of XLogReadRecord() also has a similar description. Should
we update it as well?

BTW is this patch registered to the current commitfest? I could not find it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2021-07-05 Thread Ronan Dunklau
Le vendredi 2 juillet 2021, 10:39:44 CEST David Rowley a écrit :
> On Fri, 2 Jul 2021 at 19:54, Ronan Dunklau  wrote:
> > I don't know if it's acceptable, but in the case where you add both an
> > aggregate with an ORDER BY clause, and another aggregate without the
> > clause, the output for the unordered one will change and use the same
> > ordering, maybe suprising the unsuspecting user. Would that be acceptable
> > ?
> 
> That's a good question. There was an argument in [1] that mentions
> that there might be a group of people who rely on aggregation being
> done in a certain order where they're not specifying an ORDER BY
> clause in the aggregate.  If that group of people exists, then it's
> possible they might be upset in the scenario that you describe.
> 
> I also think that it's going to be pretty hard to make significant
> gains in this area if we are too scared to make changes to undefined
> behaviour.  You wouldn't have to look too hard in the pgsql-general
> mailing list to find someone complaining that their query output is in
> the wrong order on some query that does not have an ORDER BY. We
> pretty much always tell those people that the order is undefined
> without an ORDER BY. I'm not too sure why Tom in [1] classes the ORDER
> BY aggregate people any differently. We'll be stuck forever here and
> in many other areas if we're too scared to change the order of
> aggregation. You could argue that something like parallelism has
> changed that for people already. I think the multi-batch Hash
> Aggregate code could also change this.

I would agree with that.

> 
> > I was curious about the performance implication of that additional
> > transition, and could not reproduce a signifcant difference. I may be
> > doing something wrong: how did you highlight it ?
> 
> It was pretty basic. I just created a table with two columns and no
> index and did something like SELECT a,SUM(b ORDER BY b) from ab GROUP
> BY 1;  the new code will include a Sort due to lack of any index and
> the old code would have done a sort inside nodeAgg.c. I imagine that
> the overhead comes from the fact that in the patched version nodeAgg.c
> must ask its subnode (nodeSort.c) for the next tuple each time,
> whereas unpatched nodeAgg.c already has all the tuples in a tuplestore
> and can fetch them very quickly in a tight loop.

Ok, I reproduced that case, just not using a group by: by adding the group by 
a sort node is added in both cases (master and your patch), except that with 
your patch we sort on both keys and that doesn't really incur a performance 
penalty. 

I think the overhead occurs because in the ExecAgg case, we use the 
tuplesort_*_datum API as an optimization when we have a single column as an 
input, which the ExecSort code doesn't. Maybe it would be worth it to try to 
use that API in sort nodes too, when it can be done. 


> 
> David
> 
> [1] https://www.postgresql.org/message-id/6538.1522096067%40sss.pgh.pa.us


-- 
Ronan Dunklau







Re: rand48 replacement

2021-07-05 Thread Yura Sokolov

Fabien COELHO писал 2021-07-04 23:29:
The important property of determinism is that if I set a seed, and 
then make an identical set of calls to the random API, the results 
will be identical every time, so that it's possible to write tests 
with predictable/repeatable results.


Hmmm… I like my stronger determinism definition more than this one:-)

I would work around that by deriving another 128 bit generator 
instead of splitmix 64 bit, but that is overkill.


Not really relevant now, but I'm pretty sure that's impossible to do.
You might try it as an interesting academic exercise, but I believe
it's a logical impossibility.


Hmmm… I was simply thinking of seeding a new pg_prng_state from the
main pg_prng_state with some transformation, and then iterate over the
second PRNG, pretty much like I did with splitmix, but with 128 bits
so that your #states argument does not apply, i.e. something like:

 /* select in a range with bitmask rejection */
 uint64 pg_prng_u64_range(pg_prng_state *state, uint64 range)
 {
/* always advance state once */
uint64 next = xoroshiro128ss(state);
uint64 val;

if (range >= 2)
{
uint64 mask = mask_u64(range-1);

val = next & mask;

if (val >= range)
{
/* copy and update current prng state */
pg_prng_state iterator = *state;

iterator.s0 ^= next;
iterator.s1 += UINT64CONST(0x9E3779B97f4A7C15);

/* iterate till val in [0, range) */
while ((val = xoroshiro128ss() & mask) >= range)
;
}
}
else
val = 0;

return val;
 }

The initial pseudo-random sequence is left to proceed, and a new PRNG
is basically forked for iterating on the mask, if needed.


I believe most "range" values are small, much smaller than UINT32_MAX.
In this case, according to [1] fastest method is Lemire's one (I'd take
original version from [2])

Therefore combined method pg_prng_u64_range could branch on range value

uint64 pg_prng_u64_range(pg_prng_state *state, uint64 range)
{
  uint64 val = xoroshiro128ss(state);
  uint64 m;
  if ((range & (range-1) == 0) /* handle all power 2 cases */
return range != 0 ? val & (range-1) : 0;
  if (likely(range < PG_UINT32_MAX/32)
  {
/*
 * Daniel Lamire's unbiased range random algorithm based on 
rejection sampling

 * https://lemire.me/blog/2016/06/30/fast-random-shuffling/
 */
m = (uint32)val * range;
if ((uint32)m < range)
{
  uint32 t = -range % range;
  while ((uint32)m < t)
m = (uint32)xoroshiro128ss(state) * range;
}
return m >> 32;
  }
  /* Apple's mask method */
  m = mask_u64(range-1);
  val &= m;
  while (val >= range)
val = xoroshiro128ss(state) & m;
  return val;
}

Mask method could also be faster when range is close to mask.
For example, fast check for "range is within 1/1024 to mask" is
  range < (range/512 + (range&(range*2)))

And then method choose could like:
  if (likely(range < UINT32_MAX/32 && range > (range/512 + 
(range&(range*2)


But I don't know does additional condition worth difference or not.

[1] https://www.pcg-random.org/posts/bounded-rands.html
[2] https://lemire.me/blog/2016/06/30/fast-random-shuffling/

regards,
Sokolov Yura




Re: Transactions involving multiple postgres foreign servers, take 2

2021-07-05 Thread Masahiro Ikeda



On 2021/06/30 10:05, Masahiko Sawada wrote:
> On Fri, Jun 25, 2021 at 9:53 AM Masahiro Ikeda  
> wrote:
>>
>> Hi Jamison-san, sawada-san,
>>
>> Thanks for testing!
>>
>> FWIF, I tested using pgbench with "--rate=" option to know the server
>> can execute transactions with stable throughput. As sawada-san said,
>> the latest patch resolved second phase of 2PC asynchronously. So,
>> it's difficult to control the stable throughput without "--rate=" option.
>>
>> I also worried what I should do when the error happened because to increase
>> "max_prepared_foreign_transaction" doesn't work. Since too overloading may
>> show the error, is it better to add the case to the HINT message?
>>
>> BTW, if sawada-san already develop to run the resolver processes in parallel,
>> why don't you measure performance improvement? Although Robert-san,
>> Tunakawa-san and so on are discussing what architecture is best, one
>> discussion point is that there is a performance risk if adopting asynchronous
>> approach. If we have promising solutions, I think we can make the discussion
>> forward.
> 
> Yeah, if we can asynchronously resolve the distributed transactions
> without worrying about max_prepared_foreign_transaction error, it
> would be good. But we will need synchronous resolution at some point.
> I think we at least need to discuss it at this point.
> 
> I've attached the new version patch that incorporates the comments
> from Fujii-san and Ikeda-san I got so far. We launch a resolver
> process per foreign server, committing prepared foreign transactions
> on foreign servers in parallel. To get a better performance based on
> the current architecture, we can have multiple resolver processes per
> foreign server but it seems not easy to tune it in practice. Perhaps
> is it better if we simply have a pool of resolver processes and we
> assign a resolver process to the resolution of one distributed
> transaction one by one? That way, we need to launch resolver processes
> as many as the concurrent backends using 2PC.

Thanks for updating the patches.

I have tested in my local laptop and summary is the following.

(1) The latest patch(v37) can improve throughput by 1.5 times compared to v36.

Although I expected it improves by 2.0 times because the workload is that one
transaction access two remote servers... I think the reason is that the disk
is bottleneck and I couldn't prepare disks for each postgresql servers. If I
could, I think the performance can be improved by 2.0 times.


(2) The latest patch(v37) throughput of foreign_twophase_commit = required is
about 36% compared to the case if foreign_twophase_commit = disabled.

Although the throughput is improved, the absolute performance is not good. It
may be the fate of 2PC. I think the reason is that the number of WAL writes is
much increase and, the disk writes in my laptop is the bottleneck. I want to
know the result testing in richer environments if someone can do so.


(3) The latest patch(v37) has no overhead if foreign_twophase_commit =
disabled. On the contrary, the performance improved by 3%. It may be within
the margin of error.



The test detail is following.

# condition

* 1 coordinator and 3 foreign servers

* 4 instance shared one ssd disk.

* one transaction queries different two foreign servers.

``` fxact_update.pgbench
\set id random(1, 100)

\set partnum 3
\set p1 random(1, :partnum)
\set p2 ((:p1 + 1) % :partnum) + 1

BEGIN;
UPDATE part:p1 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
UPDATE part:p2 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
COMMIT;
```

* pgbench generates load. I increased ${RATE} little by little until "maximum
number of foreign transactions reached" error happens.

```
pgbench -f fxact_update.pgbench -R ${RATE} -c 8 -j 8 -T 180
```

* parameters
max_prepared_transactions = 100
max_prepared_foreign_transactions = 200
max_foreign_transaction_resolvers = 4


# test source code patterns

1. 2pc patches(v36) based on 6d0eb385 (foreign_twophase_commit = required).
2. 2pc patches(v37) based on 2595e039 (foreign_twophase_commit = required).
3. 2pc patches(v37) based on 2595e039 (foreign_twophase_commit = disabled).
4. 2595e039 without 2pc patches(v37).


# results

1. tps = 241.8000TPS
   latency average = 10.413ms

2. tps = 359.017519 ( by 1.5 times compared to 1. by 0.36% compared to 3.)
   latency average = 15.427ms

3. tps = 987.372220 ( by 1.03% compared to 4. )
   latency average = 8.102ms

4. tps = 955.984574
   latency average = 8.368ms

The disk is the bottleneck in my environment because disk util is almost 100%
in every pattern. If disks for each instance can be prepared, I think we can
expect more performance improvements.


>> In my understanding, there are three improvement idea. First is that to make
>> the resolver processes run in parallel. Second is that to send "COMMIT/ABORT
>> PREPARED" remote servers in bulk. Third is to stop syncing the WAL
>> remove_fdwxact() after resolving is done, 

Re: Yet another fast GiST build

2021-07-05 Thread Emre Hasegeli
I tried reviewing the remaining patches.  It seems to work correctly,
and passes the tests on my laptop.

> In this pattern I flipped PointerGetDatum(a) to PointerGetDatum(ra.lower), 
> because it seems to me correct. I've followed rule of thumb: every sort 
> function must extract and use "lower" somehow. Though I suspect numeric a 
> bit. Is it regular varlena?

As far as I understand, we cannot use the sortsupport functions from
the btree operator classes because the btree_gist extension handles
things differently.  This is unfortunate and a source of bugs [1], but
we cannot do anything about it.

Given that the lower and upper datums must be the same for the leaf
nodes, it makes sense to me to compare one of them.

Using numeric_cmp() for numeric in line with using bttextcmp() for text.

> +   /*
> +* Numeric has abbreviation routines in numeric.c, but we don't try to use
> +* them here. Maybe later.
> +*/

This is also true for text.  Perhaps we should also add a comment there.

> PFA patchset with v6 intact + two fixes of discovered issues.

> +   /* Use byteacmp(), like gbt_bitcmp() does */

We can improve this comment by incorporating Heikki's previous email:

> Ok, I think I understand that now. In btree_gist, the *_cmp() function
> operates on non-leaf values, and *_lt(), *_gt() et al operate on leaf
> values. For all other datatypes, the leaf and non-leaf representation is
> the same, but for bit/varbit, the non-leaf representation is different.
> The leaf representation is VarBit, and non-leaf is just the bits without
> the 'bit_len' field. That's why it is indeed correct for gbt_bitcmp() to
> just use byteacmp(), whereas gbt_bitlt() et al compares the 'bit_len'
> field separately. That's subtle, and 100% uncommented.

I think patch number 3 should be squashed to patch number 1.

I couldn't understand patch number 2 "Remove DEBUG1 verification".  It
seems like something rather useful.

[1] 
https://www.postgresql.org/message-id/flat/201010112055.o9BKtZf7011251%40wwwmaster.postgresql.org




Re: "debug_invalidate_system_caches_always" is too long

2021-07-05 Thread Bharath Rupireddy
On Mon, Jul 5, 2021 at 1:57 AM Tom Lane  wrote:
>
> As I've been poking around in this area, I find myself growing
> increasingly annoyed at the new GUC name
> "debug_invalidate_system_caches_always".  It is too d*mn long.
> It's a serious pain to type in any context where you don't have
> autocomplete to help you.  I've kept referring to this type of
> testing as CLOBBER_CACHE_ALWAYS testing, even though that name is
> now obsolete, just because it's so much shorter.  I think we need
> to reconsider this name while we still can.
>
> I do agree with the "debug_" prefix given that it's now visible to
> users.  However, it doesn't seem that hard to save some space in
> the rest of the name.  The word "system" is adding nothing of value,
> and the word "always" seems rather confusing --- if it does
> something "always", why is there more than one level?  So a simple
> proposal is to rename it to "debug_invalidate_caches".
>
> However, I think we should also give serious consideration to
> "debug_clobber_cache" or "debug_clobber_cache_always" for continuity
> with past practice (though it still feels like "always" is a good
> word to lose now).  "debug_clobber_caches" is another reasonable
> variant.
>
> Thoughts?

+1. IMO, debug_clobber_caches is better because it is simple.  And
also, since the invalidation happens on multiple system caches,
debug_clobber_caches is preferable than debug_clobber_cache.

Regards,
Bharath Rupireddy.