date:20210707

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-07 Thread Masahiko Sawada

On Wed, Jul 7, 2021 at 11:25 PM Matthias van de Meent
 wrote:
>
> On Wed, 7 Jul 2021 at 13:47, Masahiko Sawada  wrote:
> >
> > Hi all,
> >
> > Index vacuuming is one of the most time-consuming processes in lazy
> > vacuuming. lazy_tid_reaped() is a large part among them. The attached
> > the flame graph shows a profile of a vacuum on a table that has one index
> > and 80 million live rows and 20 million dead rows, where
> > lazy_tid_reaped() accounts for about 47% of the total vacuum execution
> > time.
> >
> > [...]
> >
> > Overall, 'rtbm' has a much better lookup performance and good memory
> > usage especially if there are relatively many dead tuples. However, in
> > some cases, 'intset' and 'array' have a better memory usage.
>
> Those are some great results, with a good path to meaningful improvements.
>
> > Feedback is very welcome. Thank you for reading the email through to the 
> > end.
>
> The current available infrastructure for TIDs is quite ill-defined for
> TableAM authors [0], and other TableAMs might want to use more than
> just the 11 bits in use by max-BLCKSZ HeapAM MaxHeapTuplesPerPage to
> identify tuples. (MaxHeapTuplesPerPage is 1169 at the maximum 32k
> BLCKSZ, which requires 11 bits to fit).
>
> Could you also check what the (performance, memory) impact would be if
> these proposed structures were to support the maximum
> MaxHeapTuplesPerPage of 1169 or the full uint16-range of offset
> numbers that could be supported by our current TID struct?

I think tbm will be the most affected by the memory impact of the
larger maximum MaxHeapTuplesPerPage. For example, with 32kB blocks
(MaxHeapTuplesPerPage = 1169), even if there is only one dead tuple in
a block, it will always require at least 147 bytes per block.

Rtbm chooses the container type among array, bitmap, or run depending
on the number and distribution of dead tuples in a block, and only
bitmap containers can be searched with O(1). Run containers depend on
the distribution of dead tuples within a block. So let’s compare array
and bitmap containers.

With 8kB blocks  (MaxHeapTuplesPerPage = 291), 36 bytes are needed for
a bitmap container at maximum. In other words, when compared to an
array container, bitmap will be chosen if there are more than 18 dead
tuples in a block. On the other hand, with 32kB blocks
(MaxHeapTuplesPerPage = 1169), 147 bytes are needed for a bitmap
container at maximum, so bitmap container will be chosen if there are
more than 74 dead tuples in a block. And, with full uint16-range
(MaxHeapTuplesPerPage = 65535), 8192 bytes are needed at maximum, so
bitmap container will be chosen if there are more than 4096 dead
tuples in a block. Therefore, in any case, if more than about 6% of
tuples in a block are garbage, a bitmap container will be chosen and
bring a faster lookup performance. (Of course, if a run container is
chosen, the container size gets smaller but the lookup performance is
O(logN).) But if the number of dead tuples in the table is small and
we have the larger MaxHeapTuplesPerPage, it’s likely to choose an
array container, and the lookup performance becomes O(logN). Still, it
should be faster than the array data structure because the range of
search targets in an array container is much smaller.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: bugfix: when the blocksize is 32k, the function page_header of pageinspect returns negative numbers.

2021-07-07 Thread Michael Paquier

On Thu, Jul 08, 2021 at 12:49:25AM -0400, Tom Lane wrote:
> Since we're already past beta2, I'm not sure that's a good idea.  We
> can't really treat pageinspect 1.9 as something that the world has
> never seen.

Yeah, that's why I would object to new changes in 1.9 and
REL_14_STABLE.  So my take would be to just have 1.10, and do those
changes on HEAD.
--
Michael


signature.asc
Description: PGP signature

Re: bugfix: when the blocksize is 32k, the function page_header of pageinspect returns negative numbers.

2021-07-07 Thread Tom Lane

Justin Pryzby  writes:
> On Thu, Jul 08, 2021 at 01:15:12PM +0900, Michael Paquier wrote:
>> Changes in the object set of an extension requires a new SQL script
>> that changes the objects to reflect the change.  So, in this case,
>> what you need to do is to create pageinspect--1.9--1.10.sql, assuming
>> that the new extension version is 1.10 and change page_header()
>> accordingly.

> I think it's common (and preferred) for changes in extension versions to be
> "folded" together within a major release of postgres.  Since v1.9 is new in
> pg14 (commit 756ab2912), this change can be included in the same, new version.

Since we're already past beta2, I'm not sure that's a good idea.  We
can't really treat pageinspect 1.9 as something that the world has
never seen.

regards, tom lane

Re: list of extended statistics on psql (\dX)

2021-07-07 Thread Tatsuro Yamada


Hi Tomas and Justin,

On 2021/06/07 4:47, Tomas Vondra wrote:

Here's a slightly more complete patch, tweaking the regression tests a
bit to detect this.



I tested your patch on PG14beta2 and PG15devel.
And they work fine.
===
 All 209 tests passed.
===

Next time I create a feature on psql, I will be careful to add
a check for schema visibility rules. :-D

Thanks,
Tatsuro Yamada

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-07-07 Thread David Rowley

On Thu, 8 Jul 2021 at 13:31, David Rowley  wrote:
> It feels like if we're going to fix this negative rounding thing then
> we should maybe do it and backpatch a fix then rebase this work on top
> of that.

Here's a patch which I believe makes pg_size_pretty() and
pg_size_pretty_numeric() match in regards to negative values.

Maybe this plus your regression test would be ok to back-patch?

David

adjust_pg_size_pretty_rounding_for_negative_values.patch
Description: Binary data

Re: bugfix: when the blocksize is 32k, the function page_header of pageinspect returns negative numbers.

2021-07-07 Thread Justin Pryzby

On Thu, Jul 08, 2021 at 01:15:12PM +0900, Michael Paquier wrote:
> On Thu, Jul 08, 2021 at 09:12:26AM +0530, Bharath Rupireddy wrote:
> > +1. int32 makes sense because the maximum allowed block size is 32768
> > and smallint with range -32768 to +32767 can't hold it. Internally,
> > lower, upper, special are treated as unit16. I looked at the patch,
> > how about using "int4" instead of just "int", just for readability?
> > And, do we need to change in pageinspect--1.1--1.2.sql and
> > pageinspect--1.0--1.1.sql along with pageinspect--1.5.sql?
> 
> Changes in the object set of an extension requires a new SQL script
> that changes the objects to reflect the change.  So, in this case,
> what you need to do is to create pageinspect--1.9--1.10.sql, assuming
> that the new extension version is 1.10 and change page_header()
> accordingly.

I think it's common (and preferred) for changes in extension versions to be
"folded" together within a major release of postgres.  Since v1.9 is new in
pg14 (commit 756ab2912), this change can be included in the same, new version.

> You also need to be careful about compatibility with past versions of
> this extension, as the code you are changing to use int8 could be used
> at runtime by older versions of pageinspect where int4 is used.  I
> would suggest to test that with some new extra tests in
> oldextversions.sql.

I think you can refer to this prior commit for guidance.

commit f18aa1b203930ed28cfe42e82d3418ae6277576d
Author: Peter Eisentraut 
Date:   Tue Jan 19 10:28:05 2021 +0100

pageinspect: Change block number arguments to bigint

-- 
Justin

Re: bugfix: when the blocksize is 32k, the function page_header of pageinspect returns negative numbers.

2021-07-07 Thread Michael Paquier

On Thu, Jul 08, 2021 at 09:12:26AM +0530, Bharath Rupireddy wrote:
> +1. int32 makes sense because the maximum allowed block size is 32768
> and smallint with range -32768 to +32767 can't hold it. Internally,
> lower, upper, special are treated as unit16. I looked at the patch,
> how about using "int4" instead of just "int", just for readability?
> And, do we need to change in pageinspect--1.1--1.2.sql and
> pageinspect--1.0--1.1.sql along with pageinspect--1.5.sql?

Changes in the object set of an extension requires a new SQL script
that changes the objects to reflect the change.  So, in this case,
what you need to do is to create pageinspect--1.9--1.10.sql, assuming
that the new extension version is 1.10 and change page_header()
accordingly.

You also need to be careful about compatibility with past versions of
this extension, as the code you are changing to use int8 could be used
at runtime by older versions of pageinspect where int4 is used.  I
would suggest to test that with some new extra tests in
oldextversions.sql.

The patch, and your suggestions, are incorrect on those aspects.
--
Michael

signature.asc
Description: PGP signature

Re: More time spending with "delete pending"

2021-07-07 Thread Alexander Lakhin

Hello Michael,

06.07.2021 11:33, Michael Paquier wrote:
> On Sun, Mar 14, 2021 at 06:00:00PM +0300, Alexander Lakhin wrote:
>> I believe that the patch attached to [1] should fix this issue. The
>> patch still applies to master and makes the demotest (attached to [2])
>> pass. Also I've prepared a trivial patch that makes pgwin32_open() use
>> the original stat() function (as in the proposed change for _pgstat64()).
> Hmm.  Knowing that _pgfstat64() has some special handling related to
> files pending for deletion, do we really need that on HEAD?
Yes, this fix is needed for HEAD (b9734c13) as the simple test attached
to [1] shows:
(You can put that script in src\test\modules\test_misc\t and perform
`vcregress taptest src\test\modules\test_misc`.)
t/001_delete_pending.pl . # Looks like your test exited with 2
before it could output anything.

and
src\test\modules\test_misc\tmp_check\log\regress_log_001_delete_pending
contains:
# Postmaster PID for node "node" is 1616
error running SQL: 'psql::1: ERROR:  could not stat file
"pg_wal/dummy": Permission denied'
while running 'psql -XAtq -d port=64889 host=127.0.0.1 dbname='postgres'
-f - -v ON_ERROR_STOP=1' with sql 'select count(*) > 0 as ok from
pg_ls_waldir();' at C:/src/postgresql/src/test/perl/PostgresNode.pm line
1771.

As Tom Lane noted above, the code added with bed90759f is dubious
(_NtQueryInformationFile() can not be used to handle the "delete
pending" state as CreateFile() returns INVALID_HANDLE_VALUE in this case.)
Probably that change should be reverted. Should I do it along with the
proposed fix?

[1]
https://www.postgresql.org/message-id/c3427edf-d7c0-ff57-90f6-b5de3bb62709%40gmail.com

Best regards,
Alexander

RE: Added schema level support for publication.

2021-07-07 Thread houzj.f...@fujitsu.com

On Wednesday, June 30, 2021 7:43 PM vignesh C  wrote:
> Thanks for reporting this issue, the attached v9 patch fixes this issue. This 
> also fixes the other issue you reported at [1].

Hi,

I had a look at the patch, please consider following comments.

(1)
-   if (pub->alltables)
+   if (pub->pubtype == PUBTYPE_ALLTABLES)
{
publish = true;
if (pub->pubviaroot && am_partition)
publish_as_relid = 
llast_oid(get_partition_ancestors(relid));
}

+   if (pub->pubtype == PUBTYPE_SCHEMA)
+   {
+   Oid schemaId = 
get_rel_namespace(relid);
+   List   *pubschemas = 
GetPublicationSchemas(pub->oid);
+
+   if (list_member_oid(pubschemas, schemaId))
+   {

It might be better use "else if" for the second check here.
Like: else if (pub->pubtype == PUBTYPE_SCHEMA)

Besides, we already have the {schemaoid, pubid} set here, it might be
better to scan the cache PUBLICATIONSCHEMAMAP instead of invoking
GetPublicationSchemas() which will scan the whole table.

(2)
+   /* Identify which schemas should be dropped. */
+   foreach(oldlc, oldschemaids)
+   {
+   Oid oldschemaid = lfirst_oid(oldlc);
+   ListCell   *newlc;
+   boolfound = false;
+
+   foreach(newlc, schemaoidlist)
+   {
+   Oid newschemaid = 
lfirst_oid(newlc);
+
+   if (newschemaid == oldschemaid)
+   {
+   found = true;
+   break;
+   }
+   }

It seems we can use " if (list_member_oid(schemaoidlist, oldschemaid)) "
to replace the second foreach loop.

(3)
there are some testcases change in 0001 patch, it might be better move them
to 0002 patch.

(4)
+   case OBJECT_PUBLICATION_SCHEMA:
+   objnode = (Node *) list_make2(linitial(name), 
linitial(args));
+   break;
case OBJECT_USER_MAPPING:
objnode = (Node *) list_make2(linitial(name), 
linitial(args));
break;

Does it looks better to merge these two switch cases ?
Like:
case OBJECT_PUBLICATION_SCHEMA:
case OBJECT_USER_MAPPING:
objnode = (Node *) list_make2(linitial(name), linitial(args));
break;

best regards,
houzj

Re: bugfix: when the blocksize is 32k, the function page_header of pageinspect returns negative numbers.

2021-07-07 Thread Bharath Rupireddy

On Thu, Jul 8, 2021 at 6:26 AM Quan Zongliang  wrote:
>
> If the block size is 32k, the function page_header of the pageinspect
> module returns negative numbers:
>
> postgres=# select * from page_header(get_raw_page('t1',0));
>  lsn| checksum | flags | lower | upper | special | pagesize |
> version | prune_xid
> ---+--+---+---+---+-+--+-+---
>   0/174CF58 |0 | 0 |28 | 32736 |  -32768 |   -32768 |
> 4 | 0
> (1 row)
>
>
> This patch changes the output parameters lower, upper, special and
> pagesize to int32.
>
> postgres=# select * from page_header(get_raw_page('t1',0));
>  lsn| checksum | flags | lower | upper | special | pagesize |
> version | prune_xid
> ---+--+---+---+---+-+--+-+---
>   0/19EA640 |0 | 0 |28 | 32736 |   32768 |32768 |
> 4 | 0
> (1 row)

+1. int32 makes sense because the maximum allowed block size is 32768
and smallint with range -32768 to +32767 can't hold it. Internally,
lower, upper, special are treated as unit16. I looked at the patch,
how about using "int4" instead of just "int", just for readability?
And, do we need to change in pageinspect--1.1--1.2.sql and
pageinspect--1.0--1.1.sql along with pageinspect--1.5.sql?

Regards,
Bharath Rupireddy.

RE: Transactions involving multiple postgres foreign servers, take 2

2021-07-07 Thread k.jami...@fujitsu.com

On Wed, June 30, 2021 10:06 (GMT+9), Masahiko Sawada wrote:
> I've attached the new version patch that incorporates the comments from
> Fujii-san and Ikeda-san I got so far. We launch a resolver process per foreign
> server, committing prepared foreign transactions on foreign servers in 
> parallel.

Hi Sawada-san,
Thank you for the latest set of patches.
I've noticed from cfbot that the regression test failed, and I also could not 
compile it.

== running regression test queries==
test test_fdwxact ... FAILED   21 ms
== shutting down postmaster   ==
==
 1 of 1 tests failed. 
==

> To get a better performance based on the current architecture, we can have
> multiple resolver processes per foreign server but it seems not easy to tune 
> it
> in practice. Perhaps is it better if we simply have a pool of resolver 
> processes
> and we assign a resolver process to the resolution of one distributed
> transaction one by one? That way, we need to launch resolver processes as
> many as the concurrent backends using 2PC.

Yes, finding the right value to tune of of max_foreign_prepared_transactions and
max_prepared_transactions seem difficult. If we set the number of resolver
process to number of concurrent backends using 2PC, how do we determine
the value of max_foreign_transaction_resolvers? It might be good to set some
statistics to judge the value, then we can compare the performance from the V37
version.

-
Also, this is a bit of side topic, and I know we've been discussing how to 
improve/fix the resolver process bottlenecks, and Takahashi-san provided
the details above thread where V37 has problems. (I am joining the testing too.)

I am not sure if this has been brought up before because of the years of
thread. But I think that there is a need to consider the need to prevent for the
resolver process from an infinite wait loop of resolving a prepared foreign
transaction. Currently, when a crashed foreign server is recovered during
resolution retries, the information is recovered from WAL and files,
and the resolver process resumes the foreign transaction resolution.
However, what if we cannot (or intentionally do not want to) recover the
crashed server after a long time?

An idea is to make the resolver process to automatically stop after some
maximum number of retries.
We can call the parameter as foreign_transaction_resolution_max_retry_count.
There may be a better name, but I followed the pattern from your patch.

The server downtime can be estimated considering the proposed parameter
foreign_transaction_resolution_retry_interval (default 10s) from the
patch set.
In addition, according to docs, "a foreign server using the postgres_fdw
foreign data wrapper can have the same options that libpq accepts in
connection strings", so the connect_timeout set during CREATE SERVER can
also affect it.

Example:
CREATE SERVER's connect_timeout setting = 5s
foreign_transaction_resolution_retry_interval = 10s
foreign_transaction_resolution_max_retry_count = 3

Estimated total time before resolver stops: 
= (5s) * (3 + 1) + (10s) * (3) = 50 s

00s: 1st connect start
05s: 1st connect timeout
(retry interval)
15s: 2nd connect start (1st retry)
20s: 2nd connect timeout
(retry interval)
30s: 3rd connect start (2nd retry)
35s: 3rd connect timeout
(retry interval)
45s: 4th connect start (3rd retry)
50s: 4th connect timeout
(resolver process stops)

Then the resolver process will not wait indefinitely and will stop after
some time depending on the setting of the above parameters.
This could be the automatic implementation of pg_stop_foreign_xact_resolver.
Assuming that resolver is stopped, then the crashed server is
decided to be restored, the user can then execute pg_resolve_foreign_xact().
Do you think the idea is feasible and we can add it as part of the patch sets?

Regards,
Kirk Jamison

Re: Column Filtering in Logical Replication

2021-07-07 Thread Peter Smith

Hi, I was wondering if/when a subset of cols is specified then does
that mean it will be possible for the table to be replicated to a
*smaller* table at the subscriber side?

e.g Can a table with 7 cols replicated to a table with 2 cols?

table tab1(a,b,c,d,e,f,g) --> CREATE PUBLICATION pub1 FOR TABLE
tab1(a,b)  --> table tab1(a,b)

~~

I thought maybe that should be possible, but the expected behaviour
for that scenario was not very clear to me from the thread/patch
comments. And the new TAP test uses the tab1 table created exactly the
same for pub/sub, so I couldn't tell from the test code either.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Refactor "mutually exclusive options" error reporting code in parse_subscription_options

2021-07-07 Thread Bharath Rupireddy

On Thu, Jul 8, 2021 at 6:24 AM Peter Smith  wrote:
> OK. I created a new thread called "parse_subscription_options -
> suggested improvements" here [1]

Thanks. I closed the CF entry for this thread.

Regards,
Bharath Rupireddy.

Re: Why ALTER SUBSCRIPTION ... SET (slot_name='none') requires subscription disabled?

2021-07-07 Thread Japin Li



On Thu, 08 Jul 2021 at 01:32, Ranier Vilela  wrote:
>>OTOH, when I execute ALTER SUBSCRIPTION ... SET (slot_name=''), it doesn't
> complain. However,
>>SELECT select pg_create_logical_replication_slot('', 'pgoutput') complains
> slot name is too
>>short. Although, the slot will be created at publisher, and validate the
> slot name, IMO, we
>>can also validate the slot_name in parse_subscription_options() to get the
> error early.
>>Attached fixes it. Any thoughts?
> I think that this fix is better after the check if the name is equal to
> "none".
> Most of the time it will be "none" .
> While this, reduce the overhead with strlen into
> ReplicationSlotValidateName can it might be worth it.
>
> For convenience, I have attached a new version.
>

Thanks for your review! LGTM.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-07-07 Thread David Rowley

On Thu, 8 Jul 2021 at 01:32, Dean Rasheed  wrote:
> Hmm, this looked easy, but...
>
> It occurred to me that there ought to be regression tests for the edge
> cases where it steps from one unit to the next. So, in the style of
> the existing regression tests, I tried the following:
>
> SELECT size, pg_size_pretty(size), pg_size_pretty(-1 * size) FROM
> (VALUES (10239::bigint), (10240::bigint),
> (10485247::bigint), (10485248::bigint),
> (10736893951::bigint), (10736893952::bigint),
> (10994579406847::bigint), (10994579406848::bigint),
> (11258449312612351::bigint), (11258449312612352::bigint)) x(size);

>  11258449312612352 | 10240 TB   | -10239 TB

Hmm, yeah, I noticed that pg_size_pretty(bigint) vs
pg_size_pretty(numeric) didn't always match when I was testing this
patch, but I was more focused on having my results matching the
unpatched version than I was with making sure bigint and numeric
matched.

I imagine this must date back to 8a1fab36ab.  Do you feel like it's
something this patch should fix?  I was mostly hoping to keep this
patch about rewriting the code to both make it easier to add new units
and also to make it easier to keep all 3 functions in sync.

It feels like if we're going to fix this negative rounding thing then
we should maybe do it and backpatch a fix then rebase this work on top
of that.

What are your thoughts?

David

Re: row filtering for logical replication

2021-07-07 Thread Greg Nancarrow

On Thu, Jul 8, 2021 at 10:34 AM Euler Taveira  wrote:
>
> Greg, I like your suggestion and already integrate it (I replaced
> ExecAllocTableSlot() with MakeSingleTupleTableSlot() because we don't need the
> List).

Yes I agree, I found the same thing, it's not needed.

>I'm still working on a new version to integrate all suggestions that you
> and Peter did. I have a similar code to Peter's plan cache and I'm working on
> merging both ideas together. I'm done for today but I'll continue tomorrow.
>

I also realised that my 0005 patch wasn't handling RelationSyncEntry
invalidation, so I've updated it.
For completeness, I'm posting the complete patch set with the updates,
so you can look at it and compare with yours, and also it'll keep the
cfbot happy until you post your updated patch.

Regards,
Greg Nancarrow
Fujitsu Australia


v17-0001-Row-filter-for-logical-replication.patch
Description: Binary data


v17-0002-PS-tmp-describe-intermediate-test-steps.patch
Description: Binary data


v17-0003-PS-tmp-add-more-comments-for-expected-results.patch
Description: Binary data


v17-0004-PS-POC-Implement-a-plan-cache-for-pgoutput.patch
Description: Binary data


v17-0005-Improve-row-filtering-performance.patch
Description: Binary data

bugfix: when the blocksize is 32k, the function page_header of pageinspect returns negative numbers.

2021-07-07 Thread Quan Zongliang



If the block size is 32k, the function page_header of the pageinspect 
module returns negative numbers:


postgres=# select * from page_header(get_raw_page('t1',0));
lsn| checksum | flags | lower | upper | special | pagesize | 
version | prune_xid

---+--+---+---+---+-+--+-+---
 0/174CF58 |0 | 0 |28 | 32736 |  -32768 |   -32768 | 
   4 | 0

(1 row)


This patch changes the output parameters lower, upper, special and 
pagesize to int32.


postgres=# select * from page_header(get_raw_page('t1',0));
lsn| checksum | flags | lower | upper | special | pagesize | 
version | prune_xid

---+--+---+---+---+-+--+-+---
 0/19EA640 |0 | 0 |28 | 32736 |   32768 |32768 | 
   4 | 0

(1 row)


--
Quan Zongliang
diff --git a/contrib/pageinspect/pageinspect--1.5.sql 
b/contrib/pageinspect/pageinspect--1.5.sql
index 1e40c3c97e..f71eff19c5 100644
--- a/contrib/pageinspect/pageinspect--1.5.sql
+++ b/contrib/pageinspect/pageinspect--1.5.sql
@@ -23,10 +23,10 @@ CREATE FUNCTION page_header(IN page bytea,
 OUT lsn pg_lsn,
 OUT checksum smallint,
 OUT flags smallint,
-OUT lower smallint,
-OUT upper smallint,
-OUT special smallint,
-OUT pagesize smallint,
+OUT lower int,
+OUT upper int,
+OUT special int,
+OUT pagesize int,
 OUT version smallint,
 OUT prune_xid xid)
 AS 'MODULE_PATHNAME', 'page_header'
diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c
index 7272b21016..af04c1a688 100644
--- a/contrib/pageinspect/rawpage.c
+++ b/contrib/pageinspect/rawpage.c
@@ -316,10 +316,10 @@ page_header(PG_FUNCTION_ARGS)
values[0] = LSNGetDatum(lsn);
values[1] = UInt16GetDatum(page->pd_checksum);
values[2] = UInt16GetDatum(page->pd_flags);
-   values[3] = UInt16GetDatum(page->pd_lower);
-   values[4] = UInt16GetDatum(page->pd_upper);
-   values[5] = UInt16GetDatum(page->pd_special);
-   values[6] = UInt16GetDatum(PageGetPageSize(page));
+   values[3] = Int32GetDatum(page->pd_lower);
+   values[4] = Int32GetDatum(page->pd_upper);
+   values[5] = Int32GetDatum(page->pd_special);
+   values[6] = Int32GetDatum(PageGetPageSize(page));
values[7] = UInt16GetDatum(PageGetPageLayoutVersion(page));
values[8] = TransactionIdGetDatum(page->pd_prune_xid);

Re: Refactor "mutually exclusive options" error reporting code in parse_subscription_options

2021-07-07 Thread Peter Smith

On Wed, Jul 7, 2021 at 1:35 PM Bharath Rupireddy
 wrote:
>
> On Wed, Jul 7, 2021 at 5:33 AM Peter Smith  wrote:
> > PSA my patch which includes all the fixes mentioned above.
>
> I agree with Amit to start a separate thread to discuss these points.
> IMO, we can close this thread. What do you think?
>

OK. I created a new thread called "parse_subscription_options -
suggested improvements" here [1]

--
[1] 
https://www.postgresql.org/message-id/CAHut%2BPtXHfLgLHDDJ8ZN5f5Be_37mJoxpEsRg8LNmm4XCr06Rw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.

parse_subscription_options - suggested improvements

2021-07-07 Thread Peter Smith

This is a re-posting of my original mail from here [1]
Created this new discussion thread for it as suggested here [2]

[1] 
https://www.postgresql.org/message-id/CAHut%2BPtT0--Tf%3DK_YOmoyB3RtakUOYKeCs76aaOqO2Y%2BJt38kQ%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAA4eK1L0GT5-RJrya4q9ve%3DVi8hQM_SeHhJekmWMnOqsCh3KbQ%40mail.gmail.com

===

On Tue, Jul 6, 2021 at 6:21 PM Amit Kapila
 wrote:
>
> On Fri, Jul 2, 2021 at 12:36 PM Amit Kapila 
>  wrote:
> >
> > On Fri, Jul 2, 2021 at 8:35 AM Alvaro Herrera 
> >  wrote:
> > >
> > > > The latest patch sent by Bharath looks good to me. Would you like to
> > > > commit it or shall I take care of it?
> > >
> > > Please, go ahead.
> > >
> >
> > Okay, I'll push it early next week (by Tuesday) unless there are more
> > comments or suggestions. Thanks!
> >
>
> Pushed!

Yesterday, I needed to refactor a lot of code due to this push [1].

The refactoring exercise caused me to study these v11 changes much more deeply.

IMO there are a few improvements that should be made:

//

1. Zap 'opts' up-front

+ *
+ * Caller is expected to have cleared 'opts'.

This comment is putting the onus on the caller to "do the right thing".

I think that hopeful expectations about input should be removed - the
function should just be responsible itself just to zap the SubOpts
up-front. It makes the code more readable, and it removes any
potential risk of garbage unintentionally passed in that struct.

/* Start out with cleared opts. */
memset(opts, 0, sizeof(SubOpts));

Alternatively, at least there should be an assertion for some sanity check.

Assert(opt->specified_opts == 0);



2. Remove redundant conditions

/* Check for incompatible options from the user. */
- if (enabled && *enabled_given && *enabled)
+ if (opts->enabled &&
+ IsSet(supported_opts, SUBOPT_ENABLED) &&
+ IsSet(opts->specified_opts, SUBOPT_ENABLED))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
/*- translator: both %s are strings of the form "option = value" */
errmsg("%s and %s are mutually exclusive options",
"connect = false", "enabled = true")));

- if (create_slot && create_slot_given && *create_slot)
+ if (opts->create_slot &&
+ IsSet(supported_opts, SUBOPT_CREATE_SLOT) &&
+ IsSet(opts->specified_opts, SUBOPT_CREATE_SLOT))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("%s and %s are mutually exclusive options",
"connect = false", "create_slot = true")));

- if (copy_data && copy_data_given && *copy_data)
+ if (opts->copy_data &&
+ IsSet(supported_opts, SUBOPT_COPY_DATA) &&
+ IsSet(opts->specified_opts, SUBOPT_COPY_DATA))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("%s and %s are mutually exclusive options",
"connect = false", "copy_data = true")));

By definition, this function only allows any option to be
"specified_opts" if that option is also "supported_opts". So, there is
really no need in the above code to re-check again that it is
supported.

It can be simplified like this:

/* Check for incompatible options from the user. */
- if (enabled && *enabled_given && *enabled)
+ if (opts->enabled &&
+ IsSet(opts->specified_opts, SUBOPT_ENABLED))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
/*- translator: both %s are strings of the form "option = value" */
errmsg("%s and %s are mutually exclusive options",
"connect = false", "enabled = true")));

- if (create_slot && create_slot_given && *create_slot)
+ if (opts->create_slot &&
+ IsSet(opts->specified_opts, SUBOPT_CREATE_SLOT))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("%s and %s are mutually exclusive options",
"connect = false", "create_slot = true")));

- if (copy_data && copy_data_given && *copy_data)
+ if (opts->copy_data &&
+ IsSet(opts->specified_opts, SUBOPT_COPY_DATA))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("%s and %s are mutually exclusive options",
"connect = false", "copy_data = true")));

-

3. Remove redundant conditions

Same as 2. Here are more examples of conditions where the redundant
checking of "supported_opts" can be removed.

/*
* Do additional checking for disallowed combination when slot_name = NONE
* was used.
*/
- if (slot_name && *slot_name_given && !*slot_name)
+ if (!opts->slot_name &&
+ IsSet(supported_opts, SUBOPT_SLOT_NAME) &&
+ IsSet(opts->specified_opts, SUBOPT_SLOT_NAME))
{
- if (enabled && *enabled_given && *enabled)
+ if (opts->enabled &&
+ IsSet(supported_opts, SUBOPT_ENABLED) &&
+ IsSet(opts->specified_opts, SUBOPT_ENABLED))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
/*- translator: both %s are strings of the form "option = value" */
errmsg("%s and %s are mutually exclusive options",
"slot_name = NONE", "enabled = true")));

- if (create_slot && create_slot_given && *create_slot)
+ if (opts->create_slot &&
+ IsSet(supported_opts, SUBOPT_CREATE_SLOT) &&
+ IsSet(opts->specified_opts, SUBOPT_CREATE_SLOT))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
+ /*- translator: both %s are strings of the form "option =

Re: row filtering for logical replication

2021-07-07 Thread Euler Taveira

On Wed, Jul 7, 2021, at 2:24 AM, Greg Nancarrow wrote:
> I found that with the call to ExecInitExtraTupleSlot() in
> pgoutput_row_filter(), then the performance of pgoutput_row_filter()
> degrades considerably over the 100,000 invocations, and on my system
> it took about 43 seconds to filter and send to the subscriber.
> However, by caching the tuple table slot in RelationSyncEntry, this
> duration can be dramatically reduced by 38+ seconds.
> A further improvement can be made using this in combination with
> Peter's plan cache (v16-0004).
> I've attached a patch for this, which relies on the latest v16-0001
> and v16-0004 patches posted by Peter Smith (noting that v16-0001 is
> identical to your previously-posted 0001 patch).
> Also attached is a graph (created by Peter Smith – thanks!) detailing
> the performance improvement.
Greg, I like your suggestion and already integrate it (I replaced
ExecAllocTableSlot() with MakeSingleTupleTableSlot() because we don't need the
List). I'm still working on a new version to integrate all suggestions that you
and Peter did. I have a similar code to Peter's plan cache and I'm working on
merging both ideas together. I'm done for today but I'll continue tomorrow.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

2021-07-07 Thread Yugo NAGATA

Hello Fabien,

I attached the updated patch (v14)!

On Wed, 30 Jun 2021 17:33:24 +0200 (CEST)
Fabien COELHO  wrote:

> >> --report-latencies -> --report-per-command: should we keep supporting
> >> the previous option?
> >
> > Ok. Although now the option is not only for latencies, considering users who
> > are using the existing option, I'm fine with this. I got back this to the
> > previous name.
> 
> Hmmm. I liked the new name! My point was whether we need to support the 
> old one as well for compatibility, or whether we should not bother. I'm 
> still wondering. As I think that the new name is better, I'd suggest to 
> keep it.

Ok. I misunderstood it. I returned the option name to report-per-command.

If we keep report-latencies, I can imagine the following choises:
- use report-latencies to print only latency information
- use report-latencies as alias of report-per-command for compatibility
  and remove at an appropriate timing. (that is, treat as deprecated)

Among these, I prefer the latter because ISTM we would not need many options
for reporting information per command. However, actually, I wander that we
don't have to keep the previous one if we plan to remove it eventually.

> >> --failures-detailed: if we bother to run with handling failures, should
> >> it always be on?
> >
> > If we print other failures that cannot be retried in future, it could a lot
> > of lines and might make some users who don't need details of failures 
> > annoyed.
> > Moreover, some users would always need information of detailed failures in 
> > log,
> > and others would need only total numbers of failures.
> 
> Ok.
> 
> > Currently we handle only serialization and deadlock failures, so the number 
> > of
> > lines printed and the number of columns of logging is not large even under 
> > the
> > failures-detail, but if we have a chance to handle other failures in future,
> > ISTM adding this option makes sense considering users who would like simple
> > outputs.
> 
> Hmmm. What kind of failures could be managed with retries? I guess that on 
> a connection failure we can try to reconnect, but otherwise it is less 
> clear that other failures make sense to retry.

Indeed, there would few failures that we should retry and I can not imagine
other than serialization , deadlock, and connection failures for now. However,
considering reporting the number of failed transaction and its causes in future,
as you said

> Given that we manage errors, ISTM that we should not necessarily
> stop on other not retried errors, but rather count/report them and
> possibly proceed. 

, we could define more a few kind of failures. At least we can consider
meta-command and other SQL commands errors in addition to serialization, 
deadlock, connection failures. So, the total number of kind of failures would
be five at least and reporting always all of them results a lot of lines and
columns in logging.

> >> --debug-errors: I'm not sure we should want a special debug mode for that,
> >> I'd consider integrating it with the standard debug, or just for 
> >> development.
> >
> > I think --debug is a debug option for telling users the pgbench's internal
> > behaviors, that is, which client is doing what. On other hand, 
> > --debug-errors
> > is for telling users what error caused a retry or a failure in detail. For
> > users who are not interested in pgbench's internal behavior (sending a 
> > command,
> > receiving a result, ... ) but interested in actual errors raised during 
> > running
> > script, this option seems useful.
> 
> Ok. The this is not really about debug per se, but a verbosity setting?

I think so.

> Maybe --verbose-errors would make more sense? I'm unsure. I'll think about 
> it.

Agreed. This seems more proper than the previous one, so I fixed the name to
--verbose-errors.

> > Sorry, I couldn't understand your suggestion. Is this about the order of 
> > case
> > statements or pg_log_error?
> 
> My sentence got mixed up. My point was about the case order, so that they 
> are put in a more logical order when reading all the cases.

Ok. Considering the loical order, I moved WAIT_ROLLBACK_RESULT into
between ERROR and RETRY, because WAIT_ROLLBACK_RESULT comes atter ERROR state,
and RETRY comes after ERROR or WAIT_ROLLBACK_RESULT..

> >> Currently, ISTM that the retry on error mode is implicitely always on.
> >> Do we want that? I'd say yes, but maybe people could disagree.
> >
> > The default values of max-tries is 1, so the retry on error is off.
> 
> > Failed transactions are retried only when the user wants it and
> > specifies a valid value to max-treis.
> 
> Ok. My point is that we do not stop on such errors, whereas before ISTM 
> that we would have stopped, so somehow the default behavior has changed 
> and the previous behavior cannot be reinstated with an option. Maybe that 
> is not bad, but this is a behavioral change which needs to be documented 
> and argumented.

I understood. Indeed, there is a behavioural

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-07 Thread Peter Geoghegan

On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan  wrote:
> I wonder how much it would help to break up that loop into two loops.
> Make the callback into a batch operation that generates state that
> describes what to do with each and every index tuple on the leaf page.
> The first loop would build a list of TIDs, then you'd call into
> vacuumlazy.c and get it to process the TIDs, and finally the second
> loop would physically delete the TIDs that need to be deleted. This
> would mean that there would be only one call per leaf page per
> btbulkdelete(). This would reduce the number of calls to the callback
> by at least 100x, and maybe more than 1000x.

Maybe for something like rtbm.c (which is inspired by Roaring
bitmaps), you would really want to use an "intersection" operation for
this. The TIDs that we need to physically delete from the leaf page
inside btvacuumpage() are the intersection of two bitmaps: our bitmap
of all TIDs on the leaf page, and our bitmap of all TIDs that need to
be deleting by the ongoing btbulkdelete() call.

Obviously the typical case is that most TIDs in the index do *not* get
deleted -- needing to delete more than ~20% of all TIDs in the index
will be rare. Ideally it would be very cheap to figure out that a TID
does not need to be deleted at all. Something a little like a negative
cache (but not a true negative cache). This is a little bit like how
hash joins can be made faster by adding a Bloom filter -- most hash
probes don't need to join a tuple in the real world, and we can make
these hash probes even faster by using a Bloom filter as a negative
cache.

If you had the list of TIDs from a leaf page sorted for batch
processing, and if you had roaring bitmap style "chunks" with
"container" metadata stored in the data structure, you could then use
merging/intersection -- that has some of the same advantages. I think
that this would be a lot more efficient than having one binary search
per TID. Most TIDs from the leaf page can be skipped over very
quickly, in large groups. It's very rare for VACUUM to need to delete
TIDs from completely random heap table blocks in the real world (some
kind of pattern is much more common).

When this merging process finds 1 TID that might really be deletable
then it's probably going to find much more than 1 -- better to make
that cache miss take care of all of the TIDs together. Also seems like
the CPU could do some clever prefetching with this approach -- it
could prefetch TIDs where the initial chunk metadata is insufficient
to eliminate them early -- these are the groups of TIDs that will have
many TIDs that we actually need to delete. ISTM that improving
temporal locality through batching could matter a lot here.

-- 
Peter Geoghegan

Re: Atomic rename feature for Windows.

2021-07-07 Thread Victor Spirin


Thanks!

In this version of the patch, calls to malloc have been removed. 
Hopefully MAX_PATH is long enough for filenames.



How does that cope with durable_rename_excl() where rename() is used
on Windows?  The problems that 909b449 has somewhat "fixed" were
annoying for the users as it prevented WAL segment recycling, so we
need to be sure that this does not cause more harm.


I tested this patch to resolve the error message "could not rename 
temporary statistics file "pg_stat_tmp/pgstat.tmp" to 
"pg_stat_tmp/pgstat.stat": Permission denied".  (I have a patch option 
to rename a temporary file for statistics only.)



+ /*
+ * CHECKSUM_TYPE_NONE defined in the winioctl.h when _WIN32_WINNT >= 
_WIN32_WINNT_WIN10
+ */
+#ifdef CHECKSUM_TYPE_NONE
+#undef CHECKSUM_TYPE_NONE
+#endif

Okay.  Should this be renamed separately then to avoid conflicts?


Renaming CHECKSUM_TYPE_NONE in the  checksum_helper.h is the best way to go.


  #if defined(_MSC_VER) && _MSC_VER >= 1900
-#define MIN_WINNT 0x0600
+#define MIN_WINNT 0x0A00
  #else
  #define MIN_WINNT 0x0501
  #endif
This is a large bump for Studio >= 2015 I am afraid.  That does not
seem acceptable, as it means losing support for GetLocaleInfoEx()
across older versions.

It seems that the MIN_WINNT value 0x0600 or 0x0A00 does not affect the 
use of the GetLocaleInfoEx () function



+   # manifest with ms_compatibility:supportedOS tags for using 
IsWindows10OrGreater() function
+   print $o "\n1 24 \"src/port/win10.manifest\"\n";
+
close($o);
close($i);
}
diff --git a/src/port/win10.manifest b/src/port/win10.manifest
new file mode 100644

It would be good to not require that.  Those extra files make the
long-term maintenance harder.
Function IsWindows10OrGreater() working properly if there is manifest 
with Id="{8e0f7a12-bfb3-4fe8-b9a5-48fd50a15a9a}" />


"Applications not manifested for Windows 10 return false, even if the 
current operating system version is Windows 10."



Victor Spirin
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

06.07.2021 4:43, Michael Paquier пишет:

On Mon, Jul 05, 2021 at 04:53:06PM +0300, Victor Spirin wrote:

This patch related to this post:
https://www.postgresql.org/message-id/CAEepm%3D0FV-k%2B%3Dd9z08cW%3DZXoR1%3Dkw9wdpkP6WAuOrKJdz-8ujg%40mail.gmail.com

How does that cope with durable_rename_excl() where rename() is used
on Windows?  The problems that 909b449 has somewhat "fixed" were
annoying for the users as it prevented WAL segment recycling, so we
need to be sure that this does not cause more harm.


+ /*
+ * CHECKSUM_TYPE_NONE defined in the winioctl.h when _WIN32_WINNT >= 
_WIN32_WINNT_WIN10
+ */
+#ifdef CHECKSUM_TYPE_NONE
+#undef CHECKSUM_TYPE_NONE
+#endif

Okay.  Should this be renamed separately then to avoid conflicts?


- * get support for GetLocaleInfoEx() with locales. For everything else
+ * Studio 2015 the minimum requirement is Windows 10 (0x0A00) to get support 
for SetFileInformationByHandle.
+ * The minimum requirement is Windows Vista (0x0600) get support for 
GetLocaleInfoEx() with locales.
+ * For everything else
   * the minimum version is Windows XP (0x0501).
   */
  #if defined(_MSC_VER) && _MSC_VER >= 1900
-#define MIN_WINNT 0x0600
+#define MIN_WINNT 0x0A00
  #else
  #define MIN_WINNT 0x0501
  #endif

This is a large bump for Studio >= 2015 I am afraid.  That does not
seem acceptable, as it means losing support for GetLocaleInfoEx()
across older versions.


+#if defined(WIN32) && !defined(__CYGWIN__) && defined(_WIN32_WINNT_WIN10) && 
_WIN32_WINNT >= _WIN32_WINNT_WIN10
+
+#include 
+
+/*
+ * win10_rename - uses SetFileInformationByHandle function with 
FILE_RENAME_FLAG_POSIX_SEMANTICS flag for atomic rename file
+ * working only on Windows 10 or later and  _WIN32_WINNT must be >= 
_WIN32_WINNT_WIN10
+ */
+static int win10_rename(wchar_t const* from, wchar_t const* to)

Having win10_rename(), a wrapper for pgrename_win10(), which is itself
an extra wrapper for pgrename(), is confusing.  Could you reduce the
layers of functions here.  At the end we just want an extra runtime
option for pgrename().  Note that pgrename_win10() could be static to
dirmod.c, and it seems to me that you just want a small function to do
the path conversion anyway.  It would be better to avoid using
malloc() in those code paths as well, as the backend could finish by
calling that.  We should be able to remove the malloc()s with local
variables large enough to hold those paths, no?


+   # manifest with ms_compatibility:supportedOS tags for using 
IsWindows10OrGreater() function
+   print $o "\n1 24 \"src/port/win10.manifest\"\n";
+
close($o);
close($i);
}
diff --git a/src/port/win10.manifest b/src/port/win10.manifest
new file mode 100644

It would be good to not require that.  Those extra files make the
long-term maintenance harder.
--
Michael
diff --git

Re: A few assorted typos in comments

2021-07-07 Thread Thomas Munro

On Thu, Jul 8, 2021 at 10:12 AM Daniel Gustafsson  wrote:
> Spotted a few random typos in comments while reading code, will apply these
> tomorrow or so unless there are objections.

LGTM.

A few assorted typos in comments

2021-07-07 Thread Daniel Gustafsson

Spotted a few random typos in comments while reading code, will apply these
tomorrow or so unless there are objections.

--
Daniel Gustafsson   https://vmware.com/



assorted-typos.patch
Description: Binary data

Re: Replace remaining castNode(…, lfirst(…)) and friends calls with l*_node()

2021-07-07 Thread Daniel Gustafsson

> On 7 Jul 2021, at 21:12, Dagfinn Ilmari Mannsåker  wrote:

> Here's a patch to convert the remaining ones.

I haven't tested it yet, but +1 on the idea of cleaning these up making the
codebase consistent.

--
Daniel Gustafsson   https://vmware.com/

Re: Outdated comments about proc->sem in lwlock.c

2021-07-07 Thread Daniel Gustafsson

> On 3 Jun 2021, at 04:07, Thomas Munro  wrote:

> Here's a patch to remove the misleading comments.

While not an expert in the area; reading the referenced commit and the code
with the now removed comments, I think this is correct.

--
Daniel Gustafsson   https://vmware.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-07 Thread Peter Geoghegan

On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada  wrote:
> Currently, the TIDs of dead tuples are stored in an array that is
> collectively allocated at the start of lazy vacuum and TID lookup uses
> bsearch(). There are the following challenges and limitations:
>
> 1. Don't allocate more than 1GB. There was a discussion to eliminate
> this limitation by using MemoryContextAllocHuge() but there were
> concerns about point 2[1].

I think that the main problem with the 1GB limitation is that it is
surprising -- it can cause disruption when we first exceed the magical
limit of ~174 million TIDs. This can cause us to dirty index pages a
second time when we might have been able to just do it once with
sufficient memory for TIDs. OTOH there are actually cases where having
less memory for TIDs makes performance *better* because of locality
effects. This perverse behavior with memory sizing isn't a rare case
that we can safely ignore -- unfortunately it's fairly common.

My point is that we should be careful to choose the correct goal.
Obviously memory use matters. But it might be more helpful to think of
memory use as just a proxy for what truly matters, not a goal in
itself. It's hard to know what this means (what is the "real goal"?),
and hard to measure it even if you know for sure. It could still be
useful to think of it like this.

> A run container is selected in this test case, using 4 bytes for each block.
>
>   Execution Time  Memory Usage
> array   8,883.03600,008,248
> intset   7,358.23100,671,488
> tbm758.81 100,671,544
> rtbm   764.33  29,384,816
>
> Overall, 'rtbm' has a much better lookup performance and good memory
> usage especially if there are relatively many dead tuples. However, in
> some cases, 'intset' and 'array' have a better memory usage.

This seems very promising.

I wonder how much you have thought about the index AM side. It makes
sense to initially evaluate these techniques using this approach of
separating the data structure from how it is used by VACUUM -- I think
that that was a good idea. But at the same time there may be certain
important theoretical questions that cannot be answered this way --
questions about how everything "fits together" in a real VACUUM might
matter a lot. You've probably thought about this at least a little
already. Curious to hear how you think it "fits together" with the
work that you've done already.

The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.

I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative. The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.

I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.

This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU

Re: Enhanced error message to include hint messages for redundant options error

2021-07-07 Thread Daniel Gustafsson

> On 6 Jul 2021, at 17:08, vignesh C  wrote:

> The patch was not applying on the head because of the recent commit
> "8aafb02616753f5c6c90bbc567636b73c0cbb9d4", attached patch which is
> rebased on HEAD.

I sort of like the visual cue of seeing ereport(ERROR ..  since it makes it
clear it will break execution then and there, this will require a lookup for
anyone who don't know the function by heart.  That being said, reducing
duplicated boilerplate has clear value and this reduce the risk of introducing
strings which are complicated to translate.  On the whole I think this is a net
win, and the patch looks pretty good.

-   DefElem*defel = (DefElem *) lfirst(option);
+   defel = (DefElem *) lfirst(option);
Any particular reason to include this in the patch?

--
Daniel Gustafsson   https://vmware.com/

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-07-07 Thread David Christensen

Tom Lane writes:

> David Christensen  writes:
>> Enclosed is the patch to change the return type to numeric, as well as one 
>> for expanding units to
>> add PB and EB.
>
> Can we really get away with changing the return type?  That would
> by no stretch of the imagination be free; one could expect breakage
> of a few user views, for example.

Hmm, that's a good point, and we can't really make the return type polymorphic 
(being as there isn't
a source type of the given return value).

> Independently of that, I'm pretty much -1 on going further than PB.
> Even if the additional abbreviations you mention are actually recognized
> standards, I think not that many people are familiar with them, and such
> input is way more likely to be a typo than intended data.

If we do go ahead and restrict the expansion to just PB, the return value of 
pg_size_bytes()  would
still support up to 8192 PB before running into range limitations.  I assume 
it's not worth creating
a pg_size_bytes_numeric() with the full range of supported units, but that is 
presumably an option
as well.

Best,

David

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-07-07 Thread Tom Lane

David Christensen  writes:
> Enclosed is the patch to change the return type to numeric, as well as one 
> for expanding units to
> add PB and EB.

Can we really get away with changing the return type?  That would
by no stretch of the imagination be free; one could expect breakage
of a few user views, for example.

Independently of that, I'm pretty much -1 on going further than PB.
Even if the additional abbreviations you mention are actually recognized
standards, I think not that many people are familiar with them, and such
input is way more likely to be a typo than intended data.

regards, tom lane

Re: PostgreSQL 14 backend crash on incorrect trigger

2021-07-07 Thread Tom Lane

Alexander Pyhalov  writes:
> The following test case makes postgresql backend crash. The trigger is 
> incorrect, but this didn't crash postgresql before

Yup, that's a silly oversight.  Fixed, thanks for the report!

regards, tom lane

Replace remaining castNode(…, lfirst(…)) and friends calls with l*_node()

2021-07-07 Thread Dagfinn Ilmari Mannsåker

Hi hackers,

We've had node-casting versions of the list extraction macros since
2017, but several cases of the written-out version have been added since
then (even by Tom, who added the l*_node() macros).

Here's a patch to convert the remaining ones.  The macros were
back-patched to all supported branches, so this won't create any
new back-patching hazards.

- ilmari

>From be3556361bcba2f332f51b75aa13946359335ae2 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Wed, 7 Jul 2021 11:44:23 +0100
Subject: [PATCH] Use l*_node() family of functions where appropriate
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Instead of castNode(…, lfoo(…))
---
 src/backend/commands/publicationcmds.c |  2 +-
 src/backend/commands/tablecmds.c   |  8 
 src/backend/executor/execPartition.c   |  2 +-
 src/backend/executor/nodeValuesscan.c  |  2 +-
 src/backend/optimizer/util/paramassign.c   |  2 +-
 src/backend/parser/parse_clause.c  |  2 +-
 src/backend/parser/parse_utilcmd.c |  8 
 src/backend/partitioning/partbounds.c  | 18 +-
 src/backend/rewrite/rewriteSearchCycle.c   |  8 
 src/backend/tcop/postgres.c|  2 +-
 src/backend/utils/adt/ruleutils.c  |  6 +++---
 src/pl/plpgsql/src/pl_exec.c   |  2 +-
 src/test/modules/test_predtest/test_predtest.c |  4 ++--
 13 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 95c253c8e0..08f94fa324 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -515,7 +515,7 @@ OpenTableList(List *tables)
 	 */
 	foreach(lc, tables)
 	{
-		RangeVar   *rv = castNode(RangeVar, lfirst(lc));
+		RangeVar   *rv = lfirst_node(RangeVar, lc);
 		bool		recurse = rv->inh;
 		Relation	rel;
 		Oid			myrelid;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 97a9725df7..ba71ceed64 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4770,7 +4770,7 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 
 			foreach(lcmd, subcmds)
 ATExecCmd(wqueue, tab,
-		  castNode(AlterTableCmd, lfirst(lcmd)),
+		  lfirst_node(AlterTableCmd, lcmd),
 		  lockmode, pass, context);
 
 			/*
@@ -12753,7 +12753,7 @@ ATPostAlterTypeParse(Oid oldId, Oid oldRelId, Oid refRelId, char *cmd,
 
 			foreach(lcmd, stmt->cmds)
 			{
-AlterTableCmd *cmd = castNode(AlterTableCmd, lfirst(lcmd));
+AlterTableCmd *cmd = lfirst_node(AlterTableCmd, lcmd);
 
 if (cmd->subtype == AT_AddIndex)
 {
@@ -16571,7 +16571,7 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy)
 	/* take care of any partition expressions */
 	foreach(l, partspec->partParams)
 	{
-		PartitionElem *pelem = castNode(PartitionElem, lfirst(l));
+		PartitionElem *pelem = lfirst_node(PartitionElem, l);
 
 		if (pelem->expr)
 		{
@@ -16608,7 +16608,7 @@ ComputePartitionAttrs(ParseState *pstate, Relation rel, List *partParams, AttrNu
 	attn = 0;
 	foreach(lc, partParams)
 	{
-		PartitionElem *pelem = castNode(PartitionElem, lfirst(lc));
+		PartitionElem *pelem = lfirst_node(PartitionElem, lc);
 		Oid			atttype;
 		Oid			attcollation;
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920b06..5c723bc54e 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -585,7 +585,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 
 		foreach(ll, wcoList)
 		{
-			WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
+			WithCheckOption *wco = lfirst_node(WithCheckOption, ll);
 			ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
 			   >ps);
 
diff --git a/src/backend/executor/nodeValuesscan.c b/src/backend/executor/nodeValuesscan.c
index 5de1429fda..00bd1271e4 100644
--- a/src/backend/executor/nodeValuesscan.c
+++ b/src/backend/executor/nodeValuesscan.c
@@ -285,7 +285,7 @@ ExecInitValuesScan(ValuesScan *node, EState *estate, int eflags)
 	i = 0;
 	foreach(vtl, node->values_lists)
 	{
-		List	   *exprs = castNode(List, lfirst(vtl));
+		List	   *exprs = lfirst_node(List, vtl);
 
 		scanstate->exprlists[i] = exprs;
 
diff --git a/src/backend/optimizer/util/paramassign.c b/src/backend/optimizer/util/paramassign.c
index ebb424112b..61b856a959 100644
--- a/src/backend/optimizer/util/paramassign.c
+++ b/src/backend/optimizer/util/paramassign.c
@@ -431,7 +431,7 @@ process_subquery_nestloop_params(PlannerInfo *root, List *subplan_params)
 
 	foreach(lc, subplan_params)
 	{
-		PlannerParamItem *pitem = castNode(PlannerParamItem, lfirst(lc));
+		PlannerParamItem *pitem = lfirst_node(PlannerParamItem, lc);
 
 		if (IsA(pitem->item, Var))
 		{
diff --git

TOAST questions

2021-07-07 Thread Maciek Sakrejda

Hello,

(I hope it's okay to ask general internals questions here; if this list is
strictly for development, I'll keep my questions on -general but since I'm
asking about internal behavior, this seemed more appropriate.)

I was playing around with inspecting TOAST tables in order to understand
the mechanism better, and I ran across a strange issue: I've created a
table that has a text column, inserted and then deleted some data, and the
TOAST table still has some entries even though the owning table is now
empty:

maciek=# SELECT reltoastrelid::regclass FROM pg_class WHERE relname =
'users';
   reltoastrelid
---
 pg_toast.pg_toast_4398034
(1 row)

maciek=# select chunk_id, chunk_seq from pg_toast.pg_toast_4398034;
 chunk_id | chunk_seq
--+---
  4721665 | 0
  4721665 | 1
(2 rows)

maciek=# select * from users;
 id | created_at | is_admin | username
++--+--
(0 rows)

I've tried to reproduce this with a new table in the same database, but
could not see the same behavior (the TOAST table entries are deleted when I
delete rows from the main table). This is 11.12. Is this expected?

In case it's relevant, this table originally had the first three columns. I
inserted one row, then added the text column, set its STORAGE to EXTERNAL,
and set toast_tuple_target to the minimum of 128. I inserted a few more
rows until I found one large enough to go in the TOAST table (It looks like
Postgres attempts to compress values and store them inline first even when
STORAGE is EXTERNAL? I don't recall the exact size, but I had to use a
value much larger than 128 before it hit the TOAST table. The TOAST docs
allude to this behavior but just making sure I understand correctly.), then
I deleted the rows with non-NULL values in the text column, and noticed the
TOAST table entries were still there. So I deleted everything in the users
table and still see the two rows above in the TOAST table. I've tried this
sequence of steps again with a new table and could not reproduce the issue.

Thanks,
Maciek

Re: Numeric x^y for negative x

2021-07-07 Thread Dean Rasheed

On Wed, 7 Jul 2021 at 18:57, Zhihong Yu  wrote:
>
> +   (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
> +errmsg("value overflows numeric format")));
>
> Here is an example of existing error message which I think is more readable 
> than 'overflows numeric format':
>
>  errmsg("bigint out of range")));
>
> Maybe rephrase as: value is out of range
>

Hmm, I don't know. That's the error that has been thrown by lots of
numeric functions for a long time now, and it seems fine to me.

Regards,
Dean

Re: Case expression pushdown

2021-07-07 Thread Gilles Darold


Le 07/07/2021 à 18:55, Gilles Darold a écrit :

Le 07/07/2021 à 18:50, Gilles Darold a écrit :


Great, I changing the state in the commitfest to "Ready for committers".


I'm attaching the v5 patch again as it doesn't appears in the Latest 
attachment list in the commitfest.




And the review summary:


This patch allows pushing CASE expressions to foreign servers, so that:

  - more types of updates could be executed directly
  - full foreign table scan can be avoid
  - more push down of aggregates function

The patch compile and regressions tests with assert enabled passed 
successfully.

There is a compiler warning but it is not related to this patch:

    deparse.c: In function ‘foreign_expr_walker.isra.0’:
    deparse.c:891:28: warning: ‘collation’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]

  891 |   outer_cxt->collation = collation;
  |   ~^~~
    deparse.c:874:10: warning: ‘state’ may be used uninitialized in 
this function [-Wmaybe-uninitialized]

  874 |  else if (state == outer_cxt->state)
  |  ^

The regression test for this feature contains the use cases where push 
down of CASE clause are useful.

Nested CASE are also part of the regression tests.

The patch adds insignificant overhead by processing further than before 
a case expression but overall it adds a major performance improvement 
for queries on foreign tables that use a CASE WHEN clause in a predicate 
or in an aggregate function.



This patch does what it claims to do without detect problem, as expected 
the CASE clause is not pushed when a local table is involved in the CASE 
expression of if a non default collation is used.


Ready for committers.

Re: badly calculated width of emoji in psql

2021-07-07 Thread Pavel Stehule

st 7. 7. 2021 v 20:03 odesílatel Jacob Champion 
napsal:

> On Mon, 2021-04-05 at 14:07 +0900, Kyotaro Horiguchi wrote:
> > At Fri, 2 Apr 2021 11:51:26 +0200, Pavel Stehule <
> pavel.steh...@gmail.com> wrote in
> > > with this patch, the formatting is correct
> >
> > I think the hardest point of this issue is that we don't have a
> > reasonable authoritative source that determines character width. And
> > that the presentation is heavily dependent on environment.
>
> > Unicode 9 and/or 10 defines the character properties "Emoji" and
> > "Emoji_Presentation", and tr51[1] says that
> >
> > > Emoji are generally presented with a square aspect ratio, which
> > > presents a problem for flags.
> > ...
> > > Current practice is for emoji to have a square aspect ratio, deriving
> > > from their origin in Japanese. For interoperability, it is recommended
> > > that this practice be continued with current and future emoji. They
> > > will typically have about the same vertical placement and advance
> > > width as CJK ideographs. For example:
> >
> > Ok, even putting aside flags, the first table in [2] asserts that "#",
> > "*", "0-9" are emoji characters.  But we and I think no-one never
> > present them in two-columns.  And the table has many mysterious holes
> > I haven't looked into.
>
> I think that's why Emoji_Presentation is false for those characters --
> they _could_ be presented as emoji if the UI should choose to do so, or
> if an emoji presentation selector is used, but by default a text
> presentation would be expected.
>
> > We could Emoji_Presentation=yes for the purpose, but for example,
> > U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE) has the property
> > Emoji_Presentation=yes but U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE
> > WITH VERTICAL BAR) does not for a reason uncertaion to me.  It doesn't
> > look like other than some kind of mistake.
>
> That is strange.
>
> > About environment, for example, U+23E9 is an emoji, and
> > Emoji_Presentation=yes, but it is shown in one column on my
> > xterm. (I'm not sure what font am I using..)
>
> I would guess that's the key issue here. If we choose a particular
> width for emoji characters, is there anything keeping a terminal's font
> from doing something different anyway?
>
> Furthermore, if the stream contains an emoji presentation selector
> after a code point that would normally be text, shouldn't we change
> that glyph to have an emoji "expected width"?
>
> I'm wondering if the most correct solution would be to have the user
> tell the client what width to use, using .psqlrc or something.
>

Gnome terminal does it - VTE does it - there is option how to display chars
with not well specified width.


> > A possible compromise is that we treat all Emoji=yes characters
> > excluding ASCII characters as double-width and manually merge the
> > fragmented regions into reasonably larger chunks.
>
> We could also keep the fragments as-is and generate a full interval
> table, like common/unicode_combining_table.h. It looks like there's
> roughly double the number of emoji intervals as combining intervals, so
> hopefully adding a second binary search wouldn't be noticeably slower.
>
> --
>
> In your opinion, would the current one-line patch proposal make things
> strictly better than they are today, or would it have mixed results?
> I'm wondering how to help this patch move forward for the current
> commitfest, or if we should maybe return with feedback for now.
>

We can check how these chars are printed in most common terminals in modern
versions. I am afraid that it can be problematic to find a solution that
works everywhere, because some libraries on some platforms are pretty
obsolete.

Regards

Pavel


> --Jacob
>

Re: DELETE CASCADE

2021-07-07 Thread David Christensen

David G. Johnston writes:

> Having the defined FK behaviors be more readily changeable, while not
> mitigating this need, is IMO a more important feature to implement.  If
> there is a reason that cannot be implemented (besides no one has bothered
> to take the time) then I would consider that reason to also apply to
> prevent implementing this work-around.
>
> David J.

I assume this would look something like:

ALTER TABLE foo ALTER CONSTRAINT my_fkey ON UPDATE CASCADE ON DELETE RESTRICT

with omitted referential_action implying preserving the existing one.

Seems if we were going to tackle this particular problem, there would be two 
possible approaches
here:

1) Change the definitions of the RI_FKey_* constraints for (at least) 
RI_FKey_*_del() to instead
share a single function definition RI_FKey_del() and then pass in the 
constraint type operation
(restrict, cascade, no action, etc) in as a trigger argument instead of having 
separate functions for
each constraint type here.  This would then ensure that the dispatch function 
could both change the
constriant just by modifying the trigger arguments, as well as allowing for 
potential different behavior
depending on how the underlying function is called.

2) Keep the existing RI trigger functions, but allow an ALTER CONSTRAINT 
variant to replace the
trigger function to the new desired value, preserving (or transforming, as 
necessary) the original
arguments.

A few things off-hand:

- pg_trigger locking will be necessary as we change the underlying args for the 
tables in
  question. This seems unavoidable.

- partitions; do we need to lock them all in both solutions, or can we get away 
without it in the
  first approach?

- with the first solution you would lose the self-describing name of the 
trigger functions
  themselves (moving to the trigger args instead); while it is a change in a 
very long-standing
  behavior/design, it *should* be an implementation detail, and allows things 
like the explicit
  DELETE [ RESTRICT | CASCADE ] the original patch was pushing for.

- probably more I haven't thought of.

Best,

David

Re: badly calculated width of emoji in psql

2021-07-07 Thread Jacob Champion

On Mon, 2021-04-05 at 14:07 +0900, Kyotaro Horiguchi wrote:
> At Fri, 2 Apr 2021 11:51:26 +0200, Pavel Stehule  
> wrote in 
> > with this patch, the formatting is correct
> 
> I think the hardest point of this issue is that we don't have a
> reasonable authoritative source that determines character width. And
> that the presentation is heavily dependent on environment.

> Unicode 9 and/or 10 defines the character properties "Emoji" and
> "Emoji_Presentation", and tr51[1] says that
> 
> > Emoji are generally presented with a square aspect ratio, which
> > presents a problem for flags.
> ...
> > Current practice is for emoji to have a square aspect ratio, deriving
> > from their origin in Japanese. For interoperability, it is recommended
> > that this practice be continued with current and future emoji. They
> > will typically have about the same vertical placement and advance
> > width as CJK ideographs. For example:
> 
> Ok, even putting aside flags, the first table in [2] asserts that "#",
> "*", "0-9" are emoji characters.  But we and I think no-one never
> present them in two-columns.  And the table has many mysterious holes
> I haven't looked into.

I think that's why Emoji_Presentation is false for those characters --
they _could_ be presented as emoji if the UI should choose to do so, or
if an emoji presentation selector is used, but by default a text
presentation would be expected.

> We could Emoji_Presentation=yes for the purpose, but for example,
> U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE) has the property
> Emoji_Presentation=yes but U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE
> WITH VERTICAL BAR) does not for a reason uncertaion to me.  It doesn't
> look like other than some kind of mistake.

That is strange.

> About environment, for example, U+23E9 is an emoji, and
> Emoji_Presentation=yes, but it is shown in one column on my
> xterm. (I'm not sure what font am I using..)

I would guess that's the key issue here. If we choose a particular
width for emoji characters, is there anything keeping a terminal's font
from doing something different anyway?

Furthermore, if the stream contains an emoji presentation selector
after a code point that would normally be text, shouldn't we change
that glyph to have an emoji "expected width"?

I'm wondering if the most correct solution would be to have the user
tell the client what width to use, using .psqlrc or something.

> A possible compromise is that we treat all Emoji=yes characters
> excluding ASCII characters as double-width and manually merge the
> fragmented regions into reasonably larger chunks.

We could also keep the fragments as-is and generate a full interval
table, like common/unicode_combining_table.h. It looks like there's
roughly double the number of emoji intervals as combining intervals, so
hopefully adding a second binary search wouldn't be noticeably slower.

--

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

--Jacob

Re: Numeric x^y for negative x

2021-07-07 Thread Zhihong Yu

On Wed, Jul 7, 2021 at 10:37 AM Dean Rasheed 
wrote:

> On Thu, 1 Jul 2021 at 14:17, Dean Rasheed 
> wrote:
> >
> > On Tue, 29 Jun 2021 at 12:08, Dean Rasheed 
> wrote:
> > >
> > > Numeric x^y is supported for x < 0 if y is an integer, but this
> > > currently fails if y is outside the range of an int32
> >
> > I've been doing some more testing of this, and I spotted another
> > problem with numeric_power().
> >
> > [loss of precision and overflow errors]
> >
> > I think we should attempt to avoid all such overflow errors,
> > that are actually underflows, and return zero instead.
> >
>
> Finally getting back to this ... attached is an updated patch that now
> includes a fix for the loss-of-precision bug and the overflow errors.
> I don't think it's really worth trying to split these up, since
> they're all somewhat interrelated.
>
> Regards,
> Dean
>
Hi,

+   (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+errmsg("value overflows numeric format")));

Here is an example of existing error message which I think is more readable
than 'overflows numeric format':

 errmsg("bigint out of range")));

Maybe rephrase as: value is out of range

Cheers

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-07-07 Thread David Christensen


David Rowley writes:

> On Wed, 7 Jul 2021 at 02:46, David Christensen
>  wrote:
>> if we do decide to expand the units table there will be a
>> few additional changes (most significantly, the return value of 
>> `pg_size_bytes()` will need to switch
>> to `numeric`).
>
> I wonder if it's worth changing pg_size_bytes() to return NUMERIC
> regardless of if we add any additional units or not.
>
> Would you like to create 2 patches, one to change the return type and
> another to add the new units, both based on top of the v2 patch I sent
> earlier?
>
> David

Enclosed is the patch to change the return type to numeric, as well as one for 
expanding units to
add PB and EB.

If we decide to expand further, the current implementation will need to change, 
as
ZB and YB have 70 and 80 bits needing to be shifted accordingly, so int64 isn't 
enough to hold
it. (I fixed this particular issue in the original version of this patch, so 
there is at least a
blueprint of how to fix.)

I figured that PB and EB are probably good enough additions at this point, so 
we can debate whether
to add the others.

Best,

David

>From 57bd2eafff5404313426a10f63b0b7098314998a Mon Sep 17 00:00:00 2001
From: David Christensen 
Date: Wed, 7 Jul 2021 11:46:09 -0500
Subject: [PATCH] Make pg_size_bytes() return numeric instead of bigint

---
 src/backend/utils/adt/dbsize.c   |  7 ---
 src/include/catalog/pg_proc.dat  |  2 +-
 src/test/regress/expected/dbsize.out | 17 +
 src/test/regress/sql/dbsize.sql  |  6 --
 4 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index 4b7331d85c..a5a3ebe34e 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -813,10 +813,11 @@ pg_size_bytes(PG_FUNCTION_ARGS)
 		}
 	}
 
-	result = DatumGetInt64(DirectFunctionCall1(numeric_int8,
-			   NumericGetDatum(num)));
+	/* now finally truncate, since this is always in integer-like units */
+	num = DatumGetNumeric(DirectFunctionCall1(numeric_ceil,
+			  NumericGetDatum(num)));
 
-	PG_RETURN_INT64(result);
+	PG_RETURN_NUMERIC(num);
 }
 
 /*
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fde251fa4f..73326e3618 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -7187,7 +7187,7 @@
   prosrc => 'pg_size_pretty_numeric' },
 { oid => '3334',
   descr => 'convert a size in human-readable format with size units into bytes',
-  proname => 'pg_size_bytes', prorettype => 'int8', proargtypes => 'text',
+  proname => 'pg_size_bytes', prorettype => 'numeric', proargtypes => 'text',
   prosrc => 'pg_size_bytes' },
 { oid => '2997',
   descr => 'disk space usage for the specified table, including TOAST, free space and visibility map',
diff --git a/src/test/regress/expected/dbsize.out b/src/test/regress/expected/dbsize.out
index e901a2c92a..2950e017d0 100644
--- a/src/test/regress/expected/dbsize.out
+++ b/src/test/regress/expected/dbsize.out
@@ -101,6 +101,19 @@ SELECT size, pg_size_bytes(size) FROM
  -.0 gb | 0
 (8 rows)
 
+-- valid inputs outside bigint range (previous errors)
+SELECT pg_size_bytes('9223372036854775807.9');
+pg_size_bytes
+-
+ 9223372036854775808
+(1 row)
+
+SELECT pg_size_bytes('1e100');
+ pg_size_bytes 
+---
+ 1
+(1 row)
+
 -- invalid inputs
 SELECT pg_size_bytes('1 AB');
 ERROR:  invalid size: "1 AB"
@@ -114,10 +127,6 @@ SELECT pg_size_bytes('1 AB A');
 ERROR:  invalid size: "1 AB A"
 DETAIL:  Invalid size unit: "AB A".
 HINT:  Valid units are "bytes", "kB", "MB", "GB", and "TB".
-SELECT pg_size_bytes('9223372036854775807.9');
-ERROR:  bigint out of range
-SELECT pg_size_bytes('1e100');
-ERROR:  bigint out of range
 SELECT pg_size_bytes('1e100');
 ERROR:  value overflows numeric format
 SELECT pg_size_bytes('1 byte');  -- the singular "byte" is not supported
diff --git a/src/test/regress/sql/dbsize.sql b/src/test/regress/sql/dbsize.sql
index d10a4d7f68..e88184e9c9 100644
--- a/src/test/regress/sql/dbsize.sql
+++ b/src/test/regress/sql/dbsize.sql
@@ -30,12 +30,14 @@ SELECT size, pg_size_bytes(size) FROM
  (VALUES ('-1.'), ('-1.kb'), ('-1. kb'), ('-0. gb'),
  ('-.1'), ('-.1kb'), ('-.1 kb'), ('-.0 gb')) x(size);
 
+-- valid inputs outside bigint range (previous errors)
+SELECT pg_size_bytes('9223372036854775807.9');
+SELECT pg_size_bytes('1e100');
+
 -- invalid inputs
 SELECT pg_size_bytes('1 AB');
 SELECT pg_size_bytes('1 AB A');
 SELECT pg_size_bytes('1 AB A');
-SELECT pg_size_bytes('9223372036854775807.9');
-SELECT pg_size_bytes('1e100');
 SELECT

Re: Numeric x^y for negative x

2021-07-07 Thread Dean Rasheed

On Thu, 1 Jul 2021 at 14:17, Dean Rasheed  wrote:
>
> On Tue, 29 Jun 2021 at 12:08, Dean Rasheed  wrote:
> >
> > Numeric x^y is supported for x < 0 if y is an integer, but this
> > currently fails if y is outside the range of an int32
>
> I've been doing some more testing of this, and I spotted another
> problem with numeric_power().
>
> [loss of precision and overflow errors]
>
> I think we should attempt to avoid all such overflow errors,
> that are actually underflows, and return zero instead.
>

Finally getting back to this ... attached is an updated patch that now
includes a fix for the loss-of-precision bug and the overflow errors.
I don't think it's really worth trying to split these up, since
they're all somewhat interrelated.

Regards,
Dean
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index bc71326..1481538
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -3937,11 +3937,6 @@ numeric_power(PG_FUNCTION_ARGS)
 (errcode(ERRCODE_INVALID_ARGUMENT_FOR_POWER_FUNCTION),
  errmsg("zero raised to a negative power is undefined")));
 
-	if (sign1 < 0 && !numeric_is_integral(num2))
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_ARGUMENT_FOR_POWER_FUNCTION),
- errmsg("a negative number raised to a non-integer power yields a complex result")));
-
 	/*
 	 * Initialize things
 	 */
@@ -9754,12 +9749,18 @@ exp_var(const NumericVar *arg, NumericVa
 	 */
 	val = numericvar_to_double_no_overflow();
 
-	/* Guard against overflow */
+	/* Guard against overflow/underflow */
 	/* If you change this limit, see also power_var()'s limit */
 	if (Abs(val) >= NUMERIC_MAX_RESULT_SCALE * 3)
-		ereport(ERROR,
-(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
- errmsg("value overflows numeric format")));
+	{
+		if (val > 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+	 errmsg("value overflows numeric format")));
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
 
 	/* decimal weight = log10(e^x) = x * log10(e) */
 	dweight = (int) (val * 0.434294481903252);
@@ -10117,6 +10118,8 @@ log_var(const NumericVar *base, const Nu
 static void
 power_var(const NumericVar *base, const NumericVar *exp, NumericVar *result)
 {
+	int			res_sign;
+	NumericVar	abs_base;
 	NumericVar	ln_base;
 	NumericVar	ln_num;
 	int			ln_dweight;
@@ -10160,9 +10163,37 @@ power_var(const NumericVar *base, const
 		return;
 	}
 
+	init_var(_base);
 	init_var(_base);
 	init_var(_num);
 
+	/*
+	 * If base is negative, insist that exp be an integer.  The result is then
+	 * positive if exp is even and negative if exp is odd.
+	 */
+	if (base->sign == NUMERIC_NEG)
+	{
+		/* check that exp is an integer */
+		if (exp->ndigits > 0 && exp->ndigits > exp->weight + 1)
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_ARGUMENT_FOR_POWER_FUNCTION),
+	 errmsg("a negative number raised to a non-integer power yields a complex result")));
+
+		/* test if exp is odd or even */
+		if (exp->ndigits > 0 && exp->ndigits == exp->weight + 1 &&
+			(exp->digits[exp->ndigits - 1] & 1))
+			res_sign = NUMERIC_NEG;
+		else
+			res_sign = NUMERIC_POS;
+
+		/* then work with abs(base) below */
+		set_var_from_var(base, _base);
+		abs_base.sign = NUMERIC_POS;
+		base = _base;
+	}
+	else
+		res_sign = NUMERIC_POS;
+
 	/*--
 	 * Decide on the scale for the ln() calculation.  For this we need an
 	 * estimate of the weight of the result, which we obtain by doing an
@@ -10193,25 +10224,31 @@ power_var(const NumericVar *base, const
 
 	val = numericvar_to_double_no_overflow(_num);
 
-	/* initial overflow test with fuzz factor */
+	/* initial overflow/underflow test with fuzz factor */
 	if (Abs(val) > NUMERIC_MAX_RESULT_SCALE * 3.01)
-		ereport(ERROR,
-(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
- errmsg("value overflows numeric format")));
+	{
+		if (val > 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+	 errmsg("value overflows numeric format")));
+		zero_var(result);
+		result->dscale = NUMERIC_MAX_DISPLAY_SCALE;
+		return;
+	}
 
 	val *= 0.434294481903252;	/* approximate decimal result weight */
 
-	/* choose the result scale */
+	/* choose the result scale and the scale for the real calculation */
 	rscale = NUMERIC_MIN_SIG_DIGITS - (int) val;
 	rscale = Max(rscale, base->dscale);
 	rscale = Max(rscale, exp->dscale);
 	rscale = Max(rscale, NUMERIC_MIN_DISPLAY_SCALE);
-	rscale = Min(rscale, NUMERIC_MAX_DISPLAY_SCALE);
 
-	/* set the scale for the real exp * ln(base) calculation */
 	local_rscale = rscale + (int) val - ln_dweight + 8;
 	local_rscale = Max(local_rscale, NUMERIC_MIN_DISPLAY_SCALE);
 
+	rscale = Min(rscale, NUMERIC_MAX_DISPLAY_SCALE);
+
 	/* and do the real calculation */
 
 	ln_var(base, _base, local_rscale);
@@ -10220,8 +10257,12 @@ power_var(const NumericVar *base, const
 
 	exp_var(_num, result, rscale);
 
+	if (res_sign == NUMERIC_NEG && result->ndigits > 0)
+		result->sign =

re: Why ALTER SUBSCRIPTION ... SET (slot_name='none') requires subscription disabled?

2021-07-07 Thread Ranier Vilela

>OTOH, when I execute ALTER SUBSCRIPTION ... SET (slot_name=''), it doesn't
complain. However,
>SELECT select pg_create_logical_replication_slot('', 'pgoutput') complains
slot name is too
>short. Although, the slot will be created at publisher, and validate the
slot name, IMO, we
>can also validate the slot_name in parse_subscription_options() to get the
error early.
>Attached fixes it. Any thoughts?
I think that this fix is better after the check if the name is equal to
"none".
Most of the time it will be "none" .
While this, reduce the overhead with strlen into
ReplicationSlotValidateName can it might be worth it.

For convenience, I have attached a new version.

regards,
Ranier Vilela


v2-validate-slot_name-in-parse_subscription_options.patch
Description: Binary data


reduce_overhead_strlen_replicationslotvalidatename.patch
Description: Binary data

Re: Pipeline mode and PQpipelineSync()

2021-07-07 Thread Alvaro Herrera

On 2021-Jul-07, Boris Kolpackov wrote:

> // Try to minimize the chance of blocking the server by first processing
> // the result and then sending more queries.
> //
> if (FD_ISSET (sock, ))
> {
>   if (PQconsumeInput (conn) == 0)
> assert (false);
> 
>   while (PQisBusy (conn) == 0)
>   {
> //fprintf (stderr, "PQgetResult %zu\n", rn);
> 
> PGresult* res = PQgetResult (conn);
> assert (res != NULL);
> ExecStatusType stat = PQresultStatus (res);

Hmm ... aren't you trying to read more results than you sent queries?  I
think there should be a break out of that block when that happens (which
means the read of the PGRES_PIPELINE_SYNC needs to be out of there too).
With this patch, the program seems to work well for me.

***
*** 94,112 
while (PQisBusy (conn) == 0)
{
  //fprintf (stderr, "PQgetResult %zu\n", rn);
  
  PGresult* res = PQgetResult (conn);
  assert (res != NULL);
  ExecStatusType stat = PQresultStatus (res);
  
- if (stat == PGRES_PIPELINE_SYNC)
- {
-   assert (wdone && rn == n);
-   PQclear (res);
-   rdone = true;
-   break;
- }
- 
  if (stat == PGRES_FATAL_ERROR)
  {
const char* s = PQresultErrorField (res, PG_DIAG_SQLSTATE);
--- 94,110 
while (PQisBusy (conn) == 0)
{
  //fprintf (stderr, "PQgetResult %zu\n", rn);
+ if (rn >= wn)
+ {
+   if (wdone)
+ rdone = true;
+   break;
+ }
  
  PGresult* res = PQgetResult (conn);
  assert (res != NULL);
  ExecStatusType stat = PQresultStatus (res);
  
  if (stat == PGRES_FATAL_ERROR)
  {
const char* s = PQresultErrorField (res, PG_DIAG_SQLSTATE);
***
*** 190,195 
--- 188,201 
  break; // Blocked or done.
}
  }
+ 
+ if (rdone)
+ {
+   PGresult *res = PQgetResult(conn);
+   assert(PQresultStatus(res) == PGRES_PIPELINE_SYNC);
+   PQclear(res);
+   break;
+ }
}
  
if (PQexitPipelineMode (conn) == 0 ||
***
*** 246,248 
--- 252,269 
  PQclear (res);
}
  }
+ 
+ int main(int argc, char **argv)
+ {
+   PGconn *conn = PQconnectdb("");
+   if (PQstatus(conn) != CONNECTION_OK)
+   {
+ fprintf(stderr, "connection failed: %s\n",
+ PQerrorMessage(conn));
+ return 1;
+   }
+ 
+   test(conn);
+ }
+ 
+ 


-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: Case expression pushdown

2021-07-07 Thread Gilles Darold


Le 07/07/2021 à 18:50, Gilles Darold a écrit :


Great, I changing the state in the commitfest to "Ready for committers".


I'm attaching the v5 patch again as it doesn't appears in the Latest 
attachment list in the commitfest.



--
Gilles Darold
MigOps Inc

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 31919fda8c..6177a7b252 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -186,6 +186,8 @@ static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
 
+static void deparseCaseExpr(CaseExpr *node, deparse_expr_cxt *context);
+
 /*
  * Helper functions
  */
@@ -509,6 +511,54 @@ foreign_expr_walker(Node *node,
 	state = FDW_COLLATE_UNSAFE;
 			}
 			break;
+		case T_CaseTestExpr:
+			{
+CaseTestExpr *c = (CaseTestExpr *) node;
+
+/*
+ * Collation rule is same as for function nodes.
+ */
+collation = c->collation;
+if (collation == InvalidOid)
+	state = FDW_COLLATE_NONE;
+else if (inner_cxt.state == FDW_COLLATE_SAFE &&
+		 collation == inner_cxt.collation)
+	state = FDW_COLLATE_SAFE;
+else if (collation == DEFAULT_COLLATION_OID)
+	state = FDW_COLLATE_NONE;
+else
+	state = FDW_COLLATE_UNSAFE;
+			}
+			break;
+		case T_CaseExpr:
+			{
+ListCell   *lc;
+ 
+/* Recurse to case clause subexpressions. */
+foreach(lc, ((CaseExpr *) node)->args)
+{
+	if (!foreign_expr_walker((Node *) lfirst(lc),
+			 glob_cxt, _cxt))
+		return false;
+}
+			}
+			break;
+		case T_CaseWhen:
+			{
+CaseWhen   *whenExpr = (CaseWhen *) node;
+ 
+/* Recurse to case clause expression. */
+if (!foreign_expr_walker((Node *) whenExpr->expr,
+		 glob_cxt, _cxt))
+	return false;
+/* Recurse to result expression. */
+if (!foreign_expr_walker((Node *) whenExpr->result,
+		 glob_cxt, _cxt))
+	return false;
+/* Don't apply exprType() to the case when expr. */
+check_type = false;
+			}
+			break;
 		case T_OpExpr:
 		case T_DistinctExpr:	/* struct-equivalent to OpExpr */
 			{
@@ -2462,6 +2512,9 @@ deparseExpr(Expr *node, deparse_expr_cxt *context)
 		case T_Aggref:
 			deparseAggref((Aggref *) node, context);
 			break;
+		case T_CaseExpr:
+			deparseCaseExpr((CaseExpr *) node, context);
+			break;
 		default:
 			elog(ERROR, "unsupported expression type for deparse: %d",
  (int) nodeTag(node));
@@ -3179,6 +3232,52 @@ appendAggOrderBy(List *orderList, List *targetList, deparse_expr_cxt *context)
 	}
 }
 
+/*
+ * Deparse CASE expression
+ */
+static void
+deparseCaseExpr(CaseExpr *node, deparse_expr_cxt *context)
+{
+	StringInfo  buf = context->buf;
+	ListCell   *lc = NULL;
+ 
+	appendStringInfoString(buf, "(CASE");
+ 
+	/* If this is a CASE arg WHEN clause process arg first */
+	if (node->arg != NULL)
+	{
+		appendStringInfoString(buf, " ");
+		deparseExpr(node->arg, context);
+	}
+ 
+	/* Add each condition/result of the CASE clause */
+	foreach(lc, node->args)
+	{
+		CaseWhen   *whenclause = (CaseWhen *) lfirst(lc);
+ 
+		/* WHEN */
+		appendStringInfoString(buf, " WHEN ");
+		if (node->arg == NULL)  /* CASE WHEN */
+			deparseExpr(whenclause->expr, context);
+		else/* CASE arg WHEN */
+			deparseExpr(lsecond(((OpExpr *) whenclause->expr)->args), context);
+ 
+		/* THEN */
+		appendStringInfoString(buf, " THEN ");
+		deparseExpr(whenclause->result, context);
+	}
+ 
+	/* add ELSE if needed */
+	if (node->defresult != NULL)
+	{
+		appendStringInfoString(buf, " ELSE ");
+		deparseExpr(node->defresult, context);
+	}
+ 
+	/* append END */
+	appendStringInfoString(buf, " END)");
+}
+
 /*
  * Print the representation of a parameter to be sent to the remote side.
  *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index b510322c4e..dfef4efd79 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -5561,6 +5561,150 @@ UPDATE ft2 d SET c2 = CASE WHEN random() >= 0 THEN d.c2 ELSE 0 END
 
 UPDATE ft2 d SET c2 = CASE WHEN random() >= 0 THEN d.c2 ELSE 0 END
   FROM ft2 AS t WHERE d.c1 = t.c1 AND d.c1 > 1000;
+-- Test CASE pushdown
+EXPLAIN (VERBOSE, COSTS OFF)
+UPDATE ft2 d SET c2 = CASE WHEN c2 > 0 THEN c2 ELSE 0 END
+WHERE c1 > 1000;
+   QUERY PLAN   
+
+ Update on public.ft2 d
+   ->  Foreign Update on public.ft2 d
+ Remote SQL: UPDATE "S 1"."T 1" SET c2 = (CASE WHEN (c2 > 0) THEN c2 ELSE 0 END) WHERE (("C 1" > 1000))
+(3 rows)
+
+UPDATE ft2 d SET c2 = CASE WHEN c2 > 0 THEN c2 ELSE 0 END
+WHERE c1 > 1000;
+-- CASE in WHERE clause
+EXPLAIN (VERBOSE, COSTS

Re: Case expression pushdown

2021-07-07 Thread Gilles Darold


Le 07/07/2021 à 17:39, Alexander Pyhalov a écrit :

Hi.

Gilles Darold писал 2021-07-07 15:02:


Le 22/06/2021 à 15:39, Alexander Pyhalov a écrit :


Seino Yuki писал 2021-06-22 16:03:
On 2021-06-16 01:29, Alexander Pyhalov wrote:
Hi.

Ashutosh Bapat писал 2021-06-15 16:24:
Looks quite useful to me. Can you please add this to the next
commitfest?

Addded to commitfest. Here is an updated patch version.


Thanks for posting the patch.
I agree with this content.


+ Foreign Scan on public.ft2 (cost=156.58..165.45 rows=394
width=14)

 It's not a big issue, but is there any intention behind the pattern
of
outputting costs in regression tests?

Hi.

No, I don't think it makes much sense. Updated tests (also added case
with empty else).

The patch doesn't apply anymore to master, I join an update of your
patch update in attachment. This is your patch rebased and untouched
minus a comment in the test and renamed to v4.

I could have miss something but I don't think that additional struct
elements case_args in structs foreign_loc_cxt and deparse_expr_cxt are
necessary. They look to be useless.


I thought we should compare arg collation and expression collation and 
didn't suggest, that we can take CaseTestExpr's collation directly, 
without deriving it from CaseExpr's arg. Your version of course looks 
saner.




The patch will also be more clear if the CaseWhen node was handled
separately in foreign_expr_walker() instead of being handled in the
T_CaseExpr case. By this way the T_CaseExpr case just need to call
recursively foreign_expr_walker(). I also think that code in
T_CaseTestExpr should just check the collation, there is nothing more
to do here like you have commented the function deparseCaseTestExpr().
This function can be removed as it does nothing if the case_args
elements are removed.

There is a problem the regression test with nested CASE clauses:


EXPLAIN (VERBOSE, COSTS OFF)
SELECT c1,c2,c3 FROM ft2 WHERE CASE CASE WHEN c2 > 0 THEN c2 END
WHEN 100 THEN 601 WHEN c2 THEN c2 ELSE 0 END > 600 ORDER BY c1;


the original query use "WHERE CASE CASE WHEN" but the remote query is
not the same in the plan:


Remote SQL: SELECT "C 1", c2, c3 FROM "S 1"."T 1" WHERE (((CASE WHEN
((CASE WHEN (c2 > 0) THEN c2 ELSE NULL::integer END) = 100) THEN 601
WHEN ((CASE WHEN (c2 > 0) THEN c2 ELSE NULL::integer END) = c2) THEN
c2 ELSE 0 END) > 600)) ORDER BY "C 1" ASC NULLS LAST


Here this is "WHERE (((CASE WHEN ((CASE WHEN" I expected it to be
unchanged to "WHERE (((CASE (CASE WHEN".


I'm not sure this is an issue (as we change CASE A WHEN B ... to CASE 
WHEN (A=B)),

and expressions should be free from side effects, but again your version
looks better.



Right it returns the same result but I think this is confusing to not 
see the same case form in the remote query.





Thanks for improving the patch, it looks saner now.



Great, I changing the state in the commitfest to "Ready for committers".


--
Gilles Darold
MigOps Inc

Re: PostgreSQL-13.3 parser.y with positional references by named references

2021-07-07 Thread Tom Lane

Domingo Alvarez Duarte  writes:
> What if the generated parser/lexer be present in the tarball distributions ?

It is.  The discussion here is about developer convenience (how painful
is it to read or modify a rule) versus developer convenience (what hoops
have you got to jump through to install a version of Bison that will
work, when building from a git checkout).  Note however that the set of
developers affected by the second aspect is much larger than those
affected by the first.  Lots of people who work on PG never mess with
the grammar.

regards, tom lane

Re: Case expression pushdown

2021-07-07 Thread Alexander Pyhalov


Hi.

Gilles Darold писал 2021-07-07 15:02:


Le 22/06/2021 à 15:39, Alexander Pyhalov a écrit :


Seino Yuki писал 2021-06-22 16:03:
On 2021-06-16 01:29, Alexander Pyhalov wrote:
Hi.

Ashutosh Bapat писал 2021-06-15 16:24:
Looks quite useful to me. Can you please add this to the next
commitfest?

Addded to commitfest. Here is an updated patch version.


Thanks for posting the patch.
I agree with this content.


+ Foreign Scan on public.ft2  (cost=156.58..165.45 rows=394
width=14)

 It's not a big issue, but is there any intention behind the pattern
of
outputting costs in regression tests?

Hi.

No, I don't think it makes much sense. Updated tests (also added case
with empty else).

The patch doesn't apply anymore to master, I join an update of your
patch update in attachment. This is your patch rebased and untouched
minus a comment in the test and renamed to v4.

I could have miss something but I don't think that additional struct
elements case_args in structs foreign_loc_cxt and deparse_expr_cxt are
necessary. They look to be useless.


I thought we should compare arg collation and expression collation and 
didn't suggest, that we can take CaseTestExpr's collation directly, 
without deriving it from CaseExpr's arg. Your version of course looks 
saner.




The patch will also be more clear if the CaseWhen node was handled
separately in foreign_expr_walker() instead of being handled in the
T_CaseExpr case. By this way the T_CaseExpr case just need to call
recursively foreign_expr_walker(). I also think that code in
T_CaseTestExpr should just check the collation, there is nothing more
to do here like you have commented the function deparseCaseTestExpr().
This function can be removed as it does nothing if the case_args
elements are removed.

There is a problem the regression test with nested CASE clauses:


EXPLAIN (VERBOSE, COSTS OFF)
SELECT c1,c2,c3 FROM ft2 WHERE CASE CASE WHEN c2 > 0 THEN c2 END
WHEN 100 THEN 601 WHEN c2 THEN c2 ELSE 0 END > 600 ORDER BY c1;


the original query use "WHERE CASE CASE WHEN" but the remote query is
not the same in the plan:


Remote SQL: SELECT "C 1", c2, c3 FROM "S 1"."T 1" WHERE (((CASE WHEN
((CASE WHEN (c2 > 0) THEN c2 ELSE NULL::integer END) = 100) THEN 601
WHEN ((CASE WHEN (c2 > 0) THEN c2 ELSE NULL::integer END) = c2) THEN
c2 ELSE 0 END) > 600)) ORDER BY "C 1" ASC NULLS LAST


Here this is "WHERE (((CASE WHEN ((CASE WHEN" I expected it to be
unchanged to "WHERE (((CASE (CASE WHEN".


I'm not sure this is an issue (as we change CASE A WHEN B ... to CASE 
WHEN (A=B)),

and expressions should be free from side effects, but again your version
looks better.

Thanks for improving the patch, it looks saner now.
--
Best regards,
Alexander Pyhalov,
Postgres Professional

Re: unexpected data loaded into database when used COPY FROM

2021-07-07 Thread Tom Lane

"jian...@fujitsu.com"  writes:
> When I used COPY FROM command on windows, I found that If the line data ends 
> with a backslash and carriage return/newlines(\r\n),COPY FROM mishandle the 
> line .
> As a result, there were unexpected data loaded into database.

If what you're saying is that backslash-\r-\n results in the \r being
taken as a data character, there is exactly nothing unexpected about that.

regards, tom lane

Re: PostgreSQL-13.3 parser.y with positional references by named references

2021-07-07 Thread Domingo Alvarez Duarte

I understand the concerns but I would not qualify it as "minor 
developer-convenience feature".


I'm not impartial because the initial suggestion was mine, just to add 
more options to be considered:


What if the generated parser/lexer be present in the tarball distributions ?

Cheers !

On 7/7/21 17:14, Tom Lane wrote:

ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:

Peter Eisentraut  writes:

On 04.07.21 17:58, Tom Lane wrote:

When is "some time now"?

release 2.5 (2011-05-14)

Do we support building on RHEL6? It only ships bison 2.4, so that would
mean people building on that would have to install it seprately.

A quick look through the buildfarm shows these animals that would be
unhappy:

 sysname|  snapshot   |l
  gaur  | 2021-07-03 22:56:25 | configure: using bison (GNU Bison) 1.875
  prairiedog| 2021-07-07 06:38:15 | configure: using bison (GNU Bison) 1.875
  locust| 2021-07-07 07:15:22 | configure: using bison (GNU Bison) 2.3
  longfin   | 2021-07-07 04:39:09 | configure: using bison (GNU Bison) 2.3
  sifaka| 2021-07-07 04:33:58 | configure: using bison (GNU Bison) 2.3
  anole | 2021-07-01 15:50:38 | configure: using bison (GNU Bison) 2.4.1
  gharial   | 2021-07-05 08:00:48 | configure: using bison (GNU Bison) 2.4.1
  walleye   | 2021-07-07 06:55:35 | configure: using bison (GNU Bison) 2.4.2
  jacana| 2021-07-06 03:00:44 | Jul 05 23:00:49 configure: using bison 
(GNU Bison) 2.4.2

(hmm, almost half of those are mine :-().  The main thing I take away
from this is that Apple is still shipping 2.3, which means that requiring
2.5 would completely break the ability to build on macOS without using
anything from homebrew or macports.  That seems like moving the goalposts
pretty far for a minor developer-convenience feature.

regards, tom lane

Re: Pipeline mode and PQpipelineSync()

2021-07-07 Thread Alvaro Herrera

On 2021-Jul-07, Boris Kolpackov wrote:

> Alvaro Herrera  writes:
> 
> > On 2021-Jul-07, Boris Kolpackov wrote:
> > 
> > > I don't get any difference in behavior with this patch. That is, I
> > > still get the unexpected NULL result. Does it make a difference for
> > > your reproducer?
> > 
> > Yes, the behavior changes for my repro.   Is it possible for you to
> > share a full program I can compile and run, plesse?
> 
> Here is the test sans the connection setup:

Thanks, looking now.  (I was trying to compile libodb and everything, and
I went a down rabbit hole of configure failing with mysterious m4 errors ...)

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: "debug_invalidate_system_caches_always" is too long

2021-07-07 Thread Tom Lane

Peter Eisentraut  writes:
> The clobbering doesn't actually happen unless you turn on 
> CLOBBER_FREED_MEMORY, so it would be good to keep that separate.

Fair point.  What do you think of the alternative proposals
"debug_flush_caches", "debug_discard_caches", etc?

regards, tom lane

Re: PostgreSQL-13.3 parser.y with positional references by named references

2021-07-07 Thread Tom Lane

ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> Peter Eisentraut  writes:
>> On 04.07.21 17:58, Tom Lane wrote:
>>> When is "some time now"?

>> release 2.5 (2011-05-14)

> Do we support building on RHEL6? It only ships bison 2.4, so that would
> mean people building on that would have to install it seprately.

A quick look through the buildfarm shows these animals that would be
unhappy:

sysname|  snapshot   |l 
 gaur  | 2021-07-03 22:56:25 | configure: using bison (GNU Bison) 1.875
 prairiedog| 2021-07-07 06:38:15 | configure: using bison (GNU Bison) 1.875
 locust| 2021-07-07 07:15:22 | configure: using bison (GNU Bison) 2.3
 longfin   | 2021-07-07 04:39:09 | configure: using bison (GNU Bison) 2.3
 sifaka| 2021-07-07 04:33:58 | configure: using bison (GNU Bison) 2.3
 anole | 2021-07-01 15:50:38 | configure: using bison (GNU Bison) 2.4.1
 gharial   | 2021-07-05 08:00:48 | configure: using bison (GNU Bison) 2.4.1
 walleye   | 2021-07-07 06:55:35 | configure: using bison (GNU Bison) 2.4.2
 jacana| 2021-07-06 03:00:44 | Jul 05 23:00:49 configure: using bison 
(GNU Bison) 2.4.2

(hmm, almost half of those are mine :-().  The main thing I take away
from this is that Apple is still shipping 2.3, which means that requiring
2.5 would completely break the ability to build on macOS without using
anything from homebrew or macports.  That seems like moving the goalposts
pretty far for a minor developer-convenience feature.

regards, tom lane

Re: Pipeline mode and PQpipelineSync()

2021-07-07 Thread Boris Kolpackov

Alvaro Herrera  writes:

> On 2021-Jul-07, Boris Kolpackov wrote:
> 
> > I don't get any difference in behavior with this patch. That is, I
> > still get the unexpected NULL result. Does it make a difference for
> > your reproducer?
> 
> Yes, the behavior changes for my repro.   Is it possible for you to
> share a full program I can compile and run, plesse?

Here is the test sans the connection setup:

---

#include 

#include 
#include 
#include 
#include 
#include 
#include 

// Note: hack.
//
#include 
#define htonll(x) long long)htonl(x)) << 32) + htonl((x) >> 32))

static const size_t columns = 3;

struct data
{
  long long id;
  long long idata;
  const char* sdata;
};

static char* values[columns];
static int lengths[columns];
static int formats[columns] = {1, 1, 1};

static const unsigned int types[columns] = {
  20, // int8
  20, // int8
  25  // text
};

static void
init (const struct data* d)
{
  values[0] = (char*)>id;
  lengths[0] = sizeof (d->id);

  values[1] = (char*)>idata;
  lengths[1] = sizeof (d->idata);

  values[2] = (char*)d->sdata;
  lengths[2] = strlen (d->sdata);
}

static void
execute (PGconn* conn, const struct data* ds, size_t n)
{
  int sock = PQsocket (conn);
  assert (sock != -1);

  if (PQsetnonblocking (conn, 1) == -1 ||
  PQenterPipelineMode (conn) == 0)
assert (false);

  // True if we've written and read everything, respectively.
  //
  bool wdone = false;
  bool rdone = false;

  size_t wn = 0;
  size_t rn = 0;

  while (!rdone)
  {
fd_set wds;
if (!wdone)
{
  FD_ZERO ();
  FD_SET (sock, );
}

fd_set rds;
FD_ZERO ();
FD_SET (sock, );

if (select (sock + 1, , wdone ? NULL : , NULL, NULL) == -1)
{
  if (errno == EINTR)
continue;

  assert (false);
}

// Try to minimize the chance of blocking the server by first processing
// the result and then sending more queries.
//
if (FD_ISSET (sock, ))
{
  if (PQconsumeInput (conn) == 0)
assert (false);

  while (PQisBusy (conn) == 0)
  {
//fprintf (stderr, "PQgetResult %zu\n", rn);

PGresult* res = PQgetResult (conn);
assert (res != NULL);
ExecStatusType stat = PQresultStatus (res);

if (stat == PGRES_PIPELINE_SYNC)
{
  assert (wdone && rn == n);
  PQclear (res);
  rdone = true;
  break;
}

if (stat == PGRES_FATAL_ERROR)
{
  const char* s = PQresultErrorField (res, PG_DIAG_SQLSTATE);

  if (strcmp (s, "23505") == 0)
fprintf (stderr, "duplicate id at %zu\n", rn);
}

PQclear (res);
assert (rn != n);
++rn;

// We get a NULL result after each query result.
//
{
  PGresult* end = PQgetResult (conn);
  assert (end == NULL);
}
  }
}

if (!wdone && FD_ISSET (sock, ))
{
  // Send queries until we get blocked (write-biased). This feels like
  // a better overall strategy to keep the server busy compared to
  // sending one query at a time and then re-checking if there is
  // anything to read because the results of INSERT/UPDATE/DELETE are
  // presumably small and quite a few of them can get buffered before
  // the server gets blocked.
  //
  for (;;)
  {
if (wn != n)
{
  //fprintf (stderr, "PQsendQueryPrepared %zu\n", wn);

  init (ds + wn);

  if (PQsendQueryPrepared (conn,
   "persist_object",
   (int)(columns),
   values,
   lengths,
   formats,
   1) == 0)
assert (false);

  if (++wn == n)
  {
if (PQpipelineSync (conn) == 0)
  assert (false);
  }
}

// PQflush() result:
//
//  0  --  success (queue is now empty)
//  1  --  blocked
// -1  --  error
//
int r = PQflush (conn);
assert (r != -1);

if (r == 0)
{
  if (wn != n)
  {
// If we continue here, then we are write-biased. And if we
// break, then we are read-biased.
//
#if 1
break;
#else
continue;
#endif
  }

  wdone = true;
}

break; // Blocked or done.
  }
}
  }

  if (PQexitPipelineMode (conn) == 0 ||
  PQsetnonblocking (conn, 0) == -1)
assert (false);
}

static void
test (PGconn* conn)
{
  const size_t batch = 500;
  struct data ds[batch];

  for (size_t i = 0; i != batch; ++i)
  {
ds[i].id = htonll (i == batch / 2 ? i - 1 : i); // Cause duplicate PK.
ds[i].idata = htonll (i);
ds[i].sdata = "abc";
  }

  // Prepare the statement.

Re: [PATCH] Pull general SASL framework out of SCRAM

2021-07-07 Thread Jacob Champion

On Wed, 2021-07-07 at 14:08 +0900, Michael Paquier wrote:
> On Tue, Jul 06, 2021 at 06:20:49PM +, Jacob Champion wrote:
> > On Mon, 2021-07-05 at 17:17 +0900, Michael Paquier wrote:
> > 
> > > Hmm.  Does the RFCs tell us anything about that?
> > 
> > Just in general terms:
> > 
> > >Each authentication exchange consists of a message from the client to
> > >the server requesting authentication via a particular mechanism,
> > >followed by one or more pairs of challenges from the server and
> > >responses from the client, followed by a message from the server
> > >indicating the outcome of the authentication exchange.  (Note:
> > >exchanges may also be aborted as discussed in Section 3.5.)
> > 
> > So a challenge must be met with a response, or the exchange must be
> > aborted. (And I don't think our protocol implementation provides a
> > client abort message; if something goes wrong, we just tear down the
> > connection.)
> 
> Thanks.  At the same time, section 3.5 also says that the client may
> send a message to abort.  So one can interpret that the client has
> also the choice to abort without sending a response back to the
> server?  Or I am just interpreting incorrectly the use of "may" in
> this context?

That's correct. But the client may not simply ignore the challenge and
keep the exchange open waiting for a new one, as pg_SASL_continue()
currently allows. That's what my TODO is referring to.

--Jacob

Re: [PATCH] Make jsonapi usable from libpq

2021-07-07 Thread Jacob Champion

On Wed, 2021-07-07 at 01:42 -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > It seems to me that this does not address yet the problems with the
> > palloc/pstrdup in jsonapi.c though, which would need to rely on
> > malloc() if we finish to use this code in libpq.  I am not sure yet
> > that we have any need to do that yet as we may finish by not using
> > OAUTH as SASL mechanism at the end in core.  So perhaps it would be
> > better to just give up on this thread for now?
> 
> Yeah, I think there's nothing to do here unless we decide that we
> have to have JSON-parsing ability inside libpq ... which is a
> situation I think we should try hard to avoid.

I'm working on a corrected version of the allocation for the OAuth
proof of concept, so we can see what it might look like there. I will
withdraw this one from the commitfest. Thanks for all the feedback!

--Jacob

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-07 Thread Matthias van de Meent

On Wed, 7 Jul 2021 at 13:47, Masahiko Sawada  wrote:
>
> Hi all,
>
> Index vacuuming is one of the most time-consuming processes in lazy
> vacuuming. lazy_tid_reaped() is a large part among them. The attached
> the flame graph shows a profile of a vacuum on a table that has one index
> and 80 million live rows and 20 million dead rows, where
> lazy_tid_reaped() accounts for about 47% of the total vacuum execution
> time.
>
> [...]
>
> Overall, 'rtbm' has a much better lookup performance and good memory
> usage especially if there are relatively many dead tuples. However, in
> some cases, 'intset' and 'array' have a better memory usage.

Those are some great results, with a good path to meaningful improvements.

> Feedback is very welcome. Thank you for reading the email through to the end.

The current available infrastructure for TIDs is quite ill-defined for
TableAM authors [0], and other TableAMs might want to use more than
just the 11 bits in use by max-BLCKSZ HeapAM MaxHeapTuplesPerPage to
identify tuples. (MaxHeapTuplesPerPage is 1169 at the maximum 32k
BLCKSZ, which requires 11 bits to fit).

Could you also check what the (performance, memory) impact would be if
these proposed structures were to support the maximum
MaxHeapTuplesPerPage of 1169 or the full uint16-range of offset
numbers that could be supported by our current TID struct?

Kind regards,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/flat/0bbeb784050503036344e1f08513f13b2083244b.camel%40j-davis.com

Re: Hook for extensible parsing.

2021-07-07 Thread Julien Rouhaud

On Wed, Jul 7, 2021 at 8:45 PM Jim Mlodgenski  wrote:
>
> The test module is very useful to show how to use the hook but it isn't
> very useful to the general user like most other things in contrib. It probably
> fits better in src/test/modules

I agree that it's not useful at all to eventually have it as a
contrib, but it's somewhat convenient at this stage to be able to
easily test the hook, possibly with different behavior.

But as I said, if there's an agreement on the approach and the
implementation, I don't think that it would make sense to keep it even
in the src/test/modules.  A full bison parser, even with a limited
grammar, will have about 99% of noise when it comes to demonstrate how
the hook is supposed to work, which basically is having a "single
query" parser or a "full input string" parser.  I'm not even convinced
that flex/bison will be the preferred choice for someone who wants to
implement a custom parser.

I tried to add really thorough comments in the various parts of the
patch to make it clear how to do that and how the system will react
depending on what a hook does.  I also added some protection to catch
inconsistent hook implementation.  I think that's the best way to help
external parser authors to implement what they want, and I'll be happy
to improve the comments if necessary.  But if eventually people would
like to have a real parser in the tree, for testing or guidance, I
will of course take care of doing the required changes and moving the
demo parser in src/test/modules.

PostgreSQL 14 backend crash on incorrect trigger

2021-07-07 Thread Alexander Pyhalov


Hi.

The following test case makes postgresql backend crash. The trigger is 
incorrect, but this didn't crash postgresql before


commit 86dc90056dfdbd9d1b891718d2e5614e3e432f35 (HEAD)
Author: Tom Lane 
Date:   Wed Mar 31 11:52:34 2021 -0400

Rework planning and execution of UPDATE and DELETE.


Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x7f1032cc6905 in find_modifytable_subplan (root=0x563f05e61458, 
plan=0x563f06020e48, rtindex=1, subplan_index=0) at postgres_fdw.c:2373
2373else if (IsA(subplan, Result) && IsA(outerPlan(subplan), 
Append))

(gdb) bt
#0  0x7f1032cc6905 in find_modifytable_subplan (root=0x563f05e61458, 
plan=0x563f06020e48, rtindex=1, subplan_index=0) at postgres_fdw.c:2373
#1  0x7f1032cc6a44 in postgresPlanDirectModify (root=0x563f05e61458, 
plan=0x563f06020e48, resultRelation=1, subplan_index=0) at 
postgres_fdw.c:2433
#2  0x563f035f2876 in make_modifytable (root=0x563f05e61458, 
subplan=0x563f05e626e8, operation=CMD_DELETE, canSetTag=true, 
nominalRelation=1, rootRelation=0, partColsUpdated=false, 
resultRelations=0x563f06020b88, updateColnosLists=0x0, 
withCheckOptionLists=0x0,
returningLists=0x0, rowMarks=0x0, onconflict=0x0, epqParam=0) at 
createplan.c:7007
#3  0x563f035e9ab3 in create_modifytable_plan (root=0x563f05e61458, 
best_path=0x563f05e62168) at createplan.c:2746
#4  0x563f035e5936 in create_plan_recurse (root=0x563f05e61458, 
best_path=0x563f05e62168, flags=1) at createplan.c:530
#5  0x563f035e54be in create_plan (root=0x563f05e61458, 
best_path=0x563f05e62168) at createplan.c:347
#6  0x563f035f8810 in standard_planner (parse=0x563f05e60c08, 
query_string=0x563f05ffd8d0 "DELETE FROM test_remote WHERE 
row(i,j)=row(new.i,new.j)", cursorOptions=2048, 
boundParams=0x563f05e980d8) at planner.c:407
#7  0x563f035f84f4 in planner (parse=0x563f05e60c08, 
query_string=0x563f05ffd8d0 "DELETE FROM test_remote WHERE 
row(i,j)=row(new.i,new.j)", cursorOptions=2048, 
boundParams=0x563f05e980d8) at planner.c:271
#8  0x563f0373f9c7 in pg_plan_query (querytree=0x563f05e60c08, 
query_string=0x563f05ffd8d0 "DELETE FROM test_remote WHERE 
row(i,j)=row(new.i,new.j)", cursorOptions=2048, 
boundParams=0x563f05e980d8) at postgres.c:847
#9  0x563f0373fb15 in pg_plan_queries (querytrees=0x563f05e60bb0, 
query_string=0x563f05ffd8d0 "DELETE FROM test_remote WHERE 
row(i,j)=row(new.i,new.j)", cursorOptions=2048, 
boundParams=0x563f05e980d8) at postgres.c:939
#10 0x563f038dade2 in BuildCachedPlan (plansource=0x563f06027590, 
qlist=0x563f05e60bb0, boundParams=0x563f05e980d8, queryEnv=0x0) at 
plancache.c:936
#11 0x563f038db4d8 in GetCachedPlan (plansource=0x563f06027590, 
boundParams=0x563f05e980d8, owner=0x563f05f6c478, queryEnv=0x0) at 
plancache.c:1218
#12 0x563f03542235 in _SPI_execute_plan (plan=0x563f060c9ac0, 
paramLI=0x563f05e980d8, snapshot=0x0, crosscheck_snapshot=0x0, 
read_only=false, allow_nonatomic=false, fire_triggers=true, tcount=0, 
caller_dest=0x0, plan_owner=0x563f05f6c478) at spi.c:2405
#13 0x563f0353ebb9 in SPI_execute_plan_with_paramlist 
(plan=0x563f060c9ac0, params=0x563f05e980d8, read_only=false, tcount=0) 
at spi.c:651
#14 0x7f1032c2f504 in exec_stmt_execsql (estate=0x7ffcab5c6420, 
stmt=0x563f05bfb868) at pl_exec.c:4214
#15 0x7f1032c2a9bc in exec_stmts (estate=0x7ffcab5c6420, 
stmts=0x563f05bfb8c0) at pl_exec.c:2059
#16 0x7f1032c2b854 in exec_stmt_if (estate=0x7ffcab5c6420, 
stmt=0x563f06028700) at pl_exec.c:2481
#17 0x7f1032c2a842 in exec_stmts (estate=0x7ffcab5c6420, 
stmts=0x563f06028758) at pl_exec.c:2003
#18 0x7f1032c2a579 in exec_stmt_block (estate=0x7ffcab5c6420, 
block=0x563f060288c0) at pl_exec.c:1910
#19 0x7f1032c29cac in exec_toplevel_block (estate=0x7ffcab5c6420, 
block=0x563f060288c0) at pl_exec.c:1608
#20 0x7f1032c28730 in plpgsql_exec_trigger (func=0x563f05f6d940, 
trigdata=0x7ffcab5c6880) at pl_exec.c:1024
#21 0x7f1032c43319 in plpgsql_call_handler (fcinfo=0x7ffcab5c6700) 
at pl_handler.c:268
#22 0x563f034a3e2a in ExecCallTriggerFunc (trigdata=0x7ffcab5c6880, 
tgindx=0, finfo=0x563f05e7aa60, instr=0x0, 
per_tuple_context=0x563f05c763d0) at trigger.c:2141
#23 0x563f034a7674 in AfterTriggerExecute (estate=0x563f05e7a2b0, 
event=0x563f05e1a580, relInfo=0x563f05e7a738, trigdesc=0x563f05e7a950, 
finfo=0x563f05e7aa60, instr=0x0, per_tuple_context=0x563f05c763d0, 
trig_tuple_slot1=0x0, trig_tuple_slot2=0x0) at trigger.c:4034
#24 0x563f034a7b8d in afterTriggerInvokeEvents 
(events=0x563f05f3c540, firing_id=1, estate=0x563f05e7a2b0, 
delete_ok=false) at trigger.c:4250
#25 0x563f034a8353 in AfterTriggerEndQuery (estate=0x563f05e7a2b0) 
at trigger.c:4587
#26 0x563f034dc8eb in standard_ExecutorFinish 
(queryDesc=0x563f060279a0) at execMain.c:436
#27 0x563f034dc7c5 in ExecutorFinish (queryDesc=0x563f060279a0) at 
execMain.c:404
#28 0x563f03745edc in ProcessQuery (plan=0x563f05fc8100, 
sourceText=0x563f05ad74a0

Why ALTER SUBSCRIPTION ... SET (slot_name='none') requires subscription disabled?

2021-07-07 Thread Japin Li


Hi, hackers

The documentation [1] says:

When dropping a subscription that is associated with a replication slot on the
remote host (the normal state), DROP SUBSCRIPTION will connect to the remote
host and try to drop the replication slot as part of its operation. This is
necessary so that the resources allocated for the subscription on the remote
host are released. If this fails, either because the remote host is not
reachable or because the remote replication slot cannot be dropped or does not
exist or never existed, the DROP SUBSCRIPTION command will fail. To proceed in
this situation, disassociate the subscription from the replication slot by
executing ALTER SUBSCRIPTION ... SET (slot_name = NONE).

However, when I try this, it complains the subscription is enabled, this command
requires the subscription disabled. Why we need this limitation?

In src/backend/commands/subscriptioncmds.c:

   if (IsSet(opts.specified_opts, SUBOPT_SLOT_NAME))
   {
   if (sub->enabled && !opts.slot_name)
   ereport(ERROR,
   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("cannot set %s for enabled subscription",
   "slot_name = NONE")));

   if (opts.slot_name)
   values[Anum_pg_subscription_subslotname - 1] =
   DirectFunctionCall1(namein, 
CStringGetDatum(opts.slot_name));
   else
   nulls[Anum_pg_subscription_subslotname - 1] = true;
   replaces[Anum_pg_subscription_subslotname - 1] = true;
   }


OTOH, when I execute ALTER SUBSCRIPTION ... SET (slot_name=''), it doesn't 
complain. However,
SELECT select pg_create_logical_replication_slot('', 'pgoutput') complains slot 
name is too
short. Although, the slot will be created at publisher, and validate the slot 
name, IMO, we
can also validate the slot_name in parse_subscription_options() to get the 
error early.
Attached fixes it. Any thoughts?

[1] https://www.postgresql.org/docs/current/sql-dropsubscription.html


-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index eb88d877a5..fa07203b03 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -173,6 +173,8 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 			opts->specified_opts |= SUBOPT_SLOT_NAME;
 			opts->slot_name = defGetString(defel);
 
+			ReplicationSlotValidateName(opts->slot_name, ERROR);
+
 			/* Setting slot_name = NONE is treated as no slot name. */
 			if (strcmp(opts->slot_name, "none") == 0)
 opts->slot_name = NULL;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 57f7dd9b0a..d1ff3531a4 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -85,6 +85,8 @@ ERROR:  invalid connection string syntax: missing "=" after "foobar" in connecti
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
 ALTER SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist2';
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = 'newname');
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = '');
+ERROR:  replication slot name "" is too short
 -- fail
 ALTER SUBSCRIPTION regress_doesnotexist CONNECTION 'dbname=regress_doesnotexist2';
 ERROR:  subscription "regress_doesnotexist" does not exist
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098c14..1500a96e2b 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -64,6 +64,7 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
 ALTER SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist2';
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = 'newname');
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = '');
 
 -- fail
 ALTER SUBSCRIPTION regress_doesnotexist CONNECTION 'dbname=regress_doesnotexist2';

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

2021-07-07 Thread Yugo NAGATA

On Wed, 07 Jul 2021 21:50:16 +0900 (JST)
Tatsuo Ishii  wrote:

> >> Well, "that's very little, let's ignore it" is not technically a right
> >> direction IMO.
> > 
> > Hmmm, It seems to me these failures are ignorable because with regard to 
> > failures
> > due to -T they occur only the last transaction of each client and do not 
> > affect
> > the result such as TPS and latency of successfully processed transactions.
> > (although I am not sure for what sense you use the word "technically"...)
> 
> "My application button does not respond once in 100 times. It's just
> 1% error rate. You should ignore it." I would say this attitude is not
> technically correct.

I cannot understand what you want to say. Reporting the number of transactions 
that is failed intentionally can be treated as same as he error rate on your
application's button?

> > However, maybe I am missing something. Could you please tell me what do you 
> > think
> > the actual harm for users about failures due to -D is?
> 
> I don't know why you are referring to -D.

Sorry. It's just a typo as you can imagine.
I am asking you what do you think the actual harm for users due to termination 
of
retrying by the -T option is.

> >> That's necessarily true in practice. By the time when -T is about to
> >> expire, transactions are all finished in finite time as you can see
> >> the result I showed. So it's reasonable that the very last cycle of
> >> the benchmark will finish in finite time as well.
> > 
> > Your script may finish in finite time, but others may not.
> 
> That's why I said "practically". In other words "in most cases the
> scenario will finish in finite time".

Sure.

> > Indeed, it is possible an execution of a query takes a long or infinite
> > time. However, its cause would a problematic query in the custom script
> > or other problems occurs on the server side. These are not problem of
> > pgbench and, pgbench itself can't control either. On the other hand, the
> > unlimited number of tries is a behaviours specified by the pgbench option,
> > so I think pgbench itself should internally avoid problems caused from its
> > behaviours. That is, if max-tries=0 could cause infinite or much longer
> > benchmark time more than user expected due to too many retries, I think
> > pgbench should avoid it.
> 
> I would say that's user's responsibility to avoid infinite running
> benchmarking. Remember, pgbench is a tool for serious users, not for
> novice users.

Of course, users themselves should be careful of problematic script, but it
would be better that pgbench itself avoids problems if pgbench can beforehand.
 
> Or, we should terminate the last cycle of benchmark regardless it is
> retrying or not if -T expires. This will make pgbench behaves much
> more consistent.

Hmmm, indeed this might make the behaviour a bit consistent, but I am not
sure such behavioural change benefit users.

Regards,
Yugo Nagata

-- 
Yugo NAGATA

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-07-07 Thread Dean Rasheed

On Wed, 7 Jul 2021 at 03:47, David Rowley  wrote:
>
> Updated patch attached.
>

Hmm, this looked easy, but...

It occurred to me that there ought to be regression tests for the edge
cases where it steps from one unit to the next. So, in the style of
the existing regression tests, I tried the following:

SELECT size, pg_size_pretty(size), pg_size_pretty(-1 * size) FROM
(VALUES (10239::bigint), (10240::bigint),
(10485247::bigint), (10485248::bigint),
(10736893951::bigint), (10736893952::bigint),
(10994579406847::bigint), (10994579406848::bigint),
(11258449312612351::bigint), (11258449312612352::bigint)) x(size);

   size| pg_size_pretty | pg_size_pretty
---++
 10239 | 10239 bytes| -10239 bytes
 10240 | 10 kB  | -10 kB
  10485247 | 10239 kB   | -10 MB
  10485248 | 10 MB  | -10 MB
   10736893951 | 10239 MB   | -10 GB
   10736893952 | 10 GB  | -10 GB
10994579406847 | 10239 GB   | -10 TB
10994579406848 | 10 TB  | -10 TB
 11258449312612351 | 10239 TB   | -10239 TB
 11258449312612352 | 10240 TB   | -10239 TB
(10 rows)

SELECT size, pg_size_pretty(size), pg_size_pretty(-1 * size) FROM
(VALUES (10239::numeric), (10240::numeric),
(10485247::numeric), (10485248::numeric),
(10736893951::numeric), (10736893952::numeric),
(10994579406847::numeric), (10994579406848::numeric),
(11258449312612351::numeric), (11258449312612352::numeric)) x(size);

   size| pg_size_pretty | pg_size_pretty
---++
 10239 | 10239 bytes| -10239 bytes
 10240 | 10 kB  | -10 kB
  10485247 | 10239 kB   | -10239 kB
  10485248 | 10 MB  | -10 MB
   10736893951 | 10239 MB   | -10239 MB
   10736893952 | 10 GB  | -10 GB
10994579406847 | 10239 GB   | -10239 GB
10994579406848 | 10 TB  | -10 TB
 11258449312612351 | 10239 TB   | -10239 TB
 11258449312612352 | 10240 TB   | -10240 TB
(10 rows)

Under the assumption that what we're trying to achieve here is
schoolbook rounding (ties away from zero), the numeric results are
correct and the bigint results are wrong.

The reason is that bit shifting isn't the same as division for
negative numbers, since bit shifting rounds towards negative infinity
whereas division rounds towards zero (truncates), which is what I
think we really need.

Regards,
Dean

Re: Warn if initdb's --sync-only option is mixed with other options

2021-07-07 Thread Daniel Gustafsson

> On 7 Jul 2021, at 04:23, Gurjeet Singh  wrote:

> I'm not able to come up with an exact situation to prove this, but
> this behaviour seems potentially dangerous. The user might mix the
> --sync-only option with other options, but would be extremely
> surprised if those other options didn't take effect.

Is if there is a plausible real world situation where a user runs --sync-only
together with other arguments and also miss the fact that the other arguments
didn't take effect, and have bad consequences?

> I _think_ we should throw an error if the user specifies any options
> that are being ignored. But an error might break someone's automation
> (perhaps for their own good), since the current behaviour has been in
> place for a very long time, so I'm willing to settle for at least a
> warning in such a case.

We typically don't issue warnings for incompatible arguments, but rather error
out, and I'm not convinced this warrants breaking that.  If we are going to do
anything I think we should error out; if we decide to do something then we
consider the scripts that will break to already be broken.

A slightly confusing aspect of this is however the error message for sync-only
when -D or PGDATA isn't set says "will reside" when in fact it should say "is
residing" (or something along those lines):

  $ ./bin/initdb --sync-only
  initdb: error: no data directory specified
  You must identify the directory where the data for this database system
  will reside.  Do this with either the invocation option -D or the
  environment variable PGDATA.

I doubt it's worth complicating the code for this fringe case though.

--
Daniel Gustafsson   https://vmware.com/

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

2021-07-07 Thread Ranier Vilela

>You can check the more details in the attached patch. Any feedback is
welcome.

I have tiny comments about your patch:

1. name of file is uniquekey.c?

+ * pathkeys.c
+ *  Utilities for maintaining uniquekey.

2. Variable "PathKey *pathkey" at function: add_uniquekey_for_uniqueindex,
can have scope reduced.

+ indexpr_item = list_head(unique_index->indexprs);
+ for (c = 0; c < unique_index->nkeycolumns; c++)
+ {
+ PathKey *pathkey;

3. Variable int c = 0, has a redundant initialization at function:
add_uniquekey_for_uniqueindex.

4. Has one word with misspelled?

"/* We can't *guarantee* an FuncExpr will not return NULLs */"

4. Variable int i = -1, has a redudant initialization at function:
uniquekey_contains_in

5. __attribute__ ((unused)) at function: build_composited_uniquekey, is
incompatible with msvc.

6. Postgres uses a newline after variables declarations.

regards,

Ranier Vilela

Re: PostgreSQL-13.3 parser.y with positional references by named references

2021-07-07 Thread Dagfinn Ilmari Mannsåker

Peter Eisentraut  writes:

> On 04.07.21 17:58, Tom Lane wrote:
>> Domingo Alvarez Duarte  writes:
>>> Here https://gist.github.com/mingodad/49291e0e9505522c66fcd3fcea4a939d I
>>> posted the postgresql-13.3/src/backend/parser/gram.y with positional
>>> references by named references that is supported by bison for some time now.
>>
>> When is "some time now"?
>
> release 2.5 (2011-05-14)

Do we support building on RHEL6? It only ships bison 2.4, so that would
mean people building on that would have to install it seprately.

- ilmari

Re: track_planning causing performance regression

2021-07-07 Thread Julien Rouhaud

On Wed, Jul 7, 2021 at 8:57 PM Fujii Masao  wrote:
>
> Pushed. Thanks!

Thanks!

Re: track_planning causing performance regression

2021-07-07 Thread Fujii Masao





On 2021/07/07 18:09, Julien Rouhaud wrote:

On Thu, Jul 1, 2021 at 4:28 PM Fujii Masao  wrote:


I'm fine with this. So what about the following diff? I added  tag.

 pg_stat_statements.track_planning controls whether
 planning operations and duration are tracked by the module.
 Enabling this parameter may incur a noticeable performance penalty,
-  especially when a fewer kinds of queries are executed on many
-  concurrent connections.
+  especially when statements with identical query structure are executed
+  by many concurrent connections which compete to update a small number of
+  pg_stat_statements entries.
 The default value is off.
 Only superusers can change this setting.


It seems perfect, thanks!


Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

2021-07-07 Thread Tatsuo Ishii

>> Well, "that's very little, let's ignore it" is not technically a right
>> direction IMO.
> 
> Hmmm, It seems to me these failures are ignorable because with regard to 
> failures
> due to -T they occur only the last transaction of each client and do not 
> affect
> the result such as TPS and latency of successfully processed transactions.
> (although I am not sure for what sense you use the word "technically"...)

"My application button does not respond once in 100 times. It's just
1% error rate. You should ignore it." I would say this attitude is not
technically correct.

> However, maybe I am missing something. Could you please tell me what do you 
> think
> the actual harm for users about failures due to -D is?

I don't know why you are referring to -D.

>> That's necessarily true in practice. By the time when -T is about to
>> expire, transactions are all finished in finite time as you can see
>> the result I showed. So it's reasonable that the very last cycle of
>> the benchmark will finish in finite time as well.
> 
> Your script may finish in finite time, but others may not.

That's why I said "practically". In other words "in most cases the
scenario will finish in finite time".

> Indeed, it is possible an execution of a query takes a long or infinite
> time. However, its cause would a problematic query in the custom script
> or other problems occurs on the server side. These are not problem of
> pgbench and, pgbench itself can't control either. On the other hand, the
> unlimited number of tries is a behaviours specified by the pgbench option,
> so I think pgbench itself should internally avoid problems caused from its
> behaviours. That is, if max-tries=0 could cause infinite or much longer
> benchmark time more than user expected due to too many retries, I think
> pgbench should avoid it.

I would say that's user's responsibility to avoid infinite running
benchmarking. Remember, pgbench is a tool for serious users, not for
novice users.

Or, we should terminate the last cycle of benchmark regardless it is
retrying or not if -T expires. This will make pgbench behaves much
more consistent.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Hook for extensible parsing.

2021-07-07 Thread Jim Mlodgenski

On Wed, Jul 7, 2021 at 5:26 AM Julien Rouhaud  wrote:
>
> Also, if this patch is eventually committed and having some code to
> experience the hook is wanted it would probably be better to have a
> very naive parser (based on a few strcmp() calls or something like
> that) to validate the behavior rather than having a real parser.
>

The test module is very useful to show how to use the hook but it isn't
very useful to the general user like most other things in contrib. It probably
fits better in src/test/modules

Re: Cosmic ray hits integerset

2021-07-07 Thread Joe Conway


On 7/7/21 2:53 AM, Jakub Wartak wrote:

Hi, Asking out of pure technical curiosity about "the rhinoceros" - what kind 
of animal is it ? Physical box or VM? How one could get dmidecode(1) / dmesg(1) / mcelog 
(1) from what's out there (e.g. does it run ECC or not ?)



Rhinoceros is just a VM on a simple desktop machine. Nothing fancy.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: Incorrect usage of strtol, atoi for non-numeric junk inputs

2021-07-07 Thread Bharath Rupireddy

On Fri, Jun 4, 2021 at 10:23 PM Alvaro Herrera  wrote:
>
> On 2021-Jun-04, Bharath Rupireddy wrote:
>
> > On Fri, Jun 4, 2021 at 8:58 PM Alvaro Herrera  
> > wrote:
>
> > > I would suggest that the best way forward in this area is to rebase both
> > > there patches on current master.
> >
> > Thanks. I will read both the threads [1], [2] and try to rebase the
> > patches. If at all I get to rebase them, do you prefer the patches to
> > be in this thread or in a new thread?
>
> Thanks, that would be helpful.  This thread is a good place.

I'm unable to spend time on this work as promised. I'd be happy if
someone could take it forward, although it's not critical work(IMO)
that needs immediate focus. I will try to spend time maybe later this
year.

Regards,
Bharath Rupireddy.

Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation

2021-07-07 Thread Bharath Rupireddy

On Thu, May 27, 2021 at 10:28 PM Bharath Rupireddy
 wrote:
> I'm not taking the patch, attaching v5 again here to make cfbot happy
> and for further review.

Attaching v6 patch rebased onto the latest master.

Regards,
Bharath Rupireddy.


v6-0001-Improve-publication-error-messages.patch
Description: Binary data

Re: Case expression pushdown

2021-07-07 Thread Gilles Darold


Le 22/06/2021 à 15:39, Alexander Pyhalov a écrit :

Seino Yuki писал 2021-06-22 16:03:

On 2021-06-16 01:29, Alexander Pyhalov wrote:

Hi.

Ashutosh Bapat писал 2021-06-15 16:24:
Looks quite useful to me. Can you please add this to the next 
commitfest?




Addded to commitfest. Here is an updated patch version.


Thanks for posting the patch.
I agree with this content.


+ Foreign Scan on public.ft2 (cost=156.58..165.45 rows=394 width=14)

It's not a big issue, but is there any intention behind the pattern of
outputting costs in regression tests?


Hi.

No, I don't think it makes much sense. Updated tests (also added case 
with empty else).



The patch doesn't apply anymore to master, I join an update of your 
patch update in attachment. This is your patch rebased and untouched 
minus a comment in the test and renamed to v4.



I could have miss something but I don't think that additional struct 
elements case_args in structs foreign_loc_cxt and deparse_expr_cxt are 
necessary. They look to be useless.


The patch will also be more clear if the CaseWhen node was handled 
separately in foreign_expr_walker() instead of being handled in the 
T_CaseExpr case. By this way the T_CaseExpr case just need to call 
recursively foreign_expr_walker(). I also think that code in 
T_CaseTestExpr should just check the collation, there is nothing more to 
do here like you have commented the function deparseCaseTestExpr(). This 
function can be removed as it does nothing if the case_args elements are 
removed.



There is a problem the regression test with nested CASE clauses:

   EXPLAIN (VERBOSE, COSTS OFF)
   SELECT c1,c2,c3 FROM ft2 WHERE CASE CASE WHEN c2 > 0 THEN c2 END
   WHEN 100 THEN 601 WHEN c2 THEN c2 ELSE 0 END > 600 ORDER BY c1;

the original query use "WHERE CASE CASE WHEN" but the remote query is 
not the same in the plan:


   Remote SQL: SELECT "C 1", c2, c3 FROM "S 1"."T 1" WHERE (((CASE WHEN
   ((CASE WHEN (c2 > 0) THEN c2 ELSE NULL::integer END) = 100) THEN 601
   WHEN ((CASE WHEN (c2 > 0) THEN c2 ELSE NULL::integer END) = c2) THEN
   c2 ELSE 0 END) > 600)) ORDER BY "C 1" ASC NULLS LAST

Here this is "WHERE (((CASE WHEN ((CASE WHEN" I expected it to be 
unchanged to "WHERE (((CASE (CASE WHEN".



Also I would like the following regression tests to be added. It test 
that the CASE clause in aggregate and function is pushed down as well as 
the aggregate function. This was the original use case that I wanted to 
fix with this feature.


   -- CASE in aggregate function, both must be pushed down
   EXPLAIN (VERBOSE, COSTS OFF)
   SELECT sum(CASE WHEN mod(c1, 4) = 0 THEN 1 ELSE 2 END) FROM ft1;
   -- Same but without the ELSE clause
   EXPLAIN (VERBOSE, COSTS OFF)
   SELECT sum(CASE WHEN mod(c1, 4) = 0 THEN 1 END) FROM ft1;


For convenience I'm attaching a new patch v5 that change the code 
following my comments above, fix the nested CASE issue and adds more 
regression tests.



Best regards,

--
Gilles Darold

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 31919fda8c..f2441d4945 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -87,6 +87,7 @@ typedef struct foreign_loc_cxt
 {
 	Oid			collation;		/* OID of current collation, if any */
 	FDWCollateState state;		/* state of current collation choice */
+	Expr	   *case_arg;		/* the last case arg to inspect */
 } foreign_loc_cxt;
 
 /*
@@ -101,6 +102,7 @@ typedef struct deparse_expr_cxt
  * a base relation. */
 	StringInfo	buf;			/* output buffer to append to */
 	List	  **params_list;	/* exprs that will become remote Params */
+	List	   *case_args;		/* list of args to deparse CaseTestExpr */
 } deparse_expr_cxt;
 
 #define REL_ALIAS_PREFIX	"r"
@@ -186,6 +188,9 @@ static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
 
+static void deparseCaseExpr(CaseExpr *node, deparse_expr_cxt *context);
+static void deparseCaseTestExpr(CaseTestExpr *node, deparse_expr_cxt *context);
+
 /*
  * Helper functions
  */
@@ -254,6 +259,7 @@ is_foreign_expr(PlannerInfo *root,
 		glob_cxt.relids = baserel->relids;
 	loc_cxt.collation = InvalidOid;
 	loc_cxt.state = FDW_COLLATE_NONE;
+	loc_cxt.case_arg = NULL;
 	if (!foreign_expr_walker((Node *) expr, _cxt, _cxt))
 		return false;
 
@@ -312,6 +318,7 @@ foreign_expr_walker(Node *node,
 	/* Set up inner_cxt for possible recursion to child nodes */
 	inner_cxt.collation = InvalidOid;
 	inner_cxt.state = FDW_COLLATE_NONE;
+	inner_cxt.case_arg = outer_cxt->case_arg;
 
 	switch (nodeTag(node))
 	{
@@ -509,6 +516,62 @@ foreign_expr_walker(Node *node,
 	state = FDW_COLLATE_UNSAFE;
 			}
 			break;
+		case T_CaseExpr:
+			{
+CaseExpr   *ce = (CaseExpr *) node;
+ListCell   *arg;
+
+if (ce->arg)
+	inner_cxt.case_arg = ce->arg;
+ 
+foreach(arg, ce->args)
+{
+	CaseWhen   *w = lfirst_node(CaseWhen, arg);

[PoC] Improve dead tuple storage for lazy vacuum

2021-07-07 Thread Masahiko Sawada

Hi all,

Index vacuuming is one of the most time-consuming processes in lazy
vacuuming. lazy_tid_reaped() is a large part among them. The attached
the flame graph shows a profile of a vacuum on a table that has one index
and 80 million live rows and 20 million dead rows, where
lazy_tid_reaped() accounts for about 47% of the total vacuum execution
time.

lazy_tid_reaped() is essentially an existence check; for every index
tuple, it checks if the TID of the heap it points to exists in the set
of TIDs of dead tuples. The maximum size of dead tuples is limited by
maintenance_work_mem, and if the upper limit is reached, the heap scan
is suspended, index vacuum and heap vacuum are performed, and then
heap scan is resumed again. Therefore, in terms of the performance of
index vacuuming, there are two important factors: the performance of
lookup TIDs from the set of dead tuples and its memory usage. The
former is obvious whereas the latter affects the number of Index
vacuuming. In many index AMs, index vacuuming (i.e., ambulkdelete)
performs a full scan of the index, so it is important in terms of
performance to avoid index vacuuming from being executed more than
once during lazy vacuum.

Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:

1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].

2. Allocate the whole memory space at once.

3. Slow lookup performance (O(logN)).

I’ve done some experiments in this area and would like to share the
results and discuss ideas.

Problems Solutions
===

Firstly, I've considered using existing data structures:
IntegerSet(src/backend/lib/integerset.c)  and
TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
only either point 2 or 3. IntegerSet uses lower memory thanks to
simple-8b encoding but is slow at lookup, still O(logN), since it’s a
tree structure. On the other hand, TIDBitmap has a good lookup
performance, O(1), but could unnecessarily use larger memory in some
cases since it always allocates the space for bitmap enough to store
all possible offsets. With 8kB blocks, the maximum number of line
pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
bitmap is 40 bytes long and we always need 46 bytes in total per block
including other meta information.

So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3]  (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.

The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.

For example, if there are two dead tuples at offset 1 and 150, it uses
the array container that has an array of two 2-byte integers
representing 1 and 150, using 4 bytes in total. If we used the bitmap
container in this case, we would need 20 bytes instead. On the other
hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
uses the run container that has an array of 2-byte integers. The first
value in each pair represents a starting offset number, whereas the
second value represents its length. Therefore, in this case, the run
container uses only 4 bytes in total. Finally, if there are dead
tuples at every other offset from 1 to 100, it uses the bitmap
container that has an uncompressed bitmap, using 13 bytes. We need
another 16 bytes per block entry for hash table entry.

The lookup complexity of a bitmap container is O(1) whereas the one of
an array and a run container is O(N) or O(logN) but the number of
elements in those two containers should not be large it would not be a
problem.

Evaluation


Before implementing this idea and integrating it with lazy vacuum
code, I've implemented a benchmark tool dedicated to evaluating
lazy_tid_reaped() performance[4]. It has some functions: generating
TIDs for both index tuples and dead tuples, loading dead tuples to the
data structure, simulating

Re: Update maintenance_work_mem/autovacuum_work_mem to reflect the 1GB limitation with VACUUM

2021-07-07 Thread David Rowley

On Sun, 4 Jul 2021 at 22:38, David Rowley  wrote:
> I could do with a 2nd opinion about if we should just adjust the
> maximum value for the autovacuum_work_mem GUC to 1GB in master.
>
> I'm also not sure if since we'd not backpatch the GUC max value
> adjustment if we need to document the upper limit in the manual.

I was just looking at this again and I see that GIN indexes are able
to use more than 1GB of memory during VACUUM. That discovery makes me
think having the docs say that vacuum cannot use more than 1GB of
memory is at best misleading and more likely just incorrect.

Right now I'm considering if it might just be better to revert
ec34040af and call it quits here.

David

Re: Pipeline mode and PQpipelineSync()

2021-07-07 Thread Alvaro Herrera

On 2021-Jul-07, Boris Kolpackov wrote:

> I don't get any difference in behavior with this patch. That is, I
> still get the unexpected NULL result. Does it make a difference for
> your reproducer?

Yes, the behavior changes for my repro.   Is it possible for you to
share a full program I can compile and run, plesse?  Thanks

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
 sources, so let's move on."   (Nathaniel Smith)

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

2021-07-07 Thread Yugo NAGATA

On Wed, 07 Jul 2021 16:11:23 +0900 (JST)
Tatsuo Ishii  wrote:

> > Indeed, as Ishii-san pointed out, some users might not want to terminate
> > retrying transactions due to -T. However, the actual negative effect is only
> > printing the number of failed transactions. The other result that users 
> > want to
> > know, such as tps, are almost not affected because they are measured for
> > transactions processed successfully. Actually, the percentage of failed
> > transaction is very little, only 0.347%.
> 
> Well, "that's very little, let's ignore it" is not technically a right
> direction IMO.

Hmmm, It seems to me these failures are ignorable because with regard to 
failures
due to -T they occur only the last transaction of each client and do not affect
the result such as TPS and latency of successfully processed transactions.
(although I am not sure for what sense you use the word "technically"...)

However, maybe I am missing something. Could you please tell me what do you 
think
the actual harm for users about failures due to -D is?

> > In the existing behaviour, running transactions are never terminated due to
> > the -T option. However, ISTM that this would be based on an assumption
> > that a latency of each transaction is small and that a timing when we can
> > finish the benchmark would come soon.  On the other hand, when transactions 
> > can 
> > be retried unlimitedly, it may take a long time more than expected, and we 
> > can
> > not guarantee that this would finish successfully in limited 
> > time.Therefore,  
> > terminating the benchmark by giving up to retry the transaction after time
> > expiration seems reasonable under unlimited retries.
> 
> That's necessarily true in practice. By the time when -T is about to
> expire, transactions are all finished in finite time as you can see
> the result I showed. So it's reasonable that the very last cycle of
> the benchmark will finish in finite time as well.

Your script may finish in finite time, but others may not. However, 
considering only serialization and deadlock errors, almost transactions
would finish in finite time eventually. In the previous version of the
patch, errors other than serialization or deadlock can be retried and
it causes unlimited retrying easily. Now, only the two kind of errors
can be retried, nevertheless, it is unclear for me that we can assume
that retying will finish in finite time. If we can assume it, maybe,
we can remove the restriction that --max-retries=0 must be used with
--latency-limit or -T.

> Of course if a benchmark cycle takes infinite time, this will be a
> problem. However same thing can be said to non-retry
> benchmarks. Theoretically it is possible that *one* benchmark cycle
> takes forever. In this case the only solution will be just hitting ^C
> to terminate pgbench. Why can't we have same assumption with
> --max-tries=0 case?

Indeed, it is possible an execution of a query takes a long or infinite
time. However, its cause would a problematic query in the custom script
or other problems occurs on the server side. These are not problem of
pgbench and, pgbench itself can't control either. On the other hand, the
unlimited number of tries is a behaviours specified by the pgbench option,
so I think pgbench itself should internally avoid problems caused from its
behaviours. That is, if max-tries=0 could cause infinite or much longer
benchmark time more than user expected due to too many retries, I think
pgbench should avoid it.

> > In the sense that we don't
> > terminate running transactions forcibly, this don't change the existing 
> > behaviour. 
> 
> This statement seems to be depending on your perosnal assumption.

Ok. If we regard that a transaction is still running even when it is under
retrying after an error,  terminate of the retry may imply to terminate running
the transaction forcibly.  

> I still don't understand why you think that --max-tries non 0 case
> will *certainly* finish in finite time whereas --max-tries=0 case will
> not.

I just mean that --max-tries greater than zero will prevent pgbench from 
retrying a
transaction forever.

Regards,
Yugo Nagata

-- 
Yugo NAGATA

Re: rand48 replacement

2021-07-07 Thread Dean Rasheed

On Wed, 7 Jul 2021 at 04:00, Yura Sokolov  wrote:
>
> Anyway, excuse me for heating this discussion cause of such
> non-essential issue.
> I'll try to control myself and don't proceed it further.
>

Whilst it has been interesting learning and discussing all these
different techniques, I think it's probably best to stick with the
bitmask method, rather than making the code too complex and difficult
to follow. The bitmask method has the advantage of being very simple,
easy to understand and fast (fastest in many of the benchmarks, and
close enough in others to make me think that the difference won't
matter for our purposes).

To test the current patch, I hacked up a simple SQL-callable server
function: random(bigint, bigint) returns bigint, similar to the one in
pgbench. After doing so, I couldn't help thinking that it would be
useful to have such a function in core, so maybe that could be a
follow-on patch. Anyway, that led to the following observations:

Firstly, there's a bug in the existing mask_u64() code -- if
pg_leftmost_one_pos64(u) returns 63, you end up with a mask equal to
0, and it breaks down.

Secondly, I think it would be simpler to implement this as a bitshift,
rather than a bitmask, using the high bits from the random number.
That might not make much difference for xoroshiro**, but in general,
PRNGs tend to be weaker in the lower bits, so it seems preferable on
that basis. But also, it makes the code simpler and less error-prone.

Finally, I think it would be better to treat the upper bound of the
range as inclusive. Doing so makes the function able to cover all
possible 64-bit ranges. It would then be easy (perhaps in another
follow-on patch) to make the pgbench random() function work for all
64-bit bounds (as long as max >= min), without the weird overflow
checking it currently has.

Putting those 3 things together, the code (minus comments) becomes:

if (range > 0)
{
int rshift = 63 - pg_leftmost_one_pos64(range);

do
{
val = xoroshiro128ss(state) >> rshift;
}
while (val > range);
}
else
val = 0;

which reduces the complexity a bit.

Regards,
Dean

Re: Pipeline mode and PQpipelineSync()

2021-07-07 Thread Boris Kolpackov

Alvaro Herrera  writes:

> Can you please try with this patch?

I don't get any difference in behavior with this patch. That is, I
still get the unexpected NULL result. Does it make a difference for
your reproducer?

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-07 Thread Ronan Dunklau

Le mardi 6 juillet 2021, 17:37:53 CEST James Coleman a écrit :
> Yes and no. When incremental sort has to do a full sort there will
> always be at least 2 attributes. But in prefix sort mode (see
> prefixsort_state) only non-presorted columns are sorted (i.e., if
> given a,b already sorted by a, then only b is sorted). So the
> prefixsort_state could use this optimization.

The optimization is not when we actually sort on a single key, but when we get 
a single attribute in / out of the tuplesort.  Since sorting always add 
resjunk entries for the keys being sorted on, I don't think we can ever end up 
in a situation where the optimization would kick in, since the entries for the 
already-performed-sort keys will need to be present in the output.

Maybe if instead of adding resjunk entries to the whole query's targetlist, 
sort and incrementalsort nodes were able to do a projection from the input 
(needed tle + resjunk sorting tle) to a tuple containing only the needed tle 
on output before actually sorting it, it would be possible, but that would be 
quite a big design change.

In the meantime I fixed some formatting issues, please find attached a new 
patch.


-- 
Ronan Dunklaudiff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..9d8b0a77da 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
  *		which saves the results in a temporary file or memory. After the
  *		initial call, returns a tuple from the file with each call.
  *
+ *		The tuplesort can either occur on the whole tuple (this is the nominal
+ *		case) or, when the input / output tuple consists of only one attribute,
+ *		we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
  *		Conditions:
  *		  -- none.
  *
@@ -86,32 +90,61 @@ ExecSort(PlanState *pstate)
 		outerNode = outerPlanState(node);
 		tupDesc = ExecGetResultType(outerNode);
 
-		tuplesortstate = tuplesort_begin_heap(tupDesc,
-			  plannode->numCols,
-			  plannode->sortColIdx,
-			  plannode->sortOperators,
-			  plannode->collations,
-			  plannode->nullsFirst,
-			  work_mem,
-			  NULL,
-			  node->randomAccess);
+		/*
+		 * Switch to the tuplesort_*_datum interface when we have only one
+		 * key, as it is much more efficient especially when the type is
+		 * pass-by-value.
+		 */
+		if (tupDesc->natts == 1)
+		{
+			node->is_single_val = true;
+			tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+   plannode->sortOperators[0],
+   plannode->collations[0],
+   plannode->nullsFirst[0],
+   work_mem,
+   NULL,
+   node->randomAccess);
+		}
+		else
+			tuplesortstate = tuplesort_begin_heap(tupDesc,
+  plannode->numCols,
+  plannode->sortColIdx,
+  plannode->sortOperators,
+  plannode->collations,
+  plannode->nullsFirst,
+  work_mem,
+  NULL,
+  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
 
 		/*
-		 * Scan the subplan and feed all the tuples to tuplesort.
+		 * Scan the subplan and feed all the tuples to tuplesort, using either
+		 * the _putdatum or _puttupleslot API as appropriate.
 		 */
-
-		for (;;)
-		{
-			slot = ExecProcNode(outerNode);
-
-			if (TupIsNull(slot))
-break;
-
-			tuplesort_puttupleslot(tuplesortstate, slot);
-		}
+		if (node->is_single_val)
+			for (;;)
+			{
+slot = ExecProcNode(outerNode);
+
+if (TupIsNull(slot))
+	break;
+slot_getsomeattrs(slot, 1);
+tuplesort_putdatum(tuplesortstate,
+   slot->tts_values[0],
+   slot->tts_isnull[0]);
+			}
+		else
+			for (;;)
+			{
+slot = ExecProcNode(outerNode);
+
+if (TupIsNull(slot))
+	break;
+tuplesort_puttupleslot(tuplesortstate, slot);
+			}
 
 		/*
 		 * Complete the sort.
@@ -150,9 +183,18 @@ ExecSort(PlanState *pstate)
 	 * next fetch from the tuplesort.
 	 */
 	slot = node->ss.ps.ps_ResultTupleSlot;
-	(void) tuplesort_gettupleslot(tuplesortstate,
-  ScanDirectionIsForward(dir),
-  false, slot, NULL);
+	if (node->is_single_val)
+	{
+		ExecClearTuple(slot);
+		if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+			   &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+			ExecStoreVirtualTuple(slot);
+	}
+	else
+		(void) tuplesort_gettupleslot(tuplesortstate,
+	  ScanDirectionIsForward(dir),
+	  false, slot, NULL);
+
 	return slot;
 }
 
@@ -191,6 +233,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	sortstate->bounded = false;
 	sortstate->sort_Done = false;
 	sortstate->tuplesortstate = NULL;
+	sortstate->is_single_val = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/utils/sort/tuplesort.c

Re: Hook for extensible parsing.

2021-07-07 Thread Julien Rouhaud

Thanks for the review Jim!

On Wed, Jul 7, 2021 at 3:26 AM Jim Mlodgenski  wrote:
>
> On Sat, Jun 12, 2021 at 4:29 AM Julien Rouhaud  wrote:
>
> The patches all build properly and pass all regressions tests.

Note that the cfbot reports a compilation error on windows.  That's on
the grammar extension part, so I'm really really interested in trying
to fix that for now, as it's mostly a quick POC to demonstrate how one
could implement a different grammar and validate that everything works
as expected.

Also, if this patch is eventually committed and having some code to
experience the hook is wanted it would probably be better to have a
very naive parser (based on a few strcmp() calls or something like
that) to validate the behavior rather than having a real parser.

> > pg_parse_query() will instruct plugins to parse a query at a time.  They're
> > free to ignore that mode if they want to implement the 3rd mode.  If so, 
> > they
> > should either return multiple RawStmt, a single RawStmt with a 0 or
> > strlen(query_string) stmt_len, or error out.  Otherwise, they will implement
> > either mode 1 or 2, and they should always return a List containing a single
> > RawStmt with properly set stmt_len, even if the underlying statement is 
> > NULL.
> > This is required to properly skip valid strings that don't contain a
> > statements, and pg_parse_query() will skip RawStmt that don't contain an
> > underlying statement.
>
> Wouldn't we want to only loop through the individual statements if parser_hook
> exists? The current patch seems to go through the new code path regardless
> of the hook being grabbed.

I did think about it, but I eventually chose to write it this way.
Having a different code path for the no-hook situation won't make the
with-hook code any easier (it should only remove some check for the
hook in some places that have 2 or 3 other checks already).  On the
other hand, having a single code path avoid some (minimal) code
duplication, and also ensure that the main loop is actively tested
even without the hook being set.  That's not 100% coverage, but it's
better than nothing.  Performance wise, it shouldn't make any
noticeable difference for the no-hook case.

Re: track_planning causing performance regression

2021-07-07 Thread Julien Rouhaud

On Thu, Jul 1, 2021 at 4:28 PM Fujii Masao  wrote:
>
> I'm fine with this. So what about the following diff? I added  
> tag.
>
> pg_stat_statements.track_planning controls whether
> planning operations and duration are tracked by the module.
> Enabling this parameter may incur a noticeable performance penalty,
> -  especially when a fewer kinds of queries are executed on many
> -  concurrent connections.
> +  especially when statements with identical query structure are executed
> +  by many concurrent connections which compete to update a small number 
> of
> +  pg_stat_statements entries.
> The default value is off.
> Only superusers can change this setting.

It seems perfect, thanks!

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

2021-07-07 Thread Andy Fan

On Tue, Jul 6, 2021 at 5:34 PM David Rowley  wrote:
>
> On Sun, 4 Jul 2021 at 02:08, Andy Fan  wrote:
> >  I'd start to work on UniqueKey again, it would be great that we can target 
> > it
> >  to PG 15. The attached patch is just for the notnull_attrs. Since we can't 
> > say
> >  a column is nullable or not without saying in which resultset, So I think 
> > attaching
> > it to RelOptInfo is unavoidable. Here is how my patch works.
>
> I'd also like to see this work progress for PG15.

Thank you David!

I am re-designing/implementing the UniqueKey, but it is better to have
a design review as soon as possible. This writing is for that. To make the
review easier, I also uploaded my in-completed patch (Correct, runable
with testcase).

Main changes are:
1. Use EC instead of expr, to cover more UniqueKey case.
2. Redesign the UniqueKey as below:

@@ -246,6 +246,7 @@ struct PlannerInfo
* subquery outputs */

List   *eq_classes; /* list of active EquivalenceClasses */
+ List   *unique_exprs; /* List of unique expr */

  bool ec_merging_done; /* set true once ECs are canonical */

+typedef struct UniqueKey
+{
+ NodeTag type;
+ Bitmapset *unique_expr_indexes;
+ bool multi_nulls;
+} UniqueKey;
+

PlannerInfo.unique_exprs is a List of unique exprs.  Unique Exprs is a set of
EquivalenceClass. for example:

CREATE TABLE T1(A INT NOT NULL, B INT NOT NULL, C INT,  pk INT primary key);
CREATE UNIQUE INDEX ON t1(a, b);

SELECT DISTINCT * FROM T1 WHERE a = c;

Then we would have PlannerInfo.unique_exprs as below
[
[EC(a, c), EC(b)],
[EC(pk)]
]

RelOptInfo(t1) would have 2 UniqueKeys.
UniqueKey1 {unique_expr_indexes=bms{0}, multinull=false]
UniqueKey2 {unique_expr_indexes=bms{1}, multinull=false]

The design will benefit many table joins cases. For example, 10 tables
join. Each table has a primary key (a, b).  Then we would have a UniqueKey like
this.

JoinRel{1,2,3,4} -  {t1.a, t1.b, t2.a, t2.b, t3.a, t3.b t4.a t4.b}
JoinRel{1,2,3,4, 5} -  {t1.a, t1.b, t2.a, t2.b, t3.a, t3.b t4.a t4.b t5.a t5.b}

This would be memory consuming and building such UniqueKey is CPU consuming as
well. With the new design, we can store it as

PlannerInfo.unique_exprs =
[
[t1.a, t1.b], -- EC is ignored in document.
[t2.a, t2.b],
[t3.a, t3.b],
[t4.a, t4.b],
[t5.a, t5.b],
[t6.a, t6.b],
[t7.a, t7.b],
[t8.a, t8.b],
[t9.a, t9.b],
[t10.a, t10.b],
]

JoinRel{1,2,3,4} -  Bitmapset{0,1,2,3} -- one bitmapword.
JoinRel{1,2,3,4,5} -  Bitmapset{0,1,2,3,4} -- one bitmapword.

3. Define a new SingleRow node and use it in joinrel as well.

+typedef struct SingleRow
+{
+ NodeTag type;
+ Index relid;
+} SingleRow;

SELECT * FROM t1, t2 WHERE t2.pk = 1;

PlannerInfo.unique_exprs
[
[t1.a, t1.b],
SingleRow{relid=2}
]

JoinRel{t1} - Bitmapset{0}
JoinRel{t2} - Bitmapset{1}
JoinRelt{1, 2} Bitmapset{0, 1} -- SingleRow will never be expanded to dedicated
exprs.

4. Cut the useless UniqueKey totally on the baserel stage based on
   root->distinct_pathkey.  If we want to use it anywhere else, I think this
   design is OK as well. for example: group by UniqueKey.

5. Implemented the relation_is_distinct_for(root, rel, distinct_pathkey)
   effectively.  Here I used distinct_pathkey rather than
   Query->distinctClause.

Since I implemented the EC in PlannerInfo.unique_exprs point to the
PathKey.pk_eqclass, so we can compare the address directly with '=', rather than
equal(a, b). (since qual would check the address as well, so even I use equal,
the performance is good as well). SingleRow is handled as well for this case.

You can check the more details in the attached patch.  Any feedback is welcome.

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)

v1-0001-design-uniquekey-v2.patch
Description: Binary data

Re: Remove useless int64 range checks on BIGINT sequence MINVALUE/MAXVALUE values

2021-07-07 Thread David Rowley

On Wed, 7 Jul 2021 at 00:06, Greg Nancarrow  wrote:
> I think if you're going to reject this patch, a brief comment should
> be added to that code to justify why that existing superfluous check
> is worthwhile.

It seems strange to add a comment to explain why it's there. If we're
going to the trouble of doing that, then we should just remove it and
add a very small comment to mention why INT8 sequences don't need to
be checked.

Patch attached

David

remove_surplus_int8_seq_range_check.patch
Description: Binary data

Re: "debug_invalidate_system_caches_always" is too long

2021-07-07 Thread Peter Eisentraut


On 04.07.21 22:27, Tom Lane wrote:

I do agree with the "debug_" prefix given that it's now visible to
users.  However, it doesn't seem that hard to save some space in
the rest of the name.  The word "system" is adding nothing of value,
and the word "always" seems rather confusing --- if it does
something "always", why is there more than one level?  So a simple
proposal is to rename it to "debug_invalidate_caches".


I think we can definitely drop the "always".  Not so much the "system", 
since there are other caches, but it would be ok if we want it shorter.



However, I think we should also give serious consideration to
"debug_clobber_cache" or "debug_clobber_cache_always" for continuity
with past practice (though it still feels like "always" is a good
word to lose now).  "debug_clobber_caches" is another reasonable
variant.


The clobbering doesn't actually happen unless you turn on 
CLOBBER_FREED_MEMORY, so it would be good to keep that separate.

Re: PostgreSQL-13.3 parser.y with positional references by named references

2021-07-07 Thread Peter Eisentraut


On 04.07.21 17:58, Tom Lane wrote:

Domingo Alvarez Duarte  writes:

Here https://gist.github.com/mingodad/49291e0e9505522c66fcd3fcea4a939d I
posted the postgresql-13.3/src/backend/parser/gram.y with positional
references by named references that is supported by bison for some time now.


When is "some time now"?


release 2.5 (2011-05-14)


I do see the point about it being annoying to update $N references
when a rule is changed.  But this solution has enough downsides that
I'm not sure it's a net win.  Maybe if it were applied selectively,
to just the longer DDL productions, it'd be worth doing?


I agree that it should be applied selectively.

Re: trivial improvement to system_or_bail

2021-07-07 Thread Daniel Gustafsson

> On 6 Jul 2021, at 23:59, Alvaro Herrera  wrote:

> Failures now look like this, respectively:
> 
> Bailout called.  Further testing stopped:  failed to execute command "finitdb 
> -D 
> /home/alvherre/Code/pgsql-build/master/src/test/recovery/tmp_check/t_019_replslot_limit_primary_data/pgdata
>  -A trust -N --wal-segsize=1": No such file or directory
> 
> Bailout called.  Further testing stopped:  command "initdb -0D 
> /home/alvherre/Code/pgsql-build/master/src/test/recovery/tmp_check/t_019_replslot_limit_primary_data/pgdata
>  -A trust -N --wal-segsize=1" exited with value 1
> 
> Bailout called.  Further testing stopped:  command "initdb -0D 
> /home/alvherre/Code/pgsql-build/master/src/test/recovery/tmp_check/t_019_replslot_limit_primary_data/pgdata
>  -A trust -N --wal-segsize=1" died with signal 11
> 
> 
> Previously it was just
> 
> Bailout called.  Further testing stopped:  system initdb failed

That is no doubt going to be helpful, thanks!

--
Daniel Gustafsson   https://vmware.com/

Re: ExecRTCheckPerms() and many prunable partitions

2021-07-07 Thread Amit Langote

On Wed, Jul 7, 2021 at 1:41 PM David Rowley  wrote:
> On Fri, 2 Jul 2021 at 12:41, Amit Langote  wrote:
> > I'll mark the CF entry as WoA, unless you'd rather I just mark it RwF.
>
> I've set it to waiting on author. It was still set to needs review.

Sorry it slipped my mind to do that and thanks.

> If you think you'll not get time to write the patch during this CF,
> feel free to bump it out.

I will try to post an update next week if not later this week,
hopefully with an updated patch.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: 回复： Why is XLOG_FPI_FOR_HINT always need backups?

2021-07-07 Thread Kyotaro Horiguchi

Hello.

At Tue, 6 Jul 2021 20:42:23 +0800, "zwj" <757634...@qq.com> wrote in 
> But I wonder whether it is necessary or not while my file system can protect 
> the blocks of database to be torn. And I read a comment in 
> functionMarkBufferDirtyHint:
> 
> /*
>  * If we need to protect hint bit updates from torn writes, WAL-log a
>  * full page image of the page. This full page image is only necessary
>  * if the hint bit update is the first change to the page since the
>  * last checkpoint.
>  *
>  * We don't check full_page_writes here because that logic is included
>  * when we call XLogInsert() since the value changes dynamically.
>  */
> 
> However, the code tell me it has nothing to do with full_page_writes. I can't 
> figure it out.

The doc of wal_log_hints says that "*even* for non-critical
modifications of so-called hint bits", which seems to me implies it is
following full_page_writes (and I think it is nonsense otherwise, as
you suspect).

XLogSaveBufferForHint sets REGBUF_FORCE_IMAGE since 2c03216d83116 when
the symbol was introduced. As my understanding XLogInsert did not have
an ability to enforce FPIs before the commit. The code comment above
is older than that commit. So it seems to me a thinko that
XLogSaveBufferForHint sets REGBUF_FORCE_IMAGE.

I think the attached fixes that thinko.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From d0d3d52953eda6451c212f4994fdc21a8284f6c1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 7 Jul 2021 15:34:41 +0900
Subject: [PATCH] Make FPI_FOR_HINT follow standard FPI emitting policy

Commit 2c03216d831160bedd72d45f712601b6f7d03f1c changed
XLogSaveBufferForHint to enforce FPI but the caller didn't intend to
do so.  Restore the intended behavior that FPI_FOR_HINT follows
full_page_writes.
---
 src/backend/access/transam/xloginsert.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3d2c9c3e8c..e596a0470a 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -287,8 +287,6 @@ XLogRegisterBlock(uint8 block_id, RelFileNode *rnode, ForkNumber forknum,
 {
 	registered_buffer *regbuf;
 
-	/* This is currently only used to WAL-log a full-page image of a page */
-	Assert(flags & REGBUF_FORCE_IMAGE);
 	Assert(begininsert_called);
 
 	if (block_id >= max_registered_block_id)
@@ -995,7 +993,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
 
 	if (lsn <= RedoRecPtr)
 	{
-		int			flags;
+		int			flags = 0;
 		PGAlignedBlock copied_buffer;
 		char	   *origdata = (char *) BufferGetBlock(buffer);
 		RelFileNode rnode;
@@ -1022,7 +1020,6 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
 
 		XLogBeginInsert();
 
-		flags = REGBUF_FORCE_IMAGE;
 		if (buffer_std)
 			flags |= REGBUF_STANDARD;
 
-- 
2.27.0

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

2021-07-07 Thread Tatsuo Ishii

> Indeed, as Ishii-san pointed out, some users might not want to terminate
> retrying transactions due to -T. However, the actual negative effect is only
> printing the number of failed transactions. The other result that users want 
> to
> know, such as tps, are almost not affected because they are measured for
> transactions processed successfully. Actually, the percentage of failed
> transaction is very little, only 0.347%.

Well, "that's very little, let's ignore it" is not technically a right
direction IMO.

> In the existing behaviour, running transactions are never terminated due to
> the -T option. However, ISTM that this would be based on an assumption
> that a latency of each transaction is small and that a timing when we can
> finish the benchmark would come soon.  On the other hand, when transactions 
> can 
> be retried unlimitedly, it may take a long time more than expected, and we can
> not guarantee that this would finish successfully in limited time.Therefore,  
> terminating the benchmark by giving up to retry the transaction after time
> expiration seems reasonable under unlimited retries.

That's necessarily true in practice. By the time when -T is about to
expire, transactions are all finished in finite time as you can see
the result I showed. So it's reasonable that the very last cycle of
the benchmark will finish in finite time as well.

Of course if a benchmark cycle takes infinite time, this will be a
problem. However same thing can be said to non-retry
benchmarks. Theoretically it is possible that *one* benchmark cycle
takes forever. In this case the only solution will be just hitting ^C
to terminate pgbench. Why can't we have same assumption with
--max-tries=0 case?

> In the sense that we don't
> terminate running transactions forcibly, this don't change the existing 
> behaviour. 

This statement seems to be depending on your perosnal assumption.

I still don't understand why you think that --max-tries non 0 case
will *certainly* finish in finite time whereas --max-tries=0 case will
not.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

unexpected data loaded into database when used COPY FROM

2021-07-07 Thread jian...@fujitsu.com

Hi 

When I used COPY FROM command on windows, I found that If the line data ends 
with a backslash and carriage return/newlines(\r\n),COPY FROM mishandle the 
line .
As a result, there were unexpected data loaded into database.

The following case can reproduce this issue.

Data file:
lines ending with carriage return/newlines(\r\n)
  - test.txt --
  AAA\   ★there is only one Backslash characters (\) in the line end.
  BBB
  ---

Data loading:
#CREATE TABLE copytest( a TEXT);
#COPY copytest FROM '/test.txt';

Data in database:
# SELECT * FROM copytest;
   a
---
 aaa\r  ★\r is loaded unexpectedly
 bbb
(2 rows)
--

In this case , is it better to throw an error to user than to load the 
unexpected data to database?

Regards,

RE: Cosmic ray hits integerset

2021-07-07 Thread Jakub Wartak

Hi, Asking out of pure technical curiosity about "the rhinoceros" - what kind 
of animal is it ? Physical box or VM? How one could get dmidecode(1) / dmesg(1) 
/ mcelog (1) from what's out there (e.g. does it run ECC or not ?)

-J.

> -Original Message-
> From: Alvaro Herrera 
> Sent: Tuesday, June 22, 2021 4:21 PM
> To: Thomas Munro 
> Cc: pgsql-hackers 
> Subject: Re: Cosmic ray hits integerset
> 
> On 2021-Jun-22, Thomas Munro wrote:
> 
> > Hi,
> >
> > Here's a curious one-off failure in test_integerset:
> >
> > +ERROR:  iterate returned wrong value; got 519985430528, expected
> > +485625692160
> 
> Cosmic rays indeed.  The base-2 representation of the expected value is
> 11100010001000110001100
> and that of the actual value is
> 0010001000110001100
> 
> There's a single bit of difference.

Re: Skipping logical replication transactions on subscriber side

2021-07-07 Thread Masahiko Sawada

On Tue, Jul 6, 2021 at 6:33 PM Amit Kapila  wrote:
>
> On Tue, Jul 6, 2021 at 12:30 PM Masahiko Sawada  wrote:
> >
> > On Mon, Jul 5, 2021 at 6:46 PM Amit Kapila  wrote:
> > >
> > > On Thu, Jul 1, 2021 at 6:31 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, Jul 1, 2021 at 12:56 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > >
> > > > > Don't we want to clear stats at drop subscription as well? We do drop
> > > > > database stats in dropdb via pgstat_drop_database, so I think we need
> > > > > to clear subscription stats at the time of drop subscription.
> > > >
> > > > Yes, it needs to be cleared. In the 0003 patch, pgstat_vacuum_stat()
> > > > sends the message to clear the stats. I think it's better to have
> > > > pgstat_vacuum_stat() do that job similar to dropping replication slot
> > > > statistics rather than relying on the single message send at DROP
> > > > SUBSCRIPTION. I've considered doing both: sending the message at DROP
> > > > SUBSCRIPTION and periodical checking by pgstat_vacuum_stat(), but
> > > > dropping subscription not setting a replication slot is able to
> > > > rollback. So we need to send it only at commit time. Given that we
> > > > don’t necessarily need the stats to be updated immediately, I think
> > > > it’s reasonable to go with only a way of pgstat_vacuum_stat().
> > > >
> > >
> > > Okay, that makes sense. Can we consider sending the multiple ids in
> > > one message as we do for relations or functions in
> > > pgstat_vacuum_stat()? That will reduce some message traffic.
> >
> > Yes. Since subscriptions are objects that are not frequently created
> > and dropped I prioritized not to increase the message type. But if we
> > do that for subscriptions, is it better to do that for replication
> > slots as well? It seems to me that the lifetime of subscriptions and
> > replication slots are similar.
> >
>
> Yeah, I think it makes sense to do for both, we can work on slots
> patch separately. I don't see a reason why we shouldn't send a single
> message for multiple clear/drop entries.

+1

>
> > >
> > > True but I guess the user can wait for all the tablesyncs to either
> > > finish or get an error corresponding to the table sync. After that, it
> > > can use 'copy_data' as false. This is not a very good method but I
> > > don't see any other option. I guess whatever is the case logging
> > > errors from tablesyncs is anyway not a bad idea.
> > >
> > > Instead of using the syntax "ALTER SUBSCRIPTION name SET SKIP
> > > TRANSACTION Iconst", isn't it better to use it as a subscription
> > > option like Mark has done for his patch (disable_on_error)?
> >
> > According to the doc, ALTER SUBSCRIPTION ... SET is used to alter
> > parameters originally set by CREATE SUBSCRIPTION. Therefore, we can
> > specify a subset of parameters that can be specified by CREATE
> > SUBSCRIPTION. It makes sense to me for 'disable_on_error' since it can
> > be specified by CREATE SUBSCRIPTION. Whereas SKIP TRANSACTION stuff
> > cannot be done. Are you concerned about adding a syntax to ALTER
> > SUBSCRIPTION?
> >
>
> Both for additional syntax and consistency with disable_on_error.
> Isn't it just a current implementation that Alter only allows to
> change parameters supported by Create? Is there a reason why we can't
> allow Alter to set/change some parameters not supported by Create?

I think there is not reason for that but looking at ALTER TABLE I
thought there is such a policy. I thought the skipping transaction
feature is somewhat different from disable_on_error feature. The
former seems a feature to deal with a problem on the spot whereas the
latter seems a setting of a subscription. Anyway, if we use the
subscription option, we can reset the XID by setting 0? Or do we need
ALTER SUBSCRIPTION RESET?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

93 matches

Mail list logo