Re: [PATCH] Identify LWLocks in tracepoints

2021-03-10 Thread Peter Eisentraut

On 10.03.21 06:38, Craig Ringer wrote:
On Wed, 3 Mar 2021 at 20:50, David Steele > wrote:


On 1/22/21 6:02 AM, Peter Eisentraut wrote:

This patch set no longer applies:
http://cfbot.cputube.org/patch_32_2927.log
.

Can we get a rebase? Also marked Waiting on Author.


Rebased as requested.


In patch 0001, why was the TRACE_POSTGRESQL_LWLOCK_RELEASE() call moved? 
 Is there some correctness issue?  If so, we should explain that (at 
least in the commit message, or as a separate patch).





Re: Parallel INSERT (INTO ... SELECT ...)

2021-03-10 Thread Amit Langote
On Wed, Mar 10, 2021 at 6:18 PM Amit Kapila  wrote:
> On Mon, Mar 8, 2021 at 7:19 PM Amit Langote  wrote:
> > I just read through v25 and didn't find anything to complain about.
>
> Thanks a lot, pushed now! Amit L., your inputs are valuable for this work.

Glad I could be of help.  Really appreciate that you credited me as
author for whatever small part I contributed.

> Now, coming back to Hou-San's patch to introduce a GUC and reloption
> for this feature, I think both of those make sense to me because when
> the feature is enabled via GUC, one might want to disable it for
> partitioned tables? Do we agree on that part or someone thinks
> otherwise?

Agree to have both.

> The other points to bikeshed could be:
> 1. The name of GUC and reloption. The two proposals at hand are
> enable_parallel_dml and enable_parallel_insert. I would prefer the
> second (enable_parallel_insert) because updates/deletes might not have
> a similar overhead.

Sounds reasonable.

> 2. Should we keep the default value of GUC to on or off? It is
> currently off. I am fine keeping it off for this release and we can
> always turn it on in the later releases if required. Having said that,
> I see the value in keeping it on because in many cases Insert ...
> Select will be used for large data and there we will see a benefit of
> parallelism and users facing trouble (who have a very large number of
> partitions with less data to query) can still disable the parallel
> insert for that particular table. Also, the other benefit of keeping
> it on till at least the beta period is that this functionality will
> get tested and if we found reports of regression then we can turn it
> off for this release as well.

This makes sense too.

--
Amit Langote
EDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2021-03-10 Thread Antonin Houska
tsunakawa.ta...@fujitsu.com  wrote:

> I'm crawling like a snail to read the patch set.  Below are my first set of 
> review comments, which are all minor.

Thanks. I will check your comments when I'll be preparing the next version of
the patch.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Queries for PostgreSQL DDL convert

2021-03-10 Thread Julien Rouhaud
On Wed, Mar 10, 2021 at 04:09:49PM +0800, 杨逸存 wrote:
> Dear hacker:
>   I am a student from Nanjing University. I have some troubles 
> about DDL statement convertion. I have several MySQL DDL statements 
> fromMySQL dump command. Now I wanna convert the statements' grammar so 
> that they can be supported by PostgreSQL. However I have tried the tool - 
> 'SQL Data Definition Language (DDL) Conversion to MySQL, PostgreSQL and 
> MS-SQL' on the wiki website of pgsql but it seems to have some bugs and it 
> cannot convert contents of my .sql files. So is there any tools avaliable for 
> me to accomplish it?
>   The mail's attachment is an example I wanna convert. Looking 
> forward to your reply.

Ora2pg (http://ora2pg.darold.net/documentation.html) supports migration from
mysql to postgres.  It should be able to take care of the DDL conversion for
you.




Re: [PATCH] pg_permissions

2021-03-10 Thread Joel Jacobson
New version attached.

Changes:

* Added documentation in catalogs.sgml
* Dropped "objsubid" from pg_ownerships since columns have no owner, only tables

Do we prefer "pg_permissions" or "pg_privileges"?

I can see "privileges" occur 2325 times in the sources,
while "permissions" occur only 1097 times.

Personally, I would prefer "pg_permissions" since it seems more common in 
general.
"database permissions" gives 195 000 000 results on Google,
while "database privileges" only gives 46 800 000 Google results.

If we would have consistently used only "privileges" so far,
I would vote for "pg_privileges", but since there is already a mix,
I think "pg_permissions" would be nicer, it's easier to get right typing also.

/Joel


0004-pg_permissions-and-pg_ownerships.patch
Description: Binary data


Re: Freeze the inserted tuples during CTAS?

2021-03-10 Thread Masahiko Sawada
On Wed, Mar 10, 2021 at 3:57 PM Paul Guo  wrote:
>
> > On Mar 3, 2021, at 1:35 PM, Masahiko Sawada  wrote:
> >> On Sun, Feb 21, 2021 at 4:46 PM Paul Guo  wrote:
> >> Attached is the v2 version that fixes a test failure due to plan change 
> >> (bitmap index scan -> index only scan).
>
> > I think this is a good idea.
>
> > BTW, how much does this patch affect the CTAS performance? I expect
> > it's negligible but If there is much performance degradation due to
> > populating visibility map, it might be better to provide a way to
> > disable it.
>
> Yes,  this is a good suggestion. I did a quick test yesterday.
>
> Configuration: shared_buffers = 1280M and the test system memory is 7G.
>
> Test queries:
>   checkpoint;
>   \timing
>   create table t1 (a, b, c, d) as select i,i,i,i from 
> generate_series(1,2000) i;
>   \timing
>   select pg_size_pretty(pg_relation_size('t1'));
>
> Here are the running time:
>
> HEAD   : Time: 10299.268 ms (00:10.299)  + 1537.876 ms (00:01.538)
> Patch  : Time: 12257.044 ms (00:12.257)  + 14.247 ms
>
> The table size is 800+MB so the table should be all in the buffer. I was 
> surprised
> to see the patch increases the CTAS time by 19.x%, and also it is not better 
> than
> "CTAS+VACUUM" on HEAD version. In theory the visibility map buffer change 
> should
> not affect that much. I looked at related code again (heap_insert()). I 
> believe
> the overhead could decrease along with some discussed CTAS optimization
> solutions (multi-insert, or raw-insert, etc).
>
> I tested 'copy' also. The COPY FREEZE does not involve much overhead than COPY
> according to the experiement results as below. COPY uses multi-insert. Seems 
> there is
> no other difference than CTAS when writing a new table.
>
> COPY TO + VACUUM
> Time: 8826.995 ms (00:08.827) + 1599.260 ms (00:01.599)
> COPY TO FREEZE + VACUUM
> Time: 8836.107 ms (00:08.836) + 13.581 ms
>
> So maybe think about doing freeze in CTAS after optimizing the CTAS 
> performance
> later?

Thank you for testing. That's interesting.

I've also done some benchmarks for CTAS (2GB table creation) and got
similar results:

Patched : 44 sec
HEAD : 34 sec

Since CREATE MATERIALIZED VIEW is also internally treated as CTAS, I
got similar results even for CREATE MATVIEW.

After investigation, it seems to me that the cause of performance
degradation is that heap_insert() set PD_ALL_VISIBLE when inserting a
tuple for the first time on the page (L2133 in heapam.c). This
requires every subsequent heap_insert() to pin a visibility map buffer
(see RelationGetBufferForTuple()). This problem doesn't exist in
heap_multi_insert() since it reads vm buffer once to fill the heap
page with tuples.

Given such relatively big performance degradation, it seems to me that
we should do some optimization for heap_insert() first. Otherwise, we
will end up imposing those costs on all users.

> By the way, ‘REFRESH MatView’ does freeze by default. Matview is quite 
> similar to CTAS.
> I did test it also and the conclusion is similar to that of CTAS. Not sure 
> why FREEZE was
> enabled though, maybe I missed something?

I’m not sure the reason why setting visibility map and PD_ALL_VISIBLE
during REFRESH MATVIEW is enabled by default. By commit 7db0cd2145 and
39b66a91b, heap_insert and heap_multi_insert() sets visibility map
bits if HEAP_INSERT_FROZEN is specified. Looking at the commit
messages, those changes seem to be intended to COPY FREEZE but they
indeed affect REFRESH MATVIEW as well. But I could not find discussion
and mention about REFRESH MATVIEW both in those threads and commit
messages.

One reason that might justify such behavior would be materialized
views are read-only. Since visibility map bits never be cleared, even
if it costs here is some cost to set visibility map during the refresh
setting VM bits and PD_ALL_VISIBLE at creation might win. On the other
hand, a table created by CTAS is read-write. The user might not want
to pay a cost during creating a table if the table is updated
frequently after creation. Not sure. Being said that, I think this
performance degradation of REFRESH MATVIEW could be a problem. There
is no way to avoid the degradation and we also can rely on autovacuum
to set visibility map bits on materialized views. I'll start a new
thread to discuss that.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-10 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> Now, coming back to Hou-San's patch to introduce a GUC and reloption
> for this feature, I think both of those make sense to me because when
> the feature is enabled via GUC, one might want to disable it for
> partitioned tables? Do we agree on that part or someone thinks
> otherwise?
> 
> The other points to bikeshed could be:
> 1. The name of GUC and reloption. The two proposals at hand are
> enable_parallel_dml and enable_parallel_insert. I would prefer the
> second (enable_parallel_insert) because updates/deletes might not have
> a similar overhead.
> 
> 2. Should we keep the default value of GUC to on or off? It is
> currently off. I am fine keeping it off for this release and we can
> always turn it on in the later releases if required. Having said that,
> I see the value in keeping it on because in many cases Insert ...
> Select will be used for large data and there we will see a benefit of
> parallelism and users facing trouble (who have a very large number of
> partitions with less data to query) can still disable the parallel
> insert for that particular table. Also, the other benefit of keeping
> it on till at least the beta period is that this functionality will
> get tested and if we found reports of regression then we can turn it
> off for this release as well.
> 
> Thoughts?


TBH, I'm a bit unsure, but the names with _insert would be OK.

The reason why Oracle has ENABLE PARALLEL DML clause and the parallel DML is 
disabled by default is probably due to the following critical restriction (and 
they have other less severe restrictions in parallel execution.)  Our 
implementation does not have things like this.

"Serial or parallel statements that attempt to access a table that has been 
modified in parallel within the same transaction are rejected with an error 
message."

"A transaction can contain multiple parallel DML statements that modify 
different tables, but after a parallel DML statement modifies a table, no 
subsequent serial or parallel statement (DML or query) can access the same 
table again in that transaction."


OTOH, I'm a bit concerned about whether we would have a common reason to 
disable parallel INSERT, UPDATE and DELETE.  Oracle states space usage 
difference as follows.  I wonder if something similar would apply to our 
parallel UPDATE/DELETE, particularly UPDATE.  At least, we already know 
parallel INSERT results in larger tables and indexes compared to serial 
execution.

"Parallel UPDATE uses the existing free space in the object, while direct-path 
INSERT gets new extents for the data."

"Space usage characteristics may be different in parallel than serial execution 
because multiple concurrent child transactions modify the object."

Even if that's the case, we can add _update parameters, although we feel sorry 
to cause users trouble to have to be aware of multiple parameters.

Or, when it has turned out that we need _update and/or _delete parameters, we 
can opt to rename _insert to _dml, and keep the _insert parameter as an old, 
hidden synonym.


Regards
Takayuki Tsunakawa



Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)

2021-03-10 Thread Michael Paquier
On Thu, Mar 11, 2021 at 04:37:39PM +1300, Thomas Munro wrote:
> Michael, when you said "That's pretty hack-ish, still efficient" in
> reference to this code:
> 
>> -   if (IsUnderPostmaster && !PostmasterIsAlive())
>> +   if (IsUnderPostmaster &&
>> +#ifndef USE_POSTMASTER_DEATH_SIGNAL
>> +   count++ % 1024 == 0 &&
>> +#endif
>> +   !PostmasterIsAlive())
> 
> Is that an objection, and do you see a specific better way?

I'd like to believe that there are more elegant ways to write that,
but based on the numbers you are giving, there is too much gain here
to ignore it.  I would avoid 1024 as a hardcoded value though, so you
could just stick that in a #define or such.  So please feel free to go
ahead.  Thanks for asking.

> I know that someone just needs to write a Windows patch to get us a
> postmaster death signal when the postmaster's event fires, and then
> the problem will go away on Windows.  I still want this change,
> because we don't have such a patch yet, and even when someone writes
> that, there are still a couple of Unixes that could benefit.

Wow.  This probably means that we would be able to get rid of
USE_POSTMASTER_DEATH_SIGNAL?
--
Michael


signature.asc
Description: PGP signature


Re: shared-memory based stats collector

2021-03-10 Thread Fujii Masao




On 2021/03/11 13:42, Kyotaro Horiguchi wrote:

At Wed, 10 Mar 2021 19:21:00 -0800, Andres Freund  wrote in

Hi,

Two minor nits:


Thanks for the comments!


On 2021-03-10 21:47:51 +0900, Fujii Masao wrote:

+/* Shared memory area for archiver process */
+typedef struct PgArchData
+{
+   Latch  *latch;  /* latch to wake the archiver 
up */
+   slock_t mutex;  /* locks this struct */
+} PgArchData;
+


It doesn't really matter, but it'd be pretty trivial to avoid needing a
spinlock for this kind of thing. Just store the pgprocno of the archiver
in PgArchData.


Looks promising.


You mean that spinlock is not necessary by doing the followings?

- Save pgprocno of archiver in PgArchData, instead of latch and mutex.
- Set PgArch->pgprocno at the startup of archiver
- Reset PgArch->pgprocno to INVALID_PGPROCNO at pgarch_die()
- XLogArchiveNotify() sets the latch (i.e., 
>allProcs[PgArch->pgprocno].procLatch) if PgArch->pgprocno is not 
INVALID_PGPROCNO

Right?





While getting rid of the spinlock doesn't seem like a huge win, it does
seem nicer that we'd automatically have a way to find data about the
archiver (e.g. pid).


PGPROC GetAuxProcessInfo(AuxProcType type)?


I don't think this new function is necessary.

ISTM that Andres said that it's worth adding pgprocno into PgArch because
which enables us to get the information about archiver more easily by
using that pgprocno. For example, we can get the pid of archiver by
>allProcs[PgArch->pgprocno].pid. That is, he thinks that
adding pgprocno has several merits. I agree that. Maybe I'm misunderstanding
his comment, though...





 * checkpointer to exit as well, otherwise not. The archiver, 
stats,
 * and syslogger processes are disregarded since they are not
 * connected to shared memory; we also disregard dead_end 
children
 * here. Walsenders are also disregarded, they will be 
terminated
 * later after writing the checkpoint record, like the archiver
 * process.
 */


This comment in PostmasterStateMachine() is outdated now.


Right. Will fix a bit later.


Thanks!

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Huge memory consumption on partitioned table with FKs

2021-03-10 Thread Tatsuro Yamada

On 2021/03/11 9:39, Amit Langote wrote:

On Thu, Mar 11, 2021 at 4:25 AM Tom Lane  wrote:

Amit Langote  writes:

On Wed, Mar 10, 2021 at 8:37 AM Tom Lane  wrote:

Hmm.  So, the key point is that the values coming from the partitioned
child table are injected into the test query as parameters, not as
column references, thus it doesn't matter *to the test query* what
numbers the referencing columns have in that child.  We just have to
be sure we pass the right parameter values.



Right.


I did some cosmetic fooling with this (mostly, rewriting the comments
YA time) and pushed it.


Perfect.   Thanks for your time on this.



Thanks for fixing the problem! :-D


Regards,
Tatsuro Yamada





Re: shared-memory based stats collector

2021-03-10 Thread Kyotaro Horiguchi
At Wed, 10 Mar 2021 19:21:00 -0800, Andres Freund  wrote in 
> Hi,
> 
> Two minor nits:

Thanks for the comments!

> On 2021-03-10 21:47:51 +0900, Fujii Masao wrote:
> > +/* Shared memory area for archiver process */
> > +typedef struct PgArchData
> > +{
> > +   Latch  *latch;  /* latch to wake the archiver 
> > up */
> > +   slock_t mutex;  /* locks this struct */
> > +} PgArchData;
> > +
> 
> It doesn't really matter, but it'd be pretty trivial to avoid needing a
> spinlock for this kind of thing. Just store the pgprocno of the archiver
> in PgArchData.

Looks promising.

> While getting rid of the spinlock doesn't seem like a huge win, it does
> seem nicer that we'd automatically have a way to find data about the
> archiver (e.g. pid).

PGPROC GetAuxProcessInfo(AuxProcType type)?

> >  * checkpointer to exit as well, otherwise not. The archiver, 
> > stats,
> >  * and syslogger processes are disregarded since they are not
> >  * connected to shared memory; we also disregard dead_end 
> > children
> >  * here. Walsenders are also disregarded, they will be 
> > terminated
> >  * later after writing the checkpoint record, like the archiver
> >  * process.
> >  */
> 
> This comment in PostmasterStateMachine() is outdated now.

Right. Will fix a bit later.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] Custom compression methods

2021-03-10 Thread Dilip Kumar
 On Thu, Mar 11, 2021 at 8:50 AM Justin Pryzby  wrote:
>
> Looking at v23-0002-alter-table-set-compression, ATRewriteTable() was calling
> CompareCompressionMethodAndDecompress().

While changing the compression method user might be just interested
to compress the future tuple with the new compression method but
doesn't want to rewrite all the old tuple.  So IMHO without any option
just force
rewrite whenever changing the compression method doesn't look that
great.

> I think what's wanted here is that decompression should only happen when the
> tuple uses a different compression than the column's currently set 
> compression.
> So there's no overhead in the usual case.  I guess CLUSTER and INSERT SELECT
> should do the same.
>
> This is important to allow someone to get rid of LZ4 compression, if they want
> to get rid of that dependency.
>
> But it's also really desirable for admins to be able to "migrate" existing
> data.  People will want to test this, and I guess INSERT SELECT and CLUSTER 
> are
> the obvious ways.

For INSERT SELECT we were already doing in the older version so we can
include that code here, we will also have to include the patches for
decompressing data before forming the composite types because without
that we can not ensure that lz4 does not exist anywhere in the table.
Said that with that also we can not ensure that it doesn't exist anywhere
in the system because it might exist in the WAL and if you do the crash
recovery then might get those lz4 compressed data back.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: libpq debug log

2021-03-10 Thread Kyotaro Horiguchi
At Thu, 11 Mar 2021 04:12:57 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Kyotaro Horiguchi 
> > Right. So something like this?
> > 
> > unsigned char p;
> > 
> > p = buf + *cursor;
> > result = (uint32) (*p << 24) + (*(p + 1)) << 16 + ...);
> 
> Yes, that would work (if p is a pointer), but I think memcpy() is enough like 
> pqGetInt() does.

On second thought, memcpy onto a local variable doesn't harm. I agreed
to you.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: shared-memory based stats collector

2021-03-10 Thread Andres Freund
Hi,

On 2021-03-10 17:51:37 +0900, Kyotaro Horiguchi wrote:
> From ed2fb2fca47fccbf9af1538688aab8334cf6470c Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi 
> Date: Fri, 13 Mar 2020 16:58:03 +0900
> Subject: [PATCH v52 1/7] sequential scan for dshash
> 
> Dshash did not allow scan the all entries sequentially. This adds the
> functionality. The interface is similar but a bit different both from
> that of dynahash and simple dshash search functions. One of the most
> significant differences is the sequential scan interface of dshash
> always needs a call to dshash_seq_term when scan ends. Another is
> locking. Dshash holds partition lock when returning an entry,
> dshash_seq_next() also holds lock when returning an entry but callers
> shouldn't release it, since the lock is essential to continue a
> scan. The seqscan interface allows entry deletion while a scan. The
> in-scan deletion should be performed by dshash_delete_current().

*while a scan is in progress



> +void *
> +dshash_seq_next(dshash_seq_status *status)
> +{
> + dsa_pointer next_item_pointer;
> +
> + if (status->curitem == NULL)
> + {
> + int partition;
> +
> + Assert(status->curbucket == 0);
> + Assert(!status->hash_table->find_locked);
> +
> + /* first shot. grab the first item. */
> + partition =
> + PARTITION_FOR_BUCKET_INDEX(status->curbucket,
> +
> status->hash_table->size_log2);
> + LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
> +   status->exclusive ? LW_EXCLUSIVE : 
> LW_SHARED);
> + status->curpartition = partition;

What does "first shot" mean here?


> +/*
> + * Terminates the seqscan and release all locks.
> + *
> + * Should be always called when finishing or exiting a seqscan.
> + */
> +void
> +dshash_seq_term(dshash_seq_status *status)
> +{
> + status->hash_table->find_locked = false;
> + status->hash_table->find_exclusively_locked = false;
> +
> + if (status->curpartition >= 0)
> + LWLockRelease(PARTITION_LOCK(status->hash_table, 
> status->curpartition));
> +}

I think it'd be good if you added an assertion preventing this from
being called twice.

status->curpartition should definitely be reset to something < 0 after
releasing the lock, to avoid that happening again in case of such a bug.



> +/* Get the current entry while a seq scan. */
> +void *
> +dshash_get_current(dshash_seq_status *status)
> +{
> + return ENTRY_FROM_ITEM(status->curitem);
> +}

What is this used for? It'd probably be good to assert that there's a
scan in progress too, and that locks are held - otherwise this isn't
safe, right?


> +void *
> +dshash_find_extended(dshash_table *hash_table, const void *key,
> +  bool exclusive, bool nowait, bool 
> insert, bool *found)
> +{
> + dshash_hash hash = hash_key(hash_table, key);
> + size_t  partidx = PARTITION_FOR_HASH(hash);
> + dshash_partition *partition = _table->control->partitions[partidx];
> + LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
>   dshash_table_item *item;
> - hash = hash_key(hash_table, key);
> - partition_index = PARTITION_FOR_HASH(hash);
> - partition = _table->control->partitions[partition_index];
> -
> - Assert(hash_table->control->magic == DSHASH_MAGIC);
> - Assert(!hash_table->find_locked);
> + /* must be exclusive when insert allowed */
> + Assert(!insert || (exclusive && found != NULL));
>  
>  restart:
> - LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
> -   LW_EXCLUSIVE);
> + if (!nowait)
> + LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
> + else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
> +
> lockmode))
> + return NULL;

I think this code (and probably also some in the previous patch) would
be more readable if you introduced a local variable to store
PARTITION_LOCK(hash_table, partidx) in. There's four repetitions of
this...


>   ensure_valid_bucket_pointers(hash_table);
>  
>   /* Search the active bucket. */
>   item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, 
> hash));
>  
>   if (item)
> - *found = true;
> + {
> + if (found)
> + *found = true;
> + }
>   else
>   {
> - *found = false;
> + if (found)
> + *found = false;
> +
> + if (!insert)
> + {
> + /* The caller didn't told to add a new entry. */

s/told/tell us/


This nested ifs here make me think that the interface is too
complicated...



> +typedef struct StatsShmemStruct
> +{
> +   

RE: libpq debug log

2021-03-10 Thread tsunakawa.ta...@fujitsu.com
From: Kyotaro Horiguchi 
> Right. So something like this?
> 
> unsigned char p;
> 
> p = buf + *cursor;
> result = (uint32) (*p << 24) + (*(p + 1)) << 16 + ...);

Yes, that would work (if p is a pointer), but I think memcpy() is enough like 
pqGetInt() does.


Regards
Takayuki Tsunakawa





Re: [PATCH] Provide more information to filter_prepare

2021-03-10 Thread Amit Kapila
On Wed, Mar 10, 2021 at 4:26 PM Markus Wanner
 wrote:
>
> On 10.03.21 11:18, Amit Kapila wrote:
> > On Tue, Mar 9, 2021 at 2:14 PM Markus Wanner
> >  wrote:
> >> currently, only the gid is passed on to the filter_prepare callback.
> >> While we probably should not pass a full ReorderBufferTXN (as we do for
> >> most other output plugin callbacks), a bit more information would be
> >> nice, I think.
> >
> > How the proposed 'xid' parameter can be useful? What exactly plugins
> > want to do with it?
>
> The xid is the very basic identifier for transactions in Postgres.  Any
> output plugin that interacts with Postgres in any way slightly more
> interesting than "filter by gid prefix" is very likely to come across a
> TransactionId.
>
> It allows for basics like checking if the transaction to decode still is
> in progress, for example.
>

But this happens when we are decoding prepare, so it is clear that the
transaction is prepared, why any additional check?

>  Or in a much more complex scenario, decide on
> whether or not to filter based on properties the extension stored during
> processing the transaction.
>

What in this can't be done with GID and how XID can achieve it?

-- 
With Regards,
Amit Kapila.




Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)

2021-03-10 Thread Thomas Munro
On Tue, Mar 2, 2021 at 2:10 PM Thomas Munro  wrote:
> ...  One question I haven't
> got to the bottom of: is it a problem for the startup process that CVs
> use CHECK_FOR_INTERRUPTS()?

This was a red herring.  The startup process already reaches CFI() via
various paths, as I figured out pretty quickly with a debugger.  So
I'd like to go ahead and commit these patches.

Michael, when you said "That's pretty hack-ish, still efficient" in
reference to this code:

> -   if (IsUnderPostmaster && !PostmasterIsAlive())
> +   if (IsUnderPostmaster &&
> +#ifndef USE_POSTMASTER_DEATH_SIGNAL
> +   count++ % 1024 == 0 &&
> +#endif
> +   !PostmasterIsAlive())

Is that an objection, and do you see a specific better way?

I know that someone just needs to write a Windows patch to get us a
postmaster death signal when the postmaster's event fires, and then
the problem will go away on Windows.  I still want this change,
because we don't have such a patch yet, and even when someone writes
that, there are still a couple of Unixes that could benefit.




Re: shared-memory based stats collector

2021-03-10 Thread Andres Freund
Hi,

Two minor nits:

On 2021-03-10 21:47:51 +0900, Fujii Masao wrote:
> +/* Shared memory area for archiver process */
> +typedef struct PgArchData
> +{
> + Latch  *latch;  /* latch to wake the archiver 
> up */
> + slock_t mutex;  /* locks this struct */
> +} PgArchData;
> +

It doesn't really matter, but it'd be pretty trivial to avoid needing a
spinlock for this kind of thing. Just store the pgprocno of the archiver
in PgArchData.

While getting rid of the spinlock doesn't seem like a huge win, it does
seem nicer that we'd automatically have a way to find data about the
archiver (e.g. pid).




>* checkpointer to exit as well, otherwise not. The archiver, 
> stats,
>* and syslogger processes are disregarded since they are not
>* connected to shared memory; we also disregard dead_end 
> children
>* here. Walsenders are also disregarded, they will be 
> terminated
>* later after writing the checkpoint record, like the archiver
>* process.
>*/

This comment in PostmasterStateMachine() is outdated now.



Greetings,

Andres Freund




Re: [HACKERS] Custom compression methods

2021-03-10 Thread Justin Pryzby
On Thu, Mar 11, 2021 at 08:17:46AM +0530, Dilip Kumar wrote:
> On Thu, Mar 11, 2021 at 2:21 AM Robert Haas  wrote:
> >
> > On Wed, Mar 10, 2021 at 6:52 AM Dilip Kumar  wrote:
> > > The pending comment is providing a way to rewrite a table and
> > > re-compress the data with the current compression method.
> >
> > I spent some time poking at this yesterday and ran couldn't figure out
> > what was going on here. There are two places where we rewrite tables.
> > One is the stuff in cluter.c, which handles VACUUM FULL and CLUSTER.
> > That eventually calls reform_and_rewrite_tuple(), which deforms the
> > old tuple and creates a new one, but it doesn't seem like there's
> > anything in there that would expand toasted values, whether external
> > or inline compressed. But I think that can't be right, because it
> > seems like then you'd end up with toast pointers into the old TOAST
> > relation, not the new one, which would cause failures later. So I must
> > be missing something here. The other place where we rewrite tables is
> > in ATRewriteTable() as part of the ALTER TABLE machinery. I don't see
> > anything there to force detoasting either.
> 
> We do expand the external values, see below call stack.  See below call stack.
> 
> #0  detoast_external_attr (attr=0x2d61a80) at detoast.c:50
> #1  0x0055bd53 in toast_tuple_init
> #2  0x0050554d in heap_toast_insert_or_update
> #3  0x0050ad5b in raw_heap_insert
> #4  0x0050a9a1 in rewrite_heap_tuple
> #5  0x00502325 in reform_and_rewrite_tuple
> #6  0x004ff47c in heapam_relation_copy_for_cluster
> 
> But that is only if there are external attributes, we do nothing for
> the inline compressed values.  In raw_heap_insert only if
> HeapTupleHasExternal(tup) is true then we call raw_heap_insert.  So if
> we want to do something about inline compressed data then we will have
> to do something in reform_and_rewrite_tuple because there we deform
> and form the tuple again so we have an opportunity to decompress any
> compressed data.
> 
> > But, I am not really convinced that we need to solve this problem by
> > adding new ALTER TABLE syntax. I'd be happy enough if CLUSTER, VACUUM
> > FULL, and versions of ALTER TABLE that already force a rewrite would
> > cause the compression to be redone also.
> 
> Okay,  Maybe for directly compressed data we can do that without
> affecting the performance of unrelated paths but I am again worried
> about the composite type.  Basically, if we have some row type then we
> have to deform the tuple stored in the row type and process it fully.
>  Said that I think in the older version we already had a pacthes at
> [1], basically the first 3 patches will ensure that we will never have
> any compressed data in the composite type so we will not have to worry
> about those as well.
> 
>  Honestly, even if the user
> > had to fall back on creating a new table and doing INSERT INTO newtab
> > SELECT * FROM oldtab I would consider that to be not a total
> > showstopper for this .. assuming of course that it actually works. If
> > it doesn't, we have big problems. Even without the pg_am stuff, we
> > still need to make sure that we don't just blindly let compressed
> > values wander around everywhere. When we insert into a table column
> > with a compression method, we should recompress any data that is
> > compressed using some other method.
> 
> Well, it used to work in the older version[1] so we can make it work
> here as well without much effort.

Looking at v23-0002-alter-table-set-compression, ATRewriteTable() was calling
CompareCompressionMethodAndDecompress().

I think what's wanted here is that decompression should only happen when the
tuple uses a different compression than the column's currently set compression.
So there's no overhead in the usual case.  I guess CLUSTER and INSERT SELECT
should do the same.

This is important to allow someone to get rid of LZ4 compression, if they want
to get rid of that dependency.

But it's also really desirable for admins to be able to "migrate" existing
data.  People will want to test this, and I guess INSERT SELECT and CLUSTER are
the obvious ways.

-- 
Justin




Re: libpq debug log

2021-03-10 Thread Kyotaro Horiguchi
At Thu, 11 Mar 2021 03:01:16 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Kyotaro Horiguchi 
> > The output functions copy message bytes into local variable but the
> > same effect can be obtained by just casing via typed pointer type.
> > 
> >   uint32 tmp4;
> >   ..
> >   memcpy(, buf + *cursor, 4);
> >   result = (int) pg_ntoh32(tmp4);
> > 
> > can be written as
> > 
> >   result = pg_ntoh32(* (uint32 *) (buf + *cursor));
> 
> I'm afraid we need to memcpy() because of memory alignment.

Right. So something like this?

unsigned char p;

p = buf + *cursor;
result = (uint32) (*p << 24) + (*(p + 1)) << 16 + ...);


> > I think we can get rid of copying in the output functions for String
> > and Bytes in different ways.
> 
> I haven't looked at this code, but you sound right.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Replace buffer I/O locks with condition variables (reviving an old patch)

2021-03-10 Thread Julien Rouhaud
On Thu, Mar 11, 2021 at 03:54:06PM +1300, Thomas Munro wrote:
> On Thu, Mar 11, 2021 at 3:27 PM Julien Rouhaud  wrote:
> > -   /*
> > -* It would be nice to include the I/O locks in the BufferDesc, but that
> > -* would increase the size of a BufferDesc to more than one cache line,
> > -* and benchmarking has shown that keeping every BufferDesc aligned on a
> > -* cache line boundary is important for performance.  So, instead, the
> > -* array of I/O locks is allocated in a separate tranche.  Because those
> > -* locks are not highly contended, we lay out the array with minimal
> > -* padding.
> > -*/
> > -   size = add_size(size, mul_size(NBuffers, 
> > sizeof(LWLockMinimallyPadded)));
> > +   /* size of I/O condition variables */
> > +   size = add_size(size, mul_size(NBuffers,
> > +  
> > sizeof(ConditionVariableMinimallyPadded)));
> >
> > Should we keep for now some similar comment mentionning why we don't put 
> > the cv
> > in the BufferDesc even though it would currently fit the 64B target size?
> 
> I tried to write some words along those lines, but it seemed hard to
> come up with a replacement message about a thing we're not doing
> because of other currently proposed patches.  The situation could
> change, and it seemed to be a strange place to put this comment
> anyway, far away from the relevant struct.

Yeah, I agree that it's not the best place to document the size consideration.

> Ok, let me try that
> again... what do you think of this, as a new comment for BufferDesc,
> next to the existing discussion of the 64 byte rule?
> 
> --- a/src/include/storage/buf_internals.h
> +++ b/src/include/storage/buf_internals.h
> @@ -174,6 +174,10 @@ typedef struct buftag
>   * Be careful to avoid increasing the size of the struct when adding or
>   * reordering members.  Keeping it below 64 bytes (the most common CPU
>   * cache line size) is fairly important for performance.
> + *
> + * Per-buffer I/O condition variables are kept outside this struct in a
> + * separate array.  They could be moved in here and still fit under that
> + * limit on common systems, but for now that is not done.
>   */
>  typedef struct BufferDesc
>  {

I was mostly thinking about something like "leave room for now as other feature
could make a better use of that space", but I'm definitely fine with this
comment!




Re: libpq debug log

2021-03-10 Thread Kyotaro Horiguchi
At Thu, 11 Mar 2021 01:51:55 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> Alvaro-san, Tom-san, Horiguchi-san,
> 
> From: Kyotaro Horiguchi 
> > +1 for the thanks for the quick work.  I have some random comments
> > after a quick look on it.
> 
> Thank you very much for giving many comments.  And We're sorry to
> have caused you trouble.  I told Iwata-san yesterday to modify the
> patch to use the logging function interface that Tom-san, Alvaro-san
> and I agreed on, that I'd review after the new revision has been

Yeah, I agreed at the time. Sorry for not responding it but had no
room in my mind to do that:p

> posted, and let others know she is modifying the patch again so that
> they won't start reviewing.  But I should have done the request on
> hackers.
> 
> We're catching up with your feedback and post the updated patch.
> Then I'll review it.
> 
> We appreciate your help so much.

My pleasure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: libpq debug log

2021-03-10 Thread Kyotaro Horiguchi
At Wed, 10 Mar 2021 20:38:20 -0500, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > Maybe I'm missing something, but the above doesn't seem working.  I
> > didn't see such message when making a SSL connection.
> 
> As far as that goes, I don't see any point in trying to trace
> connection-related messages, because there is no way to enable
> tracing on a connection before it's created.

Yeah, agreed. In the previous version tracing functions are called
during protocol negotiation but that no longer happenes.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: libpq debug log

2021-03-10 Thread tsunakawa.ta...@fujitsu.com
From: Kyotaro Horiguchi 
> The output functions copy message bytes into local variable but the
> same effect can be obtained by just casing via typed pointer type.
> 
>   uint32 tmp4;
>   ..
>   memcpy(, buf + *cursor, 4);
>   result = (int) pg_ntoh32(tmp4);
> 
> can be written as
> 
>   result = pg_ntoh32(* (uint32 *) (buf + *cursor));

I'm afraid we need to memcpy() because of memory alignment.

> I think we can get rid of copying in the output functions for String
> and Bytes in different ways.

I haven't looked at this code, but you sound right.


Regards
Takayuki Tsunakawa





Re: Replace buffer I/O locks with condition variables (reviving an old patch)

2021-03-10 Thread Thomas Munro
On Thu, Mar 11, 2021 at 3:27 PM Julien Rouhaud  wrote:
> -   /*
> -* It would be nice to include the I/O locks in the BufferDesc, but that
> -* would increase the size of a BufferDesc to more than one cache line,
> -* and benchmarking has shown that keeping every BufferDesc aligned on a
> -* cache line boundary is important for performance.  So, instead, the
> -* array of I/O locks is allocated in a separate tranche.  Because those
> -* locks are not highly contended, we lay out the array with minimal
> -* padding.
> -*/
> -   size = add_size(size, mul_size(NBuffers, sizeof(LWLockMinimallyPadded)));
> +   /* size of I/O condition variables */
> +   size = add_size(size, mul_size(NBuffers,
> +  sizeof(ConditionVariableMinimallyPadded)));
>
> Should we keep for now some similar comment mentionning why we don't put the 
> cv
> in the BufferDesc even though it would currently fit the 64B target size?

I tried to write some words along those lines, but it seemed hard to
come up with a replacement message about a thing we're not doing
because of other currently proposed patches.  The situation could
change, and it seemed to be a strange place to put this comment
anyway, far away from the relevant struct.  Ok, let me try that
again... what do you think of this, as a new comment for BufferDesc,
next to the existing discussion of the 64 byte rule?

--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -174,6 +174,10 @@ typedef struct buftag
  * Be careful to avoid increasing the size of the struct when adding or
  * reordering members.  Keeping it below 64 bytes (the most common CPU
  * cache line size) is fairly important for performance.
+ *
+ * Per-buffer I/O condition variables are kept outside this struct in a
+ * separate array.  They could be moved in here and still fit under that
+ * limit on common systems, but for now that is not done.
  */
 typedef struct BufferDesc
 {




Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-03-10 Thread Fujii Masao



On 2021/03/11 9:38, Masahiro Ikeda wrote:

On 2021-03-10 17:08, Fujii Masao wrote:

On 2021/03/10 14:11, Masahiro Ikeda wrote:

On 2021-03-09 17:51, Fujii Masao wrote:

On 2021/03/05 8:38, Masahiro Ikeda wrote:

On 2021-03-05 01:02, Fujii Masao wrote:

On 2021/03/04 16:14, Masahiro Ikeda wrote:

On 2021-03-03 20:27, Masahiro Ikeda wrote:

On 2021-03-03 16:30, Fujii Masao wrote:

On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
 wrote:


I pgindented the patches.


... XLogWrite, which is invoked during an
XLogFlush request (see ...).  This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"


Thanks, I fixed them.


wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."


I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.


"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in) This is undesirable because ..."


Thanks, I fixed it.


I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics.  This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite.  Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not.  It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL).  Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead.  In either case, this
half-and-half implementation seems undesirable.


OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)


Thanks for updating the patches!



I added the infrastructure code to communicate the WAL receiver stats messages 
between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?


On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?


OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 
patch.


Thanks for updating the patches! I'm now reading 001 patch.

+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+    (sync_method == SYNC_METHOD_FSYNC ||
+ sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+ sync_method == SYNC_METHOD_FDATASYNC))
+    {

Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?


Thanks for the comments.
I added the above code in v12 patch.



+    /*
+ * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+ * the overhead in WAL-writing.
+ */
+    if (rc & WL_TIMEOUT)
+    pgstat_send_wal();

On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?


Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.


Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.



Why don't to make another way to check the timestamp?

+   /*
+    * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+ 

Re: [HACKERS] Custom compression methods

2021-03-10 Thread Dilip Kumar
On Thu, Mar 11, 2021 at 2:21 AM Robert Haas  wrote:
>
> On Wed, Mar 10, 2021 at 6:52 AM Dilip Kumar  wrote:
> > The pending comment is providing a way to rewrite a table and
> > re-compress the data with the current compression method.
>
> I spent some time poking at this yesterday and ran couldn't figure out
> what was going on here. There are two places where we rewrite tables.
> One is the stuff in cluter.c, which handles VACUUM FULL and CLUSTER.
> That eventually calls reform_and_rewrite_tuple(), which deforms the
> old tuple and creates a new one, but it doesn't seem like there's
> anything in there that would expand toasted values, whether external
> or inline compressed. But I think that can't be right, because it
> seems like then you'd end up with toast pointers into the old TOAST
> relation, not the new one, which would cause failures later. So I must
> be missing something here. The other place where we rewrite tables is
> in ATRewriteTable() as part of the ALTER TABLE machinery. I don't see
> anything there to force detoasting either.

We do expand the external values, see below call stack.  See below call stack.

#0  detoast_external_attr (attr=0x2d61a80) at detoast.c:50
#1  0x0055bd53 in toast_tuple_init
#2  0x0050554d in heap_toast_insert_or_update
#3  0x0050ad5b in raw_heap_insert
#4  0x0050a9a1 in rewrite_heap_tuple
#5  0x00502325 in reform_and_rewrite_tuple
#6  0x004ff47c in heapam_relation_copy_for_cluster

But that is only if there are external attributes, we do nothing for
the inline compressed values.  In raw_heap_insert only if
HeapTupleHasExternal(tup) is true then we call raw_heap_insert.  So if
we want to do something about inline compressed data then we will have
to do something in reform_and_rewrite_tuple because there we deform
and form the tuple again so we have an opportunity to decompress any
compressed data.

> But, I am not really convinced that we need to solve this problem by
> adding new ALTER TABLE syntax. I'd be happy enough if CLUSTER, VACUUM
> FULL, and versions of ALTER TABLE that already force a rewrite would
> cause the compression to be redone also.

Okay,  Maybe for directly compressed data we can do that without
affecting the performance of unrelated paths but I am again worried
about the composite type.  Basically, if we have some row type then we
have to deform the tuple stored in the row type and process it fully.
 Said that I think in the older version we already had a pacthes at
[1], basically the first 3 patches will ensure that we will never have
any compressed data in the composite type so we will not have to worry
about those as well.

 Honestly, even if the user
> had to fall back on creating a new table and doing INSERT INTO newtab
> SELECT * FROM oldtab I would consider that to be not a total
> showstopper for this .. assuming of course that it actually works. If
> it doesn't, we have big problems. Even without the pg_am stuff, we
> still need to make sure that we don't just blindly let compressed
> values wander around everywhere. When we insert into a table column
> with a compression method, we should recompress any data that is
> compressed using some other method.

Well, it used to work in the older version[1] so we can make it work
here as well without much effort.

[1] 
https://www.postgresql.org/message-id/CAFiTN-u%3D2-qaLTod3isQmXuSU0s0_bR%2BRcUQL-vSvH%3DMJbEd7Q%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-10 Thread houzj.f...@fujitsu.com
> > 2. Should we keep the default value of GUC to on or off? It is
> > currently off. I am fine keeping it off for this release and we can
> > always turn it on in the later releases if required. Having said that,
> > I see the value in keeping it on because in many cases Insert ...
> > Select will be used for large data and there we will see a benefit of
> > parallelism and users facing trouble (who have a very large number of
> > partitions with less data to query) can still disable the parallel
> > insert for that particular table. Also, the other benefit of keeping
> > it on till at least the beta period is that this functionality will
> > get tested and if we found reports of regression then we can turn it
> > off for this release as well.
> >
> > Thoughts?
> 
> IMHO, we should keep it on because in most of the cases it is going the give
> benefit to the user, and if for some specific scenario where a table has a 
> lot of
> partition then it can be turned off using reloption.  And, if for some users 
> most
> of the tables are like that where they always have a lot of partition then 
> the user
> can turn it off using guc.
> 
I think your suggestion makes sense.
If no one have other opinions, I will post a new version set default 
enable_parallel_insert=on soon.

Best regards,
houzj



Re: [HACKERS] Custom compression methods

2021-03-10 Thread Justin Pryzby
This includes a patch to use pkgconfig, in an attempt to build on mac, which
currently fails like:

https://cirrus-ci.com/task/5993712963551232?command=build#L126
checking for LZ4_compress in -llz4... no
configure: error: library 'lz4' is required for LZ4 support

-- 
Justin
>From 601ed9e2966b29e1603fd07f69b8068404753810 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 5 Mar 2021 09:33:08 +0530
Subject: [PATCH 1/5] Built-in compression method

Add syntax allowing a compression method to be specified.
As of now there is only 2 option for build-in compression
method (pglz, lz4) which can be set while creating a table
or adding a new column.  No option for altering the
compression method for an existing column.

Dilip Kumar based on the patches from Ildus Kurbangaliev.
Design input from Robert Haas and Tomas Vondra.
Reviewed by Robert Haas, Tomas Vondra, Alexander Korotkov and Justin Pryzby

Discussions:
https://www.postgresql.org/message-id/20171213151818.75a20...@postgrespro.ru
https://www.postgresql.org/message-id/CA%2BTgmoaKDW1Oi9V%3Djc9hOGyf77NbkNEABuqgHD1Cq%3D%3D1QsOcxg%40mail.gmail.com
https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
https://www.postgresql.org/message-id/20201005160355.byp74sh3ejsv7wrj%40development
https://www.postgresql.org/message-id/CAFiTN-tzTTT2oqWdRGLv1dvvS5MC1W%2BLE%2B3bqWPJUZj4GnHOJg%40mail.gmail.com
---
 configure | 118 +++
 configure.ac  |  18 +
 doc/src/sgml/catalogs.sgml|  10 +
 doc/src/sgml/ref/create_table.sgml|  33 +-
 doc/src/sgml/ref/psql-ref.sgml|  11 +
 src/backend/access/brin/brin_tuple.c  |   5 +-
 src/backend/access/common/Makefile|   1 +
 src/backend/access/common/detoast.c   |  71 ++--
 src/backend/access/common/indextuple.c|   3 +-
 src/backend/access/common/toast_compression.c | 308 ++
 src/backend/access/common/toast_internals.c   |  54 ++-
 src/backend/access/common/tupdesc.c   |   8 +
 src/backend/access/table/toast_helper.c   |   5 +-
 src/backend/bootstrap/bootstrap.c |   5 +
 src/backend/catalog/genbki.pl |   3 +
 src/backend/catalog/heap.c|   4 +
 src/backend/catalog/index.c   |   1 +
 src/backend/catalog/toasting.c|   6 +
 src/backend/commands/tablecmds.c  | 110 +++
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/nodes/equalfuncs.c|   1 +
 src/backend/nodes/nodeFuncs.c |   2 +
 src/backend/nodes/outfuncs.c  |   1 +
 src/backend/parser/gram.y |  26 +-
 src/backend/parser/parse_utilcmd.c|   9 +
 src/backend/utils/adt/varlena.c   |  41 +++
 src/bin/pg_dump/pg_backup.h   |   1 +
 src/bin/pg_dump/pg_dump.c |  39 +++
 src/bin/pg_dump/pg_dump.h |   1 +
 src/bin/pg_dump/t/002_pg_dump.pl  |  12 +-
 src/bin/psql/describe.c   |  31 +-
 src/bin/psql/help.c   |   2 +
 src/bin/psql/settings.h   |   1 +
 src/bin/psql/startup.c|  10 +
 src/include/access/detoast.h  |   8 +
 src/include/access/toast_compression.h| 119 +++
 src/include/access/toast_helper.h |   1 +
 src/include/access/toast_internals.h  |  20 +-
 src/include/catalog/pg_attribute.h|   8 +-
 src/include/catalog/pg_proc.dat   |   4 +
 src/include/nodes/parsenodes.h|   2 +
 src/include/parser/kwlist.h   |   1 +
 src/include/pg_config.h.in|   3 +
 src/include/postgres.h|  14 +-
 src/test/regress/expected/compression.out | 247 ++
 src/test/regress/expected/compression_1.out   | 240 ++
 src/test/regress/parallel_schedule|   2 +-
 src/test/regress/pg_regress_main.c|   4 +-
 src/test/regress/serial_schedule  |   1 +
 src/test/regress/sql/compression.sql  | 102 ++
 src/tools/msvc/Solution.pm|   1 +
 src/tools/pgindent/typedefs.list  |   1 +
 52 files changed, 1645 insertions(+), 85 deletions(-)
 create mode 100644 src/backend/access/common/toast_compression.c
 create mode 100644 src/include/access/toast_compression.h
 create mode 100644 src/test/regress/expected/compression.out
 create mode 100644 src/test/regress/expected/compression_1.out
 create mode 100644 src/test/regress/sql/compression.sql

diff --git a/configure b/configure
index fad817bb38..761a27965d 100755
--- a/configure
+++ b/configure
@@ -699,6 +699,7 @@ with_gnu_ld
 LD
 LDFLAGS_SL
 LDFLAGS_EX
+with_lz4
 with_zlib
 with_system_tzdata
 with_libxslt
@@ -864,6 +865,7 @@ with_libxml
 with_libxslt
 

Re: Replace buffer I/O locks with condition variables (reviving an old patch)

2021-03-10 Thread Julien Rouhaud
On Thu, Mar 11, 2021 at 10:48:40AM +1300, Thomas Munro wrote:
> On Tue, Mar 9, 2021 at 6:24 PM Julien Rouhaud  wrote:
> >
> > +1 for adding the cv into BufferDesc.  That brings the struct size to 
> > exactly
> > 64 bytes on x86 64 bits architecture.  This won't add any extra overhead to
> > LOCK_DEBUG cases, as it was already exceeding the 64B threshold, if that 
> > even
> > was a concern.
> 
> I also checked that it's 64B on an Arm box.  Not sure about POWER.
> But... despite the fact that it looks like a good change in isolation,
> I decided to go back to the separate array in this initial commit,
> because the AIO branch also wants to add a new BufferDesc member[1].

Ok!

> I may come back to that change, if we can make some more space (seems
> entirely doable, but I'd like to look into that separately).

-   /*
-* It would be nice to include the I/O locks in the BufferDesc, but that
-* would increase the size of a BufferDesc to more than one cache line,
-* and benchmarking has shown that keeping every BufferDesc aligned on a
-* cache line boundary is important for performance.  So, instead, the
-* array of I/O locks is allocated in a separate tranche.  Because those
-* locks are not highly contended, we lay out the array with minimal
-* padding.
-*/
-   size = add_size(size, mul_size(NBuffers, sizeof(LWLockMinimallyPadded)));
+   /* size of I/O condition variables */
+   size = add_size(size, mul_size(NBuffers,
+  sizeof(ConditionVariableMinimallyPadded)));

Should we keep for now some similar comment mentionning why we don't put the cv
in the BufferDesc even though it would currently fit the 64B target size?

> Thanks for the review.  And of course to Robert for writing the patch.  
> Pushed.

Great!




Re: Removing vacuum_cleanup_index_scale_factor

2021-03-10 Thread Masahiko Sawada
On Thu, Mar 11, 2021 at 10:12 AM Peter Geoghegan  wrote:
>
> On Tue, Mar 9, 2021 at 7:42 PM Peter Geoghegan  wrote:
> > My current plan is to commit everything within the next day or two.
> > This includes backpatching to Postgres 13 only.
>
> Pushed, thanks.

Great! Thank you!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




RE: libpq debug log

2021-03-10 Thread tsunakawa.ta...@fujitsu.com
Alvaro-san, Tom-san, Horiguchi-san,

From: Kyotaro Horiguchi 
> +1 for the thanks for the quick work.  I have some random comments
> after a quick look on it.

Thank you very much for giving many comments.  And We're sorry to have caused 
you trouble.  I told Iwata-san yesterday to modify the patch to use the logging 
function interface that Tom-san, Alvaro-san and I agreed on, that I'd review 
after the new revision has been posted, and let others know she is modifying 
the patch again so that they won't start reviewing.  But I should have done the 
request on hackers.

We're catching up with your feedback and post the updated patch.  Then I'll 
review it.

We appreciate your help so much.


Regards
Takayuki Tsunakawa





Re: libpq debug log

2021-03-10 Thread Tom Lane
Kyotaro Horiguchi  writes:
> Maybe I'm missing something, but the above doesn't seem working.  I
> didn't see such message when making a SSL connection.

As far as that goes, I don't see any point in trying to trace
connection-related messages, because there is no way to enable
tracing on a connection before it's created.

regards, tom lane




Re: 64-bit XIDs in deleted nbtree pages

2021-03-10 Thread Peter Geoghegan
On Mon, Mar 1, 2021 at 7:25 PM Peter Geoghegan  wrote:
> Attached is v8 of the patch series, which has new patches. No real
> changes compared to v7 for the first patch, though.

Here is another bitrot-fix-only revision, v9. Just the recycling patch again.

I'll commit this when we get your patch committed. Still haven't
decided on exactly how more aggressive we should be. For example the
use of the heap relation within _bt_newly_deleted_pages_recycle()
might have unintended consequences for recycling efficiency with some
workloads, since it doesn't agree with _bt_getbuf() (it is still "more
ambitious" than _bt_getbuf(), at least for now).

-- 
Peter Geoghegan


v9-0001-Recycle-pages-deleted-during-same-VACUUM.patch
Description: Binary data


Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Thomas Munro
On Thu, Mar 11, 2021 at 2:25 PM Tom Lane  wrote:
> Trolling the net, I found a newer-looking version of the man page,
> and behold it says
>
>In mainline kernel versions prior to 5.8, syncfs() will fail only
>when passed a bad file descriptor (EBADF).  Since Linux 5.8,
>syncfs() will also report an error if one or more inodes failed
>to be written back since the last syncfs() call.
>
> So this means that in less-than-bleeding-edge kernels, syncfs can
> only be regarded as a dangerous toy.  If we expose an option to use
> it, there had better be large blinking warnings in the docs.

Agreed.  Perhaps we could also try to do something programmatic about that.

Its fsync() was also pretty rough for the first 28 years.




Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Tom Lane
Thomas Munro  writes:
> On Thu, Mar 11, 2021 at 1:16 PM Tom Lane  wrote:
>> I'm a little skeptical about the "simple" part.  At minimum, you'd
>> have to syncfs() each tablespace, since we have no easy way to tell
>> which of them are on different filesystems.  (Although, if we're
>> presuming this is Linux-only, we might be able to tell with some
>> unportable check or other.)

> Right, the patch knows about that:

I noticed that the syncfs man page present in RHEL8 seemed a little
squishy on the critical question of error reporting.  It promises
that syncfs will wait for I/O completion, but it doesn't say in so
many words that I/O errors will be reported (and the list of
applicable errno codes is only EBADF, not very reassuring).

Trolling the net, I found a newer-looking version of the man page,
and behold it says

   In mainline kernel versions prior to 5.8, syncfs() will fail only
   when passed a bad file descriptor (EBADF).  Since Linux 5.8,
   syncfs() will also report an error if one or more inodes failed
   to be written back since the last syncfs() call.

So this means that in less-than-bleeding-edge kernels, syncfs can
only be regarded as a dangerous toy.  If we expose an option to use
it, there had better be large blinking warnings in the docs.

regards, tom lane




Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Thomas Munro
On Thu, Mar 11, 2021 at 2:00 PM Fujii Masao  wrote:
> On 2021/03/11 8:30, Thomas Munro wrote:
> > I've run into a couple of users who have just commented that recursive
> > fsync() code out!
>
> BTW, we can skip that recursive fsync() by disabling fsync GUC even without
> commenting out the code?

Those users wanted fsync=on because they wanted to recover to a normal
online system after a crash, but they believed that the preceding
fsync of the data directory was useless, because replaying the WAL
should be enough.  IMHO they were nearly on the right track, and the
prototype patch I linked earlier as [2] was my attempt to find the
specific reasons why that doesn't work and fix them.  So far, I
figured out that you still have to remember to fsync the WAL files
(otherwise you're replaying WAL that potentially hasn't reached the
disk), and data files holding blocks that recovery decided to skip due
to BLK_DONE (otherwise you might decide to skip replay because of a
higher LSN that is on a page that is in the kernel's cache but not yet
on disk).




Re: libpq debug log

2021-03-10 Thread Kyotaro Horiguchi
At Wed, 10 Mar 2021 10:31:27 -0300, "'alvhe...@alvh.no-ip.org'" 
 wrote in 
> On 2021-Mar-10, iwata@fujitsu.com wrote:
> 
> > Hi all,
> > 
> > Following all reviewer's advice, I have created a new patch.
> > 
> > In this patch, I add only two tracing entry points; I call 
> > pqTraceOutputMsg(PGconn conn, int msgCursor, PGCommSource commsource) in 
> > pqParseInput3 () and pqPutMsgEnd () to output log.
> > The argument contains message first byte offset called msgCursor because it 
> > is simpler to specify the buffer pointer in the caller. 
> > 
> > In pqTraceOutputMsg(), the content common to all protocol messages (first 
> > timestamp, < or >, first 1 byte, and length) are output first and after 
> > that each protocol message contents is output. I referred to pqParseInput3 
> > () , fe-exec.c and documentation page for that part.
> > 
> > This fix almost eliminates if(conn->Pfdebug) that was built into the code 
> > for other features.
> 
> This certainly looks much better!  Thanks for getting it done so
> quickly.
> 
> I didn't review the code super closely.  I do see a compiler warning:

+1 for the thanks for the quick work.  I have some random comments
after a quick look on it.

The individual per-type-output functions are designed to fit to the
corresponding pqGet*() functions.  On the other hand, now that we read
the message bytes knowing the exact format in advance as you did in
pqTraceOutputMsg().  So the complexity exist in the functions can be
eliminated by separating them into more type specific output
functions. For example, pqTraceOutputInt() is far simplified by
spliting it into pqTraceOutputInt16() and -32().

The output functions copy message bytes into local variable but the
same effect can be obtained by just casing via typed pointer type.

  uint32 tmp4;
  ..
  memcpy(, buf + *cursor, 4);
  result = (int) pg_ntoh32(tmp4);

can be written as

  result = pg_ntoh32(* (uint32 *) (buf + *cursor));

I think we can get rid of copying in the output functions for String
and Bytes in different ways.


Now that we can manage our own reading pointer within
pqTraceOutputMsg(), the per-type-output functions can work only on the
reading pointer instead of buffer pointer and cursor, and length.
*I*'d want to make the output functions move the reading pointer by
themseves.  pqTradeOutputMsg can be as simplified as the follows doing
this.  End-of-message pointer may be useful to check read-overrunning
on the message buffer.

  switch (id) {
case 'R':
  pqTraceOutputInt32(, endptr, conn->Pfdebug);
  fputc('\n', conn->Pfdebug);
  break;
case 'K':
  pqTraceOutputInt32(, endptr, conn->Pfdebug));
  pqTraceOutputInt32(, endptr, conn->Pfdebug));
...


+   char   *prefix = commsource == MSGDIR_FROM_BACKEND ? "<" : ">";
+   int LogEnd = commsource == MSGDIR_FROM_BACKEND ? 
conn->inCursor : conn->outMsgEnd;
+   char*logBuffer = commsource == MSGDIR_FROM_BACKEND ? 
conn->inBuffer 
..
+   if (commsource == MSGDIR_FROM_BACKEND)
+   id = logBuffer[LogCursor++];

Repeated usage of the ternaly operator on the same condition makes
code hard-to-read.  It is simpler to consolidate them into one if-else
statement.


+   result = (int) pg_ntoh32(result32);
+   if (result == NEGOTIATE_SSL_CODE)

Maybe I'm missing something, but the above doesn't seem working.  I
didn't see such message when making a SSL connection.


+   /* Protocol 2.0 does not support tracing. */
+   if (PG_PROTOCOL_MAJOR(conn->pversion) < 3)
+   return;

We can write that message out into tracing file.

+void
+PQtraceSetFlags(PGconn *conn, int flags)
+{
+   if (conn == NULL)
+   return;
+   /* Protocol 2.0 is not supported. */
+   if (PG_PROTOCOL_MAJOR(conn->pversion) < 3)
+   return;
+   /* If PQtrace() failed, do nothing. */
+   if (conn->Pfdebug == NULL)
+   return;
+   conn->traceFlags = flags;

Pfdebug is always NULL for V2 protocol there, so it can be an
assertion?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Removing vacuum_cleanup_index_scale_factor

2021-03-10 Thread Peter Geoghegan
On Tue, Mar 9, 2021 at 7:42 PM Peter Geoghegan  wrote:
> My current plan is to commit everything within the next day or two.
> This includes backpatching to Postgres 13 only.

Pushed, thanks.

-- 
Peter Geoghegan




Re: Huge memory consumption on partitioned table with FKs

2021-03-10 Thread Keisuke Kuroda
Hi hackers,

> >
> > I did some cosmetic fooling with this (mostly, rewriting the comments
> > YA time) and pushed it.
>
> Perfect.   Thanks for your time on this.
>

Thank you for your help! I'm glad to solve it.

-- 
Keisuke Kuroda
NTT Software Innovation Center
keisuke.kuroda.3...@gmail.com




RE: Enhance traceability of wal_level changes for backup management

2021-03-10 Thread osumi.takami...@fujitsu.com
On Tuesday, March 9, 2021 11:13 PM David Steele 
> On 3/7/21 9:45 PM, osumi.takami...@fujitsu.com wrote:
> > On Sun, Mar 7, 2021 3:48 AM Peter Eisentraut
>  wrote:
> >> On 28.01.21 01:44, osumi.takami...@fujitsu.com wrote:
>  (1) writing the time or LSN in the control file to indicate
>  when/where wal_level is changed to 'minimal'
>  from upper level to invalidate the old backups or make alerts to users.
> >>> I attached the first patch which implementes this idea.
> >>> It was aligned by pgindent and shows no regression.
> >>
> >> It's not clear to me what this is supposed to accomplish.  I read the
> >> thread, but it's still not clear.
> >> What is one supposed to do with this information?
> > OK. The basic idea is to enable backup management tools to recognize
> > wal_level drop between *snapshots*.
> > When you have a snapshot of the cluster at one time and another one at
> > different time, with this new parameter, you can see if anything that
> > causes discontinuity from the drop happens in the middle of the two
> > snapshots without efforts to have a look at the WALs in between.
> 
> As a backup software author, I don't see this feature as very useful.
> 
> The problem is that there are lots of ways for WAL to go missing so
> monitoring the WAL archive for gaps is essential and this feature would not
> replace that requirement. The only extra information you'd get is the ability 
> to
> classify the most recent gap as "intentional", maybe.
> 
> So, -1 from me.
Thanks for giving me a meaningful viewpoint.
Let me sleep on it, about whether the new param doesn't help in all cases or 
not.


Best Regards,
Takamichi Osumi





Re: Boundary value check in lazy_tid_reaped()

2021-03-10 Thread Masahiko Sawada
On Wed, Mar 10, 2021 at 11:53 PM Peter Eisentraut
 wrote:
>
> On 10.03.21 02:29, Masahiko Sawada wrote:
> >> There is no noticeable effect of inlining lazy_tid_reaped(). So it
> >> would be better to not do that.
> >
> > Attached the patch that doesn't inline lazy_tid_reaped().
>
> committed

Thank you!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Fujii Masao




On 2021/03/11 8:30, Thomas Munro wrote:

On Thu, Mar 11, 2021 at 11:38 AM Thomas Munro  wrote:

On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
 wrote:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?


As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors).  syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that.  I mean, I even posted
one[1] in that other thread.  There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).


Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14.  Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.


+1 to push this kind of change into v14!!


I've run into a couple of users who have just commented that recursive
fsync() code out!


BTW, we can skip that recursive fsync() by disabling fsync GUC even without
commenting out the code?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: pgbench - add pseudo-random permutation function

2021-03-10 Thread Bruce Momjian
On Mon, Mar  8, 2021 at 11:50:43AM +0100, Fabien COELHO wrote:
> 
> > > What are your current thoughts?
> > 
> > Thanks for prodding.  I still think it's a useful feature.  However I
> > don't think I'll have to time to get it done on the current commitfest.
> > I suggest to let it sit in the commitfest to see if somebody else will
> > pick it up -- and if not, we move it to the next one, with apologies to
> > author and reviewers.
> > 
> > I may have time to become familiar or at least semi-comfortable with all
> > that weird math in it by then.
> 
> Yep.
> 
> Generating a parametric good-quality low-cost (but not
> cryptographically-secure) pseudo-random permutations on arbitrary sizes (not
> juste power of two sizes) is not a trivial task, I had to be quite creative
> to achieve it, hence the "weird" maths. I had a lot of bad
> not-really-working ideas before the current status of the patch.
> 
> The code could be simplified if we assume that PG_INT128_TYPE will be
> available on all relevant architectures, and accept the feature not to be
> available if not.

Maybe Dean Rasheed can help because of his math background --- CC'ing him.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: shared-memory based stats collector

2021-03-10 Thread Kyotaro Horiguchi
At Wed, 10 Mar 2021 21:47:51 +0900, Fujii Masao  
wrote in 
> Attached is the updated version of the 0003 patch. Barring any
> objection,
> I will commit this patch.
> 
> 
> -#include "storage/latch.h"
> -#include "storage/proc.h"
> 
> I removed these because they are no longer necessary.

Mmm. Sorry for the garbage.

> logical replication worker,
> parallel worker, background
> writer,
> client backend, checkpointer,
> +   archiver,
> startup, walreceiver,
> walsender and walwriter.
> 
> In the document about pg_stat_activity, possible values in
> backend_type
> column are all listed. I added "archiver" into the list.
> 
> BTW, those values were originally listed in alphabetical order,
> but ISTM that they were gotten out of order at a certain moment.
> So they should be listed in alphabetical order again. This should
> be implemented as a separate patch.

Thanks for adding it.

They are also mildly sorted by function or characteristics. I'm not
sure which is better, but anyway they should be the order based on a
clear criteria.

> -PgArchData *PgArch = NULL;
> +static PgArchData *PgArch = NULL;
> 
> I marked PgArchData as static because it's used only in pgarch.c.

Right.

> - ShmemInitStruct("Archiver ", PgArchShmemSize(), );
> + ShmemInitStruct("Archiver Data", PgArchShmemSize(), );
> 
> I found that the original shmem name ended with unnecessary space
> character.
> I replaced it with "Archiver Data".

Oops. The trailing space is where I stopped writing the string and try
to find a better word, then in the meanwhile, my mind have been
attracted to something else and left.  I don't object to "Archiver
Data".  Thanks for completing it.

> In reaper(), I moved the code block for archiver to the original
> location.

Agreed.

> I ran pgindent for the files that the patch modifies.

Yeah, I forgot to add the new struct into typedefs.list.  I
intentionally omitted clearing newly-acquired shared memory but it
doesn't matter to do that.

So, I'm fine with it. Thanks for taking this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Improve join selectivity estimation using extended statistics

2021-03-10 Thread Tomas Vondra
Hi Konstantin,

Thanks for working on this! Using extended statistics to improve join
cardinality estimates was definitely on my radar, and this patch seems
like a good start.

I had two basic ideas about how we might improve join estimates:

(a) use per-table extended statistics to estimate join conditions

(b) invent multi-table extended statistics (requires inventing how to
sample the tables in a coordinated way, etc.)

This patch aims to do (a) which is perfectly reasonable - I think we can
achieve significant improvements this way. I have some ideas about (b),
but it seems harder and for a separate thread/patch.


The patch includes some *very* interesting ideas, but I think it's does
them too late and at the wrong level of abstraction. I mean that:

1) I don't think the code in clausesel.c should deal with extended
statistics directly - it requires far too much knowledge about different
types of extended stats, what clauses are supported by them, etc.
Allowing stats on expressions will make this even worse.

Better do that in extended_stats.c, like statext_clauselist_selectivity.

2) in clauselist_selectivity_ext, functional dependencies are applied in
the part that processes remaining clauses, not estimated using extended
statistics. That seems a bit confusing, and I suspect it may lead to
issues - for example, it only processes the clauses incrementally, in a
particular order. That probably affects the  result, because it affects
which functional dependencies we can apply.

In the example query that's not an issue, because it only has two Vars,
so it either can't apply anything (with one Var) or it can apply
everything (with two Vars). But with 3 or more Vars the order would
certainly matter, so it's problematic.


Moreover, it seems a bit strange that this considers dependencies only
on the inner relation. Can't that lead to issues with different join
orders producing different cardinality estimates?


I think a better approach would be to either modify the existing block
dealing with extended stats for a single relation to also handle join
conditions. Or perhaps we should invent a separate block, dealing with
*pairs* of relations? And it should deal with *all* join clauses for
that pair of relations at once, not one by one.

As for the exact implementation, I'd imagine we call overall logic to be
something like (for clauses on two joined relations):

- pick a subset of clauses with the same type of extended statistics on
both sides (MCV, ndistinct, ...), repeat until we can't apply more
statistics

- estimate remaining clauses either using functional dependencies or in
the regular (old) way


As for how to use other types of extended statistics, I think eqjoinsel
could serve as an inspiration. We should look for an MCV list and
ndistinct stats on both sides of the join (possibly on some subset of
clauses), and then do the same thing eqjoinsel does, just with multiple
columns.

Note: I'm not sure what to do when we find the stats only on one side.
Perhaps we should assume the other side does not have correlations and
use per-column statistics (seems reasonable), or maybe just not apply
anything (seems a bit too harsh).

Anyway, if there are some non-estimated clauses, we could try applying
functional dependencies similarly to what this patch does. It's also
consistent with statext_clauselist_selectivity - that also tries to
apply MCV lists first, and only then we try functional dependencies.


BTW, should this still rely on oprrest (e.g. F_EQSEL). That's the
selectivity function for restriction (non-join) clauses, so maybe we
should be looking at oprjoin when dealing with joins? Not sure.


One bit that I find *very* interesting is the calc_joinrel_size_estimate
part, with this comment:

  /*
   * Try to take in account functional dependencies between attributes
   * of clauses pushed-down to joined relations and retstrictlist
   * clause. Right now we consider only case of restrictlist consists of
   * one clause.
   */

If I understand the comment and the code after it, it essentially tries
to apply extended statistics from both the join clauses and filters at
the relation level. That is, with a query like

SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a) WHERE t1.b = 10

we would be looking at statistics on t1(a,b), because we're interested
in estimating conditional probability distribution

   P(t1.a = a? | t1.b = 10)

I think that's extremely interesting and powerful, because it allows us
to "restrict" the multi-column MCV lists, we could probably estimate
number of distinct "a" values in rows with "b=10" like:

ndistinct(a,b) / ndistinct(b)

and do various interesting stuff like this.

That will require some improvements to the extended statistics code (to
allow passing a list of conditions), but that's quite doable. I think
the code actually did something like that originally ;-)


Obviously, none of this is achievable for PG14, as we're in the middle
of the last CF. But if you're 

Re: [HACKERS] Custom compression methods

2021-03-10 Thread Justin Pryzby
On Wed, Mar 10, 2021 at 03:50:48PM -0500, Robert Haas wrote:
> On Wed, Mar 10, 2021 at 6:52 AM Dilip Kumar  wrote:
> > The pending comment is providing a way to rewrite a table and
> > re-compress the data with the current compression method.
> 
> I spent some time poking at this yesterday and ran couldn't figure out
> what was going on here. There are two places where we rewrite tables.
> One is the stuff in cluter.c, which handles VACUUM FULL and CLUSTER.
> That eventually calls reform_and_rewrite_tuple(), which deforms the
> old tuple and creates a new one, but it doesn't seem like there's
> anything in there that would expand toasted values, whether external
> or inline compressed. But I think that can't be right, because it
> seems like then you'd end up with toast pointers into the old TOAST
> relation, not the new one, which would cause failures later. So I must
> be missing something here.

I did this the empirical way.

postgres=# CREATE TABLE t (a text compression lz4);
postgres=# INSERT INTO t SELECT repeat('a',9);
postgres=# ALTER TABLE t ALTER a SET COMPRESSION pglz;
postgres=# VACUUM FULL t;
postgres=# SELECT pg_column_compression(a) FROM t;
pg_column_compression | lz4

Actually, a --without-lz4 build can *also* VACUUM FULL the lz4 table.

It looks like VACUUM FULL t processes t but not its toast table (which is
strange to me, since VACUUM processes both, but then autovacuum also processes
them independently).

-- 
Justin




Re: Huge memory consumption on partitioned table with FKs

2021-03-10 Thread Amit Langote
On Thu, Mar 11, 2021 at 4:25 AM Tom Lane  wrote:
> Amit Langote  writes:
> > On Wed, Mar 10, 2021 at 8:37 AM Tom Lane  wrote:
> >> Hmm.  So, the key point is that the values coming from the partitioned
> >> child table are injected into the test query as parameters, not as
> >> column references, thus it doesn't matter *to the test query* what
> >> numbers the referencing columns have in that child.  We just have to
> >> be sure we pass the right parameter values.
>
> > Right.
>
> I did some cosmetic fooling with this (mostly, rewriting the comments
> YA time) and pushed it.

Perfect.   Thanks for your time on this.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-03-10 Thread Masahiro Ikeda

On 2021-03-10 17:08, Fujii Masao wrote:

On 2021/03/10 14:11, Masahiro Ikeda wrote:

On 2021-03-09 17:51, Fujii Masao wrote:

On 2021/03/05 8:38, Masahiro Ikeda wrote:

On 2021-03-05 01:02, Fujii Masao wrote:

On 2021/03/04 16:14, Masahiro Ikeda wrote:

On 2021-03-03 20:27, Masahiro Ikeda wrote:

On 2021-03-03 16:30, Fujii Masao wrote:

On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
 wrote:


I pgindented the patches.


... XLogWrite, which is invoked during 
an
XLogFlush request (see ...).  This is 
also

incremented by the WAL receiver during replication.

("which normally called" should be "which is normally 
called" or
"which normally is called" if you want to keep true to the 
original)
You missed the adding the space before an opening 
parenthesis here and

elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly 
query the

operating system..."
", because" -> "as"


Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This 
is also

incremented by the WAL receiver during replication."


I skipped changing it since I separated the stats for the WAL 
receiver

in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this 
event is
reported in wal_buffers_full in) This is undesirable 
because ..."


Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require 
explicitly
computing the sync statistics but does require computing the 
write
statistics.  This is because of the presence of 
issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite.  Additionally, I 
observe that
the XLogWrite code path calls pgstat_report_wait_*() while 
the WAL
receiver path does not.  It seems technically 
straight-forward to
refactor here to avoid the almost-duplicated logic in the 
two places,
though I suspect there may be a trade-off for not adding 
another
function call to the stack given the importance of WAL 
processing
(though that seems marginalized compared to the cost of 
actually
writing the WAL).  Or, as Fujii noted, go the other way and 
don't have
any shared code between the two but instead implement the 
WAL receiver
one to use pg_stat_wal_receiver instead.  In either case, 
this

half-and-half implementation seems undesirable.


OK, as Fujii-san mentioned, I separated the WAL receiver 
stats.

(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)


Thanks for updating the patches!


I added the infrastructure code to communicate the WAL 
receiver stats messages between the WAL receiver and the 
stats collector, and
the stats for WAL receiver is counted in 
pg_stat_wal_receiver.

What do you think?


On second thought, this idea seems not good. Because those 
stats are

collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver 
process running
at that moment. IOW, it seems strange that some values show 
dynamic
stats and the others show collected stats, even though they 
are in

the same view pg_stat_wal_receiver. Thought?


OK, I fixed it.
The stats collected in the WAL receiver is exposed in 
pg_stat_wal view in v11 patch.


Thanks for updating the patches! I'm now reading 001 patch.

+    /* Check whether the WAL file was synced to disk right now 
*/

+    if (enableFsync &&
+    (sync_method == SYNC_METHOD_FSYNC ||
+ sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+ sync_method == SYNC_METHOD_FDATASYNC))
+    {

Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or 
open_data_sync,

to simplify the code more?


Thanks for the comments.
I added the above code in v12 patch.



+    /*
+ * Send WAL statistics only if WalWriterDelay has 
elapsed to minimize

+ * the overhead in WAL-writing.
+ */
+    if (rc & WL_TIMEOUT)
+    pgstat_send_wal();

On second thought, this change means that it always takes 
wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() 
is called.
For example, if wal_writer_delay is set to several seconds, some 
values in
pg_stat_wal would be not up-to-date meaninglessly for those 
seconds.
So I'm thinking to withdraw my previous comment and it's ok to 
send

the stats every after XLogBackgroundFlush() is called. Thought?


Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.


Yeah, if wal_writer_delay is set to very small value, there is a 
risk
that the WAL stats are sent too frequently. I agree that's a 
problem.




Why don't to make another way to check the timestamp?

+   /*
+    * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+   

Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Thomas Munro
On Thu, Mar 11, 2021 at 1:16 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Thinking about this some more, if you were to propose a patch like
> > that syncfs() one but make it a configurable option, I'd personally be
> > in favour of trying to squeeze it into v14.  Others might object on
> > commitfest procedural grounds, I dunno, but I think this is a real
> > operational issue and that's a fairly simple and localised change.
> > I've run into a couple of users who have just commented that recursive
> > fsync() code out!
>
> I'm a little skeptical about the "simple" part.  At minimum, you'd
> have to syncfs() each tablespace, since we have no easy way to tell
> which of them are on different filesystems.  (Although, if we're
> presuming this is Linux-only, we might be able to tell with some
> unportable check or other.)

Right, the patch knows about that:

+/*
+ * On Linux, we don't have to open every single file one by one.  We can
+ * use syncfs() to sync whole filesystems.  We only expect filesystem
+ * boundaries to exist where we tolerate symlinks, namely pg_wal and the
+ * tablespaces, so we call syncfs() for each of those directories.
+ */




Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Tom Lane
Thomas Munro  writes:
> Thinking about this some more, if you were to propose a patch like
> that syncfs() one but make it a configurable option, I'd personally be
> in favour of trying to squeeze it into v14.  Others might object on
> commitfest procedural grounds, I dunno, but I think this is a real
> operational issue and that's a fairly simple and localised change.
> I've run into a couple of users who have just commented that recursive
> fsync() code out!

I'm a little skeptical about the "simple" part.  At minimum, you'd
have to syncfs() each tablespace, since we have no easy way to tell
which of them are on different filesystems.  (Although, if we're
presuming this is Linux-only, we might be able to tell with some
unportable check or other.)

regards, tom lane




Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Thomas Munro
On Thu, Mar 11, 2021 at 11:38 AM Thomas Munro  wrote:
> On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
>  wrote:
> > * is there a knob missing we can configure?
> > * can we get an opt-in knob to use a single sync() call instead of a
> > recursive fsync()?
> > * would you be open to merging a patch providing said knob?
> > * is there something else we missed?
>
> As discussed on that other thread, I don't think sync() is an option
> (it doesn't wait on all OSes or in the standard and it doesn't report
> errors).  syncfs() on Linux 5.8+ looks like a good candidate though,
> and I think we'd consider a patch like that.  I mean, I even posted
> one[1] in that other thread.  There will of course be cases where
> that's slower (small database sharing filesystem with other software
> that has a lot of dirty data to write back).

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14.  Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.
I've run into a couple of users who have just commented that recursive
fsync() code out!

I'd probably make it an enum-style GUC, because I intend to do some
more work on my "precise" alternative, though not in time for this
release, and it could just as well be an option too.




Re: libpq debug log

2021-03-10 Thread Tom Lane
Alvaro Herrera  writes:
> On 2021-Mar-10, Tom Lane wrote:
>> After studying this further, I think we should apply the attached
>> patch to remove that responsibility from pqParseInput3's subroutines.
>> This will allow a single trace call near the bottom of pqParseInput3
>> to handle all cases that that function processes.

> Works for me.

I dug into the git history a little and concluded that the reason for
doing that in the subroutines is that 92785dac2 made it so for this
reason:

+   /*
+* Advance inStart to show that the "D" message has been processed.  We
+* must do this before calling the row processor, in case it longjmps.
+*/
+   conn->inStart = conn->inCursor;

We later gave up on allowing user-defined row processor callbacks,
but we failed to undo this messiness.  Looking at what 92785dac2 did
confirms something else I'd been eyeing, which is why these subroutines
have their own checks for having consumed the right amount of data
instead of letting the master check in pqParseInput3 take care of it.
They didn't use to do that, but that check was hoisted up to before the
row processor call, so we wouldn't expose data from a corrupt message to
user code.  So I think we can undo that too.

92785dac2 only directly hacked getRowDescriptions and getAnotherTuple,
but it looks like similar error handling was stuck into
getParamDescriptions later.

regards, tom lane




Re: PROXY protocol support

2021-03-10 Thread Jacob Champion
On Tue, 2021-03-09 at 11:25 +0100, Magnus Hagander wrote:
> I've also added some trivial tests (man that took an ungodly amount of
> fighting perl -- it's clearly been a long time since I used perl
> properly).

Yeah. The tests I'm writing for this and NSS have been the same way;
it's a real problem. I'm basically writing supplemental tests in Python
as the "daily driver", then trying to port whatever is easiest (not
much) into Perl, when I get time.

== More Notes ==

Some additional spec-compliance stuff:

>   /* Lower 4 bits hold type of connection */
>   if (proxyheader.fam == 0)
>   {
>   /* LOCAL connection, so we ignore the address included */
>   }

(fam == 0) is the UNSPEC case, which isn't necessarily LOCAL. We have
to do something different for the LOCAL case:

> - \x0 : LOCAL : [...] The receiver must accept this connection as
> valid and must use the real connection endpoints and discard the
> protocol block including the family which is ignored.

So we should ignore the entire "protocol block" (by which I believe
they mean the protocol-and-address-family byte) in the case of LOCAL,
and just accept it with the original address info intact. That seems to
match the sample code in the back of the spec. The current behavior in
the patch will apply the PROXY behavior incorrectly if the sender sends
a LOCAL header with something other than UNSPEC -- which is strange
behavior but not explicitly prohibited as far as I can see.

We also need to reject all connections that aren't either LOCAL or
PROXY commands:

> - other values are unassigned and must not be emitted by senders.
> Receivers must drop connections presenting unexpected values here.

...and naturally it'd be Nice (tm) if the tests covered those corner
cases.

Over on the struct side:

> + struct
> + {   /* for TCP/UDP 
> over IPv4, len = 12 */
> + uint32  src_addr;
> + uint32  dst_addr;
> + uint16  src_port;
> + uint16  dst_port;
> + }   ip4;
> ... snip ...
> + /* TCPv4 */
> + if (proxyaddrlen < 12)
> + {

Given the importance of these hardcoded lengths matching reality, is it
possible to add some static assertions to make sure that sizeof() == 12 and so on? That would also save any poor souls who are
using compilers with nonstandard struct-packing behavior.

--Jacob


Re: Columns correlation and adaptive query optimization

2021-03-10 Thread Tomas Vondra
On 3/10/21 3:00 AM, Tomas Vondra wrote:
> Hello Konstantin,
> 
> 
> Sorry for not responding to this thread earlier. I definitely agree the
> features proposed here are very interesting and useful, and I appreciate
> you kept rebasing the patch.
> 
> I think the patch improving join estimates can be treated as separate,
> and I see it already has a separate CF entry - it however still points
> to this thread, which will be confusing. I suggest we start a different
> thread for it, to keep the discussions separate.
> 

D'oh! I must have been confused yesterday, because now I see there
already is a separate thread [1] for the join selectivity patch. So you
can ignore this.

regards

[1]
https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc...@postgrespro.ru

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Speeding up GIST index creation for tsvectors

2021-03-10 Thread John Naylor
I wrote:
> I'll post a patch soon that builds on that, so you can see what I mean.

I've attached where I was imagining this heading, as a text file to avoid
distracting the cfbot. Here are the numbers I got with your test on the
attached, as well as your 0001, on x86-64 Clang 10, default siglen:

master:
739ms

v3-0001
692ms

attached POC
665ms

The small additional speed up is not worth it, given the code churn and
complexity, so I don't want to go this route after all. I think the way to
go is a simplified version of your 0001 (not 0002), with only a single
function, for gist and intarray only, and a style that better matches the
surrounding code. If you look at my xor functions in the attached text
file, you'll get an idea of what it should look like. Note that it got the
above performance without ever trying to massage the pointer alignment. I'm
a bit uncomfortable with the fact that we can't rely on alignment, but
maybe there's a simple fix somewhere in the gist code.

--
John Naylor
EDB: http://www.enterprisedb.com
 src/backend/access/heap/visibilitymap.c |  13 +-
 src/backend/nodes/bitmapset.c   |  23 +--
 src/backend/utils/adt/tsgistidx.c   |  14 +-
 src/include/port/pg_bitutils.h  |  12 +-
 src/port/pg_bitutils.c  | 240 ++--
 5 files changed, 214 insertions(+), 88 deletions(-)

diff --git a/src/backend/access/heap/visibilitymap.c 
b/src/backend/access/heap/visibilitymap.c
index e198df65d8..d910eb2875 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -388,7 +388,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, 
BlockNumber *all_fro
{
Buffer  mapBuffer;
uint64 *map;
-   int i;
 
/*
 * Read till we fall off the end of the map.  We assume that 
any extra
@@ -409,17 +408,11 @@ visibilitymap_count(Relation rel, BlockNumber 
*all_visible, BlockNumber *all_fro
StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
 "unsupported MAPSIZE");
if (all_frozen == NULL)
-   {
-   for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
-   nvisible += pg_popcount64(map[i] & 
VISIBLE_MASK64);
-   }
+   nvisible = pg_popcount_mask64(map, MAPSIZE, 
VISIBLE_MASK64);
else
{
-   for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
-   {
-   nvisible += pg_popcount64(map[i] & 
VISIBLE_MASK64);
-   nfrozen += pg_popcount64(map[i] & 
FROZEN_MASK64);
-   }
+   nvisible = pg_popcount_mask64(map, MAPSIZE, 
VISIBLE_MASK64);
+   nfrozen = pg_popcount_mask64(map, MAPSIZE, 
FROZEN_MASK64);
}
 
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 649478b0d4..5368c72255 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -452,7 +452,6 @@ bms_is_member(int x, const Bitmapset *a)
 int
 bms_member_index(Bitmapset *a, int x)
 {
-   int i;
int bitnum;
int wordnum;
int result = 0;
@@ -466,14 +465,8 @@ bms_member_index(Bitmapset *a, int x)
bitnum = BITNUM(x);
 
/* count bits in preceding words */
-   for (i = 0; i < wordnum; i++)
-   {
-   bitmapword  w = a->words[i];
-
-   /* No need to count the bits in a zero word */
-   if (w != 0)
-   result += bmw_popcount(w);
-   }
+   result = pg_popcount((const char *) a->words,
+wordnum * BITS_PER_BITMAPWORD 
/ BITS_PER_BYTE);
 
/*
 * Now add bits of the last word, but only those before the item. We can
@@ -645,22 +638,14 @@ bms_get_singleton_member(const Bitmapset *a, int *member)
 int
 bms_num_members(const Bitmapset *a)
 {
-   int result = 0;
int nwords;
-   int wordnum;
 
if (a == NULL)
return 0;
nwords = a->nwords;
-   for (wordnum = 0; wordnum < nwords; wordnum++)
-   {
-   bitmapword  w = a->words[wordnum];
 
-   /* No need to count the bits in a zero word */
-   if (w != 0)
-   result += bmw_popcount(w);
-   }
-   return result;
+   return pg_popcount((const char *) a->words,
+  nwords * BITS_PER_BITMAPWORD / 
BITS_PER_BYTE);
 }
 
 /*
diff --git a/src/backend/utils/adt/tsgistidx.c 

Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Thomas Munro
On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
 wrote:
> * pg_basebackup receives a streaming backup (via [2] fsync_dir_recurse
> or fsync_pgdata) unless --no-sync is specified
> * postgres starts up unclean (via [3] SyncDataDirectory)
>
> We run multiple postgres clusters and some of those clusters have many
> (~450) databases (one database-per-customer) meaning that the postgres
> data directory has around 700,000 files.
>
> On one of our less loaded servers this takes ~7 minutes to complete, but
> on another [4] this takes ~90 minutes.

Ouch.

> My questions are:
>
> * is there a knob missing we can configure?
> * can we get an opt-in knob to use a single sync() call instead of a
> recursive fsync()?
> * would you be open to merging a patch providing said knob?
> * is there something else we missed?

As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors).  syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that.  I mean, I even posted
one[1] in that other thread.  There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).

I also wrote a WAL-and-checkpoint based prototype[2], which, among
other advantages such as being faster, not ignoring errors and not
triggering collateral write-back storms, happens to work on all
operating systems.  On the other hand it requires a somewhat dogmatic
switch in thinking about the meaning of checkpoints (I mean, it
requires humans to promise not to falsify checkpoints by copying
databases around underneath us), which may be hard to sell (I didn't
try very hard), and there may be subtleties I have missed...

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGKT6XiPiEJrqeOFGi7RYCGzbBysF9pyWwv0-jm-oNajxg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CA%2BhUKGKHhDNnN6fxf6qrAx9h%2BmjdNU2Zmx7ztJzFQ0C5%3Du3QPg%40mail.gmail.com




Re: automatic analyze: readahead - add "IO read time" log message

2021-03-10 Thread Tomas Vondra
On 3/8/21 8:42 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
>> On 2/10/21 11:10 PM, Stephen Frost wrote:
>>> * Heikki Linnakangas (hlinn...@iki.fi) wrote:
 On 05/02/2021 23:22, Stephen Frost wrote:
> Unless there's anything else on this, I'll commit these sometime next
> week.

 One more thing: Instead of using 'effective_io_concurrency' GUC directly,
 should call get_tablespace_maintenance_io_concurrency().
>>>
>>> Ah, yeah, of course.
>>>
>>> Updated patch attached.
>>
>> A couple minor comments:
>>
>> 1) I think the patch should be split into two parts, one adding the
>> track_io_timing, one adding the prefetching.
> 
> This was already done..
> 

Not sure what you mean by "done"? I see the patch still does both
changes related to track_io_timing and prefetching.

>> 2) I haven't tried but I'm pretty sure there'll be a compiler warning
>> about 'prefetch_maximum' being unused without USE_PREFETCH defined.
> 
> Ah, that part is likely true, moved down into the #ifdef block to
> address that, which also is good since it should avoid mistakenly using
> it outside of the #ifdef's later on by mistake too.
> 

OK

>> 3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?
> 
> Perhaps..
> 
>> This makes the code rather hard to read, IMHO. It seems to me we can
>> move the code around a bit and merge some of the #ifdef blocks - see the
>> attached patch. Most of this is fairly trivial, with the exception of
>> moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
>> does not materially change the behavior, but perhaps I'm wrong.
> 
> but I don't particularly like doing the prefetch right before we run
> vacuum_delay_point() and potentially sleep.
> 

Why? Is that just a matter of personal preference (fair enough) or is
there a reason why that would be wrong/harmful?

I think e.g. prefetch_targblock could be moved to the next #ifdef, which
will eliminate the one-line ifdef.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: libpq debug log

2021-03-10 Thread Alvaro Herrera
On 2021-Mar-10, Tom Lane wrote:

> After studying this further, I think we should apply the attached
> patch to remove that responsibility from pqParseInput3's subroutines.
> This will allow a single trace call near the bottom of pqParseInput3
> to handle all cases that that function processes.

Works for me.

> BTW, while looking at this I concluded that getParamDescriptions
> is actually buggy: if it's given a malformed message that seems
> to need more data than the message length specifies, it just
> returns EOF, resulting in an infinite loop.  This function apparently
> got missed while converting the logic from v2 (where that was the
> right thing) to v3 (where it ain't).  So that bit needs to be
> back-patched.

Ah, that makes sense.

> I'm tempted to back-patch the whole thing though.

+0.5 on that.  I think we may be happy that all branches are alike
(though it doesn't look like this will cause any subtle bugs -- breakage
will be fairly obvious).

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Most hackers will be perfectly comfortable conceptualizing users as entropy
 sources, so let's move on."   (Nathaniel Smith)




Re: GiST comment improvement

2021-03-10 Thread Bruce Momjian
On Tue, Mar  2, 2021 at 11:40:21AM -0500, Bruce Momjian wrote:
> In talking to Teodor this week, I have written the attached C comment
> patch which improves our description of GiST's NSN and GistBuildLSN
> values.

Patch applied.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





fdatasync performance problem with large number of DB files

2021-03-10 Thread Michael Brown
I initially posted this on the pgsql-general mailing list [5] but was
advised to also post this to the -hackers list as it deals with internals.

We've encountered a production performance problem with pg13 related to
how it fsyncs the whole data directory in certain scenarios, related to
what Paul (bcc'ed) described in a post to pgsql-hackers [1].

Background:

We've observed the full recursive fsync is triggered when

* pg_basebackup receives a streaming backup (via [2] fsync_dir_recurse
or fsync_pgdata) unless --no-sync is specified
* postgres starts up unclean (via [3] SyncDataDirectory)

We run multiple postgres clusters and some of those clusters have many
(~450) databases (one database-per-customer) meaning that the postgres
data directory has around 700,000 files.

On one of our less loaded servers this takes ~7 minutes to complete, but
on another [4] this takes ~90 minutes.

Obviously this is untenable risk. We've modified our process that
bootstraps a replica via pg_basebackup to instead do "pg_basebackup
--no-sync…" followed by a "sync", but we don't have any way to do the
equivalent for the postgres startup.

I presume the reason postgres doesn't blindly run a sync() is that we
don't know what other I/O is on the system and it'd be rude to affect
other services. That makes sense, except for our environment the work
done by the recursive fsync is orders of magnitude more disruptive than
a sync().

My questions are:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

Thanks!

[1]:
https://www.postgresql.org/message-id/flat/caeet0zhgnbxmi8yf3ywsdzvb3m9cbdsgzgftxscq6agcbzc...@mail.gmail.com
[2]:
https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181
[3]:
https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495
[4]: It should be identical config-wise. It isn't starved for IO but
does have other regular write workloads
[5]:
https://www.postgresql-archive.org/fdatasync-performance-problem-with-large-number-of-DB-files-td6184094.html

-- 
Michael Brown
Civilized Discourse Construction Kit, Inc.
https://www.discourse.org/





Re: Replace buffer I/O locks with condition variables (reviving an old patch)

2021-03-10 Thread Thomas Munro
On Tue, Mar 9, 2021 at 6:24 PM Julien Rouhaud  wrote:
> > The old I/O lock array was the only user of struct
> > LWLockMinimallyPadded, added in commit 6150a1b08a9, and it seems kinda
> > strange to leave it in the tree with no user.  Of course it's remotely
> > possible there are extensions using it (know of any?).  In the
> > attached, I've ripped that + associated commentary out, because it's
> > fun to delete dead code.  Objections?
>
> None from me.  I don't know of any extension relying on it, and neither does
> codesearch.debian.net.  I would be surprised to see any extension actually
> relying on that anyway.

Thanks for checking!

> > Since the whole reason for that out-of-line array in the first place
> > was to keep BufferDesc inside one cache line, and since it is in fact
> > possible to put a new condition variable into BufferDesc without
> > exceeding 64 bytes on a 64 bit x86 box, perhaps we should just do that
> > instead?  I haven't yet considered other architectures or potential
> > member orders.
>
> +1 for adding the cv into BufferDesc.  That brings the struct size to exactly
> 64 bytes on x86 64 bits architecture.  This won't add any extra overhead to
> LOCK_DEBUG cases, as it was already exceeding the 64B threshold, if that even
> was a concern.

I also checked that it's 64B on an Arm box.  Not sure about POWER.
But... despite the fact that it looks like a good change in isolation,
I decided to go back to the separate array in this initial commit,
because the AIO branch also wants to add a new BufferDesc member[1].
I may come back to that change, if we can make some more space (seems
entirely doable, but I'd like to look into that separately).

> > I wonder if we should try to preserve user experience a little harder,
> > for the benefit of people who have monitoring queries that look for
> > this condition.  Instead of inventing a new wait_event value, let's
> > just keep showing "BufferIO" in that column.  In other words, the
> > change is that wait_event_type changes from "LWLock" to "IPC", which
> > is a pretty good summary of this patch.  Done in the attached.  Does
> > this make sense?
>
> I think it does make sense, and it's good to preserve this value.
>
> Looking at the patch itself, I don't have much to add it all looks sensible 
> and
> I agree with the arguments in the first mail.  All regression tests pass and
> documentation builds.

I found one more thing to tweak: a reference in the README.

> I'm marking this patch as RFC.

Thanks for the review.  And of course to Robert for writing the patch.  Pushed.

[1] 
https://github.com/anarazel/postgres/blob/aio/src/include/storage/buf_internals.h#L190




Re: libpq debug log

2021-03-10 Thread Tom Lane
I wrote:
> Or we could rethink the logic.  It's not quite clear to me, after
> all this time, why getRowDescriptions() et al are allowed to
> move inStart themselves rather than letting the main loop in
> pqParseInput3 do it.  It might well be an artifact of having not
> rewritten the v2 logic too much.

After studying this further, I think we should apply the attached
patch to remove that responsibility from pqParseInput3's subroutines.
This will allow a single trace call near the bottom of pqParseInput3
to handle all cases that that function processes.

There are still some cowboy advancements of inStart in the functions
concerned with COPY processing, but it doesn't look like there's
much to be done about that.  Those spots will need their own trace
calls.

BTW, while looking at this I concluded that getParamDescriptions
is actually buggy: if it's given a malformed message that seems
to need more data than the message length specifies, it just
returns EOF, resulting in an infinite loop.  This function apparently
got missed while converting the logic from v2 (where that was the
right thing) to v3 (where it ain't).  So that bit needs to be
back-patched.  I'm tempted to back-patch the whole thing though.

regards, tom lane

diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c
index 2ca8c057b9..233456fd90 100644
--- a/src/interfaces/libpq/fe-protocol3.c
+++ b/src/interfaces/libpq/fe-protocol3.c
@@ -290,8 +290,6 @@ pqParseInput3(PGconn *conn)
 		/* First 'T' in a query sequence */
 		if (getRowDescriptions(conn, msgLength))
 			return;
-		/* getRowDescriptions() moves inStart itself */
-		continue;
 	}
 	else
 	{
@@ -337,8 +335,7 @@ pqParseInput3(PGconn *conn)
 case 't':		/* Parameter Description */
 	if (getParamDescriptions(conn, msgLength))
 		return;
-	/* getParamDescriptions() moves inStart itself */
-	continue;
+	break;
 case 'D':		/* Data Row */
 	if (conn->result != NULL &&
 		conn->result->resultStatus == PGRES_TUPLES_OK)
@@ -346,8 +343,6 @@ pqParseInput3(PGconn *conn)
 		/* Read another tuple of a normal query response */
 		if (getAnotherTuple(conn, msgLength))
 			return;
-		/* getAnotherTuple() moves inStart itself */
-		continue;
 	}
 	else if (conn->result != NULL &&
 			 conn->result->resultStatus == PGRES_FATAL_ERROR)
@@ -462,7 +457,6 @@ handleSyncLoss(PGconn *conn, char id, int msgLength)
  * command for a prepared statement) containing the attribute data.
  * Returns: 0 if processed message successfully, EOF to suspend parsing
  * (the latter case is not actually used currently).
- * In the former case, conn->inStart has been advanced past the message.
  */
 static int
 getRowDescriptions(PGconn *conn, int msgLength)
@@ -577,9 +571,6 @@ getRowDescriptions(PGconn *conn, int msgLength)
 	/* Success! */
 	conn->result = result;
 
-	/* Advance inStart to show that the "T" message has been processed. */
-	conn->inStart = conn->inCursor;
-
 	/*
 	 * If we're doing a Describe, we're done, and ready to pass the result
 	 * back to the client.
@@ -603,9 +594,6 @@ advance_and_error:
 	if (result && result != conn->result)
 		PQclear(result);
 
-	/* Discard the failed message by pretending we read it */
-	conn->inStart += 5 + msgLength;
-
 	/*
 	 * Replace partially constructed result with an error result. First
 	 * discard the old result to try to win back some memory.
@@ -624,6 +612,12 @@ advance_and_error:
 	appendPQExpBuffer(>errorMessage, "%s\n", errmsg);
 	pqSaveErrorResult(conn);
 
+	/*
+	 * Show the message as fully consumed, else pqParseInput3 will overwrite
+	 * our error with a complaint about that.
+	 */
+	conn->inCursor = conn->inStart + 5 + msgLength;
+
 	/*
 	 * Return zero to allow input parsing to continue.  Subsequent "D"
 	 * messages will be ignored until we get to end of data, since an error
@@ -635,12 +629,8 @@ advance_and_error:
 /*
  * parseInput subroutine to read a 't' (ParameterDescription) message.
  * We'll build a new PGresult structure containing the parameter data.
- * Returns: 0 if completed message, EOF if not enough data yet.
- * In the former case, conn->inStart has been advanced past the message.
- *
- * Note that if we run out of data, we have to release the partially
- * constructed PGresult, and rebuild it again next time.  Fortunately,
- * that shouldn't happen often, since 't' messages usually fit in a packet.
+ * Returns: 0 if processed message successfully, EOF to suspend parsing
+ * (the latter case is not actually used currently).
  */
 static int
 getParamDescriptions(PGconn *conn, int msgLength)
@@ -690,23 +680,16 @@ getParamDescriptions(PGconn *conn, int msgLength)
 	/* Success! */
 	conn->result = result;
 
-	/* Advance inStart to show that the "t" message has been processed. */
-	conn->inStart = conn->inCursor;
-
 	return 0;
 
 not_enough_data:
-	PQclear(result);
-	return EOF;
+	errmsg = 

Re: libpq debug log

2021-03-10 Thread 'alvhe...@alvh.no-ip.org'
On 2021-Mar-10, Tom Lane wrote:

> Or we could rethink the logic.  It's not quite clear to me, after
> all this time, why getRowDescriptions() et al are allowed to
> move inStart themselves rather than letting the main loop in
> pqParseInput3 do it.  It might well be an artifact of having not
> rewritten the v2 logic too much.

I would certainly prefer that the logic stays put for the time being,
while I finalize the pipelining stuff.

-- 
Álvaro Herrera   Valdivia, Chile




Re: libpq debug log

2021-03-10 Thread 'alvhe...@alvh.no-ip.org'
On 2021-Mar-10, Tom Lane wrote:

> "'alvhe...@alvh.no-ip.org'"  writes:
> > After staring at it a couple of times, I think that the places in
> > pqParseInput3() where there's a comment "... moves inStart itself" and
> > then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU
> > those are the places where the message in question would not reach the
> > "Successfully consumed this message" block that prints each message.
> 
> Yeah, the whole business of just when a message has been "consumed"
> is a stumbling block for libpq tracing.  It was a real mess with the
> existing code and v2 protocol, because we could actually try to parse
> a message more than once, with the first few tries deciding that the
> message wasn't all here yet (but nonetheless emitting partial trace
> output).

Hmm, that makes sense, but the issue I'm reporting is not the same,
unless I misunderstand you.

> Now that v2 is dead, the logic to abort because of the message not
> being all arrived yet is basically useless: only the little bit of
> code concerned with the message length word really needs to cope with
> that.  It's tempting to go through and get rid of all the now-unreachable
> "return"s and such, but it seemed like it would be a lot of code churn for
> not really that much gain.

That sounds like an interesting exercise, and I bet it'd bring a lot of
code readability improvements.

> I didn't look at the new version of the patch yet, so I'm not
> sure whether the issues it still has are related to this.

The issues I noticed are related to separate messages rather than one
message split in pieces -- for example several DataRow messages are
processed internally in a loop, rather than each individually.  The
number of places that need to be adjusted for things to AFAICT work
correctly are few enough; ISTM that the attached patch is sufficient.

(The business with the "logged" boolean is necessary so that we print to
the trace file any of those messages even if they are deviating from the
protocol.)

-- 
Álvaro Herrera   Valdivia, Chile
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)




Re: [HACKERS] Custom compression methods

2021-03-10 Thread Robert Haas
On Wed, Mar 10, 2021 at 6:52 AM Dilip Kumar  wrote:
> The pending comment is providing a way to rewrite a table and
> re-compress the data with the current compression method.

I spent some time poking at this yesterday and ran couldn't figure out
what was going on here. There are two places where we rewrite tables.
One is the stuff in cluter.c, which handles VACUUM FULL and CLUSTER.
That eventually calls reform_and_rewrite_tuple(), which deforms the
old tuple and creates a new one, but it doesn't seem like there's
anything in there that would expand toasted values, whether external
or inline compressed. But I think that can't be right, because it
seems like then you'd end up with toast pointers into the old TOAST
relation, not the new one, which would cause failures later. So I must
be missing something here. The other place where we rewrite tables is
in ATRewriteTable() as part of the ALTER TABLE machinery. I don't see
anything there to force detoasting either.

That said, I think that using the word REWRITE may not really capture
what we're on about. Leaving aside the question of exactly what the
CLUSTER code does today, you could in theory rewrite the main table by
just taking all the tuples and putting them into a new relfilenode.
And then you could do the same thing with the TOAST table. And despite
having fully rewritten both tables, you wouldn't have done anything
that helps with this problem because you haven't deformed the tuples
at any point. Now as it happens we do have code -- in
reform_and_rewrite_tuple() -- that does deform and reform the tuples,
but it doesn't take care of this problem either. We might need to
distinguish between rewriting the table, which is mostly about getting
a new relfilenode, and some other word that means doing this.

But, I am not really convinced that we need to solve this problem by
adding new ALTER TABLE syntax. I'd be happy enough if CLUSTER, VACUUM
FULL, and versions of ALTER TABLE that already force a rewrite would
cause the compression to be redone also. Honestly, even if the user
had to fall back on creating a new table and doing INSERT INTO newtab
SELECT * FROM oldtab I would consider that to be not a total
showstopper for this .. assuming of course that it actually works. If
it doesn't, we have big problems. Even without the pg_am stuff, we
still need to make sure that we don't just blindly let compressed
values wander around everywhere. When we insert into a table column
with a compression method, we should recompress any data that is
compressed using some other method.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_amcheck contrib application

2021-03-10 Thread Robert Haas
On Wed, Mar 10, 2021 at 11:10 AM Mark Dilger
 wrote:
> Once again, I think you are right and have removed the objectionable 
> behavior, but
>
> The --startblock and --endblock options make the most sense when the user is 
> only checking one table, like
>
>   pg_amcheck --startblock=17 --endblock=19 --table=my_schema.my_corrupt_table
>
> because the user likely has some knowledge about that table, perhaps from a 
> prior run of pg_amcheck.  The --startblock and --endblock arguments are a bit 
> strange when used globally, as relations don't all have the same number of 
> blocks, so
>
>   pg_amcheck --startblock=17 --endblock=19 mydb
>
> will very likely emit lots of error messages for tables which don't have 
> blocks in that range.  That's not entirely pg_amcheck's fault, as it just did 
> what the user asked, but it also doesn't seem super helpful.  I'm not going 
> to do anything about it in this release.

+1 to all that. I tend toward the opinion that trying to make
--startblock and --endblock do anything useful in the context of
checking multiple relations is not really possible, and therefore we
just shouldn't put any effort into it. But if user feedback shows
otherwise, we can always do something about it later.

> After running 'make installcheck', if I delete all entries from pg_class 
> where relnamespace = 'pg_toast'::regclass, by running 'pg_amcheck 
> regression', I get lines that look like this:
>
> heap relation "regression"."public"."quad_poly_tbl":
> ERROR:  could not open relation with OID 17177

In this here example, the first line ends in a colon.

> relation "regression"."public"."functional_dependencies", block 28, offset 
> 54, attribute 0
> attribute 0 with length 4294967295 ends at offset 50 beyond total tuple 
> length 43

But this here one does not. Seems like it should be consistent.

The QUALIFIED_NAME_FIELDS macro doesn't seem to be used anywhere,
which is good, because macros with unbalanced parentheses are usually
not a good plan; and a macro that expands to a comma-separated list of
things is suspect too.

"invalid skip options\n" seems too plural.

With regard to your use of strtol() for --{start,end}block, telling
the user that their input is garbage seems pejorative, even though it
may be accurate. Compare:

[rhaas EDBAS]$ pg_dump -jdsgdsgd
pg_dump: error: invalid number of parallel jobs

In the message "relation end block argument precedes start block
argument\n", I think you could lose both instances of the word
"argument" and probably the word "relation" as well. I actually don't
know why all of these messages about start and end block mention
"relation". It's not like there is some other kind of
non-relation-related start block with which it could be confused.

The comment for run_command() explains some things about the cparams
argument, but those things are false. In fact the argument is unused.

Usual PostgreSQL practice when freeing memory in e.g.
verify_heap_slot_handler is to set the pointers to NULL as well. The
performance cost of this is trivial, and it makes debugging a lot
easier should somebody accidentally write code to access one of those
things after it's been freed.

The documentation says that -D "does exclude any database that was
listed explicitly as dbname on the command line, nor does it exclude
the database chosen in the absence of any dbname argument." The first
part of this makes complete sense to me, but I'm not sure about the
second part. If I type pg_amcheck --all -D 'r*', I think I'm expecting
that "rhaas" won't be checked. Likewise, if I say pg_amcheck -d
'bob*', I think I only want to check the bob-related databases and not
rhaas.

I suggest documenting --endblock as "Check table blocks up to and
including the specified ending block number. An error will occur if a
relation being checked has fewer than this number of blocks." And
similarly for --startblock: "Check table blocks beginning with the
specified block number. An error will occur, etc." Perhaps even
mention something like "This option is probably only useful when
checking a single table." Also, the documentation here isn't clear
that this affects only table checking, not index checking.

It appears that pg_amcheck sometimes makes dummy connections to the
database that don't do anything, e.g. pg_amcheck -t 'q*' resulted in:

2021-03-10 15:00:14.273 EST [95473] LOG:  connection received: host=[local]
2021-03-10 15:00:14.274 EST [95473] LOG:  connection authorized:
user=rhaas database=rhaas application_name=pg_amcheck
2021-03-10 15:00:14.286 EST [95473] LOG:  statement: SELECT
pg_catalog.set_config('search_path', '', false);
2021-03-10 15:00:14.290 EST [95464] DEBUG:  forked new backend,
pid=95474 socket=11
2021-03-10 15:00:14.291 EST [95464] DEBUG:  server process (PID 95473)
exited with exit code 0
2021-03-10 15:00:14.291 EST [95474] LOG:  connection received: host=[local]
2021-03-10 15:00:14.293 EST [95474] LOG:  connection authorized:
user=rhaas database=rhaas 

Re: libpq debug log

2021-03-10 Thread Tom Lane
"'alvhe...@alvh.no-ip.org'"  writes:
> After staring at it a couple of times, I think that the places in
> pqParseInput3() where there's a comment "... moves inStart itself" and
> then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU
> those are the places where the message in question would not reach the
> "Successfully consumed this message" block that prints each message.

After digging around a little, I remember that there are a *bunch*
of places in fe-protocol3.c that advance inStart.  The cleanest way
to shove this stuff in without rethinking that logic would be to
call pqTraceOutputMsg immediately before each such advance (using
the old inStart value as the message start address).  A possible
advantage of doing it like that is that we'd be aware by that point
of whether we think the message is good or bad (or whether it was
good but we failed to process it, perhaps because of OOM).  Seems
like that could be a useful thing to include in the log.

Or we could rethink the logic.  It's not quite clear to me, after
all this time, why getRowDescriptions() et al are allowed to
move inStart themselves rather than letting the main loop in
pqParseInput3 do it.  It might well be an artifact of having not
rewritten the v2 logic too much.

regards, tom lane




Re: Occasional tablespace.sql failures in check-world -jnn

2021-03-10 Thread Andres Freund
Hi,

On 2021-03-10 15:40:38 +0900, Michael Paquier wrote:
> On Mon, Mar 08, 2021 at 11:53:57AM +0100, Peter Eisentraut wrote:
> > On 09.12.20 08:55, Michael Paquier wrote:
> >> ...  Because we may still introduce this problem again if some new
> >> stuff uses src/test/pg_regress in a way similar to pg_upgrade,
> >> triggering again tablespace-setup.  Something like the attached may be
> >> enough, though I have not spent much time checking the surroundings,
> >> Windows included.
> > 
> > This patch looks alright to me.
> 
> So, I have spent more time checking the surroundings of this patch,
> and finally applied it.  Thanks for the review, Peter.

Thanks!

Greetings,

Andres Freund




Re: libpq debug log

2021-03-10 Thread Tom Lane
"'alvhe...@alvh.no-ip.org'"  writes:
> After staring at it a couple of times, I think that the places in
> pqParseInput3() where there's a comment "... moves inStart itself" and
> then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU
> those are the places where the message in question would not reach the
> "Successfully consumed this message" block that prints each message.

Yeah, the whole business of just when a message has been "consumed"
is a stumbling block for libpq tracing.  It was a real mess with the
existing code and v2 protocol, because we could actually try to parse
a message more than once, with the first few tries deciding that the
message wasn't all here yet (but nonetheless emitting partial trace
output).

Now that v2 is dead, the logic to abort because of the message not
being all arrived yet is basically useless: only the little bit of
code concerned with the message length word really needs to cope with
that.  It's tempting to go through and get rid of all the now-unreachable
"return"s and such, but it seemed like it would be a lot of code churn for
not really that much gain.

I didn't look at the new version of the patch yet, so I'm not
sure whether the issues it still has are related to this.

regards, tom lane




Re: libpq debug log

2021-03-10 Thread 'alvhe...@alvh.no-ip.org'
On 2021-Mar-10, 'alvhe...@alvh.no-ip.org' wrote:

> I also found that DataRow messages are not being printed.  This seems to
> fix that in the normal case and singlerow, at least in pipeline mode.
> Didn't verify the non-pipeline mode.

Hm, and RowDescription ('T') messages are also not being printed ... so
I think there's some larger problem here, and perhaps it needs to be
fixed using a different approach.

After staring at it a couple of times, I think that the places in
pqParseInput3() where there's a comment "... moves inStart itself" and
then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU
those are the places where the message in question would not reach the
"Successfully consumed this message" block that prints each message.

-- 
Álvaro Herrera39°49'30"S 73°17'W
Maybe there's lots of data loss but the records of data loss are also lost.
(Lincoln Yeoh)




Re: Huge memory consumption on partitioned table with FKs

2021-03-10 Thread Tom Lane
Amit Langote  writes:
> On Wed, Mar 10, 2021 at 8:37 AM Tom Lane  wrote:
>> Hmm.  So, the key point is that the values coming from the partitioned
>> child table are injected into the test query as parameters, not as
>> column references, thus it doesn't matter *to the test query* what
>> numbers the referencing columns have in that child.  We just have to
>> be sure we pass the right parameter values.

> Right.

I did some cosmetic fooling with this (mostly, rewriting the comments
YA time) and pushed it.

regards, tom lane




Re: Speeding up GIST index creation for tsvectors

2021-03-10 Thread John Naylor
On Mon, Mar 8, 2021 at 8:43 AM Amit Khandekar 
wrote:
>
> On Wed, 3 Mar 2021 at 23:32, John Naylor 
wrote:

> > 0001:
> >
> > + /*
> > + * We can process 64-bit chunks only if both are mis-aligned by the
same
> > + * number of bytes.
> > + */
> > + if (b_aligned - b == a_aligned - a)
> >
> > The obvious question here is: how often are they identically
misaligned? You
> > don't indicate that your measurements differ in a bimodal fashion, so
does
> > that mean they happen to be always (mis)aligned?
>
> I ran CREATE INDEX on tsvector columns using the tsearch.data that I
> had attached upthread, with some instrumentation; here are the
> proportions :
> 1. In 15% of the cases, only one among a and b was aligned. The other
> was offset from the 8-byte boundary by 4 bytes.
> 2. 6% of the cases, both were offset by 4 bytes, i.e. identically
misaligned.
> 3. Rest of the cases, both were aligned.

That's somewhat strange. I would have assumed always aligned, or usually
not. It sounds like we could require them both to be aligned and not bother
with the byte-by-byte prologue. I also wonder if it's worth it to memcpy to
a new buffer if the passed pointer is not aligned.

> > + /* For smaller lengths, do simple byte-by-byte traversal */
> > + if (bytes <= 32)
> >
> > You noted upthread:
> >
> > > Since for the gist index creation of some of these types the default
> > > value for siglen is small (8-20), I tested with small siglens. For
> > > siglens <= 20, particularly for values that are not multiples of 8
> > > (e.g. 10, 13, etc), I see a  1-7 % reduction in speed of index
> > > creation. It's probably because of
> > > an extra function call for pg_xorcount(); and also might be due to the
> > > extra logic in pg_xorcount() which becomes prominent for shorter
> > > traversals. So for siglen less than 32, I kept the existing method
> > > using byte-by-byte traversal.
> >
> > I wonder if that can be addressed directly, while cleaning up the loops
and
> > alignment checks in pg_xorcount_long() a little. For example, have a
look at
> > pg_crc32c_armv8.c -- it might be worth testing a similar approach.
>
> Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long().
> I avoided that to not hamper the <= 32 scenarios. Details explained
> below for "why inline pg_xorcount is calling global function"
>
> > Also, pardon my ignorance, but where can I find the default siglen for
various types?
> Check SIGLEN_DEFAULT.

Okay, so it's hard-coded in various functions in contrib modules. In that
case, let's just keep the existing coding for those. In fact, the comments
that got removed by your patch specifically say:

/* Using the popcount functions here isn't likely to win */

...which your testing confirmed. The new function should be used only for
Gist and possibly Intarray, whose default is 124. That means we can't get
rid of hemdistsign(), but that's fine. Alternatively, we could say that a
small regression is justifiable reason to refactor all call sites, but I'm
not proposing that.

> > (I did make sure to remove  indirect calls  from the retail functions
> > in [1], in case we want to go that route).
>
> Yeah, I quickly had a look at that. I am still going over that thread.
> Thanks for the exhaustive analysis there.

I'll post a patch soon that builds on that, so you can see what I mean.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: get rid of tags in the docs?

2021-03-10 Thread Tom Lane
John Naylor  writes:
> On Thu, Feb 4, 2021 at 11:31 AM Tom Lane  wrote:
>> +1, nobody italicizes those in normal usage.

> Now that protocol v2 is gone, here's a patch to remove those tags.

Pushed.

regards, tom lane




Re: Consider parallel for lateral subqueries with limit

2021-03-10 Thread David Steele

On 12/7/20 6:45 PM, James Coleman wrote:

On Sun, Dec 6, 2020 at 7:34 PM Brian Davis  wrote:


Played around with this a bit, here's a non-correlated subquery that gets us to 
that if statement


While I haven't actually tracked down to guarantee this is handled
elsewhere, a thought experiment -- I think -- shows it must be so.
Here's why: suppose we don't have a limit here, but the query return
order is different in different backends. Then we would have the same
problem you bring up. In that case this code is already setting
consider_parallel=true on the rel. So I don't think we're changing any
behavior here.


So it looks like you and Brian are satisfied that this change is not 
allowing bad behavior.


Seems like an obvious win. Hopefully we can get some other concurring 
opinions.


Regards,
--
-David
da...@pgmasters.net




Re: Add header support to text format and matching feature

2021-03-10 Thread David Steele

On 12/7/20 6:40 PM, Rémi Lapeyre wrote:

Hi, here’s a rebased version of the patch.


Michael, since the issue of duplicated options has been fixed do either 
of these patches look like they are ready for commit?


Regards,
--
-David
da...@pgmasters.net




Re: cryptohash: missing locking functions for OpenSSL <= 1.0.2?

2021-03-10 Thread Jacob Champion
On Wed, 2021-03-10 at 17:21 +0900, Michael Paquier wrote:
> Do you mean something like the attached?

Yes! Patch LGTM.

--Jacob


Re: WIP: System Versioned Temporal Table

2021-03-10 Thread Vik Fearing
On 3/10/21 5:49 PM, Surafel Temesgen wrote:
> hi Ibrar,
> thank you for rebasing
> 
> On Mon, Mar 8, 2021 at 9:34 AM Ibrar Ahmed  wrote:
> 
>>
>>> Since the get_row_start_time_col_name() and get_row_end_time_col_name()
>>> are similar, IMO we can pass a flag to get StartTime/EndTime column name,
>>> thought?
>>>
>>>
> For me your option is better.  i will change to it in my next
> patch if no objection

I have plenty of objection.  I'm sorry that I am taking so long with my
review.  I am still working on it and it is coming soon, I promise.
-- 
Vik Fearing




Re: Evaluate expression at planning time for two more cases

2021-03-10 Thread Surafel Temesgen
Hi Ibrar,


On Mon, Mar 8, 2021 at 8:13 AM Ibrar Ahmed  wrote:

>
> It was a minor change therefore I rebased the patch, please take a look.
>

It is perfect thank you

regards
Surafel


Re: WIP: System Versioned Temporal Table

2021-03-10 Thread Surafel Temesgen
hi Ibrar,
thank you for rebasing

On Mon, Mar 8, 2021 at 9:34 AM Ibrar Ahmed  wrote:

>
>> Since the get_row_start_time_col_name() and get_row_end_time_col_name()
>> are similar, IMO we can pass a flag to get StartTime/EndTime column name,
>> thought?
>>
>>
For me your option is better.  i will change to it in my next
patch if no objection


regards
Surafel


Re: WIP: document the hook system

2021-03-10 Thread David Fetter
On Wed, Mar 10, 2021 at 09:38:39AM -0500, David Steele wrote:
> On 3/9/21 12:20 PM, Bruce Momjian wrote:
> > On Sat, Mar  6, 2021 at 08:32:43PM -0500, Tom Lane wrote:
> > > I think that the best you should hope for here is that people are
> > > willing to add a short, not-too-detailed para to a markup-free
> > > plain-text README file that lists all the hooks.  As soon as it
> > > gets any more complex than that, either the doco aspect will be
> > > ignored, or there simply won't be any more hooks.
> > > 
> > > (I'm afraid I likewise don't believe in the idea of carrying a test
> > > module for each hook.  Again, requiring that is a good way to
> > > ensure that new hooks just won't happen.)
> > 
> > Agreed.  If you document the hooks too much, it allows them to drift
> > away from matching the code, which makes the hook documentation actually
> > worse than having no hook documentation at all.
> 
> There's doesn't seem to be agreement on how to proceed here, so closing.
> 
> David, if you do decide to proceed with a README then it would probably be
> best to create a new thread/entry.

Thanks for the work on this and the helpful feedback!

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: [PATCH] Covering SPGiST index

2021-03-10 Thread David Steele

On 12/4/20 12:31 PM, Pavel Borisov wrote:

The cfbot's still unhappy --- looks like you omitted a file from the
patch?

You are right, thank you. Corrected this. PFA v13


Tom, do the changes as enumerated in [1] look like they are going in the 
right direction?


Regards,
--
-David
da...@pgmasters.net

[1] 
https://www.postgresql.org/message-id/CALT9ZEEszJUwsXMWknXQ3k_YbGtQaQwiYxxEGZ-pFGRUDSXdtQ%40mail.gmail.com





Re: Huge memory consumption on partitioned table with FKs

2021-03-10 Thread Amit Langote
On Wed, Mar 10, 2021 at 8:37 AM Tom Lane  wrote:
> Amit Langote  writes:
> > On Fri, Mar 5, 2021 at 6:00 AM Tom Lane  wrote:
> >> This claim seems false on its face:
> >>> All child constraints of a given foreign key constraint can use the
> >>> same RI query and the resulting plan, that is, no need to create as
> >>> many copies of the query and the plan as there are partitions, as
> >>> happens now due to the child constraint OID being used in the key
> >>> for ri_query_cache.
>
> > The quoted comment could have been written to be clearer about this,
> > but it's not talking about the table that is to be queried, but the
> > table whose RI trigger is being executed.  In all the cases except one
> > (mentioned below), the table that is queried is the same irrespective
> > of which partition's trigger is being executed, so we're basically
> > creating the same plan separately for each partition.
>
> Hmm.  So, the key point is that the values coming from the partitioned
> child table are injected into the test query as parameters, not as
> column references, thus it doesn't matter *to the test query* what
> numbers the referencing columns have in that child.  We just have to
> be sure we pass the right parameter values.

Right.

>  But ... doesn't the code
> use riinfo->fk_attnums[] to pull out the values to be passed?

Yes, from a slot that belongs to the child table.

> IOW, I now get the point about being able to share the SPI plans,
> but I'm still dubious about sharing the RI_ConstraintInfo cache entries.

There was actually a proposal upthread about sharing the
RI_ConstraintInfo too, but we decided to not pursue that for now.

> It looks to me like the v4 patch is now actually not sharing the
> cache entries, ie their hash key is just the child constraint OID
> same as before;

Yeah, you may see that we're only changing ri_BuildQueryKey() in the
patch affecting only ri_query_cache, but not ri_LoadConstraintInfo()
which leaves ri_constraint_cache unaffected.

> but the comments are pretty confused about this.

I've tried improving the comment in ri_BuildQueryKey() a bit to make
clear what is and what is not being shared between partitions.

> It might be simpler if you add just one new field which is the OID of
> the constraint that we're building the SPI query from, which might be
> either equal to constraint_id, or the OID of some parent constraint.
> In particular it's not clear to me why we need both constraint_parent
> and constraint_root_id.

Yeah, I think constraint_parent is a leftover from some earlier
hacking.  I have removed it.

Attached updated patch.

--
Amit Langote
EDB: http://www.enterprisedb.com


v5-0001-Share-SPI-plan-between-partitions-in-some-RI-trig.patch
Description: Binary data


Re: Boundary value check in lazy_tid_reaped()

2021-03-10 Thread Peter Eisentraut

On 10.03.21 02:29, Masahiko Sawada wrote:

There is no noticeable effect of inlining lazy_tid_reaped(). So it
would be better to not do that.


Attached the patch that doesn't inline lazy_tid_reaped().


committed




Re: WIP: document the hook system

2021-03-10 Thread David Steele

On 3/9/21 12:20 PM, Bruce Momjian wrote:

On Sat, Mar  6, 2021 at 08:32:43PM -0500, Tom Lane wrote:

I think that the best you should hope for here is that people are
willing to add a short, not-too-detailed para to a markup-free
plain-text README file that lists all the hooks.  As soon as it
gets any more complex than that, either the doco aspect will be
ignored, or there simply won't be any more hooks.

(I'm afraid I likewise don't believe in the idea of carrying a test
module for each hook.  Again, requiring that is a good way to
ensure that new hooks just won't happen.)


Agreed.  If you document the hooks too much, it allows them to drift
away from matching the code, which makes the hook documentation actually
worse than having no hook documentation at all.


There's doesn't seem to be agreement on how to proceed here, so closing.

David, if you do decide to proceed with a README then it would probably 
be best to create a new thread/entry.


Regards,
--
-David
da...@pgmasters.net




Re: libpq debug log

2021-03-10 Thread 'alvhe...@alvh.no-ip.org'
I also found that DataRow messages are not being printed.  This seems to
fix that in the normal case and singlerow, at least in pipeline mode.
Didn't verify the non-pipeline mode.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Nunca se desea ardientemente lo que solo se desea por razón" (F. Alexandre)
diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c
index 68f0f6a081..e8db5edb71 100644
--- a/src/interfaces/libpq/fe-protocol3.c
+++ b/src/interfaces/libpq/fe-protocol3.c
@@ -869,6 +869,9 @@ getAnotherTuple(PGconn *conn, int msgLength)
 		goto advance_and_error;
 	}
 
+	if (conn->Pfdebug)
+		pqTraceOutputMsg(conn, conn->inStart, MSGDIR_FROM_BACKEND);
+
 	/* Advance inStart to show that the "D" message has been processed. */
 	conn->inStart = conn->inCursor;
 


Re: [PATCH] SET search_path += octopus

2021-03-10 Thread David Steele

On 12/1/20 9:49 AM, Abhijit Menon-Sen wrote:


Therefore, for lack of any plausible way to proceed, I do not plan to
work on this feature any more.


It sounds like this CF entry should be closed. I'll do that on March 12 
unless somebody beats me to it.


Regards,
--
-David
da...@pgmasters.net




Re: Parallel INSERT (INTO ... SELECT ...)

2021-03-10 Thread Dilip Kumar
On Wed, Mar 10, 2021 at 2:48 PM Amit Kapila  wrote:
>
> On Mon, Mar 8, 2021 at 7:19 PM Amit Langote  wrote:
> >
> > Hi Amit
> >
> > On Mon, Mar 8, 2021 at 10:18 PM Amit Kapila  wrote:
> > > On Mon, Mar 8, 2021 at 3:54 PM Greg Nancarrow  wrote:
> > > > I've attached an updated set of patches with the suggested locking 
> > > > changes.
> >
> > (Thanks Greg.)
> >
> > > Amit L, others, do let me know if you have still more comments on
> > > 0001* patch or if you want to review it further?
> >
> > I just read through v25 and didn't find anything to complain about.
> >
>
> Thanks a lot, pushed now! Amit L., your inputs are valuable for this work.
>
> Now, coming back to Hou-San's patch to introduce a GUC and reloption
> for this feature, I think both of those make sense to me because when
> the feature is enabled via GUC, one might want to disable it for
> partitioned tables? Do we agree on that part or someone thinks
> otherwise?

I think it makes sense to have both.

>
> The other points to bikeshed could be:
> 1. The name of GUC and reloption. The two proposals at hand are
> enable_parallel_dml and enable_parallel_insert. I would prefer the
> second (enable_parallel_insert) because updates/deletes might not have
> a similar overhead.

I agree enable_parallel_insert makes more sense.

> 2. Should we keep the default value of GUC to on or off? It is
> currently off. I am fine keeping it off for this release and we can
> always turn it on in the later releases if required. Having said that,
> I see the value in keeping it on because in many cases Insert ...
> Select will be used for large data and there we will see a benefit of
> parallelism and users facing trouble (who have a very large number of
> partitions with less data to query) can still disable the parallel
> insert for that particular table. Also, the other benefit of keeping
> it on till at least the beta period is that this functionality will
> get tested and if we found reports of regression then we can turn it
> off for this release as well.
>
> Thoughts?

IMHO, we should keep it on because in most of the cases it is going
the give benefit to the user, and if for some specific scenario where
a table has a lot of partition then it can be turned off using
reloption.  And, if for some users most of the tables are like that
where they always have a lot of partition then the user can turn it
off using guc.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Lowering the ever-growing heap->pd_lower

2021-03-10 Thread Matthias van de Meent
On Tue, 9 Mar 2021 at 22:35, Tom Lane  wrote:
>
> Matthias van de Meent  writes:
> > The only two existing mechanisms that I could find (in the access/heap
> > directory) that possibly could fail on shrunken line pointer arrays;
> > being xlog recovery (I do not have enough knowledge on recovery to
> > determine if that may touch pages that have shrunken line pointer
> > arrays, or if those situations won't exist due to never using dirtied
> > pages in recovery) and backwards table scans on non-MVCC snapshots
> > (which would be fixed in the attached patch).
>
> I think you're not visualizing the problem properly.
>
> The case I was concerned about back when is that there are various bits of
> code that may visit a page with a predetermined TID in mind to look at.
> An index lookup is an obvious example, and another one is chasing an
> update chain's t_ctid link.  You might argue that if the tuple was dead
> enough to be removed, there should be no such in-flight references to
> worry about, but I think such an assumption is unsafe.  There is not, for
> example, any interlock that ensures that a process that has obtained a TID
> from an index will have completed its heap visit before a VACUUM that
> subsequently removed that index entry also removes the heap tuple.

I am aware of this problem. I will admit that I did not detected all
potentially problematic accesses, so I'll show you my work.

> So, to accept a patch that shortens the line pointer array, what we need
> to do is verify that every such code path checks for an out-of-range
> offset before trying to fetch the target line pointer.  I believed
> back in 2007 that there were, or once had been, code paths that omitted
> such a range check, assuming that they could trust the TID they had
> gotten from $wherever to point at an extant line pointer array entry.
> Maybe things are all good now, but I think you should run around and
> examine every place that checks for tuple deadness to see if the offset
> it used is known to be within the current page bounds.

In my search for problematic accesses, I make the following assumptions:
* PageRepairFragmentation as found in bufpage is only applicable to
heap pages; this function is not applied to other pages in core
postgres. So, any problems that occur are with due to access with an
OffsetNumber > PageGetMaxOffsetNumber.
* Items [on heap pages] are only extracted after using PageGetItemId
for that item on the page, whilst holding a lock.

Under those assumptions, I ran "grep PageGetItemId" over the src
directory. For all 227 results (as of 68b34b23) I checked if the page
accessed (or item accessed thereafter) was a heap page or heap tuple.
After analysis of the relevant references, I had the following results
(attached full report, containing a line with the file & line number,
and code line of the call, followed by a line containing the usage
type):

Count of usage type - usage type
4  - Not a call (comment)
7  - Callsite guarantees bounds
8  - Has assertion ItemIdIsNormal (asserts item is not removed; i.e.
concurrent vacuum should not have been able to remove this item)
39  - Has bound checks
6  - Not a heap page (brin)
1  - Not a heap page (generic index)
24  - Not a heap page (gin)
21  - Not a heap page (gist)
14  - Not a heap page (hash)
60  - Not a heap page (nbtree)
1  - Not a heap page (sequence)
36  - Not a heap page (spgist)
2  - OffsetNumber is generated by PageAddItem
2  - problem case 1 (table scan)
1  - Problem case 2 (xlog)
1  - Problem case 3 (invalid redirect pointers)

The 3 problem cases were classified based on the origin of the
potentially invalid pointer.

Problem case 1: table scan; heapam.c lines 678 and 811, in heapgettup

The table scan maintains a state which contains a page-bound
OffsetNumber, which it uses as a cursor whilst working through the
pages of the relation. In forward scans, the bounds of the page are
re-validated at the start of the 'while (linesleft > 0)' loop at 681,
but for backwards scans this check is invalid, because it incorrectly
assumes that the last OffsetNumber is guaranteed to still exist.

For MVCC snapshots, this is true (the previously returned value must
have been visible in its snapshot, therefore cannot have been vacuumed
because the snapshot is still alive), but non-mvcc snapshots may have
returned a dead tuple, which is now vacuumed and truncated away.

The first occurrance of this issue is easily fixed with the changes as
submitted in patch v2.

The second problem case (on line 811) is for forward scans, where the
line pointer array could have been truncated to 0 length. As the code
uses a hardcoded offset of FirstOffsetNumber (=1), that might point
into arbitrary data. The reading of this arbitrary data is saved by
the 'while (linesleft > 0) check', because at that point linesleft
will be PageGetMaxOffsetNumber, which would then equal 0.

Problem case 2: xlog; heapam.c line 8796, in heap_xlog_freeze_page

This is in the replay of transaction 

Re: proposal: unescape_text function

2021-03-10 Thread David Steele

On 12/2/20 1:30 PM, Pavel Stehule wrote:
st 2. 12. 2020 v 11:37 odesílatel Pavel Stehule st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut


Heh.  The fact that there is a table of two dozen possible
representations kind of proves my point that we should be
deliberate in
picking one.

I do see Oracle unistr() on that list, which appears to be very
similar
to what you are trying to do here.  Maybe look into aligning
with that.

unistr is a primitive form of proposed function.  But it can be used
as a base. The format is compatible with our  "4.1.2.3. String
Constants with Unicode Escapes".

What do you think about the following proposal?

1. unistr(text) .. compatible with Postgres unicode escapes - it is
enhanced against Oracle, because Oracle's unistr doesn't support 6
digits unicodes.

2. there can be optional parameter "prefix" with default "\". But
with "\u" it can be compatible with Java or Python.

What do you think about it?

I thought about it a little bit more, and  the prefix specification has 
not too much sense (more if we implement this functionality as function 
"unistr"). I removed the optional argument and renamed the function to 
"unistr". The functionality is the same. Now it supports Oracle 
convention, Java and Python (for Python U) and \+XX. These 
formats was already supported.The compatibility witth Oracle is nice.


Peter, it looks like Pavel has aligned this function with unistr() as 
you suggested. Thoughts?


Regards,
--
-David
da...@pgmasters.net




Re: get rid of tags in the docs?

2021-03-10 Thread John Naylor
On Thu, Feb 4, 2021 at 11:31 AM Tom Lane  wrote:
>
> John Naylor  writes:
> > While looking at the proposed removal of the v2 protocol, I noticed
that we
> > italicize some, but not all, instances of 'per se', 'pro forma', and 'ad
> > hoc'. I'd say these are widespread enough in formal registers of English
> > that they hardly need to be called out as foreign, so I propose removing
> > the tags for those words.
>
> +1, nobody italicizes those in normal usage.

Now that protocol v2 is gone, here's a patch to remove those tags.

> > The other case is 'voilà', found in rules.sgml. The case for italics
here
> > is stronger, but looking at that file, I actually think a more
> > generic-sounding phrase here would be preferable.
>
> Yeah, seeing that we only use that in one place, I think we could do
> without it.  Looks like something as pedestrian as "The results are:"
> would do fine.

Done that way.

--
John Naylor
EDB: http://www.enterprisedb.com


v1-0001-Get-rid-of-foreignphrase-tags-in-the-docs.patch
Description: Binary data


Re: libpq debug log

2021-03-10 Thread 'alvhe...@alvh.no-ip.org'
On 2021-Mar-10, iwata@fujitsu.com wrote:

> Hi all,
> 
> Following all reviewer's advice, I have created a new patch.
> 
> In this patch, I add only two tracing entry points; I call 
> pqTraceOutputMsg(PGconn conn, int msgCursor, PGCommSource commsource) in 
> pqParseInput3 () and pqPutMsgEnd () to output log.
> The argument contains message first byte offset called msgCursor because it 
> is simpler to specify the buffer pointer in the caller. 
> 
> In pqTraceOutputMsg(), the content common to all protocol messages (first 
> timestamp, < or >, first 1 byte, and length) are output first and after that 
> each protocol message contents is output. I referred to pqParseInput3 () , 
> fe-exec.c and documentation page for that part.
> 
> This fix almost eliminates if(conn->Pfdebug) that was built into the code for 
> other features.

This certainly looks much better!  Thanks for getting it done so
quickly.

I didn't review the code super closely.  I do see a compiler warning:

/pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c: In function 
'pqTraceOutputNchar':
/pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:373:2: warning: 
argument 1 null where non-null expected [-Wnonnull]
  memcpy(v, buf + *cursor, len);
  ^

and then when I try to run the "libpq_pipeline" program from the other
thread, I get a crash with this backtrace:

Program received signal SIGSEGV, Segmentation fault.
__memmove_avx_unaligned_erms () at 
../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:288
288 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No existe el 
fichero o el directorio.
(gdb) bt
#0  __memmove_avx_unaligned_erms () at 
../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:288
#1  0x77f9b50b in pqTraceOutputNchar (buf=buf@entry=0x55564660 "P", 
LogEnd=LogEnd@entry=42, cursor=cursor@entry=0x7fffdeb4, 
pfdebug=0x55569320, len=1)
at /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:373
#2  0x77f9bc25 in pqTraceOutputMsg (conn=conn@entry=0x55560260, 
msgCursor=, 
commsource=commsource@entry=MSGDIR_FROM_FRONTEND)
at /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:476
#3  0x77f96280 in pqPutMsgEnd (conn=conn@entry=0x55560260)
at /pgsql/source/pipeline/src/interfaces/libpq/fe-misc.c:533
...

The attached patch fixes it, though I'm not sure that that's the final
form we want it to have (since we'd alloc and free repeatedly, making it
slow -- perhaps better to use a static, or ...? ).

-- 
Álvaro Herrera   Valdivia, Chile
Essentially, you're proposing Kevlar shoes as a solution for the problem
that you want to walk around carrying a loaded gun aimed at your foot.
(Tom Lane)
diff --git a/src/interfaces/libpq/libpq-trace.c b/src/interfaces/libpq/libpq-trace.c
index ef294fa556..5d3b40a1d0 100644
--- a/src/interfaces/libpq/libpq-trace.c
+++ b/src/interfaces/libpq/libpq-trace.c
@@ -368,7 +368,15 @@ pqTraceOutputBinary(char *v, int length, FILE *pfdebug)
 static void
 pqTraceOutputNchar(char *buf, int LogEnd, int *cursor, FILE *pfdebug, int len)
 {
-	char	*v = '\0';
+	char	*v;
+
+	v = malloc(len);
+	if (v == NULL)
+	{
+		fprintf(pfdebug, "'..%d chars..'", len);
+		*cursor += len;
+		return;
+	}
 
 	memcpy(v, buf + *cursor, len);
 	*cursor += len;
@@ -377,6 +385,8 @@ pqTraceOutputNchar(char *buf, int LogEnd, int *cursor, FILE *pfdebug, int len)
 	pqTraceOutputBinary(v, len, pfdebug);
 	fprintf(pfdebug, "\'");
 
+	free(v);
+
 	if (*cursor < LogEnd)
 		fprintf(pfdebug, " ");
 	else


Re: error_severity of brin work item

2021-03-10 Thread David Steele

On 12/1/20 5:25 PM, Justin Pryzby wrote:

On Tue, Dec 01, 2020 at 03:57:24PM -0300, Alvaro Herrera wrote:


Another idea is if perform_work_item() were responsible for discarding
relations which disappear.  Currently it does this, which is racy since it
holds no lock.


That has the property that it remains contained in autovacuum.c, but no
other advantages I think.


It has the advantage that it moves all the try_open stuff out of brin.

I started implementing this, and then realized that the try_open stuff *has* to
be in the brin_summarize function, to handle the case that someone passes a
non-index, since it's SQL exposed.
So maybe we should use your LockOid patch now, and refactor in the future if we
add additional work-item types.


Thoughts on this, Álvaro? I can see that the first version of this patch 
was not ideal but the rework seems to have stalled. Since it is a bug 
perhaps it would be better to get something in as Justin suggests?


Regards,
--
-David
da...@pgmasters.net




Re: Change JOIN tutorial to focus more on explicit joins

2021-03-10 Thread David Steele

On 12/1/20 3:38 AM, Jürgen Purtz wrote:

On 30.11.20 21:25, David G. Johnston wrote:

Sorry, I managed to overlook the most recent patch.

I admitted my use of parentheses was incorrect and I don't see anyone 
else defending them.  Please remove them.


Minor typos:

"the database compare" -> needs an "s" (compares)

"In this case, the definition how to compare their rows." -> remove, 
redundant with the first sentence


"The results from the older implicit syntax, and the newer explicit 
JOIN/ON syntax, are identical" -> move the commas around to what is 
shown here




OK. Patch attached.


Peter, you committed some of this patch originally. Do you think the 
rest of the patch is now in shape to be committed?


Regards,
--
-David
da...@pgmasters.net




Re: shared-memory based stats collector

2021-03-10 Thread Fujii Masao



On 2021/03/10 17:51, Kyotaro Horiguchi wrote:

At Wed, 10 Mar 2021 15:20:43 +0900, Fujii Masao  
wrote in

On 2021/03/10 12:10, Kyotaro Horiguchi wrote:

Agreed. The code moved to the original place and added the crash
handling code. And I added a phrase to the comment.
+* Was it the archiver?  If exit status is zero (normal) or one 
(FATAL
+* exit), we assume everything is all right just like normal 
backends
+* and just try to restart a new one so that we immediately 
retry
   
+* archiving of remaining files. (If fail, we'll try again in 
future
 



"of" of "archiving of remaining" should be replaced with "the", or removed?


Either will do.  I don't mind turning the gerund (archiving) into a
gerund phrase (archiving remaining files).


Just for record. Previously LogChildExit() was called and the following LOG
message was output when the archiver reported FATAL error. OTOH the patch
prevents that and the following LOG message is not output at FATAL exit of
archiver. But I don't think that the following message is required in that
case
because FATAL message indicating the similar thing is already output.
Therefore, I'm ok with the patch.

LOG:  archiver process (PID 46418) exited with exit code 1


Yeah, that's the same behavor with wal receiver.


I read v50_003 patch.

When archiver dies, ProcGlobal->archiverLatch should be reset to NULL,
like walreceiver does the similar thing in WalRcvDie()?

Differently from walwriter and checkpointer, archiver as well as
walreceiver may die while server is running. Leaving the latch pointer
alone may lead to nudging a wrong process that took over the same
procarray slot. Added pgarch_die() to do that.


Thanks!

+   if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+   SetLatch(ProcGlobal->archiverLatch);

The latch can be reset to NULL in pgarch_die() between the if-condition and
SetLatch(), and which would be problematic. Probably we should protect
the access to the latch by using spinlock, like we do for walreceiver's latch?


Ugg. Right.  I remember about that bug.  I moved the archiverLatch out
of ProcGlobal to a dedicate local struct PgArch and placed a spinlock
together.

Thanks for the review!  v52 is attached.


Thanks! I applied minor and cosmetic changes to the 0003 patch as follows.
Attached is the updated version of the 0003 patch. Barring any objection,
I will commit this patch.


-#include "storage/latch.h"
-#include "storage/proc.h"

I removed these because they are no longer necessary.


logical replication worker,
parallel worker, background 
writer,
client backend, checkpointer,
+   archiver,
startup, walreceiver,
walsender and walwriter.

In the document about pg_stat_activity, possible values in backend_type
column are all listed. I added "archiver" into the list.

BTW, those values were originally listed in alphabetical order,
but ISTM that they were gotten out of order at a certain moment.
So they should be listed in alphabetical order again. This should
be implemented as a separate patch.


-PgArchData *PgArch = NULL;
+static PgArchData *PgArch = NULL;

I marked PgArchData as static because it's used only in pgarch.c.


-   ShmemInitStruct("Archiver ", PgArchShmemSize(), );
+   ShmemInitStruct("Archiver Data", PgArchShmemSize(), );

I found that the original shmem name ended with unnecessary space character.
I replaced it with "Archiver Data".


In reaper(), I moved the code block for archiver to the original location.


I ran pgindent for the files that the patch modifies.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3335d71eba..b82b2f6d66 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -935,6 +935,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34 
  0:00 postgres: ser
logical replication worker,
parallel worker, background 
writer,
client backend, checkpointer,
+   archiver,
startup, walreceiver,
walsender and walwriter.
In addition, background workers registered by extensions may have
diff --git a/src/backend/access/transam/xlogarchive.c 
b/src/backend/access/transam/xlogarchive.c
index 1c5a4f8b5a..26b023e754 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -25,11 +25,11 @@
 #include "common/archive.h"
 #include "miscadmin.h"
 #include "postmaster/startup.h"
+#include "postmaster/pgarch.h"
 #include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
-#include "storage/pmsignal.h"
 
 /*
  * Attempt to retrieve the 

Re: Allow batched insert during cross-partition updates

2021-03-10 Thread Georgios


‐‐‐ Original Message ‐‐‐
On Wednesday, March 10, 2021 1:23 PM, Georgios  
wrote:

>
>
> Hi,
>
> ‐‐‐ Original Message ‐‐‐
> On Thursday, February 18, 2021 10:54 AM, Amit Langote amitlangot...@gmail.com 
> wrote:
>
> > On Thu, Feb 18, 2021 at 6:52 PM Amit Langote amitlangot...@gmail.com wrote:
> >
> > > Based on the discussion at:
> > > https://www.postgresql.org/message-id/6929d485-2d2a-da46-3681-4a400a3d794f%40enterprisedb.com
> > > I'm posting the patch for $subject here in this new thread and I'll
> > > add it to the next CF per Tomas' advice.
> >
> > Done:https://commitfest.postgresql.org/32/2992/
>
> apparently I did not receive the review comment I sent via the commitfest app.
> Apologies for the chatter. Find the message-id here:
https://www.postgresql.org/message-id/161530518971.29967.9368488207318158252.pgcf%40coridan.postgresql.org

I continued looking a bit at the patch, yet I am either failing to see fix or I 
am
looking at the wrong thing. Please find attached a small repro of what my 
expectetions
were.

As you can see in the repro, I would expect the
 UPDATE local_root_remote_partitions SET a = 2;
to move the tuples to remote_partition_2 on the same transaction.
However this is not the case, with or without the patch.

Is my expectation of this patch wrong?

Cheers,
//Georgios

>
> > Amit Langote
> > EDB: http://www.enterprisedb.com



repro.sql
Description: application/sql


Re: Allow batched insert during cross-partition updates

2021-03-10 Thread Georgios


Hi,

‐‐‐ Original Message ‐‐‐
On Thursday, February 18, 2021 10:54 AM, Amit Langote  
wrote:

> On Thu, Feb 18, 2021 at 6:52 PM Amit Langote amitlangot...@gmail.com wrote:
>
> > Based on the discussion at:
> > https://www.postgresql.org/message-id/6929d485-2d2a-da46-3681-4a400a3d794f%40enterprisedb.com
> > I'm posting the patch for $subject here in this new thread and I'll
> > add it to the next CF per Tomas' advice.
>
> Done:https://commitfest.postgresql.org/32/2992/
>
> --

apparently I did not receive the review comment I sent via the commitfest app.
Apologies for the chatter. Find the message-id here:



>
> Amit Langote
> EDB: http://www.enterprisedb.com






Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation

2021-03-10 Thread Euler Taveira
On Wed, Mar 10, 2021, at 2:14 AM, Bharath Rupireddy wrote:
> While providing thoughts on [1], I observed that the error messages
> that are emitted while adding foreign, temporary and unlogged tables
> can be improved a bit from the existing [2] to [3]. For instance, the
> existing message when foreign table is tried to add into the
> publication "f1" is not a table" looks odd. Because it says that the
> foreign table is not a table at all.
I wouldn't mix [regular|partitioned|temporary|unlogged] tables with foreign
tables. Although, they have a pg_class entry in common, foreign tables aren't
"real" tables (external storage); they even have different DDLs to handle it
(CREATE TABLE x CREATE FOREIGN TABLE).

postgres=# CREATE PUBLICATION testpub FOR TABLE f1;
ERROR:  "f1" is not a table
DETAIL:  Only tables can be added to publications.

I agree that "f1 is not a table" is a confusing message at first because
foreign table has "table" as description. Maybe if we apply the negation in
both messages it would be clear (using the same wording as system tables).

ERROR:  "f1" is a foreign table
DETAIL:  Foreign tables cannot be added to publications.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


RE: Parallel Inserts in CREATE TABLE AS

2021-03-10 Thread houzj.f...@fujitsu.com
> Seems like v22 patch was failing in cfbot for one of the unstable test cases.
> Attaching v23 patch set with modification in 0003 and 0004 patches. No
> changes to 0001 and 0002 patches. Hopefully cfbot will be happy with v23.
> 
> Please consider v23 for further review.
Hi,

I was looking into the latest patch, here are some comments:

1)
 * (Note that we do allow CREATE TABLE AS, SELECT INTO, and CREATE
 * MATERIALIZED VIEW to use parallel plans, but as of now, only the 
leader
 * backend writes into a completely new table.  In the future, we can

Since we will support parallel insert with CTAS, this existing comments need to 
be changed.

2)
In parallel.sgml

The query writes any data or locks any database rows. If a query
contains a data-modifying operation either at the top level or within
a CTE, no parallel plans for that query will be generated.  As an
exception, the commands CREATE TABLE ... AS,

The same as 1), we'd better comment we have support parallel insert with CTAS.

3)
ExecInitParallelPlan(PlanState *planstate, EState *estate,
 Bitmapset *sendParams, int nworkers,
-int64 tuples_needed)
+int64 tuples_needed,
+ParallelInsertCmdKind parallel_ins_cmd,
+void *parallel_ins_info)

Is it better to place ParallelInsertCmdKind in struct ParallelInsertCTASInfo?
We can pass less parameter in some places.
Like:
typedef struct ParallelInsertCTASInfo
{
+   ParallelInsertCmdKind parallel_ins_cmd;
IntoClause *intoclause;
Oid objectid;

} ParallelInsertCTASInfo;

Best regards,
houzj


Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation

2021-03-10 Thread Bharath Rupireddy
On Wed, Mar 10, 2021 at 1:27 PM Jeevan Ladhe
 wrote:
>
> On Wed, Mar 10, 2021 at 10:44 AM Bharath Rupireddy 
>  wrote:
>>
>> Hi,
>>
>> While providing thoughts on [1], I observed that the error messages
>> that are emitted while adding foreign, temporary and unlogged tables
>> can be improved a bit from the existing [2] to [3].
>
> +1 for improving the error messages here.

Thanks for taking a look at the patch.

>> Attaching a small patch. Thoughts?
>
> I had a look at the patch and it looks good to me. However, I think after
> you have added the specific kind of table type in the error message itself,
> now the error details seem to be giving redundant information, but others 
> might
> have different thoughts.

The error detail is to give a bit of information of what and all
relation types are unsupported with the create publication statement.
But with the error message now showing up the type of relation, the
detail message looks redundant to me as well. If agreed, I can remove
that. Thoughts?

> The patch itself looks good otherwise. Also the make check and postgres_fdw
> check looking good.

Thanks.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




  1   2   >