Re: Asynchronous Append on postgres_fdw nodes.

2021-03-19 Thread Etsuro Fujita
On Fri, Mar 19, 2021 at 9:57 PM Justin Pryzby  wrote:
> is_async_capable_path() should probably have a "break" for case T_ForeignPath.

Good catch!  Will fix.

> little typos:
> aready
> sigle
> givne
> a event

Lots of typos.  :-(  Will fix.

Thank you for the review!

Best regards,
Etsuro Fujita




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-19 Thread Julien Rouhaud
On Fri, Mar 19, 2021 at 08:10:54PM -0400, Bruce Momjian wrote:
> On Sat, Mar 20, 2021 at 01:03:16AM +0100, Hannu Krosing wrote:
> > It would be really convenient if user-visible serialisations of the query id
> > had something that identifies the computation method.
> > 
> > maybe prefix 'N' for internal, 'S' for pg_stat_statements etc.
> > 
> > This would immediately show in logs at what point the id calculator was 
> > changed
> 
> Yeah, but it an integer, and I don't think we want to change that.

Also, with Bruce's approach to ask extensions to error out if they would
overwrite a queryid the only way to change the calculation method is a restart.
So only one source can exist in the system.

Hopefully that's a big enough hammer that administrators will know what method
they're using.




Re: pglz compression performance, take two

2021-03-19 Thread Justin Pryzby
On Fri, Mar 19, 2021 at 01:29:14PM -0700, Mark Dilger wrote:
> Robert Haas just committed Dilip Kumar's LZ4 compression, 
> bbe0a81db69bd10bd166907c3701492a29aca294.
> 
> Is this pglz compression patch still relevant?  How does the LZ4 compression 
> compare on your hardware?

I think it's still relevant, since many people may not end up with binaries
--with-lz4 (I'm thinking of cloud providers).  PGLZ is what existing data uses,
and people may not want to/know to migrate to shiny new features, but they'd
like it if their queries were 20% faster after upgrading without needing to.

Also, Dilip's patch is only for TOAST compression, and pglz is also being used
for wal_compression - Andrey has a short patch to implement lz4 for that:
https://commitfest.postgresql.org/32/3015/

-- 
Justin




Re: Replication slot stats misgivings

2021-03-19 Thread Amit Kapila
On Sat, Mar 20, 2021 at 9:25 AM Amit Kapila  wrote:
>
> On Sat, Mar 20, 2021 at 12:22 AM Andres Freund  wrote:
> >
> > And then more generally about the feature:
> > - If a slot was used to stream out a large amount of changes (say an
> >   initial data load), but then replication is interrupted before the
> >   transaction is committed/aborted, stream_bytes will not reflect the
> >   many gigabytes of data we may have sent.
> >
>
> We can probably update the stats each time we spilled or streamed the
> transaction data but it was not clear at that stage whether or how
> much it will be useful.
>
> > - I seems weird that we went to the trouble of inventing replication
> >   slot stats, but then limit them to logical slots, and even there don't
> >   record the obvious things like the total amount of data sent.
> >
>
> Won't spill_bytes and stream_bytes will give you the amount of data sent?
>
> >
> > I think the best way to address the more fundamental "pgstat related"
> > complaints is to change how replication slot stats are
> > "addressed". Instead of using the slots name, report stats using the
> > index in ReplicationSlotCtl->replication_slots.
> >
> > That removes the risk of running out of "replication slot stat slots":
> > If we loose a drop message, the index eventually will be reused and we
> > likely can detect that the stats were for a different slot by comparing
> > the slot name.
> >
>
> This idea is worth exploring to address the complaints but what do we
> do when we detect that the stats are from the different slot? It has
> mixed of stats from the old and new slot. We need to probably reset it
> after we detect that.
>

What if the user created a slot with the same name after dropping the
slot and it has used the same index. I think chances are less but
still a possibility, but maybe that is okay.

> What if after some frequency (say whenever we
> run out of indexes) we check whether the slots we are maintaining is
> pgstat.c have some stale slot entry (entry exists but the actual slot
> is dropped)?
>

A similar drawback (the user created a slot with the same name after
dropping it) exists with this as well.

-- 
With Regards,
Amit Kapila.




Re: [HACKERS] logical decoding of two-phase transactions

2021-03-19 Thread Ajin Cherian
On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila  wrote:

> On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian  wrote:
> >
> > Missed the patch - 0001, resending.
> >
>
> I have made miscellaneous changes in the patch which includes
> improving comments, error messages, and miscellaneous coding
> improvements. The most notable one is that we don't need an additional
> parameter in walrcv_startstreaming, if the two_phase option is set
> properly. My changes are in v63-0002-Misc-changes-by-Amit, if you are
> fine with those, then please merge them in the next version. I have
> omitted the dev-logs patch but feel free to submit it. I have one
> question:
>
>
I am fine with these changes. I see that Peter has already merged in these
changes.

thanks,
Ajin Cherian
Fujitsu Australia


Re: Fix DROP TABLESPACE on Windows with ProcSignalBarrier?

2021-03-19 Thread Thomas Munro
On Sat, Mar 6, 2021 at 12:10 PM Daniel Gustafsson  wrote:
> > On 3 Mar 2021, at 23:19, Thomas Munro  wrote:
> > That's why I release and reacquire that LWLock.  But does that break some
> > logic?
>
> One clear change to current behavior is naturally that a concurrent
> TablespaceCreateDbspace can happen while barrier absorption is performed.
> Given where we are that might not be a problem, but I don't have enough
> caffeine at the moment to conclude anything there.  Testing nu inducing
> concurent calls while absorption was stalled didn't trigger anything, but I'm
> sure I didn't test every scenario. Do you see anything off the cuff?

Now I may have the opposite problem (too much coffee) but it looks
like it should work about as well as it does today.  At this new point
where we released the LWLock, all we've really done is possibly unlink
some empty database directories in destroy_tablespace_directories(),
and that's harmless, they'll be recreated on demand if we abandon
ship.  If TablespaceCreateDbspace() happened while we were absorbing
the barrier and not holding the lock in this new code, then a
concurrent mdcreate() is running and so we have a race where we'll
again try to drop all empty directories, and it'll try to create its
relfile in the new empty directory, and one of us will fail (possibly
with an ugly ENOENT error message).  But that's already the case in
the master branch: mdcreate() could have run TablespaceCreateDbspace()
before we acquire the lock in the master branch, and (with
pathological enough scheduling) it could reach its attempt to create
its relfile after DropTableSpace() has unlinked the empty directory.

The interlocking here is hard to follow.  I wonder why we don't use
heavyweight locks to do per-tablespace interlocking between
DefineRelation() and DropTableSpace().  I'm sure this question is
hopelessly naive and I should probably go and read some history.




Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-19 Thread Masahiro Ikeda

When only the stats collector exits by SIGQUIT, with the patch
FreeWaitEventSet() is also skipped. Is this ok?


Thanks, I fixed it.


I'm still not sure if FreeWaitEventSet() is actually necessary in that
exit case. Could you tell me why you thought FreeWaitEventSet() is
necessary in the case?


Sorry, there is no evidence we should call it.
I thought that to call FreeWaitEventSet() is a manner after using
CreateWaitEventSet() and the performance impact to call it seems to be 
too small.


(Please let me know if my understanding is wrong.)
The reason why I said this is a manner because I thought it's no problem
even without calling FreeWaitEventSet() before the process exits 
regardless
of fast or immediate shutdown. Since the "WaitEventSet *wes" is a 
process local resource,

this will be released by OS even if FreeWaitEventSet() is not called.

I'm now changing my mind to skip it is better because the code can be 
simpler and,
it's unimportant for the above reason especially when the immediate 
shutdown is

requested which is a bad manner itself.

BTW, the SysLoggerMain() create the WaitEventSet, but it didn't call 
FreeWaitEventSet().

Is it better to call FreeWaitEventSet() before exiting too?



Since it's not good to do many things in a signal handler, even when
FreeWaitEventSet() is really necessary, it should be called at 
subsequent

startup of the stats collector instead of calling it in the handler
at the end? BTW, I found bgworker calls FreeWaitEventSet() via
ShutdownLatchSupport() at its startup. But I'm also not sure if this
is really necessary at the startup...


OK, I understood that I need to change the signal handler's 
implementation

if FreeWaitEventSet() is necessary.

In my understanding from the following commit message, the ordinary 
bgworker
which not access the shared memory doesn't use the latch which 
postmaster

installed. So, ShutdownLatchSupport() is called at the startup. Though?

2021/3/1 commit: 83709a0d5a46559db016c50ded1a95fd3b0d3be6
```
The signal handler is now installed in all postmaster children by
InitializeLatchSupport().  Those wishing to disconnect from it should
call ShutdownLatchSupport().
```

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-19 Thread Jan Wieck

On 3/8/21 11:58 AM, Tom Lane wrote:

The answer up to now has been "raise max_locks_per_transaction enough
so you don't see the failure".  Having now consumed a little more
caffeine, I remember that that works in pg_upgrade scenarios too,
since the user can fiddle with the target cluster's postgresql.conf
before starting pg_upgrade.

So it seems like the path of least resistance is

(a) make pg_upgrade use --single-transaction when calling pg_restore

(b) document (better) how to get around too-many-locks failures.


That would first require to fix how pg_upgrade is creating the 
databases. It uses "pg_restore --create", which is mutually exclusive 
with --single-transaction because we cannot create a database inside of 
a transaction. On the way pg_upgrade also mangles the pg_database.datdba 
(all databases are owned by postgres after an upgrade; will submit a 
separate patch for that as I consider that a bug by itself).


All that aside, the entire approach doesn't scale.

In a hacked up pg_upgrade that does "createdb" first before calling 
pg_upgrade with --single-transaction. I can upgrade 1M large objects with

max_locks_per_transaction = 5300
max_connectinons=100
which contradicts the docs. Need to find out where that math went off 
the rails because that config should only have room for 530,000 locks, 
not 1M. The same test fails with max_locks_per_transaction = 5200.


But this would mean that one has to modify the postgresql.conf to 
something like 530,000 max_locks_per_transaction at 100 max_connections 
in order to actually run a successful upgrade of 100M large objects. 
This config requires 26GB of memory just for locks. Add to that the 
memory pg_restore needs to load the entire TOC before even restoring a 
single object.


Not going to work. But tests are still ongoing ...


Regards, Jan


--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: a verbose option for autovacuum

2021-03-19 Thread Masahiko Sawada
On Fri, Mar 19, 2021 at 3:14 PM Michael Paquier  wrote:
>
> On Thu, Mar 18, 2021 at 11:30:46PM +0900, Masahiko Sawada wrote:
> > Sorry, I attached the wrong version patch. So attached the right one.
>
> Thanks.  I have been hacking aon that, and I think that we could do
> more in terms of integration of the index stats into LVRelStats to
> help with debugging issues, mainly, but also to open the door at
> allowing autovacuum to use the parallel path in the future.

Thank you for updating the patch!

> Hence,
> for consistency, I think that we should change the following things in
> LVRelStats:
> - Add the number of indexes.  It looks rather unusual to not track
> down the number of indexes directly in the structure anyway, as the
> stats array gets added there.
> - Add all the index names, for parallel and non-parallel mode.

Agreed with those two changes.

> - Replace the index name in the error callback by an index number,
> pointing back to its location in indstats and indnames.

I like this idea but I'm not sure the approach that the patch took
improved the code. Please read the below my concern.

>
> As lazy_vacuum_index() requires the index number to be set internally
> to it, this means that we need to pass it down across
> vacuum_indexes_leader(), lazy_parallel_vacuum_indexes(), but that
> seems like an acceptable compromise to me for now.  I think that it
> would be good to tighten a bit more the relationship between the index
> stats in the DSM for the parallel case and the ones in local memory,
> but what we have here looks enough to me so we could figure out that
> as a future step.
>
> Sawada-san, what do you think?  Attached is the patch I have finished
> with.

With this idea, vacuum_one_index() will become:

static void
lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
  LVDeadTuples *dead_tuples, double reltuples,
  LVRelStats *vacrelstats, int indnum)

and the caller calls this function as follow:

for (idx = 0; idx < nindexes; idx++)
lazy_vacuum_index(Irel[idx], &(vacrelstats->indstats[idx]),
  vacrelstats->dead_tuples,
  vacrelstats->old_live_tuples, vacrelstats,
  idx);

It's not bad but it seems redundant a bit to me. We pass the idx in
spite of passing also Irel[idx] and &(vacrelstats->indstats[idx]). I
think your first idea that is done in v4 patch (saving index names at
the beginning of heap_vacuum_rel() for autovacuum logging purpose
only) and the idea of deferring to close indexes until the end of
heap_vacuum_rel() so that we can refer index name at autovacuum
logging are more simple.

BTW the patch led to a crash in my environment. The problem is here:

 static void
-vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
+vacuum_one_index(Relation indrel,
 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-LVDeadTuples *dead_tuples, LVRelStats *vacrelstats)
+LVDeadTuples *dead_tuples, LVRelStats *vacrelstats,
+int indnum)
 {
IndexBulkDeleteResult *bulkdelete_res = NULL;
+   IndexBulkDeleteResult *stats;

We need to initialize *stats with NULL here.

And while looking at the change of vacuum_one_index() I found another problem:

@@ -2349,17 +2400,20 @@ vacuum_one_index(Relation indrel,
IndexBulkDeleteResult **stats,
 * Update the pointer to the corresponding
bulk-deletion result if
 * someone has already updated it.
 */
-   if (shared_indstats->updated && *stats == NULL)
-   *stats = bulkdelete_res;
+   if (shared_indstats->updated)
+   stats = bulkdelete_res;
}
+   else
+   stats = vacrelstats->indstats[indnum];

/* Do vacuum or cleanup of the index */
if (lvshared->for_cleanup)
-   lazy_cleanup_index(indrel, stats, lvshared->reltuples,
-
lvshared->estimated_count, vacrelstats);
+   lazy_cleanup_index(indrel, , lvshared->reltuples,
+
lvshared->estimated_count, vacrelstats,
+  indnum);
else
-   lazy_vacuum_index(indrel, stats, dead_tuples,
- lvshared->reltuples,
vacelstats);
+   lazy_vacuum_index(indrel, , dead_tuples,
+ lvshared->reltuples,
vacrelstats, indnum);

/*
 * Copy the index bulk-deletion result returned from ambulkdelete and

If shared_indstats is NULL (e.g., we do " stats =
vacrelstats->indstats[indnum];"), vacrelstats->indstats[indnum] is not
updated since we pass  I think we should pass
&(vacrelstats->indstats[indnum]) instead in this case.

Previously, we update the element of the pointer array of index
statistics to the pointer pointing to either the local memory or DSM.

Re: Replication slot stats misgivings

2021-03-19 Thread Amit Kapila
On Sat, Mar 20, 2021 at 12:22 AM Andres Freund  wrote:
>
> And then more generally about the feature:
> - If a slot was used to stream out a large amount of changes (say an
>   initial data load), but then replication is interrupted before the
>   transaction is committed/aborted, stream_bytes will not reflect the
>   many gigabytes of data we may have sent.
>

We can probably update the stats each time we spilled or streamed the
transaction data but it was not clear at that stage whether or how
much it will be useful.

> - I seems weird that we went to the trouble of inventing replication
>   slot stats, but then limit them to logical slots, and even there don't
>   record the obvious things like the total amount of data sent.
>

Won't spill_bytes and stream_bytes will give you the amount of data sent?

>
> I think the best way to address the more fundamental "pgstat related"
> complaints is to change how replication slot stats are
> "addressed". Instead of using the slots name, report stats using the
> index in ReplicationSlotCtl->replication_slots.
>
> That removes the risk of running out of "replication slot stat slots":
> If we loose a drop message, the index eventually will be reused and we
> likely can detect that the stats were for a different slot by comparing
> the slot name.
>

This idea is worth exploring to address the complaints but what do we
do when we detect that the stats are from the different slot? It has
mixed of stats from the old and new slot. We need to probably reset it
after we detect that. What if after some frequency (say whenever we
run out of indexes) we check whether the slots we are maintaining is
pgstat.c have some stale slot entry (entry exists but the actual slot
is dropped)?

> It also makes it easy to handle the issue of max_replication_slots being
> lowered and there still being stats for a slot - we simply can skip
> restoring that slots data, because we know the relevant slot can't exist
> anymore. And we can make the initial pgstat_report_replslot() during
> slot creation use a
>

Here, your last sentence seems to be incomplete.

> I'm wondering if we should just remove the slot name entirely from the
> pgstat.c side of things, and have pg_stat_get_replication_slots()
> inquire about slots by index as well and get the list of slots to report
> stats for from slot.c infrastructure.
>

But how will you detect in your idea that some of the stats from the
already dropped slot?

I'll create an entry for this in PG14 Open items wiki.

-- 
With Regards,
Amit Kapila.




Re: pspg pager is finished

2021-03-19 Thread Julien Rouhaud
On Sat, Mar 20, 2021 at 04:34:30AM +0100, Pavel Stehule wrote:
> Hi
> 
> I finished work on pspg.
> 
> https://github.com/okbob/pspg
> 
> Now it has special features like rows or block selection by mouse, and
> export related data to file or to clipboard in csv or tsv or insert
> formats. Some basic features like sorting data per selected columns are
> possible too.
> 
> I hope this tool will serve well, and so work with Postgres (or other
> supported databases) in the terminal will be more comfortable and more
> efficient.

Thanks a lot for that tool Pavel.  It has been my favorite psql pager for
years!




pspg pager is finished

2021-03-19 Thread Pavel Stehule
Hi

I finished work on pspg.

https://github.com/okbob/pspg

Now it has special features like rows or block selection by mouse, and
export related data to file or to clipboard in csv or tsv or insert
formats. Some basic features like sorting data per selected columns are
possible too.

I hope this tool will serve well, and so work with Postgres (or other
supported databases) in the terminal will be more comfortable and more
efficient.

Regards

Pavel Stehule


Re: [HACKERS] Custom compression methods

2021-03-19 Thread Robert Haas
On Fri, Mar 19, 2021 at 6:38 PM Alvaro Herrera  wrote:
> I suggest to add comments to this effect,
> perhaps as the attached (feel free to reword, I think mine is awkward.)

It's not bad, although "the decompressed version of the full datum"
might be a little better. I'd probably say instead: "This method might
decompress the entire datum rather than just a slice, if slicing is
not supported." or something of to that effect. Feel free to commit
something you like.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [HACKERS] Custom compression methods

2021-03-19 Thread Robert Haas
On Fri, Mar 19, 2021 at 8:25 PM Tom Lane  wrote:
> Extrapolating from the way we've dealt with similar issues
> in the past, I think the structure of pg_dump's output ought to be:
>
> 1. SET default_toast_compression = 'source system's value'
> in among the existing passel of SETs at the top.  Doesn't
> matter whether or not that is the compiled-in value.
>
> 2. No mention of compression in any CREATE TABLE command.
>
> 3. For any column having a compression option different from
> the default, emit ALTER TABLE SET ... to set that option after
> the CREATE TABLE.  (You did implement such a SET, I trust.)

Actually, *I* didn't implement any of this. But ALTER TABLE sometab
ALTER somecol SET COMPRESSION somealgo works.

This sounds like a reasonable approach.

> There might be scope for a dump option to suppress mention
> of compression altogether (comparable to, eg, --no-tablespaces).
> But I think that's optional.  In any case, we don't want
> to put people in a position where they should have used such
> an option and now they have no good way to recover their
> dump to the system they want to recover to.

The patch already has --no-toast-compression.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




recovery_init_sync_method=wal

2021-03-19 Thread Thomas Munro
Hello hackers,

I'm starting a new thread and CF entry for the material for r15 from
the earlier thread[1] that introduced the recovery_init_sync_method
GUC for r14.  I wrote a summary of this topic as I see it, while it's
still fresh on my mind from working on commit 61752afb, starting from
what problem this solves.

TL;DR: Here's a patch that adds a less pessimistic, faster starting
crash recovery mode based on first principles.

=== Background ===

Why do we synchronise the data directory before we run crash recovery?

1.  WAL: It's not safe for changes to data pages to reach the disk
before the WAL.  This is the basic log-before-data rule.  Suppose we
didn't do that.  If our online cluster crashed after calling
pwrite() but before calling fdatasync(), the WAL data
we later read in crash recovery may differ from what's really on disk,
and it'd be dangerous to replay its contents, because its effects to
data pages might then be written to disk at any time and break the
rule.  If that happens and you lose power and then run crash recovery
a second time, now you have some phantom partial changes already
applied but no WAL left to redo them, leading to hazards including
xids being recycled, effects of committed transactions being partially
lost, multi-page changes being half done, and other such corruption.

2.  Data files: We can't skip changes to a data page based on the page
header's LSN if the page is not known to be on disk (that is, it is
clean in PostgreSQL's buffer pool, but possibly dirty in the kernel's
page cache).  Otherwise, the end-of-recovery checkpoint will do
nothing for the page (assuming nothing else marks the page dirty in
our buffer pool before that), so we'll complete the checkpoint, and
allow the WAL to be discarded.  Then we might lose power before the
kernel gets a chance to write the data page to disk, and when the
lights come back on we'll run crash recovery but we don't replay that
forgotten change from before the bogus checkpoint, and we have lost
committed changes.  (I don't think this can happen with
full_page_writes=on, because in that mode we never skip pages and
always do a full replay, which has various tradeoffs.)

I believe those are the fundamental reasons.  If you know of more
reasons, I'd love to hear them.

Why don't we synchronise the data directory for a clean startup?

When we start up the database from a shutdown checkpoint, we take the
checkpoint at face value.  A checkpoint is a promise that all changes
up to a given LSN have been durably stored on disk.  There are a
couple of cases where that isn't true:

1.  You were previously running with fsync=off.  That's OK, we told
you not to do that.  Checkpoints created without fsync barriers to
enforce the strict WAL-then-data-then-checkpoint protocol are
forgeries.

2.  You made a file system-level copy of a cluster that you shut down
cleanly first, using cp, tar, scp, rsync, xmodem etc.  Now you start
up the copy.  Its checkpoint is a forgery.  (Maybe our manual should
mention this problem under "25.2. File System Level Backup" where it
teaches you to rsync your cluster.)

How do the existing recovery_init_sync_method modes work?

You can think of recovery_init_sync_method as different "query plans"
for finding dirty buffers in the kernel's page cache to sync.

1.  fsync:  Go through the directory tree and call fsync() on each
file, just in case that file had some dirty pages.  This is a terrible
plan if the files aren't currently in the kernel's VFS cache, because
it could take up to a few milliseconds to get each one in there
(random read to slower SSDs or network storage or IOPS-capped cloud
storage).  If there really is a lot of dirty data, that's a good bet,
because the files must have been in the VFS cache already.  But if
there are one million mostly read-only tables, it could take ages just
to *open* all the files, even though there's not much to actually
write out.

2.  syncfs:  Go through the kernel page cache instead, looking for
dirty data in the small number of file systems that contain our
directories.  This is driven by data that is already in the kernel's
cache, so we avoid the need to perform I/O to search for dirty data.
That's great if your cluster is running mostly alone on the file
system in question, but it's not great if you're running another
PostgreSQL cluster on the same file system, because now we generate
extra write I/O when it finds incidental other stuff to write out.

These are both scatter gun approaches that can sometimes do a lot of
useless work, and I'd like to find a precise version that uses
information we already have about what might be dirty according to the
meaning of a checkpoint and a transaction log.  The attached patch
does that as follows:

1.  Sync the WAL using fsync(), to enforce the log-before-data rule.
That's moved into the existing loop that scans the WAL files looking
for temporary files to unlink.  (I suppose it should perhaps do the
"presync" trick too.  Not 

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-19 Thread Peter Geoghegan
On Thu, Mar 18, 2021 at 5:07 PM Tom Lane  wrote:
> Doubt that it matters.  The workflow would have to be "commit and push
> the mechanical updates, then edit the tracking file, commit and push
> that".  You don't have the commit hash nailed down till you've pushed.

Okay. I have made a personal TODO list item for this. I'll pick this
up again in April, once the final CF is over.

-- 
Peter Geoghegan




Re: [HACKERS] logical decoding of two-phase transactions

2021-03-19 Thread Amit Kapila
On Sat, Mar 20, 2021 at 7:07 AM Ajin Cherian  wrote:
>
> On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila  wrote:
>>
>> On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian  wrote:
>> >
>> > Missed the patch - 0001, resending.
>> >
>>
>>
>> @@ -538,10 +550,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
>> ..
>> + /* Set two_phase_at LSN only if it hasn't already been set. */
>> + if (ctx->twophase && !MyReplicationSlot->data.two_phase_at)
>> + {
>> + MyReplicationSlot->data.two_phase_at = start_lsn;
>> + slot->data.two_phase = true;
>> + ReplicationSlotMarkDirty();
>> + ReplicationSlotSave();
>> + SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
>> + }
>>
>> What if the walsender or apply worker restarts after setting
>> two_phase_at/two_phase here and updating the two_phase state in
>> pg_subscription? Won't we need to set SnapBuildSetTwoPhaseAt after
>> restart as well?
>
>
> After a restart,   two_phase_at will be set by calling  
> AllocateSnapshotBuilder with two_phase_at
>

Okay, that makes sense.

-- 
With Regards,
Amit Kapila.




Re: Logical Replication vs. 2PC

2021-03-19 Thread Amit Kapila
On Fri, Mar 19, 2021 at 9:22 PM Markus Wanner
 wrote:
>
> On 18.03.21 10:45, Amit Kapila wrote:
> > While reviewing/testing subscriber-side work for $SUBJECT [1], I
> > noticed a problem that seems to need a broader discussion, so started
> > this thread. We can get prepare for the same GID more than once for
> > the cases where we have defined multiple subscriptions for
> > publications on the same server and prepared transaction has
> > operations on tables subscribed to those subscriptions. For such
> > cases, one of the prepare will be successful and others will fail in
> > which case the server will send them again. Once the commit prepared
> > is done for the first one, the next prepare will be successful. Now,
> > this is not ideal but will work.
>
> That's assuming you're using the same gid on the subscriber, which does
> not apply to all use cases.  It clearly depends on what you try to
> achieve by decoding in two phases, obviously.
>
> We clearly don't have this issue in BDR, because we're using xids
> (together with a node id) to globally identify transactions and
> construct local (per-node) gids that don't clash.
>

So, I think you are using xid of publisher and origin_id of
subscription to achieve uniqueness because both will be accessible in
prepare and commit prepared. Right? If so, I think that will work out
here as well. But if we think to use xid generated on subscriber then
we need to keep some mapping of original GID sent by publisher and GID
generated by us (origin+xid of subscription) because, at commit
prepared time, we won't know that xid.

> (Things get even more interesting if you take into account that users
> may reuse the same gid for different transactions.
>

Are you saying that users might use the same GID which we have
constructed internally (say by combining origin and xid: originid_xid)
and then there will be conflict while replaying such transactions?


>  Lag between
> subscriptions could then lead to blocking between different origin
> transactions...)
>

Right and even for one subscription that can lead to blocking
transactions. But isn't it similar to what we get for a primary key
violation while replaying transactions? In that case, we suggest users
remove conflicting rows, so in such cases, we can recommend users to
commit/rollback such prepared xacts?

-- 
With Regards,
Amit Kapila.




RE: invalid data in file backup_label problem on windows

2021-03-19 Thread wangsh.f...@fujitsu.com
David Steele  wrote:

> It's not clear to me what text editors have to do with this? Are you  
> editing the file manually?

When I execute SELECT * FROM pg_stop_backup(false, true) in psql.

The results will be shown like:
lsn|   labelfile
   | spcmapfile

+-+
0/2000138 | START WAL LOCATION: 0/228 (file 
00010002)+|
   | CHECKPOINT LOCATION: 0/260 
  +|
   | BACKUP METHOD: streamed
 +|
   | BACKUP FROM: master
  +
..
The results only will be shown on screen and this function will not generate 
any files. What I do is write 
the second field(labelfile) to a new file backup_label and write the third 
field(spcmapfile) to tablespace_map if 
the third field is not null.

Therefore, I choose a text editor to help me write the file. 
I copy the a line in second field and paste this to text editor and press the 
'enter' key, repeat this action util
all the line have be pasted to text editor, then save the file.

If this is not a good way to create the backup_label file, can you tell me how 
can I create this file on windows?


I think the real problem is this file on windows is opened with binary mode. If 
I use libpq to get the result and write 
the result to file directly(the default action on windows is open file in text 
mode), this problem will be happened.
So I consider this is a bug.

Best Regards,
Shenhao Wang





Re: [HACKERS] Custom compression methods

2021-03-19 Thread Andrew Dunstan


On 3/19/21 8:25 PM, Tom Lane wrote:
> Alvaro Herrera  writes:
>> On 2021-Mar-19, Robert Haas wrote:
>>> Well, I really do hope that some day in the bright future, pglz will
>>> no longer be the thing we're shipping as the postgresql.conf default.
>>> So we'd just be postponing the noise until then. I think we need a
>>> better idea than that.
>> Hmm, why?  In that future, we can just change the pg_dump behavior to no
>> longer dump the compression clause if it's lz4 or whatever better
>> algorithm we choose.  So I think I'm clarifying my proposal to be "dump
>> the compression clause if it's different from the compiled-in default"
>> rather than "different from the GUC default".
> Extrapolating from the way we've dealt with similar issues
> in the past, I think the structure of pg_dump's output ought to be:
>
> 1. SET default_toast_compression = 'source system's value'
> in among the existing passel of SETs at the top.  Doesn't
> matter whether or not that is the compiled-in value.
>
> 2. No mention of compression in any CREATE TABLE command.
>
> 3. For any column having a compression option different from
> the default, emit ALTER TABLE SET ... to set that option after
> the CREATE TABLE.  (You did implement such a SET, I trust.)
>
> This minimizes the chatter for the normal case where all or most
> columns have the same setting, and more importantly it allows the
> dump to be read by older PG systems (or non-PG systems, or newer
> systems built without --with-lz4) that would fail altogether
> if the CREATE TABLE commands contained compression options.
> To use the dump that way, you do have to be willing to ignore
> errors from the SET and the ALTERs ... but that beats the heck
> out of having to manually edit the dump script to get rid of
> embedded COMPRESSION clauses.
>
> I'm not sure whether we'd still need to mess around beyond
> that to make the buildfarm's existing upgrade tests happy.
> But we *must* do this much in any case, because as it stands
> this patch has totally destroyed some major use-cases for
> pg_dump.
>
> There might be scope for a dump option to suppress mention
> of compression altogether (comparable to, eg, --no-tablespaces).
> But I think that's optional.  In any case, we don't want
> to put people in a position where they should have used such
> an option and now they have no good way to recover their
> dump to the system they want to recover to.
>
>   


I'm fairly sure this prescription would satisfy the buildfarm. It sounds
pretty sane to me - I'd independently come to a very similar conclusion
before reading the above.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [HACKERS] logical decoding of two-phase transactions

2021-03-19 Thread Ajin Cherian
On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila  wrote:

> On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian  wrote:
> >
> > Missed the patch - 0001, resending.
> >
>
>
> @@ -538,10 +550,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
> ..
> + /* Set two_phase_at LSN only if it hasn't already been set. */
> + if (ctx->twophase && !MyReplicationSlot->data.two_phase_at)
> + {
> + MyReplicationSlot->data.two_phase_at = start_lsn;
> + slot->data.two_phase = true;
> + ReplicationSlotMarkDirty();
> + ReplicationSlotSave();
> + SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
> + }
>
> What if the walsender or apply worker restarts after setting
> two_phase_at/two_phase here and updating the two_phase state in
> pg_subscription? Won't we need to set SnapBuildSetTwoPhaseAt after
> restart as well?
>

After a restart,   two_phase_at will be set by calling
AllocateSnapshotBuilder with two_phase_at

@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
  ctx->reorder = ReorderBufferAllocate();
  ctx->snapshot_builder =
  AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
- need_full_snapshot, slot->data.initial_consistent_point);
+ need_full_snapshot, slot->data.two_phase_at);

and then in  AllocateSnapshotBuilder:

@@ -309,7 +306,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
  builder->initial_xmin_horizon = xmin_horizon;
  builder->start_decoding_at = start_lsn;
  builder->building_full_snapshot = need_full_snapshot;
- builder->initial_consistent_point = initial_consistent_point;
+ builder->two_phase_at = two_phase_at;

regards,
Ajin Cherian
Fujitsu Australia


Re: [PATCH] Identify LWLocks in tracepoints

2021-03-19 Thread Craig Ringer
On Sat, 20 Mar 2021, 04:21 Peter Eisentraut, <
peter.eisentr...@enterprisedb.com> wrote:

>
> On 18.03.21 07:34, Craig Ringer wrote:
> > The main reason I didn't want to add more tracepoints than were strictly
> > necessary is that Pg doesn't enable the systemtap semaphores feature, so
> > right now we do a T_NAME(lock) evaluation each time we pass a tracepoint
> > if --enable-dtrace is compiled in, whether or not anything is tracing.
> > This was fine on pg11 where it was just:
> >
> > #define T_NAME(lock) \
> >  (LWLockTrancheArray[(lock)->tranche])
> >
> > but since pg13 it instead expands to
> >
> >  GetLWTrancheName((lock)->tranche)
> >
> > where GetLWTrancheName isn't especially trivial. We'll run that function
> > every single time we pass any of these tracepoints and then discard the
> > result, which is ... not ideal. That applies so long as Pg is compiled
> > with --enable-dtrace. I've been meaning to look at enabling the
> > systemtap semaphores feature in our build so these can be wrapped in
> > unlikely(TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED()) guards, but I wanted
> > to wrap this patch set up first as there are some complexities around
> > enabling the semaphores feature.
>
> There is already support for that.  See the documentation at the end of
> this page:
>
> https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS


Pretty sure it won't work right now.

To use systemtap semaphores (the _ENABLED macros) you need to run dtrace -g
to generate a probes.o then link that into postgres.

I don't think we do that. I'll double check soon.


Re: shared memory stats: high level design decisions: consistency, dropping

2021-03-19 Thread Hannu Krosing
On Sat, Mar 20, 2021 at 1:21 AM Andres Freund  wrote:
>
> Hi,
>
> On 2021-03-20 01:16:31 +0100, Hannu Krosing wrote:
> > > But now we could instead schedule stats to be removed at commit
> > time. That's not trivial of course, as we'd need to handle cases where
> > the commit fails after the commit record, but before processing the
> > dropped stats.
> >
> > We likely can not remove them at commit time, but only after the
> > oldest open snapshot moves parts that commit ?
>
> I don't see why? A dropped table is dropped, and cannot be accessed
> anymore. Snapshots don't play a role here - the underlying data is gone
> (minus a placeholder file to avoid reusing the oid, until the next
> commit).  If you run a vacuum on some unrelated table in the same
> database, the stats for a dropped table will already be removed long
> before there's no relation that could theoretically open the table.
>
> Note that table level locking would prevent a table from being dropped
> if a long-running transaction has already accessed it.

Yeah, just checked. DROP TABLE waits until the reading transaction finishes.

>
> > Would an approach where we keep stats in a structure logically similar
> > to MVCC we use for normal tables be completely unfeasible ?
>
> Yes, pretty unfeasible. Stats should work on standbys too...

I did not mean actually using MVCC and real transaction ids but rather
 similar approach, where (potentially) different stats rows are kept
for each backend.

This of course only is a win in case multiple backends can use the
same stats row. Else it is easier to copy the backends version into
backend local memory.

But I myself do not see any problem with stats rows changing all the time.

The only worry would be parts of the same row being out of sync. This
can of course be solved by locking, but for large number of backends
with tiny transactions this locking itself could potentially become a
problem. Here alternating between two or more versions could help and
then it also starts makes sense to keep the copies in shared memory



> Regards,
>
> Andres




Re: fdatasync performance problem with large number of DB files

2021-03-19 Thread David Steele

On 3/19/21 7:16 PM, Thomas Munro wrote:

Thanks Justin and David.  Replies to two emails inline:

Fair point.  Here's what I went with:

 When set to fsync, which is the default,
 PostgreSQL will recursively open and
 synchronize all files in the data directory before crash
recovery
 begins.  The search for files will follow symbolic links for the WAL
 directory and each configured tablespace (but not any other symbolic
 links).



+1


I thought about adding some text along the lines that such symlinks
are not expected, but I think you're right that what we really need is
a good place to point to.  I mean, generally you can't mess around
with the files managed by PostgreSQL and expect everything to keep
working correctly


WRT to symlinks I'm not sure that's fair to say. From PG's perspective 
it's just a dir/file after all. Other than pg_wal I have seen 
pg_stat/pg_stat_tmp sometimes symlinked, plus config files, and the log dir.


pgBackRest takes a pretty liberal approach here. Were preserve all 
dir/file symlinks no matter where they appear and allow all of them to 
be remapped on restore.



but it wouldn't hurt to make an explicit statement
about symlinks and where they're allowed (or maybe there is one
already and I failed to find it).  


I couldn't find it either and I would be in favor of it. For instance, 
pgBackRest forbids tablespaces inside PGDATA and when people complain 
(more often then you might imagine) we can just point to the code/docs.



There are hints though, like
pg_basebackup's documentation which tells you it won't follow or
preserve them in general, but... hmm, it also contemplates various
special subdirectories (pg_dynshmem, pg_notify, pg_replslot, ...) that
might be symlinks without saying why.


Right, pg_dynshmem is another one that I've seen symlinked. Some things 
are nice to have on fast storage. pg_notify and pg_replslot are similar 
since they get written to a lot in certain configurations.



It worries me that this needs to be explicitly "turned off" after the
initial recovery. Seems like something of a foot gun.

Since we have not offered this functionality before I'm not sure we
should rush to introduce it now. For backup solutions that do their own
syncing, syncfs() should provide excellent performance so long as the
file system is not shared, which is something the user can control (and
is noted in the docs).


Thanks.  I'm leaving the 0002 patch "on ice" until someone can explain
how you're supposed to use it without putting a hole in your foot.


+1


(One silly thing I noticed is that our comments generally think
"filesystem" is one word, but our documentation always has a space;
this patch followed the local convention in both cases!)


Personally I prefer "file system".

Regards,
--
-David
da...@pgmasters.net




Re: [HACKERS] Custom compression methods

2021-03-19 Thread Tom Lane
Alvaro Herrera  writes:
> On 2021-Mar-19, Robert Haas wrote:
>> Well, I really do hope that some day in the bright future, pglz will
>> no longer be the thing we're shipping as the postgresql.conf default.
>> So we'd just be postponing the noise until then. I think we need a
>> better idea than that.

> Hmm, why?  In that future, we can just change the pg_dump behavior to no
> longer dump the compression clause if it's lz4 or whatever better
> algorithm we choose.  So I think I'm clarifying my proposal to be "dump
> the compression clause if it's different from the compiled-in default"
> rather than "different from the GUC default".

Extrapolating from the way we've dealt with similar issues
in the past, I think the structure of pg_dump's output ought to be:

1. SET default_toast_compression = 'source system's value'
in among the existing passel of SETs at the top.  Doesn't
matter whether or not that is the compiled-in value.

2. No mention of compression in any CREATE TABLE command.

3. For any column having a compression option different from
the default, emit ALTER TABLE SET ... to set that option after
the CREATE TABLE.  (You did implement such a SET, I trust.)

This minimizes the chatter for the normal case where all or most
columns have the same setting, and more importantly it allows the
dump to be read by older PG systems (or non-PG systems, or newer
systems built without --with-lz4) that would fail altogether
if the CREATE TABLE commands contained compression options.
To use the dump that way, you do have to be willing to ignore
errors from the SET and the ALTERs ... but that beats the heck
out of having to manually edit the dump script to get rid of
embedded COMPRESSION clauses.

I'm not sure whether we'd still need to mess around beyond
that to make the buildfarm's existing upgrade tests happy.
But we *must* do this much in any case, because as it stands
this patch has totally destroyed some major use-cases for
pg_dump.

There might be scope for a dump option to suppress mention
of compression altogether (comparable to, eg, --no-tablespaces).
But I think that's optional.  In any case, we don't want
to put people in a position where they should have used such
an option and now they have no good way to recover their
dump to the system they want to recover to.

regards, tom lane




Re: shared memory stats: high level design decisions: consistency, dropping

2021-03-19 Thread Andres Freund
Hi,

On 2021-03-20 01:16:31 +0100, Hannu Krosing wrote:
> > But now we could instead schedule stats to be removed at commit
> time. That's not trivial of course, as we'd need to handle cases where
> the commit fails after the commit record, but before processing the
> dropped stats.
> 
> We likely can not remove them at commit time, but only after the
> oldest open snapshot moves parts that commit ?

I don't see why? A dropped table is dropped, and cannot be accessed
anymore. Snapshots don't play a role here - the underlying data is gone
(minus a placeholder file to avoid reusing the oid, until the next
commit).  If you run a vacuum on some unrelated table in the same
database, the stats for a dropped table will already be removed long
before there's no relation that could theoretically open the table.

Note that table level locking would prevent a table from being dropped
if a long-running transaction has already accessed it.


> Would an approach where we keep stats in a structure logically similar
> to MVCC we use for normal tables be completely unfeasible ?

Yes, pretty unfeasible. Stats should work on standbys too...

Regards,

Andres




Re: [HACKERS] Custom compression methods

2021-03-19 Thread David Steele

On 3/19/21 8:00 PM, Andres Freund wrote:

On 2021-03-19 15:44:34 -0400, Robert Haas wrote:

I committed the core patch (0003) with a bit more editing.  Let's see
what the buildfarm thinks.


Congrats Dilip, Robert, All. The slow toast compression has been a
significant issue for a long time.


Yes, congratulations! This is a terrific improvement.

Plus, now that lz4 is part of configure it lowers the bar for other 
features that want to use it. I'm guessing there will be a few.


Thanks!
--
-David
da...@pgmasters.net




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-19 Thread Bruce Momjian
On Sat, Mar 20, 2021 at 01:03:16AM +0100, Hannu Krosing wrote:
> It would be really convenient if user-visible serialisations of the query id
> had something that identifies the computation method.
> 
> maybe prefix 'N' for internal, 'S' for pg_stat_statements etc.
> 
> This would immediately show in logs at what point the id calculator was 
> changed

Yeah, but it an integer, and I don't think we want to change that.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: [HACKERS] Custom compression methods

2021-03-19 Thread Alvaro Herrera
On 2021-Mar-19, Robert Haas wrote:

> On Fri, Mar 19, 2021 at 6:22 PM Alvaro Herrera  
> wrote:

> > (At least, for binary upgrade surely you must make sure to apply the
> > correct setting regardless of defaults on either system).
> 
> It's not critical from a system integrity point of view; the catalog
> state just dictates what happens to new data.

Oh, okay.

> You could argue that if, in a future release, we change the default to
> lz4, it's good for pg_upgrade to migrate users to a set of column
> definitions that will use that for new data.

Agreed, that seems a worthy goal.

> > Maybe it makes sense to dump the compression clause if it is different
> > from pglz, regardless of the default on the source server.
> 
> Well, I really do hope that some day in the bright future, pglz will
> no longer be the thing we're shipping as the postgresql.conf default.
> So we'd just be postponing the noise until then. I think we need a
> better idea than that.

Hmm, why?  In that future, we can just change the pg_dump behavior to no
longer dump the compression clause if it's lz4 or whatever better
algorithm we choose.  So I think I'm clarifying my proposal to be "dump
the compression clause if it's different from the compiled-in default"
rather than "different from the GUC default".

-- 
Álvaro Herrera   Valdivia, Chile
"Para tener más hay que desear menos"




shared memory stats: high level design decisions: consistency, dropping

2021-03-19 Thread Andres Freund
Hi,

I am working on Kyotaro Horiguchi's shared memory stats patch [1] with
the goal of getting it into a shape that I'd be happy to commit.  That
thread is quite long and most are probably skipping over new messages in
it.

There are two high-level design decisions / questions that I think
warrant a wider audience (I'll keep lower level discussion in the other
thread).

In case it is not obvious, the goal of the shared memory stats patch is
to replace the existing statistics collector, to which new stats are
reported via an UDP socket, and where clients read data from the stats
collector by reading the entire database's stats from disk.

The replacement is to put the statistics into a shared memory
segment. Fixed-size stats (e.g. bgwriter, checkpointer, wal activity,
etc) being stored directly in a struct in memory. Stats for objects
where a variable number exists, e.g. tables, are addressed via a dshash.
table that points to the stats that are in turn allocated using dsa.h.


1) What kind of consistency do we want from the pg_stats_* views?

Right now the first access to stats in a transaction will trigger a read
of both the global and per-database stats from disk. If the on-disk
state is too old, we'll ask the stats collector to write out a new file
a couple times.

For the rest of the transaction that in-memory state is used unless
pg_stat_clear_snapshot() is called. Which means that multiple reads from
e.g. pg_stat_user_tables will show the same results as before [2].

That makes stats accesses quite expensive if there are lots of
objects.

But it also means that separate stats accesses - which happen all the
time - return something repeatable and kind of consistent.

Now, the stats aren't really consistent in the sense that they are
really accurate, UDP messages can be lost, or only some of the stats
generated by a TX might not yet have been received, other transactions
haven't yet sent them. Etc.


With the shared memory patch the concept of copying all stats for the
current database into local memory at the time of the first stats access
doesn't make sense to me.  Horiguchi-san had actually implemented that,
but I argued that that would be cargo-culting an efficiency hack
required by the old storage model forward.

The cost of doing this is substantial. On master, with a database that
contains 1 million empty tables, any stats access takes ~0.4s and
increases memory usage by 170MB.


1.1)

I hope everybody agrees with not requiring that stats don't need to be
the way they were at the time of first stat access in a transaction,
even if that first access was to a different stat object than the
currently accessed stat?


1.2)

Do we expect repeated accesses to the same stat to stay the same through
the transaction?  The easiest way to implement stats accesses is to
simply fetch the stats from shared memory ([3]). That would obviously
result in repeated accesses to the same stat potentially returning
changing results over time.

I think that's perfectly fine, desirable even, for pg_stat_*.


1.3)

What kind of consistency do we expect between columns of views like
pg_stat_all_tables?

Several of the stats views aren't based on SRFs or composite-type
returning functions, but instead fetch each stat separately:

E.g. pg_stat_all_tables:
 SELECT c.oid AS relid,
n.nspname AS schemaname,
c.relname,
pg_stat_get_numscans(c.oid) AS seq_scan,
pg_stat_get_tuples_returned(c.oid) AS seq_tup_read,
sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan,
sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + 
pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch,
pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins,
...
pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count
   FROM pg_class c
 LEFT JOIN pg_index i ON c.oid = i.indrelid
...

Which means that if we do not cache stats, additional stats updates
could have been applied between two stats accessors. E.g the seq_scan
from before some pgstat_report_stat() but the seq_tup_read from after.

If we instead fetch all of a table's stats in one go, we would get
consistency between the columns. But obviously that'd require changing
all the stats views.

Horiguchi-san, in later iterations of the patch, attempted to address
this issue by adding a one-entry caches below
pgstat_fetch_stat_tabentry(), pgstat_fetch_stat_dbentry() etc, which is
what pg_stat_get_numscans(), pg_stat_get_db_tuples_updated() etc use.


But I think that leads to very confusing results. Access stats for the
same relation multiple times in a row? Do not see updates. Switch
between e.g. a table and its indexes? See updates.


I personally think it's fine to have short-term divergences between the
columns. The stats aren't that accurate anyway, as we don't submit them
all the time.  And that if we want consistency between columns, we
instead should replace the current view definitions with[set of] record
returning function - everything else seems to lead to 

Re: fdatasync performance problem with large number of DB files

2021-03-19 Thread Thomas Munro
Thanks Justin and David.  Replies to two emails inline:

On Sat, Mar 20, 2021 at 12:01 AM Justin Pryzby  wrote:
> On Fri, Mar 19, 2021 at 06:28:46PM +1300, Thomas Munro wrote:
> > +++ b/doc/src/sgml/config.sgml
>
> > +PostgreSQL will recursively open and 
> > fsync
> > +all files in the data directory before crash recovery begins.  This
>
> Maybe it should say "data, tablespace, and WAL directories", or just "critical
> directories".

Fair point.  Here's what I went with:

When set to fsync, which is the default,
PostgreSQL will recursively open and
synchronize all files in the data directory before crash
recovery
begins.  The search for files will follow symbolic links for the WAL
directory and each configured tablespace (but not any other symbolic
links).

> > + {
> > + {"recovery_init_sync_method", PGC_POSTMASTER, 
> > ERROR_HANDLING_OPTIONS,
> > + gettext_noop("Sets the method for synchronizing the 
> > data directory before crash recovery."),
> > + },
>
> "and tablespaces and WAL"

I feel like that's getting too detailed for the GUC description?

On Sat, Mar 20, 2021 at 2:55 AM David Steele  wrote:
> I had a look at the patch and it looks good to me.

Thanks!

> Should we mention in the docs that the contents of non-standard symlinks
> will not be synced, i.e. anything other than tablespaces and pg_wal? Or
> can we point them to some docs saying not to do that (if such exists)?

Good idea.  See above for the adjustment I went with to describe the
traditional behaviour, and then I also made a similar change to the
syncfs part, which, I hope, manages to convey that the new mode
matches the existing policy on symlinks:

On Linux, syncfs may be used instead, to ask the
operating system to synchronize the whole file systems that contain the
data directory, the WAL file and each tablespace (but not any other
file systems that may be reachable through symbolic links).

I thought about adding some text along the lines that such symlinks
are not expected, but I think you're right that what we really need is
a good place to point to.  I mean, generally you can't mess around
with the files managed by PostgreSQL and expect everything to keep
working correctly, but it wouldn't hurt to make an explicit statement
about symlinks and where they're allowed (or maybe there is one
already and I failed to find it).  There are hints though, like
pg_basebackup's documentation which tells you it won't follow or
preserve them in general, but... hmm, it also contemplates various
special subdirectories (pg_dynshmem, pg_notify, pg_replslot, ...) that
might be symlinks without saying why.

> > Ok, I made the changes you suggested.  Let's see if anyone else would
> > like to vote for or against the concept of the 0002 patch
> > (recovery_init_sync_method=none).
>
> It worries me that this needs to be explicitly "turned off" after the
> initial recovery. Seems like something of a foot gun.
>
> Since we have not offered this functionality before I'm not sure we
> should rush to introduce it now. For backup solutions that do their own
> syncing, syncfs() should provide excellent performance so long as the
> file system is not shared, which is something the user can control (and
> is noted in the docs).

Thanks.  I'm leaving the 0002 patch "on ice" until someone can explain
how you're supposed to use it without putting a hole in your foot.

I pushed the 0001 patch.  Thanks to all who reviewed.  Of course,
further documentation improvement patches are always welcome.

(One silly thing I noticed is that our comments generally think
"filesystem" is one word, but our documentation always has a space;
this patch followed the local convention in both cases!)




Re: [HACKERS] Custom compression methods

2021-03-19 Thread Robert Haas
On Fri, Mar 19, 2021 at 6:22 PM Alvaro Herrera  wrote:
> Do you mean the column storage strategy, attstorage?  I don't think
> that's really related, because the difference there is not a GUC setting
> but a compiled-in default for the type.  In the case of compression, I'm
> not sure it makes sense to do it like that, but I can see the clutter
> argument: if we dump compression for all columns, it's going to be super
> noisy.

I agree.

> (At least, for binary upgrade surely you must make sure to apply the
> correct setting regardless of defaults on either system).

It's not critical from a system integrity point of view; the catalog
state just dictates what happens to new data. You could argue that if,
in a future release, we change the default to lz4, it's good for
pg_upgrade to migrate users to a set of column definitions that will
use that for new data.

> Maybe it makes sense to dump the compression clause if it is different
> from pglz, regardless of the default on the source server.  Then, if the
> target server has chosen lz4 as default, *all* columns are going to end
> up as lz4, and if it hasn't, then only the ones that were lz4 in the
> source server are going to.  That seems reasonable behavior.  Also, if
> some columns are lz4 in source, and target does not have lz4, then
> everything is going to work out to not-lz4 with just a bunch of errors
> in the output.

Well, I really do hope that some day in the bright future, pglz will
no longer be the thing we're shipping as the postgresql.conf default.
So we'd just be postponing the noise until then. I think we need a
better idea than that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [HACKERS] Custom compression methods

2021-03-19 Thread Alvaro Herrera
On 2021-Mar-19, Robert Haas wrote:

> Here's one from me that tries to make the handling of the LZ4 stuff
> more like what we already do for zlib, but I'm not sure if it's
> correct, or if it's what everyone wants.

This one seems to behave as expected (Debian 10, with and without
liblz4-dev).

-- 
Álvaro Herrera   Valdivia, Chile
"Just treat us the way you want to be treated + some extra allowance
 for ignorance."(Michael Brusser)




Re: [HACKERS] Custom compression methods

2021-03-19 Thread Justin Pryzby
On Fri, Mar 19, 2021 at 10:19:49PM +0100, Tomas Vondra wrote:
> On 3/19/21 9:40 PM, Andres Freund wrote:
> > On 2021-03-19 17:35:58 -0300, Alvaro Herrera wrote:
> >> I find this behavior confusing; I'd rather have configure error out if
> >> it can't find the package support I requested, than continuing with a
> >> set of configure options different from what I gave.
> > 
> > +1
> 
> Yeah. And why does it even require pkg-config, unlike any other library
> that I'm aware of?

The discussion regarding pkg-config started here.
https://www.postgresql.org/message-id/20210309071655.GL2021%40telsasoft.com

I sent a patch adding it to allow the macos CI to build  --with-lz4.
Since LZ4 was only recently installed on the CI environments, it's possible
that's not the ideal way to do it.

-- 
Justin




Re: [HACKERS] Custom compression methods

2021-03-19 Thread Alvaro Herrera
On 2021-Mar-19, Robert Haas wrote:

> On Fri, Mar 19, 2021 at 10:11 AM Dilip Kumar  wrote:
> > Also added a test case for vacuum full to recompress the data.
> 
> I committed the core patch (0003) with a bit more editing.  Let's see
> what the buildfarm thinks.

I updated the coverage script to use --with-lz4; results are updated.
While eyeballing the results I noticed this bit in
lz4_decompress_datum_slice():

+   /* slice decompression not supported prior to 1.8.3 */
+   if (LZ4_versionNumber() < 10803)
+   return lz4_decompress_datum(value);

which I read as returning the complete decompressed datum if slice
decompression is not supported.  I thought that was a bug, but looking
at the caller I realize that this isn't really a problem, since it's
detoast_attr_slice's responsibility to slice the result further -- no
bug, it's just wasteful.  I suggest to add comments to this effect,
perhaps as the attached (feel free to reword, I think mine is awkward.)

-- 
Álvaro Herrera39°49'30"S 73°17'W
Si no sabes adonde vas, es muy probable que acabes en otra parte.
diff --git a/src/backend/access/common/detoast.c b/src/backend/access/common/detoast.c
index 2fef40c2e9..6c79bd2a40 100644
--- a/src/backend/access/common/detoast.c
+++ b/src/backend/access/common/detoast.c
@@ -497,6 +497,9 @@ toast_decompress_datum(struct varlena *attr)
  * Decompress the front of a compressed version of a varlena datum.
  * offset handling happens in detoast_attr_slice.
  * Here we just decompress a slice from the front.
+ *
+ * If slice decompression is not supported, the return datum is the
+ * decompression of the full datum.
  */
 static struct varlena *
 toast_decompress_datum_slice(struct varlena *attr, int32 slicelength)
diff --git a/src/backend/access/common/toast_compression.c b/src/backend/access/common/toast_compression.c
index a6f8b79a9e..080ddcf93c 100644
--- a/src/backend/access/common/toast_compression.c
+++ b/src/backend/access/common/toast_compression.c
@@ -203,6 +203,9 @@ lz4_decompress_datum(const struct varlena *value)
 
 /*
  * Decompress part of a varlena that was compressed using LZ4.
+ *
+ * If slice decompression is not supported, the return datum is the
+ * decompression of the full datum.
  */
 struct varlena *
 lz4_decompress_datum_slice(const struct varlena *value, int32 slicelength)


Re: [HACKERS] Custom compression methods

2021-03-19 Thread Alvaro Herrera
On 2021-Mar-19, Robert Haas wrote:

> > Regarding your point, that does look like clutter. We don't annotate
> > the dump with a storage clause unless it's non-default, so probably we
> > should do the same thing here. I think I gave Dilip bad advice here...
> 
> Here's a patch for that. It's a little strange because you're going to
> skip dumping the toast compression based on the default value on the
> source system, but that might not be the default on the system where
> the dump is being restored, so you could fail to recreate the state
> you had. That is avoidable if you understand how things work, but some
> people might not. I don't have a better idea, though, so let me know
> what you think of this.

Do you mean the column storage strategy, attstorage?  I don't think
that's really related, because the difference there is not a GUC setting
but a compiled-in default for the type.  In the case of compression, I'm
not sure it makes sense to do it like that, but I can see the clutter
argument: if we dump compression for all columns, it's going to be super
noisy.

(At least, for binary upgrade surely you must make sure to apply the
correct setting regardless of defaults on either system).

Maybe it makes sense to dump the compression clause if it is different
from pglz, regardless of the default on the source server.  Then, if the
target server has chosen lz4 as default, *all* columns are going to end
up as lz4, and if it hasn't, then only the ones that were lz4 in the
source server are going to.  That seems reasonable behavior.  Also, if
some columns are lz4 in source, and target does not have lz4, then
everything is going to work out to not-lz4 with just a bunch of errors
in the output.

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: Bringing some sanity to RestoreGUCState()

2021-03-19 Thread Tom Lane
I wrote:
> The argument for "sending too little" comes from the race condition
> that's described in the function's comments: a variable that has
> source PGC_S_DEFAULT (ie, has never moved off its compiled-in default)
> in the leader could have just been updated in the postmaster, due to
> re-reading postgresql.conf after SIGHUP.  In that case, when the
> postmaster forks the worker it will inherit the new setting from
> postgresql.conf, and will run with that because the leader didn't send
> its value.  So we risk having a situation where parallel workers are
> using a setting that the leader won't adopt until it next goes idle.

After further study I've realized that the above can't happen, because
the existing code is considerably more magical, delicate, and badly
commented than I'd realized.

Basically it divides the GUCs into two categories: those that will
never be shipped based on their context or name (for which we assume
the worker will obtain correct values via other mechanisms), and all
others, which are shipped if they don't have their compiled-in
default values.  On the receiving side, the first loop in
RestoreGUCOptions acts to ensure that all GUCs in the second category
are at their compiled-in defaults, essentially throwing away whatever
the worker might've obtained from pg_db_role_setting or other places.
Then, after receiving and applying the shipped GUCs, we have an exact
match to the leader's state (given the assumption that the compiled-in
values are identical, anyway), without any race conditions.

The magical/fragile part of this is that the same can_skip_guc test
works for both sides of the operation; it's not really obvious that
that must be so.

Forcing all the potentially-shipped GUCs into PGC_S_DEFAULT state has
another critical and undocumented property, which is that it ensures
that set_config_option won't refuse to apply any of the incoming
settings on the basis of their source priority being lower than what
the worker already has.

So we do need RestoreGUCOptions to be doing something equivalent to
InitializeOneGUCOption, although preferably without the leaks.
That doesn't look too awful though, since we should be able to just
Assert that the stack is empty; the only thing that may need to be
freed is the current values of string variables.  I'll see about
fixing that and improving the comments while this is all swapped in.

regards, tom lane




Re: [HACKERS] Custom compression methods

2021-03-19 Thread Robert Haas
On Fri, Mar 19, 2021 at 5:29 PM Andres Freund  wrote:
> On 2021-03-19 22:19:49 +0100, Tomas Vondra wrote:
> > Yeah. And why does it even require pkg-config, unlike any other library
> > that I'm aware of?
>
> IMO it's fine to require pkg-config to simplify the configure
> code. Especially for new optional features. Adding multiple alternative
> ways to discover libraries for something like this makes configure
> slower, without a comensurate benefit.

So, would anyone like to propose a patch to revise the logic in a way
that they like better?

Here's one from me that tries to make the handling of the LZ4 stuff
more like what we already do for zlib, but I'm not sure if it's
correct, or if it's what everyone wants.

Thoughts?

--
Robert Haas
EDB: http://www.enterprisedb.com


redo-lz4-configuration.patch
Description: Binary data


Re: [PATCH] Hooks at XactCommand level

2021-03-19 Thread Gilles Darold
Le 12/03/2021 à 06:55, Julien Rouhaud a écrit :
> Hi,
>
> On Tue, Dec 08, 2020 at 11:15:12AM +0100, Gilles Darold wrote:
>> Based on a PoC reported in a previous thread [1] I'd like to propose new
>> hooks around transaction commands. The objective of this patch is to
>> allow PostgreSQL extension to act at start and end (including abort) of
>> a SQL statement in a transaction.
>>
>> The idea for these hooks is born from the no go given to Takayuki
>> Tsunakawa's patch[2] proposing an in core implementation of
>> statement-level rollback transaction and the pg_statement_rollback
>> extension[3] that we have developed at LzLabs. The extension
>> pg_statement_rollback has two limitation, the first one is that the
>> client still have to call the ROLLBACK TO SAVEPOINT when an error is
>> encountered and the second is that it generates a crash when PostgreSQL
>> is compiled with assert that can not be fixed at the extension level.
> This topic came up quite often on the mailing list, the last being from Álvaro
> at [1].  I think there's a general agreement that customers want that feature,
> won't stop asking for it, and many if not all forks ended up implementing it.
>
> I would still prefer if he had a way to support if in vanilla postgres, with 
> of
> course all possible safeguards to avoid an epic fiasco.


I have added Alvarro and Takayuki to the thread, this patch is inspired
from their proposals. I wrote this patch after reading the thread and
concluding that a core implementation doesn't seems to make the
consensus and that this feature could be available to users through an
extension.


> I personally think that Álvaro's previous approach, giving the ability to
> specify the rollback behavior in the TransactionStmt grammar, would be enough
> (I mean without the GUC part) to cover realistic and sensible usecases, which
> is where the client fully decides whether there's a statement level rollback 
> or
> not.  One could probably build a custom module on top of that to introduce 
> some
> kind of GUC to change the behavior more globally if it wants to take that 
> risk.


Yes probably, with this patch I just want to propose an external
implementation of the feature. The extension implementation "just"
require these three hooks to propose the same feature as if it was
implemented in vanilla postgres. The feature can be simply enabled or
disabled by a custom user defined variable before a transaction is
started or globaly for all transaction.


> If such an approach is still not wanted for core inclusion, then I'm in favor
> of adding those hooks.  There's already a published extension that tries to
> implement that (for complete fairness I'm one of the people to blame), but as
> Gilles previously mentioned this is very hackish and the currently available
> hooks makes it very hard if not impossible to have a perfect implementation.
> It's clear that people will never stop to try doing it, so at least let's make
> it possible using a custom module.
>
> It's also probably worthwhile to mention that the custom extension 
> implementing
> server side statement level rollback wasn't implemented because it wasn't
> doable in the client side, but because the client side implementation was
> causing a really big overhead due to the need of sending the extra commands,
> and putting it on the server side lead to really significant performance
> improvement.

Right, the closer extension to reach this feature is the extension we
develop at LzLabs [2] but it still require a rollback to savepoint at
client side in case of error. The extension [3] using these hooks
doesn't have this limitation, everything is handled server side.


>> Although that I have not though about other uses for these hooks, they
>> will allow a full server side statement-level rollback feature like in
>> commercial DBMSs like DB2 and Oracle. This feature is very often
>> requested by users that want to migrate to PostgreSQL.
> I also thought about it, and I don't really see other possible usage for those
> hooks.


Yes I have not a lot of imagination too for possible other use for these
hooks but I hope that in itself this feature can justify them. I just
though that if we expose the query_string at command_start hook we could
allow its modification by external modules, but this is surely the worst
idea I can produce.


>> There is no additional syntax or GUC, the patch just adds three new hooks:
>>
>>
>> * start_xact_command_hook called at end of the start_xact_command()
>> function.
>> * finish_xact_command called in finish_xact_command() just before
>> CommitTransactionCommand().
>> * abort_current_transaction_hook called after an error is encountered at
>> end of AbortCurrentTransaction().
>>
>> These hooks allow an external plugins to execute code related to the SQL
>> statements executed in a transaction.
> The only comment I have for those hooks is for the
> abort_current_transaction_hook.  AbortCurrentTransaction() can be called
> 

Re: [HACKERS] Custom compression methods

2021-03-19 Thread Robert Haas
On Fri, Mar 19, 2021 at 4:38 PM Robert Haas  wrote:
> Yes, and prion's got this concerning diff:
>
>   Column |  Type   | Collation | Nullable | Default | Storage |
> Compression | Stats target | Description
>  
> +-+---+--+-+-+-+--+-
> - f1 | integer |   |  | | plain   |
>  |  |
> + f1 | integer |   |  | | plain   | pglz
>  |  |
>
> Since the column is not a varlena, it shouldn't have a compression
> method configured, yet on that machine it does, possibly because that
> machine uses -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE.

I could reproduce the problem with those flags. I pushed a fix.

> Regarding your point, that does look like clutter. We don't annotate
> the dump with a storage clause unless it's non-default, so probably we
> should do the same thing here. I think I gave Dilip bad advice here...

Here's a patch for that. It's a little strange because you're going to
skip dumping the toast compression based on the default value on the
source system, but that might not be the default on the system where
the dump is being restored, so you could fail to recreate the state
you had. That is avoidable if you understand how things work, but some
people might not. I don't have a better idea, though, so let me know
what you think of this.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


non-default-toast-compression-only-v1.patch
Description: Binary data


Re: [HACKERS] Custom compression methods

2021-03-19 Thread Andres Freund
On 2021-03-19 22:19:49 +0100, Tomas Vondra wrote:
> Yeah. And why does it even require pkg-config, unlike any other library
> that I'm aware of?

IMO it's fine to require pkg-config to simplify the configure
code. Especially for new optional features. Adding multiple alternative
ways to discover libraries for something like this makes configure
slower, without a comensurate benefit.




Re: shared-memory based stats collector

2021-03-19 Thread Andres Freund
Hi,

On 2021-03-10 20:26:56 -0800, Andres Freund wrote:
> > +   shhashent = dshash_find_extended(pgStatSharedHash, ,
> > +
> > create, nowait, create, );
> > +   if (shhashent)
> > +   {
> > +   if (create && !shfound)
> > +   {
> > +   /* Create new stats entry. */
> > +   dsa_pointer chunk = dsa_allocate0(area,
> > +   
> >   pgstat_sharedentsize[type]);
> > +
> > +   shheader = dsa_get_address(area, chunk);
> > +   LWLockInitialize(>lock, LWTRANCHE_STATS);
> > +   pg_atomic_init_u32(>refcount, 0);
> > +
> > +   /* Link the new entry from the hash entry. */
> > +   shhashent->body = chunk;
> > +   }
> > +   else
> > +   shheader = dsa_get_address(area, shhashent->body);
> > +
> > +   /*
> > +* We expose this shared entry now.  You might think that the 
> > entry
> > +* can be removed by a concurrent backend, but since we are 
> > creating
> > +* an stats entry, the object actually exists and used in the 
> > upper
> > +* layer. Such an object cannot be dropped until the first 
> > vacuum
> > +* after the current transaction ends.
> > +*/
> > +   dshash_release_lock(pgStatSharedHash, shhashent);
>
> I don't think you can safely release the lock before you incremented the
> refcount?  What if, once the lock is released, somebody looks up that
> entry, increments the refcount, and decrements it again? It'll see a
> refcount of 0 at the end and decide to free the memory. Then the code
> below will access already freed / reused memory, no?

Yep, it's not even particularly hard to hit:

S0: CREATE TABLE a_table();
S0: INSERT INTO a_table();
S0: disconnect
S1: set a breakpoint to just after the dshash_release_lock(), with an
 if objid == a_table_oid
S1: SELECT pg_stat_get_live_tuples('a_table'::regclass);
  (will break at above breakpoint, without having incremented the
  refcount yet)
S2: DROP TABLE a_table;
S2: VACUUM pg_class;

At that point S2's call to pgstat_vacuum_stat() will find the shared
stats entry for a_table, delete the entry from the shared hash table,
see that the stats data has a zero refcount, and free it. Once S1 wakes
up it'll use already freed (and potentially since reused) memory.

Greetings,

Andres Freund




Re: [HACKERS] Custom compression methods

2021-03-19 Thread Tomas Vondra
On 3/19/21 9:40 PM, Andres Freund wrote:
> On 2021-03-19 17:35:58 -0300, Alvaro Herrera wrote:
>> I find this behavior confusing; I'd rather have configure error out if
>> it can't find the package support I requested, than continuing with a
>> set of configure options different from what I gave.
> 
> +1
> 

Yeah. And why does it even require pkg-config, unlike any other library
that I'm aware of?

checking for liblz4... no
configure: error: in `/home/ubuntu/postgres':
configure: error: The pkg-config script could not be found or is too
old.  Make sure it
is in your PATH or set the PKG_CONFIG environment variable to the full
path to pkg-config.

Alternatively, you may set the environment variables LZ4_CFLAGS
and LZ4_LIBS to avoid the need to call pkg-config.
See the pkg-config man page for more details.

To get pkg-config, see .
See `config.log' for more details

I see xml2 also mentions pkg-config in configure (next to XML2_CFLAGS),
but works fine without it.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Speeding up GIST index creation for tsvectors

2021-03-19 Thread John Naylor
On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar 
wrote:
>
> On Thu, 11 Mar 2021 at 04:25, John Naylor 
wrote:
> > Okay, so it's hard-coded in various functions in contrib modules. In
that
> > case, let's just keep the existing coding for those. In fact, the
comments
> > that got removed by your patch specifically say: /* Using the popcount
> > functions here isn't likely to win */ ...which your testing confirmed.
The
> > new function should be used only for Gist and possibly Intarray, whose
> > default is 124. That means we can't get rid of hemdistsign(), but that's
> > fine.
>
> The comment is there for all types. Since I get the performance better
> on all the types, I have kept the pg_xorcount() call for all these
> contrib modules.  I understand that since for some types default
> siglen is small, we won't get benefit. But I think we should call
> pg_xorcount() for the benefit of non-default siglen case.

The 1-7% degradation measured was from an earlier version, when
pg_xorcount_long had a lot of finicky branching and computation. Is it
still true in v3? We should answer that first. I'm interested in what
happens if you now use pg_xorcount_long in the call sites, at least in the
worst case 7% test.

> Have replaced hemdistsign() by pg_xorcount() everywhere; but with
> that, the call looks a bit clumsy because of having to type-cast the
> parameters to const unsigned char *.  I noticed that the cast to
> "unsigned char *" is required only when we use the value in the
> pg_number_of_ones array. Elsewhere we can safely use 'a' and 'b' as
> "char *". So I changed the pg_xorcount() parameters from unsigned char
> * to char *.

That matches the style of that file, so +1.

> > I think the way to go is a simplified version of your 0001 (not 0002),
with
> > only a single function, for gist and intarray only, and a style that
better
> > matches the surrounding code. If you look at my xor functions in the
attached
> > text file, you'll get an idea of what it should look like. Note that it
got
> > the above performance without ever trying to massage the pointer
alignment.
> > I'm a bit uncomfortable with the fact that we can't rely on alignment,
but
> > maybe there's a simple fix somewhere in the gist code.
>
> In the attached 0001-v3 patch, I have updated the code to match it
> with surrounding code; specifically the usage of *a++ rather than
> a[i].
>
> Regarding the alignment changes... I have removed the code that
> handled the leading identically unaligned bytes, for lack of evidence
> that percentage of such cases is significant. Like I noted earlier,
> for the tsearch data I used, identically unaligned cases were only 6%.
> If I find scenarios where these cases can be significant after all and
> if we cannot do anything in the gist index code, then we might have to
> bring back the unaligned byte handling. I didn't get a chance to dig
> deeper into the gist index implementation to see why they are not
> always 8-byte aligned.

I find it stranger that something equivalent to char* is not randomly
misaligned, but rather only seems to land on 4-byte boundaries.

[thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.

FWIW, I anticipate some push back from the community because of the fact
that the optimization relies on statistical phenomena.

> I have kept the 0002 patch as-is. Due to significant *additional*
> speedup, over and above the 0001 improvement, I think the code
> re-arrangement done is worth it for non-x86 platforms.

For the amount of code uglification involved, we should be getting full asm
popcount support on Arm, not an attempt to kluge around the current
implementation. I'd be happy to review such an effort for PG15, by the way.

Readability counts, and as it stands I don't feel comfortable asking a
committer to read 0002.
--
John Naylor
EDB: http://www.enterprisedb.com


Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-03-19 Thread James Coleman
On Fri, Sep 11, 2020 at 5:11 PM James Coleman  wrote:
>
> On Tue, Sep 8, 2020 at 4:37 PM Heikki Linnakangas  wrote:
> >
> > On 08/09/2020 22:25, James Coleman wrote:
> > > On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas  
> > > wrote:
> > >>
> > >> You could also apply the optimization for non-Const expressions, as long
> > >> as they're stable (i.e. !contain_volatile_functions(expr)).
> > >
> > > Is that true? Don't we also have to worry about whether or not the
> > > value is stable (i.e., know when a param has changed)? There have been
> > > discussions about being able to cache stable subexpressions, and my
> > > understanding was that we'd need to have that infrastructure (along
> > > with the necessarily invalidation when the param changes) to be able
> > > to use this for non-Const expressions.
> >
> > Yeah, you're right, you'd have to also check for PARAM_EXEC Params. And
> > Vars. I think the conditions are the same as those checked in
> > match_clause_to_partition_key() in partprune.c (it's a long function,
> > search for "if (!IsA(rightop, Const))"). Not sure it's worth the
> > trouble, then. But it would be nice to handle queries like "WHERE column
> > = ANY ($1)"
>
> If I'm understanding properly you're imagining something in the form of:
>
> with x as (select '{1,2,3,4,5,6,7,8,9,10}'::int[])
> select * from t where i = any ((select * from x)::int[]);
>
> I've been playing around with this and currently have these checks:
>
> contain_var_clause((Node *) arrayarg)
> contain_volatile_functions((Node *) arrayarg)
> contain_exec_param((Node *) arrayarg, NIL)
>
> (the last one I had to modify the function to handle empty lists)
>
> If any are true, then have to disable the optimization. But for
> queries in the form above the test contain_exec_param((Node *)
> arrayarg, NIL) evaluates to true, even though we know from looking at
> the query that the array subexpression is stable for the length of the
> query.
>
> Am I misunderstanding what you're going for? Or is there a way to
> confirm the expr, although an exec param, won't change?
>
> Another interesting thing that this would imply is that we'd either have to:
>
> 1. Remove the array length check altogether,
> 2. Always use the hash when have a non-Const, but when a Const only if
> the array length check passes, or
> 3. Make this new expr op more fully featured by teaching it how to use
> either a straight loop through a deconstructed array or use the hash.
>
> That last option feeds into further discussion in the below:
>
> > >> Deconstructing the array Datum into a simple C array on first call would
> > >> be a win even for very small arrays and for AND semantics, even if you
> > >> don't use a hash table.
> > >
> > > Because you wouldn't have to repeatedly detoast it? Or some other
> > > reason I'm not thinking of? My intuition would have been that (aside
> > > from detoasting if necessary) there wouldn't be much real overhead in,
> > > for example, an array storing integers.
> >
> > Dealing with NULLs and different element sizes in the array is pretty
> > complicated. Looping through the array currently looks like this:
> >
> > /* Loop over the array elements */
> > s = (char *) ARR_DATA_PTR(arr);
> > bitmap = ARR_NULLBITMAP(arr);
> > bitmask = 1;
> >
> > for (int i = 0; i < nitems; i++)
> > {
> > Datum   elt;
> > Datum   thisresult;
> >
> > /* Get array element, checking for NULL */
> > if (bitmap && (*bitmap & bitmask) == 0)
> > {
> > fcinfo->args[1].value = (Datum) 0;
> > fcinfo->args[1].isnull = true;
> > }
> > else
> > {
> > elt = fetch_att(s, typbyval, typlen);
> > s = att_addlength_pointer(s, typlen, s);
> > s = (char *) att_align_nominal(s, typalign);
> > fcinfo->args[1].value = elt;
> > fcinfo->args[1].isnull = false;
> > }
> >
> > [do stuff with Datum/isnull]
> >
> > /* advance bitmap pointer if any */
> > if (bitmap)
> > {
> > bitmask <<= 1;
> > if (bitmask == 0x100)
> > {
> > bitmap++;
> > bitmask = 1;
> > }
> > }
> > }
> >
> > Compared with just:
> >
> > for (int i = 0; i < nitems; i++)
> > {
> > Datum   elt = datums[i];
> >
> > [do stuff with the Datum]
> > }
> >
> > I'm not sure how much difference that makes, but I presume it's not
> > zero, and it seems like an easy win when you have the code to deal with
> > the Datum array representation 

Re: [HACKERS] Custom compression methods

2021-03-19 Thread Andres Freund
On 2021-03-19 17:35:58 -0300, Alvaro Herrera wrote:
> I find this behavior confusing; I'd rather have configure error out if
> it can't find the package support I requested, than continuing with a
> set of configure options different from what I gave.

+1




Re: [HACKERS] Custom compression methods

2021-03-19 Thread Robert Haas
On Fri, Mar 19, 2021 at 4:20 PM Tom Lane  wrote:
> Nope ... crake's displeased with your assumption that it's OK to
> clutter dumps with COMPRESSION clauses.  As am I: that is going to
> be utterly fatal for cross-version transportation of dumps.

Yes, and prion's got this concerning diff:

  Column |  Type   | Collation | Nullable | Default | Storage |
Compression | Stats target | Description
 
+-+---+--+-+-+-+--+-
- f1 | integer |   |  | | plain   |
 |  |
+ f1 | integer |   |  | | plain   | pglz
 |  |

Since the column is not a varlena, it shouldn't have a compression
method configured, yet on that machine it does, possibly because that
machine uses -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE.

Regarding your point, that does look like clutter. We don't annotate
the dump with a storage clause unless it's non-default, so probably we
should do the same thing here. I think I gave Dilip bad advice here...

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [HACKERS] Custom compression methods

2021-03-19 Thread Alvaro Herrera
Hmm, if I use configure --with-lz4, I get this:

checking whether to build with LZ4 support... yes
checking for liblz4... no
configure: error: Package requirements (liblz4) were not met:

No package 'liblz4' found

Consider adjusting the PKG_CONFIG_PATH environment variable if you
installed software in a non-standard prefix.

Alternatively, you may set the environment variables LZ4_CFLAGS
and LZ4_LIBS to avoid the need to call pkg-config.
See the pkg-config man page for more details.
running CONFIG_SHELL=/bin/bash /bin/bash /pgsql/source/master/configure 
--enable-debug --enable-depend --enable-cassert --enable-nls 
--cache-file=/home/alvherre/run/pgconfig.master.cache --enable-thread-safety 
--with-python --with-perl --with-tcl --with-openssl --with-libxml 
--enable-tap-tests --with-tclconfig=/usr/lib/tcl8.6 PYTHON=/usr/bin/python3 
--with-llvm --prefix=/pgsql/install/master --with-pgport=55432 --no-create 
--no-recursion
...

I find this behavior confusing; I'd rather have configure error out if
it can't find the package support I requested, than continuing with a
set of configure options different from what I gave.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Postgres is bloatware by design: it was built to house
 PhD theses." (Joey Hellerstein, SIGMOD annual conference 2002)




Re: pglz compression performance, take two

2021-03-19 Thread Mark Dilger



> On Jan 28, 2021, at 2:56 AM, Andrey Borodin  wrote:
> 
> 
> 
>> 22 янв. 2021 г., в 07:48, Justin Pryzby  написал(а):
>> 
>> @cfbot: rebased
>> <0001-Reorganize-pglz-compression-code.patch>
> 
> Thanks!
> 
> I'm experimenting with TPC-C over PostgreSQL 13 on production-like cluster in 
> the cloud. Overall performance is IO-bound, but compression is burning a lot 
> energy too (according to perf top). Cluster consists of 3 nodes(only HA, no 
> standby queries) with 32 vCPU each, 128GB RAM, sync replication, 2000 
> warehouses, 240GB PGDATA.
> 
> Samples: 1M of event 'cpu-clock', 4000 Hz, Event count (approx.): 177958545079
> Overhead  Shared Object Symbol
>  18.36%  postgres  [.] pglz_compress
>   3.88%  [kernel]  [k] 
> _raw_spin_unlock_irqrestore
>   3.39%  postgres  [.] 
> hash_search_with_hash_value
>   3.00%  [kernel]  [k] 
> finish_task_switch
>   2.03%  [kernel]  [k] 
> copy_user_enhanced_fast_string
>   1.14%  [kernel]  [k] 
> filemap_map_pages
>   1.02%  postgres  [.] AllocSetAlloc
>   0.93%  postgres  [.] _bt_compare
>   0.89%  postgres  [.] PinBuffer
>   0.82%  postgres  [.] SearchCatCache1
>   0.79%  postgres  [.] 
> LWLockAttemptLock
>   0.78%  postgres  [.] GetSnapshotData
> 
> Overall cluster runs 862tps (52KtpmC, though only 26KtmpC is qualified on 2K 
> warehouses).
> 
> Thanks!

Robert Haas just committed Dilip Kumar's LZ4 compression, 
bbe0a81db69bd10bd166907c3701492a29aca294.

Is this pglz compression patch still relevant?  How does the LZ4 compression 
compare on your hardware?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: [PATCH] Identify LWLocks in tracepoints

2021-03-19 Thread Peter Eisentraut



On 18.03.21 07:34, Craig Ringer wrote:
The main reason I didn't want to add more tracepoints than were strictly 
necessary is that Pg doesn't enable the systemtap semaphores feature, so 
right now we do a T_NAME(lock) evaluation each time we pass a tracepoint 
if --enable-dtrace is compiled in, whether or not anything is tracing. 
This was fine on pg11 where it was just:


#define T_NAME(lock) \
         (LWLockTrancheArray[(lock)->tranche])

but since pg13 it instead expands to

         GetLWTrancheName((lock)->tranche)

where GetLWTrancheName isn't especially trivial. We'll run that function 
every single time we pass any of these tracepoints and then discard the 
result, which is ... not ideal. That applies so long as Pg is compiled 
with --enable-dtrace. I've been meaning to look at enabling the 
systemtap semaphores feature in our build so these can be wrapped in 
unlikely(TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED()) guards, but I wanted 
to wrap this patch set up first as there are some complexities around 
enabling the semaphores feature.


There is already support for that.  See the documentation at the end of 
this page: 
https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS





Re: [HACKERS] Custom compression methods

2021-03-19 Thread Tom Lane
I wrote:
> Since no animals will be using --with-lz4, I'd expect vast silence.

Nope ... crake's displeased with your assumption that it's OK to
clutter dumps with COMPRESSION clauses.  As am I: that is going to
be utterly fatal for cross-version transportation of dumps.

regards, tom lane




Re: [HACKERS] Custom compression methods

2021-03-19 Thread Tom Lane
Robert Haas  writes:
> I committed the core patch (0003) with a bit more editing.  Let's see
> what the buildfarm thinks.

Since no animals will be using --with-lz4, I'd expect vast silence.

regards, tom lane




Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

2021-03-19 Thread Tom Lane
Peter Eisentraut  writes:
> Well, I had an idea that I put to work.  In most of these cases where we 
> need division, we divide an integer by a power of 10.  That can be done 
> with numeric very quickly by just shifting the weight and scale around. 
> So I wrote a function that does that specifically (look for 
> int64_div_fast_to_numeric()).  With that, the slow cases I mentioned now 
> have the same performance as the other cases that didn't have any 
> numeric division.  You just get the overhead for constructing and 
> passing around a numeric instead of a double, which can't be avoided.

Yeah, I was wondering if we could do something like that, but I hadn't
got as far as figuring a way to deal with divisors not a multiple of
NBASE.

Looking at the proposed code, I wonder if it wouldn't be better to
define the function as taking the base-10-log of the divisor, so that
you'd have the number of digits to shift (and the dscale) immediately
instead of needing repeated integer divisions to get that.  Also, the
risk of intermediate overflow here seems annoying:

+   if (unlikely(pg_mul_s64_overflow(val1, NBASE/x, )))
+   elog(ERROR, "overflow");

Maybe that's unreachable for the ranges of inputs the current patch could
create, but it seems like it makes the function distinctly less
general-purpose than one would think from its comment.  Maybe, if that
overflows, we could handle the failure by making that adjustment after
we've converted to numeric?

> So here is an intermediate patch that does this.  I haven't gotten rid 
> of all numeric_div_opt_error() calls yet, but if this seems acceptable, 
> I can work on the remaining ones.

I guess the immediate question is how much of a performance gap there
is now between the float and numeric implementations.

regards, tom lane




Re: [PATCH] Identify LWLocks in tracepoints

2021-03-19 Thread Peter Eisentraut

On 18.03.21 07:34, Craig Ringer wrote:

In patch 0001, why was the TRACE_POSTGRESQL_LWLOCK_RELEASE() call
moved?
   Is there some correctness issue?  If so, we should explain that (at
least in the commit message, or as a separate patch).


If you want I can split it out, or drop that change. I thought it was 
sufficiently inconsequential, but you're right to check.


The current tracepoint TRACE_POSTGRESQL_LWLOCK_RELEASE really means 
"releaseD". It's appropriate to emit this as soon as the lock could be 
acquired by anything else. By deferring it until we'd processed the 
waitlist and woken other backends the window during which the lock was 
reported as "held" was longer than it truly was, and it was easy to see 
one backend acquire the lock while another still appeared to hold it.


From the archeology department: The TRACE_POSTGRESQL_LWLOCK_RELEASE 
probe was in the right place until PG 9.4, but was then moved by 
ab5194e6f617a9a9e7aadb3dd1cee948a42d0755, which was a major rewrite, so 
it seems the move might have been accidental.  The documentation 
specifically states that the probe is triggered before waiters are woken 
up, which it specifically does not do at the moment.  So this looks like 
a straight bug fix to me.





Re: [PATCH] ProcessInterrupts_hook

2021-03-19 Thread Tom Lane
Robert Haas  writes:
> On Fri, Mar 19, 2021 at 3:25 PM Tom Lane  wrote:
>> I'm not very comfortable about the idea of having the postmaster set
>> child processes' latches ... that doesn't sound terribly safe from the
>> standpoint of not allowing the postmaster to mess with shared memory
>> state that could cause it to block or crash.  If we already do that
>> elsewhere, then OK, but I don't think we do.

> It should be unnecessary anyway. We changed it a while back to make
> any SIGUSR1 set the latch 

Hmm, so the postmaster could send SIGUSR1 without setting any particular
pmsignal reason?  Yeah, I suppose that could work.  Or we could recast
this as being a new pmsignal reason.

regards, tom lane




Re: [HACKERS] Custom compression methods

2021-03-19 Thread Robert Haas
On Fri, Mar 19, 2021 at 10:11 AM Dilip Kumar  wrote:
> Also added a test case for vacuum full to recompress the data.

I committed the core patch (0003) with a bit more editing.  Let's see
what the buildfarm thinks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] ProcessInterrupts_hook

2021-03-19 Thread Robert Haas
On Fri, Mar 19, 2021 at 3:25 PM Tom Lane  wrote:
> David Steele  writes:
> > On 1/19/21 1:42 AM, Craig Ringer wrote:
> >> My suggestion, which I'm happy to post in patch form if you think it's
> >> reasonable 
>
> > Tom, Robert, and thoughts on the proposals in [1]?
> > https://www.postgresql.org/message-id/CAGRY4nyNfscmQiZBCNT7cBYnQxJLAAVCGz%2BGZAQDAco1Fbb01w%40mail.gmail.com
>
> No objection to generalizing the state passed through pmsignal.c.
>
> I'm not very comfortable about the idea of having the postmaster set
> child processes' latches ... that doesn't sound terribly safe from the
> standpoint of not allowing the postmaster to mess with shared memory
> state that could cause it to block or crash.  If we already do that
> elsewhere, then OK, but I don't think we do.

It should be unnecessary anyway. We changed it a while back to make
any SIGUSR1 set the latch 

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

2021-03-19 Thread Peter Eisentraut

On 18.03.21 09:28, Peter Eisentraut wrote:
Which leads me to:  After retesting this now, with a new machine, the 
performance of the numeric implementation is brutal compared to the 
float implementation, for cases where we need numeric division, which is 
milliseconds, seconds, and epoch.  In the first two cases, I imagine we 
could rewrite this a bit to avoid a lot of the numeric work, but for the 
epoch case (which is what started this thread), there isn't enough space 
in int64 to make this work.  Perhaps int128 could be pressed into 
service, optionally.  I think it would also help if we cracked open the 
numeric APIs a bit to avoid all the repeated unpacking and packing for 
each step.


So I think we need to do a bit more thinking and work here, meaning it 
will have to be postponed.


Well, I had an idea that I put to work.  In most of these cases where we 
need division, we divide an integer by a power of 10.  That can be done 
with numeric very quickly by just shifting the weight and scale around. 
So I wrote a function that does that specifically (look for 
int64_div_fast_to_numeric()).  With that, the slow cases I mentioned now 
have the same performance as the other cases that didn't have any 
numeric division.  You just get the overhead for constructing and 
passing around a numeric instead of a double, which can't be avoided.


So here is an intermediate patch that does this.  I haven't gotten rid 
of all numeric_div_opt_error() calls yet, but if this seems acceptable, 
I can work on the remaining ones.


From 801460c46ddf7273d2c49a9af3460f2f5614599d Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 19 Mar 2021 20:27:15 +0100
Subject: [PATCH v5] Change return type of EXTRACT to numeric

The previous implementation of EXTRACT mapped internally to
date_part(), which returned type double precision (since it was
implemented long before the numeric type existed).  This can lead to
imprecise output in some cases, so returning numeric would be
preferrable.  Changing the return type of an existing function is a
bit risky, so instead we do the following:  We implement a new set of
functions, which are now called "extract", in parallel to the existing
date_part functions.  They work the same way internally but use
numeric instead of float8.  The EXTRACT construct is now mapped by the
parser to these new extract functions.  That way, dumps of views
etc. from old versions (which would use date_part) continue to work
unchanged, but new uses will map to the new extract functions.

Additionally, the reverse compilation of EXTRACT now reproduces the
original syntax, using the new mechanism introduced in
40c24bfef92530bd846e111c1742c2a54441c62c.

The following minor changes of behavior result from the new
implementation:

- The column name from an isolated EXTRACT call is now "extract"
  instead of "date_part".

- Extract from date now rejects inappropriate field names such as
  HOUR.  It was previously mapped internally to extract from
  timestamp, so it would silently accept everything appropriate for
  timestamp.

- Return values when extracting fields with possibly fractional
  values, such as second and epoch, now have the full scale that the
  value has internally (so, for example, '1.00' instead of just
  '1').

Discussion: 
https://www.postgresql.org/message-id/flat/42b73d2d-da12-ba9f-570a-420e0cce1...@phystech.edu
---
 doc/src/sgml/func.sgml  |  10 +-
 src/backend/parser/gram.y   |   2 +-
 src/backend/utils/adt/date.c| 370 +++
 src/backend/utils/adt/numeric.c |  56 ++
 src/backend/utils/adt/ruleutils.c   |  21 +
 src/backend/utils/adt/timestamp.c   | 670 +++-
 src/include/catalog/pg_proc.dat |  18 +
 src/include/utils/numeric.h |   1 +
 src/test/regress/expected/create_view.out   |   2 +-
 src/test/regress/expected/date.out  | 482 +++---
 src/test/regress/expected/interval.out  |  72 +--
 src/test/regress/expected/psql_crosstab.out |  12 +-
 src/test/regress/expected/time.out  |  24 +-
 src/test/regress/expected/timetz.out|  46 +-
 src/test/regress/sql/date.sql   |   6 +-
 src/test/regress/sql/timetz.sql |   4 +-
 16 files changed, 1431 insertions(+), 365 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 9492a3c6b9..0383cf3db9 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -8776,7 +8776,7 @@ Date/Time Functions
   extract
  
  extract ( field 
from timestamp )
- double precision
+ numeric
 
 
  Get timestamp subfield; see 
@@ -8790,7 +8790,7 @@ Date/Time Functions

 
  extract ( field 
from interval )
- double precision
+ numeric
 
 
  Get interval subfield; see 
@@ -9305,7 +9305,7 @@ EXTRACT, 
date_part
 well.)  

Re: pglz compression performance, take two

2021-03-19 Thread Mark Dilger



> On Jan 21, 2021, at 6:48 PM, Justin Pryzby  wrote:
> 
> @cfbot: rebased
> <0001-Reorganize-pglz-compression-code.patch>

Review comments.

First, I installed a build from master without this patch, created a test 
installation with lots of compressed text and array columns, upgraded the 
binaries to a build with this patch included, and tried to find problems with 
the data left over from the pre-patch binaries.  Everything checks out.  This 
is on little-endian mac osx intel core i9, not on any ARM platform that you are 
targeting with portions of the patch.

+/**
+ *  CPU Feature Detection*
+ **/
+/* PGLZ_FORCE_MEMORY_ACCESS
+ * By default, access to unaligned memory is controlled by `memcpy()`, which 
is safe and portable.
+ * Unfortunately, on some target/compiler combinations, the generated assembly 
is sub-optimal.
+ * The below switch allow to select different access method for improved 
performance.
+ * Method 0 (default) : use `memcpy()`. Safe and portable.
+ * Method 1 : direct access. This method is portable but violate C standard.
+ * It can generate buggy code on targets which assembly generation 
depends on alignment.
+ * But in some circumstances, it's the only known way to get the most 
performance (ie GCC + ARMv6)
+ * See 
https://fastcompression.blogspot.fr/2015/08/accessing-unaligned-memory.html for 
details.
+ * Prefer these methods in priority order (0 > 1)
+ */

The link to blogspot.fr has a lot more information than your summary in the 
code comments.  It might be hard to understand this comment if the blogspot 
article were ever to disappear.  Perhaps you could include a bit more of the 
relevant details?

+#ifndef PGLZ_FORCE_MEMORY_ACCESS   /* can be defined externally */
+#if defined(__GNUC__) && \
+  ( defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || 
defined(__ARM_ARCH_6K__) \
+  || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) || 
defined(__ARM_ARCH_6T2__) )
+#define PGLZ_FORCE_MEMORY_ACCESS 1
+#endif
+#endif

I can understand wanting to set this on gcc + ARMv6, but doesn't this belong in 
a configure script rather than directly in the compression code?

The blogspot article indicates that the author lied about alignment to the 
compiler when using gcc on ARMv6, thereby generating a fast load instruction 
which happens to work on ARMv6.  You appear to be using that same approach.  
Your #if defined(__GNUC__), seems to assume that all future versions of gcc 
will generate the instructions that you want, and not start generating some 
other set of instructions.  Wouldn't you at least need a configure test to 
verify that the version of gcc being used generates the desired assembly?  Even 
then, you'd be banking on gcc doing the same thing for the test code and for 
the pglz code, which I guess might not be true.  Have you considered using 
inline assembly instead?



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: [PATCH] ProcessInterrupts_hook

2021-03-19 Thread Tom Lane
David Steele  writes:
> On 1/19/21 1:42 AM, Craig Ringer wrote:
>> My suggestion, which I'm happy to post in patch form if you think it's 
>> reasonable 

> Tom, Robert, and thoughts on the proposals in [1]?
> https://www.postgresql.org/message-id/CAGRY4nyNfscmQiZBCNT7cBYnQxJLAAVCGz%2BGZAQDAco1Fbb01w%40mail.gmail.com

No objection to generalizing the state passed through pmsignal.c.

I'm not very comfortable about the idea of having the postmaster set
child processes' latches ... that doesn't sound terribly safe from the
standpoint of not allowing the postmaster to mess with shared memory
state that could cause it to block or crash.  If we already do that
elsewhere, then OK, but I don't think we do.

regards, tom lane




Re: Do we work with LLVM 12 on s390x?

2021-03-19 Thread Tom Lane
Andres Freund  writes:
> I think the error above comes from a "mismatch" between the clang used
> to compile bitcode, and the LLVM version linked to. Normally we're
> somewhat tolerant of differences between the two, but there was an ABI
> change at some point, leading to that error.  IIRC I hit that, but it
> vanished as soon as I used a matching libllvm and clang.

Thanks, I passed that advice on.

regards, tom lane




Re: Do we work with LLVM 12 on s390x?

2021-03-19 Thread Andres Freund
Hi,

On 2021-03-19 14:03:21 -0400, Tom Lane wrote:
> The Red Hat folk are seeing a problem with that combination:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1940964
> 
> which boils down to
> 
> > Build fails with this error:
> > ERROR:  failed to JIT module: Added modules have incompatible data layouts: 
> > E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-a:8:16-n32:64 (module) vs 
> > E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64 (jit)
> 
> (By "build", I imagine the reporter means "regression tests")
> 
> So I was wondering if we'd tested it yet.

Yes, I did test it not too long ago, after Christoph Berg reported
Debian s390x failing with jit. Which made me learn a bunch of s390x
assembler and discover a bug in our code that only rarely happend (iirc
something about booleans that are not exactly 0 or 1 not testing
true)...

https://www.postgresql.org/message-id/20201015222924.yyms42qjloydfvar%40alap3.anarazel.de

I think the error above comes from a "mismatch" between the clang used
to compile bitcode, and the LLVM version linked to. Normally we're
somewhat tolerant of differences between the two, but there was an ABI
change at some point, leading to that error.  IIRC I hit that, but it
vanished as soon as I used a matching libllvm and clang.

Greetings,

Andres Freund




Replication slot stats misgivings

2021-03-19 Thread Andres Freund
Hi,

I started to write this as a reply to
https://postgr.es/m/20210318015105.dcfa4ceybdjubf2i%40alap3.anarazel.de
but I think it doesn't really fit under that header anymore.

On 2021-03-17 18:51:05 -0700, Andres Freund wrote:
> It does make it easier for the shared memory stats patch, because if
> there's a fixed number + location, the relevant stats reporting doesn't
> need to go through a hashtable with the associated locking.  I guess
> that may have colored my perception that it's better to just have a
> statically sized memory allocation for this.  Noteworthy that SLRU stats
> are done in a fixed size allocation as well...

As part of reviewing the replication slot stats patch I looked at
replication slot stats a fair bit, and I've a few misgivings. First,
about the pgstat.c side of things:

- If somehow slot stat drop messages got lost (remember pgstat
  communication is lossy!), we'll just stop maintaining stats for slots
  created later, because there'll eventually be no space for keeping
  stats for another slot.

- If max_replication_slots was lowered between a restart,
  pgstat_read_statfile() will happily write beyond the end of
  replSlotStats.

- pgstat_reset_replslot_counter() acquires ReplicationSlotControlLock. I
  think pgstat.c has absolutely no business doing things on that level.

- We do a linear search through all replication slots whenever receiving
  stats for a slot. Even though there'd be a perfectly good index to
  just use all throughout - the slots index itself. It looks to me like
  slots stat reports can be fairly frequent in some workloads, so that
  doesn't seem great.

- PgStat_ReplSlotStats etc use slotname[NAMEDATALEN]. Why not just NameData?

- pgstat_report_replslot() already has a lot of stats parameters, it
  seems likely that we'll get more. Seems like we should just use a
  struct of stats updates.


And then more generally about the feature:
- If a slot was used to stream out a large amount of changes (say an
  initial data load), but then replication is interrupted before the
  transaction is committed/aborted, stream_bytes will not reflect the
  many gigabytes of data we may have sent.
- I seems weird that we went to the trouble of inventing replication
  slot stats, but then limit them to logical slots, and even there don't
  record the obvious things like the total amount of data sent.


I think the best way to address the more fundamental "pgstat related"
complaints is to change how replication slot stats are
"addressed". Instead of using the slots name, report stats using the
index in ReplicationSlotCtl->replication_slots.

That removes the risk of running out of "replication slot stat slots":
If we loose a drop message, the index eventually will be reused and we
likely can detect that the stats were for a different slot by comparing
the slot name.

It also makes it easy to handle the issue of max_replication_slots being
lowered and there still being stats for a slot - we simply can skip
restoring that slots data, because we know the relevant slot can't exist
anymore. And we can make the initial pgstat_report_replslot() during
slot creation use a

I'm wondering if we should just remove the slot name entirely from the
pgstat.c side of things, and have pg_stat_get_replication_slots()
inquire about slots by index as well and get the list of slots to report
stats for from slot.c infrastructure.

Greetings,

Andres Freund




Re: non-HOT update not looking at FSM for large tuple update

2021-03-19 Thread Matthias van de Meent
On Fri, 19 Mar 2021 at 19:16, John Naylor  wrote:
>
> On Thu, Mar 18, 2021 at 5:30 PM Matthias van de Meent 
>  wrote:
> >
> > This is slightly more verbose, but I think this clarifies the
> > reasoning why we need this a bit better. Feel free to reject or adapt
> > as needed.
>
> I like this in general, but still has some rough edges. I've made another 
> attempt in v5 incorporating your suggestions. Let me know what you think.

That is indeed better.

I believe this is ready, so I've marked it as RFC in the commitfest application.

With regards,

Matthias van de Meent.




Re: Proposal: Save user's original authenticated identity for logging

2021-03-19 Thread Jacob Champion
On Fri, 2021-03-19 at 16:54 +, Jacob Champion wrote:
> One additional improvement I would suggest, now that the rotation logic
> is simpler than it was in my original patch, is to rotate the logfile
> regardless of whether the test is checking the logs or not. (Similarly,
> we can manually rotate after the block of test_query() calls.) That way
> it's harder to match the last test's output.

The same effect can be had by moving the log rotation to the top of the
test that needs it, so I've done it that way in v7.

> The tradeoff is that if you need to check for log message order, or for
> multiple instances of overlapping patterns, you still need some sort of
> search-forward functionality.

Turns out it's easy now to have our cake and eat it too; a single if
statement can implement the same search-forward functionality that was
spread across multiple places before. So I've done that too.

Much nicer, thank you for the suggestion!

--Jacob
From 29b2de27e6370c883bfab9fa307c9764a1ffd0e1 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Wed, 3 Feb 2021 11:04:42 -0800
Subject: [PATCH v7 1/3] test/kerberos: rotate logs between tests

The log content tests searched through the entire log file, from the
beginning, for every match. If two tests shared the same expected log
content, it was possible for the second test to get a false positive by
matching against the first test's output. (This could be seen by
modifying one of the expected-failure tests to expect the same output as
a previous happy-path test.)

To fix, rotate the logs before every test that checks log files. This
has the advantage that we no longer wait for a long timeout on a failed
match.

Original implementation by Michael Pacquier, with a change by me to
rotate at the beginning of every test instead of the end.
---
 src/test/kerberos/t/001_auth.pl | 33 +++--
 1 file changed, 11 insertions(+), 22 deletions(-)

diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index 079321bbfc..df62c28a10 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -20,7 +20,7 @@ use Time::HiRes qw(usleep);
 
 if ($ENV{with_gssapi} eq 'yes')
 {
-	plan tests => 34;
+	plan tests => 26;
 }
 else
 {
@@ -170,7 +170,6 @@ $node->append_conf(
 	'postgresql.conf', qq{
 listen_addresses = '$hostaddr'
 krb_server_keyfile = '$keytab'
-logging_collector = on
 log_connections = on
 # these ensure stability of test results:
 log_rotation_age = 0
@@ -187,6 +186,14 @@ sub test_access
 {
 	my ($node, $role, $query, $expected_res, $gssencmode, $test_name, $expect_log_msg) = @_;
 
+	if ($expect_log_msg ne '')
+	{
+		# Rotate to a new log file to prevent previous tests' log output from
+		# matching during this one.
+		$node->rotate_logfile;
+		$node->restart;
+	}
+
 	# need to connect over TCP/IP for Kerberos
 	my ($res, $stdoutres, $stderrres) = $node->psql(
 		'postgres',
@@ -212,27 +219,9 @@ sub test_access
 	# Verify specified log message is logged in the log file.
 	if ($expect_log_msg ne '')
 	{
-		my $current_logfiles = slurp_file($node->data_dir . '/current_logfiles');
-		note "current_logfiles = $current_logfiles";
-		like($current_logfiles, qr|^stderr log/postgresql-.*log$|,
-			 'current_logfiles is sane');
-
-		my $lfname = $current_logfiles;
-		$lfname =~ s/^stderr //;
-		chomp $lfname;
-
-		# might need to retry if logging collector process is slow...
-		my $max_attempts = 180 * 10;
-		my $first_logfile;
-		for (my $attempts = 0; $attempts < $max_attempts; $attempts++)
-		{
-			$first_logfile = slurp_file($node->data_dir . '/' . $lfname);
-			last if $first_logfile =~ m/\Q$expect_log_msg\E/;
-			usleep(100_000);
-		}
-
+		my $first_logfile = slurp_file($node->logfile);
 		like($first_logfile, qr/\Q$expect_log_msg\E/,
-			 'found expected log file content');
+		 'found expected log file content');
 	}
 
 	return;
-- 
2.25.1

From db83791b1c95fdbc21322dba9bb5f1ea0e73c837 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 8 Feb 2021 10:53:20 -0800
Subject: [PATCH v7 2/3] ssl: store client's DN in port->peer_dn

Original patch by Andrew Dunstan:

https://www.postgresql.org/message-id/fd96ae76-a8e3-ef8e-a642-a592f5b76771%40dunslane.net

but I've taken out the clientname=DN functionality; all that will be
needed for the next patch is the DN string.
---
 src/backend/libpq/be-secure-openssl.c | 53 +++
 src/include/libpq/libpq-be.h  |  1 +
 2 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index ad33122e0e..945419b76a 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -551,22 +551,25 @@ aloop:
 	/* Get client certificate, if available. */
 	port->peer = SSL_get_peer_certificate(port->ssl);
 
-	/* and extract the Common Name from it. */
+	/* and extract the Common Name / Distinguished Name from it. */
 	

Re: non-HOT update not looking at FSM for large tuple update

2021-03-19 Thread John Naylor
On Thu, Mar 18, 2021 at 5:30 PM Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:
>
> > + * The minimum space on a page for it to be considered "empty"
regardless
> > + * of fillfactor. We base this on MaxHeapTupleSize, minus a small
amount
> > + * of slack. We allow slack equal to 1/8 the maximum space that
could be
> > + * taken by line pointers, which is somewhat arbitrary.
>
> > + * We want to allow inserting a large tuple into an empty page
even if
> > + * that would violate the fillfactor. Otherwise, we would
unnecessarily
> > + * extend the relation. Instead, ask the FSM for
maxPaddedFsmRequest
> > + * bytes. This will allow it to return a page that is not quite
empty
> > + * because of unused line pointers
>
> How about
>
> +* Because pages that have no items left can still have space
allocated
> +* to item pointers, we consider pages "empty" for FSM requests when
they
> +* have at most 1/8 of their MaxHeapTuplesPerPage item pointers' space
> +* allocated. This is a somewhat arbitrary number, but should prevent
> +* most unnecessary relation extensions due to not finding "empty"
pages
> +* while inserting combinations of large tuples with low fillfactors.
>
> +* When the free space to be requested from the FSM is greater than
> +* maxPaddedFsmRequest, this is considered equivalent to requesting
> +* insertion on an "empty" page, so instead of failing to find a page
> +* with more empty space than an "empty" page and extend the relation,
> +* we try to find a page which is considered "emtpy".
>
> This is slightly more verbose, but I think this clarifies the
> reasoning why we need this a bit better. Feel free to reject or adapt
> as needed.

I like this in general, but still has some rough edges. I've made another
attempt in v5 incorporating your suggestions. Let me know what you think.

--
John Naylor
EDB: http://www.enterprisedb.com


v5-0001-Fix-bug-in-heap-space-management-that-was-overly-.patch
Description: Binary data


Re: [HACKERS] Custom compression methods

2021-03-19 Thread Robert Haas
On Fri, Mar 19, 2021 at 1:44 PM Justin Pryzby  wrote:
> Working with one of Andrey's patches on another thread, he reported offlist
> getting this message, resolved by this patch.  Do you see this warning during
> ./configure ?  The latest CI is of a single patch without the LZ4 stuff, so I
> can't check its log.
>
> configure: WARNING: lz4.h: accepted by the compiler, rejected by the 
> preprocessor!
> configure: WARNING: lz4.h: proceeding with the compiler's result

No, I don't see this. I wonder whether this could possibly be an
installation issue on Andrey's machine? If not, it must be
version-dependent or installation-dependent in some way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Do we work with LLVM 12 on s390x?

2021-03-19 Thread Tom Lane
The Red Hat folk are seeing a problem with that combination:

https://bugzilla.redhat.com/show_bug.cgi?id=1940964

which boils down to

> Build fails with this error:
> ERROR:  failed to JIT module: Added modules have incompatible data layouts: 
> E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-a:8:16-n32:64 (module) vs 
> E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64 (jit)

(By "build", I imagine the reporter means "regression tests")

So I was wondering if we'd tested it yet.

regards, tom lane




Re: cleanup temporary files after crash

2021-03-19 Thread Euler Taveira
On Fri, Mar 19, 2021, at 2:27 PM, Tomas Vondra wrote:
> I've pushed this version, with the plpgsql block. I've tested it on all
> the machines I have here, hopefully it'll make buildfarm happy too.
Thanks for taking care of this issue.

> AFAICS I was mistaken about what the pump() functions do - it clearly
> does not run the command repeatedly, it just waits for the right output
> to appear. So a busy loop in plpgsql seems like a reasonable solution.
> Perhaps there's a better way to do this in TAP, not sure.
I took 3 times longer to do that fragile test than the code itself. :-/

> My brain hurts from reading too much Perl today ...
I feel the same when I have to deal with Perl code.

It seems the animals are happy with this fix.

--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: [HACKERS] Custom compression methods

2021-03-19 Thread Justin Pryzby
On Fri, Mar 19, 2021 at 01:24:39PM -0400, Robert Haas wrote:
> On Fri, Mar 19, 2021 at 12:35 PM Justin Pryzby  wrote:
> > I sent offlist a couple of times but notice that the latest patch is missing
> > this bit around AC_CHECK_HEADERS, which apparently can sometimes cause
> > warnings on mac.
> >
> > ac_save_CPPFLAGS=$CPPFLAGS
> > CPPFLAGS="$LZ4_CFLAGS $CPPFLAGS"
> > AC_CHECK_HEADERS(lz4/lz4.h, [],
> >  [AC_CHECK_HEADERS(lz4.h, [], [AC_MSG_ERROR([lz4.h header file is 
> > required for LZ4])])])
> > CPPFLAGS=$ac_save_CPPFLAGS
> 
> Hmm, it's working for me on macOS Catalina without this. Why do we
> need it? Can you provide a patch that inserts it in the exact place
> you think it needs to go?

Working with one of Andrey's patches on another thread, he reported offlist
getting this message, resolved by this patch.  Do you see this warning during
./configure ?  The latest CI is of a single patch without the LZ4 stuff, so I
can't check its log.

configure: WARNING: lz4.h: accepted by the compiler, rejected by the 
preprocessor!   
  
configure: WARNING: lz4.h: proceeding with the compiler's result

   

-- 
Justin




Re: Columns correlation and adaptive query optimization

2021-03-19 Thread Zhihong Yu
Hi,
In AddMultiColumnStatisticsForQual(),

+   /* Loop until we considered all vars */
+   while (vars != NULL)
...
+   /* Contruct list of unique vars */
+   foreach (cell, vars)

What if some cell / node, gets into the else block:

+   else
+   {
+   continue;

and being left in vars. Is there a chance for infinite loop ?
It seems there should be a bool variable indicating whether any cell gets
to the following:

+   vars = foreach_delete_current(vars, cell);

If no cell gets removed in the current iteration, the outer while loop
should exit.

Cheers

On Fri, Mar 19, 2021 at 9:58 AM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
>
> On 19.03.2021 12:17, Yugo NAGATA wrote:
> > On Wed, 10 Mar 2021 03:00:25 +0100
> > Tomas Vondra  wrote:
> >
> >> What is being proposed here - an extension suggesting which statistics
> >> to create (and possibly creating them automatically) is certainly
> >> useful, but I'm not sure I'd call it "adaptive query optimization". I
> >> think "adaptive" means the extension directly modifies the estimates
> >> based on past executions. So I propose calling it maybe "statistics
> >> advisor" or something like that.
> > I am also agree with the idea to implement this feature as a new
> > extension for statistics advisor.
> >
> >> BTW Why is "qual" in
> >>
> >>static void
> >>AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
> >>
> >> declared as "void *"? Shouldn't that be "List *"?
> > When I tested this extension using TPC-H queries, it raised segmentation
> > fault in this function. I think the cause would be around this argument.
> >
> > Regards,
> > Yugo Nagata
> >
> Attached please find new version of the patch with
> AddMultiColumnStatisticsForQual parameter type fix and one more fix
> related with handling synthetic attributes.
> I can not reproduce the crash on TPC-H queries, so if the problem
> persists, can you please send me stack trace and may be some other
> information helping to understand the reason of SIGSEGV?
>
> Thanks in advance,
> Konstantin
>
>


Bringing some sanity to RestoreGUCState()

2021-03-19 Thread Tom Lane
In the thread about valgrind leak detection [1], we noticed that
RestoreGUCState(), which is intended to load the leader process's
GUC settings into a parallel worker, was causing visible memory
leaks by invoking InitializeOneGUCOption() on already-set-up GUCs.
I noted that simply removing that call made the leaks go away with
no obvious ill effects, but didn't stop to look closer.

I've now looked closer, and I see that the reason that removing
that call has no ill effects is in fact that it's a complete no-op.
Every GUC that this chooses to target:

if (!can_skip_gucvar(guc_variables[i]))
InitializeOneGUCOption(guc_variables[i]);

is one that the leader backend will send a value for, so that the
reinitialized value will certainly be overwritten in the next loop.
(Actually, the set of forcibly-reinited GUCs is a strict subset of
those that the leader will send, since GUCs that have source
PGC_S_DEFAULT in the newly-started worker might have other sources
in the leader.)

I wonder whether the intent was to do the negation of this test,
ie reset GUCs that the leader *isn't* going to send.  But that's
very obviously the wrong thing, because it would lose the values
of (at least) PGC_POSTMASTER variables.

So we can remove the code that does this, and I intend to go do so.
However, given the unmistakable evidence of sloppy thinking here,
I looked closer at exactly what can_skip_gucvar() is doing, and
I think we're either sending too much or too little.

The argument for "sending too little" comes from the race condition
that's described in the function's comments: a variable that has
source PGC_S_DEFAULT (ie, has never moved off its compiled-in default)
in the leader could have just been updated in the postmaster, due to
re-reading postgresql.conf after SIGHUP.  In that case, when the
postmaster forks the worker it will inherit the new setting from
postgresql.conf, and will run with that because the leader didn't send
its value.  So we risk having a situation where parallel workers are
using a setting that the leader won't adopt until it next goes idle.

Now, this shouldn't cause any really fundamental problems (if it
could, the variable shouldn't have been marked as safe to change
at SIGHUP).  But you could imagine some minor query weirdness
being traceable to that.  I think that the authors of this code
judged that the cost of sending default GUC values was more than
preventing such weirdness is worth, and I can see the point.
Neglecting the PGC_POSTMASTER and PGC_INTERNAL variables, which
seem safe to not send, I see this in a regression test install:

=# select source,count(*) from pg_settings where context not in ('postmaster', 
'internal') group by 1;
source| count 
--+---
 client   | 2
 environment variable | 1
 configuration file   | 6
 default  |   246
 database | 6
 override | 3
 command line | 1
(7 rows)

Sending 265 values to a new parallel worker instead of 19 would be
a pretty large cost to avoid a race condition that probably wouldn't
have significant ill effects anyway.

However, if you are willing to accept that tradeoff, then this code
is leaving a lot on the table, because there is no more reason for
it to send values with any of these sources than there is to send
PGC_S_DEFAULT ones:

PGC_S_DYNAMIC_DEFAULT,  /* default computed during initialization */
PGC_S_ENV_VAR,  /* postmaster environment variable */
PGC_S_FILE, /* postgresql.conf */
PGC_S_ARGV, /* postmaster command line */
PGC_S_GLOBAL,   /* global in-database setting */
PGC_S_DATABASE, /* per-database setting */
PGC_S_USER, /* per-user setting */
PGC_S_DATABASE_USER,/* per-user-and-database setting */

The new worker will have absorbed all such values already during its
regular InitPostgres() call.

I suppose there's an argument to be made that skipping such values
widens the scope of the race hazard a bit, since a GUC that the DBA
has already chosen to move off of default (or set via ALTER USER/
DATABASE) might be one she's more likely to change later.  But that
argument seems pretty tissue-thin to me.

In short, I think we really only need to transmit GUCs with sources
of PGC_S_CLIENT and higher.  In the regression environment that
would cut us down from sending 19 values to sending 5.  In production
the win would likely be substantially more, since it's more likely
that the DBA would have tweaked more things in postgresql.conf;
which are variables we are sending today and don't have to.

Thoughts?

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/3471359.1615937770%40sss.pgh.pa.us




Re: cleanup temporary files after crash

2021-03-19 Thread Tomas Vondra
On 3/19/21 5:26 PM, Tomas Vondra wrote:
> On 3/19/21 5:23 PM, Tomas Vondra wrote:
>> ...
>>
>> If I replace this with a wait loop in a plpgsql block, that works
>> perfectly fine (no infinite loops). Tested both on x86_64 and rpi.
>>
> 
> For the record, here's the version with plpgsql block, which seems to be
> working just fine.
> 

I've pushed this version, with the plpgsql block. I've tested it on all
the machines I have here, hopefully it'll make buildfarm happy too.

AFAICS I was mistaken about what the pump() functions do - it clearly
does not run the command repeatedly, it just waits for the right output
to appear. So a busy loop in plpgsql seems like a reasonable solution.
Perhaps there's a better way to do this in TAP, not sure.

My brain hurts from reading too much Perl today ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Failed assertion on standby while shutdown

2021-03-19 Thread Maxim Orlov

Hi, haсkers!

Recently, I was doing some experiments with primary/standby instances 
interaction. In certain conditions I’ve got and was able to reproduce 
crash on failed assertion.


The scenario is the following:
1. start primary server
2. start standby server by pg_basebackup -P -R -X stream -c fast -p5432 
-D data
3. apply some load to the primary server by pgbench -p5432 -i -s 150 
postgres

4. kill primary server (with kill -9) and keep it down
5. stop standby server by pg_ctl
6. run standby server

Then any standby server termination will result in a failed assertion.

The log with a backtrace is following:

2021-03-19 18:54:25.352 MSK [3508443] LOG:  received fast shutdown 
request
2021-03-19 18:54:25.379 MSK [3508443] LOG:  aborting any active 
transactions
TRAP: FailedAssertion("SHMQueueEmpty(&(MyProc->myProcLocks[i]))", File: 
"/home/ziva/projects/pgpro/build-secondary/../postgrespro/src/backend/storage/lmgr/proc.c", 
Line: 592, PID: 3508452)

postgres: walreceiver (ExceptionalCondition+0xd0)[0x55d0526f]
postgres: walreceiver (InitAuxiliaryProcess+0x31c)[0x55b43e31]
postgres: walreceiver (AuxiliaryProcessMain+0x54f)[0x5574ae32]
postgres: walreceiver (+0x530bff)[0x55a84bff]
postgres: walreceiver (+0x531044)[0x55a85044]
postgres: walreceiver (+0x530959)[0x55a84959]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x77a303c0]
/lib/x86_64-linux-gnu/libc.so.6(__select+0x1a)[0x772a40da]
postgres: walreceiver (+0x52bea4)[0x55a7fea4]
postgres: walreceiver (PostmasterMain+0x129f)[0x55a7f7c1]
postgres: walreceiver (+0x41ff1f)[0x55973f1f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x771b30b3]
postgres: walreceiver (_start+0x2e)[0x5561abfe]

After a brief investigation I found out that I can get this assert with 
100% probability if I insert a sleep for about 5 sec into 
InitAuxiliaryProcess(void) in src/backend/storage/lmgr/proc.c:


diff --git a/src/backend/storage/lmgr/proc.c 
b/src/backend/storage/lmgr/proc.c

index 897045ee272..b5f365f426d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -525,7 +525,7 @@ InitAuxiliaryProcess(void)

if (MyProc != NULL)
elog(ERROR, "you already exist");
-
+   pg_usleep(500L);
/*
 * We use the ProcStructLock to protect assignment and releasing 
of

 * AuxiliaryProcs entries.

Maybe, this kinda behaviour would appear if a computer hosting instances 
is under significant side load, which cause delay to start db-instances 
under a heavy load.


Configuration for a primary server is default with "wal_level = logical"

Configuration for a standby server is default with "wal_level = logical" 
and "primary_conninfo = 'port=5432'"


I'm puzzled with this behavor. I'm pretty sure it is not what should be. 
Any ideas how this can be fixed?


---
Best regards,
Maxim Orlov.




Re: [HACKERS] Custom compression methods

2021-03-19 Thread Robert Haas
On Fri, Mar 19, 2021 at 12:35 PM Justin Pryzby  wrote:
> I sent offlist a couple of times but notice that the latest patch is missing
> this bit around AC_CHECK_HEADERS, which apparently can sometimes cause
> warnings on mac.
>
> ac_save_CPPFLAGS=$CPPFLAGS
> CPPFLAGS="$LZ4_CFLAGS $CPPFLAGS"
> AC_CHECK_HEADERS(lz4/lz4.h, [],
>  [AC_CHECK_HEADERS(lz4.h, [], [AC_MSG_ERROR([lz4.h header file is 
> required for LZ4])])])
> CPPFLAGS=$ac_save_CPPFLAGS

Hmm, it's working for me on macOS Catalina without this. Why do we
need it? Can you provide a patch that inserts it in the exact place
you think it needs to go?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: support for MERGE

2021-03-19 Thread Bruce Momjian
On Fri, Mar 19, 2021 at 10:53:53AM -0400, David Steele wrote:
> On 3/19/21 10:44 AM, Alvaro Herrera wrote:
> > On 2021-Mar-19, David Steele wrote:
> > 
> > > Since it does not appear this is being worked on for PG14 perhaps we 
> > > should
> > > close it in this CF and then reopen it when a patch is available?
> > 
> > No, please move it to the next CF.  Thanks
> 
> Ugh. I clicked wrong and moved it to the 2021-09 CF. I'm not sure if there's
> a way to fix it without creating a new entry.

Let's get someone to go into the "database" and adjust it.  ;-)

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Columns correlation and adaptive query optimization

2021-03-19 Thread Konstantin Knizhnik



On 19.03.2021 12:17, Yugo NAGATA wrote:

On Wed, 10 Mar 2021 03:00:25 +0100
Tomas Vondra  wrote:


What is being proposed here - an extension suggesting which statistics
to create (and possibly creating them automatically) is certainly
useful, but I'm not sure I'd call it "adaptive query optimization". I
think "adaptive" means the extension directly modifies the estimates
based on past executions. So I propose calling it maybe "statistics
advisor" or something like that.

I am also agree with the idea to implement this feature as a new
extension for statistics advisor.


BTW Why is "qual" in

   static void
   AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)

declared as "void *"? Shouldn't that be "List *"?

When I tested this extension using TPC-H queries, it raised segmentation
fault in this function. I think the cause would be around this argument.

Regards,
Yugo Nagata

Attached please find new version of the patch with 
AddMultiColumnStatisticsForQual parameter type fix and one more fix 
related with handling synthetic attributes.
I can not reproduce the crash on TPC-H queries, so if the problem 
persists, can you please send me stack trace and may be some other 
information helping to understand the reason of SIGSEGV?


Thanks in advance,
Konstantin

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index faa6231d87..3fefc47e0b 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,32 @@
 #include "postgres.h"
 
 #include 
+#include 
 
+#include "access/hash.h"
 #include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/skey.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/pg_statistic_ext.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -34,7 +54,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_add_statistics_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -85,6 +107,7 @@ static void explain_ExecutorRun(QueryDesc *queryDesc,
 static void explain_ExecutorFinish(QueryDesc *queryDesc);
 static void explain_ExecutorEnd(QueryDesc *queryDesc);
 
+static void AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
 
 /*
  * Module load callback
@@ -230,6 +253,30 @@ _PG_init(void)
 			 NULL,
 			 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+			 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+			 "Zero disables implicit creation of multicolumn statistic.",
+			 _explain_add_statistics_threshold,
+			 0.0,
+			 0.0,
+			 INT_MAX,
+			 PGC_SUSET,
+			 0,
+			 NULL,
+			 NULL,
+			 NULL);
+
+	DefineCustomBoolVariable("auto_explain.add_statistics_suggest_only",
+			 "Do not create statistic but just record in WAL suggested create statistics statement.",
+			 NULL,
+			 _explain_add_statistics_suggest_only,
+			 false,
+			 PGC_SUSET,
+			 0,
+			 NULL,
+			 NULL,
+			 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -363,6 +410,283 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+/**
+ * Try to add multicolumn statistics for specified subplans.
+ */
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+/**
+ * Try to add multicolumn statistics for plan subnodes.
+ */
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+	   ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+/**
+ * Comparator used to sort Vars by name
+ */
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef 

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-19 Thread Bruce Momjian
On Fri, Mar 19, 2021 at 10:27:51PM +0800, Julien Rouhaud wrote:
> On Fri, Mar 19, 2021 at 09:29:06AM -0400, Bruce Momjian wrote:
> > OK, that makes perfect sense.  I think the best solution is to document
> > that compute_query_id just controls the built-in computation of the
> > query id, and that extensions can also compute it if this is off, and
> > pg_stat_activity and log_line_prefix will display built-in or extension
> > computed query ids.
> 
> So the last version of the patch should implement that behavior right?  It's
> just missing some explicit guidance that third-party extensions should only
> calculate a queryid if compute_query_id is off

Yes, I think we are now down to just how the extensions should be told
to behave, and how we document this --- see the email I just sent.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Proposal: Save user's original authenticated identity for logging

2021-03-19 Thread Jacob Champion
On Fri, 2021-03-19 at 17:21 +0900, Michael Paquier wrote:
> On Thu, Mar 18, 2021 at 05:14:24PM +0900, Michael Paquier wrote:
> > Looking at 0001, I am not much a fan of relying on the position of the
> > matching pattern in the log file.  Instead of relying on the logging
> > collector and one single file, why not just changing the generation of
> > the logfile and rely on the output of stderr by restarting the server?

For getting rid of the logging collector logic, this is definitely an
improvement. It was briefly discussed in [1] but I never got around to
trying it; thanks!

One additional improvement I would suggest, now that the rotation logic
is simpler than it was in my original patch, is to rotate the logfile
regardless of whether the test is checking the logs or not. (Similarly,
we can manually rotate after the block of test_query() calls.) That way
it's harder to match the last test's output.

> While looking at 0003, I have noticed that the new kerberos tests
> actually switch from a logic where one message pattern matches, to a
> logic where multiple message patterns match, but I don't see a problem
> with what I sent previously, as long as one consume once a log file
> and matches all the patterns once, say like the following in
> test_access():

The tradeoff is that if you need to check for log message order, or for
multiple instances of overlapping patterns, you still need some sort of
search-forward functionality. But looking over the tests, I don't see
any that truly *need* that yet. It's nice that the current patchset
enforces an "authenticated" line before an "authorized" line, but I
think it's nicer to not have the extra code.

I'll incorporate this approach into the patchset. Thanks!

--Jacob

[1] 
https://www.postgresql.org/message-id/f1fd9ccaf7ffb2327bf3c06120afeadd50c1db97.camel%40vmware.com


Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-19 Thread Bruce Momjian
On Fri, Mar 19, 2021 at 10:35:21PM +0800, Julien Rouhaud wrote:
> On Fri, Mar 19, 2021 at 02:54:16PM +0100, Hannu Krosing wrote:
> > On Fri, Mar 19, 2021 at 2:29 PM Bruce Momjian  wrote:
> > The log-spam could be mitigated by logging it just once per connection
> > the first time it is overridden
> 
> Yes, but it might still generate a significant amount of additional lines.
> 
> If extensions authors follow the recommendations and only calculate a queryid
> when compute_query_id is off, it shoule be easy to check that you have
> everything setup properly.

Seems extensions that want to generate their own query id should just
error out with a message to the log file if compute_query_id is set ---
that should fix the entire issue --- but see below.

> > Also, we could ask the extensions to expose the "method name" in a 
> > read-only GUC
> > 
> > so one can do
> > 
> > SHOW compute_query_id_method;
> > 
> > and get the name of method use
> > 
> > compute_query_id_method
> > 
> > builtin
> > 
> > And it may even dynamically change to indicate the overriding of builtin
> > 
> > compute_query_id_method
> > ---
> > fancy_compute_query_id (overrides builtin)
> 
> This could be nice, but I'm not sure that it would work well if someones
> install multiple extensions that calculate a queryid (which would be silly but
> still), or load another one at runtime.

Well, given we don't really want to support multiple query id types
being generated or displayed, the "error out" above should fix it. 

Let's do this --- tell extensions to error out if the query id is
already set, either by compute_query_id or another extension.  If an
extension wants to generate its own query id and store is internal to
the extension, that is fine, but the server-displayed query id should be
generated once and never overwritten by an extension.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: [HACKERS] Custom compression methods

2021-03-19 Thread Justin Pryzby
I sent offlist a couple of times but notice that the latest patch is missing
this bit around AC_CHECK_HEADERS, which apparently can sometimes cause
warnings on mac.

ac_save_CPPFLAGS=$CPPFLAGS
CPPFLAGS="$LZ4_CFLAGS $CPPFLAGS"
AC_CHECK_HEADERS(lz4/lz4.h, [],
 [AC_CHECK_HEADERS(lz4.h, [], [AC_MSG_ERROR([lz4.h header file is required 
for LZ4])])])
CPPFLAGS=$ac_save_CPPFLAGS  

   

> diff --git a/configure.ac b/configure.ac
> index 2f1585a..54efbb2 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -1410,6 +1425,11 @@ failure.  It is possible the compiler isn't looking in 
> the proper directory.
>  Use --without-zlib to disable zlib support.])])
>  fi
>  
> +if test "$with_lz4" = yes; then
> +  AC_CHECK_HEADERS(lz4/lz4.h, [],
> +   [AC_CHECK_HEADERS(lz4.h, [], [AC_MSG_ERROR([lz4.h header file is 
> required for LZ4])])])
> +fi
> +
>  if test "$with_gssapi" = yes ; then
>AC_CHECK_HEADERS(gssapi/gssapi.h, [],
>   [AC_CHECK_HEADERS(gssapi.h, [], [AC_MSG_ERROR([gssapi.h header file is 
> required for GSSAPI])])])




Re: cleanup temporary files after crash

2021-03-19 Thread Tomas Vondra
On 3/19/21 5:23 PM, Tomas Vondra wrote:
> ...
>
> If I replace this with a wait loop in a plpgsql block, that works
> perfectly fine (no infinite loops). Tested both on x86_64 and rpi.
>

For the record, here's the version with plpgsql block, which seems to be
working just fine.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 8044849b73..c5624fe864 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -79,8 +79,8 @@ my $killme2 = IPC::Run::start(
 # Insert one tuple and leave the transaction open
 $killme_stdin2 .= q[
 BEGIN;
-SELECT $$insert-tuple-to-lock-next-insert$$;
 INSERT INTO tab_crash (a) VALUES(1);
+SELECT $$insert-tuple-to-lock-next-insert$$;
 ];
 pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
 $killme_stdout2 = '';
@@ -100,6 +100,26 @@ ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Wait until the batch insert gets stuck on the lock.
+$killme_stdin2 .= q[
+DO $c$
+DECLARE
+  c INT;
+BEGIN
+  LOOP
+SELECT COUNT(*) INTO c FROM pg_locks WHERE pid = ] . $pid . q[ AND NOT granted;
+IF c > 0 THEN
+  EXIT;
+END IF;
+  END LOOP;
+END; $c$;
+SELECT $$insert-tuple-lock-waiting$$;
+];
+
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-lock-waiting/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Kill with SIGKILL
 my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');
@@ -147,8 +167,8 @@ $killme2->run();
 # Insert one tuple and leave the transaction open
 $killme_stdin2 .= q[
 BEGIN;
-SELECT $$insert-tuple-to-lock-next-insert$$;
 INSERT INTO tab_crash (a) VALUES(1);
+SELECT $$insert-tuple-to-lock-next-insert$$;
 ];
 pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
 $killme_stdout2 = '';
@@ -168,6 +188,26 @@ ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Wait until the batch insert gets stuck on the lock.
+$killme_stdin2 .= q[
+DO $c$
+DECLARE
+  c INT;
+BEGIN
+  LOOP
+SELECT COUNT(*) INTO c FROM pg_locks WHERE pid = ] . $pid . q[ AND NOT granted;
+IF c > 0 THEN
+  EXIT;
+END IF;
+  END LOOP;
+END; $c$;
+SELECT $$insert-tuple-lock-waiting$$;
+];
+
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-lock-waiting/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Kill with SIGKILL
 $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');


Re: cleanup temporary files after crash

2021-03-19 Thread Tomas Vondra


On 3/19/21 5:23 PM, Tomas Vondra wrote:
> ...
>
> If I replace this with a wait loop in a plpgsql block, that works
> perfectly fine (no infinite loops). Tested both on x86_64 and rpi.
>

For the record, here's the version with plpgsql block, which seems to be
working just fine.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: cleanup temporary files after crash

2021-03-19 Thread Tomas Vondra


On 3/19/21 3:17 PM, Euler Taveira wrote:
> On Fri, Mar 19, 2021, at 12:23 AM, Tom Lane wrote:
>> [ reads code... ]
>> ... no, I think the problem is the test is still full of race conditions.
>>
>> In the first place, waiting till you see the output of a SELECT that's
>> before the useful query is not enough to guarantee that the useful query
>> is done, or even started.  That's broken on both sessions.
> That's an ugly and fragile mechanism to workaround the fact that pump_until
> reacts after you have the query return.
> 
>> In the second place, even if the second INSERT has started, you don't know
>> that it's reached the point of blocking on the tuple conflict yet.
>> Which in turn means that it might not've filled its tuplestore yet.
>>
>> In short, this script is designed to ensure that the test query can't
>> finish too soon, but that proves nothing about whether the test query
>> has even started.  And since you also haven't really guaranteed that the
>> intended-to-be-blocking query is done, I don't think that the first
>> condition really holds either.
> In order to avoid the race condition between filling the tuplestore and
> killing
> the backend, we could use a pool_query_until() before SIGKILL to wait the
> temporary file being created. Do you think this modification will make this
> test more stable?
> 

Wow, I thought I understand the perl code, but clearly I was wrong.

I think the solution is not particularly difficult.

For the initial insert, it's fine to switch the order of the SQL
commands, so that the insert happens first, and only then emit the string.

For the second insert, we can't do that, because it's expected to block
on the lock. So we have to verify that by looking at the lock directly.

I however don't understand what the pump_until function does. AFAICS it
should run the query in a loop, and bail out once it matches the
expected output. But that does not seem to happen, so when the query
gets executed before the lock is there, it results in infinite loop.

In the attached patch I've simulated this by random() < 0.5.

If I replace this with a wait loop in a plpgsql block, that works
perfectly fine (no infinite loops). Tested both on x86_64 and rpi.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 8044849b73..88478d44c1 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -79,8 +79,8 @@ my $killme2 = IPC::Run::start(
 # Insert one tuple and leave the transaction open
 $killme_stdin2 .= q[
 BEGIN;
-SELECT $$insert-tuple-to-lock-next-insert$$;
 INSERT INTO tab_crash (a) VALUES(1);
+SELECT $$insert-tuple-to-lock-next-insert$$;
 ];
 pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
 $killme_stdout2 = '';
@@ -100,6 +100,15 @@ ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Wait until the batch insert gets stuck on the lock.
+$killme_stdin2 .= q[
+SELECT $$insert-tuple-lock-waiting$$ FROM pg_locks WHERE pid = ] . $pid . q[ AND NOT granted AND random() < 0.5;
+];
+
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-lock-waiting/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Kill with SIGKILL
 my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');
@@ -147,8 +156,8 @@ $killme2->run();
 # Insert one tuple and leave the transaction open
 $killme_stdin2 .= q[
 BEGIN;
-SELECT $$insert-tuple-to-lock-next-insert$$;
 INSERT INTO tab_crash (a) VALUES(1);
+SELECT $$insert-tuple-to-lock-next-insert$$;
 ];
 pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
 $killme_stdout2 = '';
@@ -168,6 +177,15 @@ ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Wait until the batch insert gets stuck on the lock.
+$killme_stdin2 .= q[
+SELECT $$insert-tuple-lock-waiting$$ FROM pg_locks WHERE pid = ] . $pid . q[ AND NOT granted AND random() < 0.5;
+];
+
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-lock-waiting/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Kill with SIGKILL
 $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');


Re: Change default of checkpoint_completion_target

2021-03-19 Thread David Steele

On 1/19/21 2:47 PM, Stephen Frost wrote:


* Tom Lane (t...@sss.pgh.pa.us) wrote:

Stephen Frost  writes:

Any further comments or thoughts on this one?


This:

+total time between checkpoints. The default is 0.9, which spreads the
+checkpoint across the entire checkpoint timeout period of time,

is confusing because 0.9 is obviously not 1.0; people will wonder
whether the scale is something strange or the text is just wrong.
They will also wonder why not use 1.0 instead.  So perhaps more like

... The default is 0.9, which spreads the checkpoint across almost
all the available interval, providing fairly consistent I/O load
while also leaving some slop for checkpoint completion overhead.

The other chunk of text seems accurate, but there's no reason to let
this one be misleading.


Good point, updated along those lines.


I had a look at the patch and the change and new documentation seem 
sensible to me.


I think this phrase may be a bit too idiomatic:

+consistent I/O load while also leaving some slop for checkpoint

Perhaps just:

+consistent I/O load while also leaving some time for checkpoint

It seems to me that the discussion about changing the wording for GUCs 
not changeable after server should be saved for another patch as long as 
this patch follows the current convention.


Regards,
--
-David
da...@pgmasters.net




Re: create table like: ACCESS METHOD

2021-03-19 Thread David Steele

On 1/19/21 4:03 PM, Justin Pryzby wrote:

On Wed, Dec 30, 2020 at 12:33:56PM +, Simon Riggs wrote:

There are no tests for the new functionality, please could you add some?


Did you look at the most recent patch?

+CREATE ACCESS METHOD heapdup TYPE TABLE HANDLER heap_tableam_handler;
+CREATE TABLE likeam() USING heapdup;
+CREATE TABLE likeamlike(LIKE likeam INCLUDING ALL);

Also, I just realized that Dilip's toast compression patch adds "INCLUDING
COMPRESSION", which is stored in pg_am.  That's an implementation detail of
that patch, but it's not intuitive that "including access method" wouldn't
include the compression stored there.  So I think this should use "INCLUDING
TABLE ACCESS METHOD" not just ACCESS METHOD.


Simon, do you know when you'll have a chance to review the updated patch 
in [1]?


Regards,
--
-David
da...@pgmasters.net

[1] 
https://www.postgresql.org/message-id/20210119210331.GN8560%40telsasoft.com





Re: Logical Replication vs. 2PC

2021-03-19 Thread Markus Wanner

On 18.03.21 10:45, Amit Kapila wrote:

While reviewing/testing subscriber-side work for $SUBJECT [1], I
noticed a problem that seems to need a broader discussion, so started
this thread. We can get prepare for the same GID more than once for
the cases where we have defined multiple subscriptions for
publications on the same server and prepared transaction has
operations on tables subscribed to those subscriptions. For such
cases, one of the prepare will be successful and others will fail in
which case the server will send them again. Once the commit prepared
is done for the first one, the next prepare will be successful. Now,
this is not ideal but will work.


That's assuming you're using the same gid on the subscriber, which does 
not apply to all use cases.  It clearly depends on what you try to 
achieve by decoding in two phases, obviously.


We clearly don't have this issue in BDR, because we're using xids 
(together with a node id) to globally identify transactions and 
construct local (per-node) gids that don't clash.


(Things get even more interesting if you take into account that users 
may reuse the same gid for different transactions.  Lag between 
subscriptions could then lead to blocking between different origin 
transactions...)


Regards

Markus




Re: psql \df choose functions by their arguments

2021-03-19 Thread David Steele

On 1/19/21 11:58 AM, Greg Sabino Mullane wrote:
Ha ha ha, my bad, I am not sure why I left those out. Here is a new 
patch with int2, int4, and int8. Thanks for the email.


Ian, does the new patch look good to you?

Also, not sure why the target version for this patch is stable so I have 
updated it to PG14.


Regards,
--
-David
da...@pgmasters.net




Re: [PATCH] ProcessInterrupts_hook

2021-03-19 Thread David Steele

On 1/19/21 1:42 AM, Craig Ringer wrote:
On Tue, 19 Jan 2021 at 12:44, Craig Ringer 
mailto:craig.rin...@enterprisedb.com>> 
wrote:


 > We're about halfway there already, see 7e784d1dc.  I didn't
do the
 > other half because it wasn't necessary to the problem, but
exposing
 > the shutdown state more fully seems reasonable.

Excellent, I'll take a look. Thanks.

That looks very handy already.

Extending it to be set before SIGTERM too would be handy.

My suggestion, which I'm happy to post in patch form if you think it's 
reasonable 


Tom, Robert, and thoughts on the proposals in [1]?

Craig, based on the state of this proposal (i.e. likely not a candidate 
for PG14) I think it makes sense to move it to the next CF.


Regards,
--
-David
da...@pgmasters.net

[1] 
https://www.postgresql.org/message-id/CAGRY4nyNfscmQiZBCNT7cBYnQxJLAAVCGz%2BGZAQDAco1Fbb01w%40mail.gmail.com





Re: Disable WAL logging to speed up data loading

2021-03-19 Thread Stephen Frost
Greetings,

* tsunakawa.ta...@fujitsu.com (tsunakawa.ta...@fujitsu.com) wrote:
> From: David Steele 
> > After reading through the thread (but not reading the patch) I am -1 on
> > this proposal.
> > 
> > The feature seems ripe for abuse and misunderstanding, and as has been
> > noted in the thread, there are a variety of alternatives that can
> > provide a similar effect.
> > 
> > It doesn't help that at several points along the way new WAL records
> > have been found that still need to be included even when wal_level =
> > none. It's not clear to me how we know when we have found them all.
> > 
> > The patch is marked Ready for Committer but as far as I can see there
> > are no committers in favor of it and quite a few who are not.
> 
> I can understand that people are worried about not having WAL.  But as far as 
> I remember, I'm afraid those concerns were emotional, not logical, i.e., 
> something like "something may happen.".  Regarding concrete concerns that 
> Stephen-san, Magnus-san, Horiguchi-san, Sawada-san and others raised, 
> Osumi-san addressed them based on their advice and review, both in this 
> thread and other threads.
> 
> I also understand we want to value people's emotion for worry-free 
> PostgreSQL.  At the same time, I'd like the emotion understood that we want 
> Postgres to have this convenient, easy-to-use feature.  MySQL recently 
> introduced this feature.  Why can't Postgres do it?

I don't see there being anything particularly 'emotional' around
suggesting that we already have multiple ways to avoid WAL writes when
populating tables and suggesting that we work to improve those instead
of adding yet another WAL level (note: we've been removing them of late,
for the good reason that having more is confusing and not helpful).

> > Perhaps it would be better to look at some of the more targeted
> > approaches mentioned in the thread and see if any of them can be
> > used/improved to achieve the desired result?
> 
> Other methods are not as easy-to-use, and more complex to implement.

The argument here seems to stem from the idea that issueing a 'TRUNCATE'
inside the transaction before starting the 'COPY' command is 'too hard'.
That's clearly a subjective question and perhaps it's my emotional
response to that which is at issue, but I don't think I'm alone in the
feeling that just isn't enough of a justification for adding a new WAL
level.

> What kind of destiny does this type of feature end up in?

I could be wrong and perhaps opinions will change in the future, but it
really doesn't seem like the there's much support for adding a new WAL
level just to avoid doing the TRUNCATE.  Appealing to what other
database systems support can be helpful in some cases but that's usually
when we're looking at a new capability which multiple other systems have
implemented.  This isn't actually a new capability at all- the WAL level
that you're looking for already exists and already gives the
optimization you're looking for, as long as you issue a TRUNCATE at the
start of the transaction.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: default result formats setting

2021-03-19 Thread Emre Hasegeli
I applied the patch, tried running the test and got the following:

rm -rf 
'/Users/hasegeli/Developer/postgres/src/test/modules/libpq_extended'/tmp_check
/bin/sh ../../../../config/install-sh -c -d 
'/Users/hasegeli/Developer/postgres/src/test/modules/libpq_extended'/tmp_check
cd . && 
TESTDIR='/Users/hasegeli/Developer/postgres/src/test/modules/libpq_extended' 
PATH="/Users/hasegeli/Developer/postgres/tmp_install/Users/hasegeli/.local/pgsql/bin:$PATH"
 
DYLD_LIBRARY_PATH="/Users/hasegeli/Developer/postgres/tmp_install/Users/hasegeli/.local/pgsql/lib"
  PGPORT='65432' 
PG_REGRESS='/Users/hasegeli/Developer/postgres/src/test/modules/libpq_extended/../../../../src/test/regress/pg_regress'
 REGRESS_SHLIB='/Users/hasegeli/Developer/postgres/src/test/regress/regress.so' 
/usr/bin/prove -I ../../../../src/test/perl/ -I .  t/*.pl
t/001_result_format.pl .. # Looks like your test exited with 2 before it could 
output anything.
t/001_result_format.pl .. Dubious, test returned 2 (wstat 512, 0x200)
Failed 4/4 subtests

Test Summary Report
---
t/001_result_format.pl (Wstat: 512 Tests: 0 Failed: 0)
  Non-zero exit status: 2
  Parse errors: Bad plan.  You planned 4 tests but ran 0.

Re: truncating timestamps on arbitrary intervals

2021-03-19 Thread David Steele

On 1/18/21 3:54 PM, John Naylor wrote:
On Mon, Nov 23, 2020 at 1:44 PM John Naylor 
mailto:john.nay...@enterprisedb.com>> wrote:

 >
 > On Thu, Nov 12, 2020 at 9:56 AM Peter Eisentraut 
> wrote:

 > > - After reading the discussion a few times, I'm not so sure anymore
 > > whether making this a cousin of date_trunc is the right way to go.  As
 > > you mentioned, there are some behaviors specific to date_trunc that
 > > don't really make sense in date_trunc_interval, and maybe we'll have
 > > more of those.

For v10, I simplified the behavior by decoupling the behavior from 
date_trunc() and putting in some restrictions as discussed earlier. It's 
much simpler now. It could be argued that it goes too far in that 
direction, but it's easy to reason about and we can put back some 
features as we see fit.


Peter, thoughts on the new patch?

Regards,
--
-David
da...@pgmasters.net




Re: support for MERGE

2021-03-19 Thread David Steele

On 3/19/21 10:44 AM, Alvaro Herrera wrote:

On 2021-Mar-19, David Steele wrote:


Since it does not appear this is being worked on for PG14 perhaps we should
close it in this CF and then reopen it when a patch is available?


No, please move it to the next CF.  Thanks


Ugh. I clicked wrong and moved it to the 2021-09 CF. I'm not sure if 
there's a way to fix it without creating a new entry.


Regards,
--
-David
da...@pgmasters.net




Re: support for MERGE

2021-03-19 Thread Alvaro Herrera
On 2021-Mar-19, David Steele wrote:

> Since it does not appear this is being worked on for PG14 perhaps we should
> close it in this CF and then reopen it when a patch is available?

No, please move it to the next CF.  Thanks

-- 
Álvaro Herrera   Valdivia, Chile




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-19 Thread James Coleman
On Tue, Mar 9, 2021 at 9:27 AM Fujii Masao  wrote:
>
>
>
> On 2021/03/09 23:19, James Coleman wrote:
> > On Tue, Mar 9, 2021 at 9:17 AM Alvaro Herrera  
> > wrote:
> >>
> >> On 2021-Mar-09, James Coleman wrote:
> >>
> >>> Yes, I think they both agreed on the "DETAIL:  Hot standby mode is
> >>> disabled." message, but that alternative meant not needing to add any
> >>> new signals and pm states, correct?
> >>
> >> Ah, I see!  I was thinking that you still needed the state and signal in
> >> order to print the correct message in hot-standby mode, but that's
> >> (obviously!) wrong.  So you're right that no signal/state are needed.
> >
> > Cool. And yes, I'm planning to update the patch soon.
>
> +1. Thanks!

Here's an updated patch; I think I've gotten what Alvaro suggested.

Thanks,
James


v4-0001-Improve-standby-connection-denied-error-message.patch
Description: Binary data


Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-19 Thread Julien Rouhaud
On Fri, Mar 19, 2021 at 02:54:16PM +0100, Hannu Krosing wrote:
> On Fri, Mar 19, 2021 at 2:29 PM Bruce Momjian  wrote:
> >
> > OK, that makes perfect sense.  I think the best solution is to document
> > that compute_query_id just controls the built-in computation of the
> > query id, and that extensions can also compute it if this is off, and
> > pg_stat_activity and log_line_prefix will display built-in or extension
> > computed query ids.
> >
> > It might be interesting someday to check if the hook changed a
> > pre-computed query id and warn the user in the logs, but that could
> > cause more log-spam problems than help.
> 
> The log-spam could be mitigated by logging it just once per connection
> the first time it is overridden

Yes, but it might still generate a significant amount of additional lines.

If extensions authors follow the recommendations and only calculate a queryid
when compute_query_id is off, it shoule be easy to check that you have
everything setup properly.

> Also, we could ask the extensions to expose the "method name" in a read-only 
> GUC
> 
> so one can do
> 
> SHOW compute_query_id_method;
> 
> and get the name of method use
> 
> compute_query_id_method
> 
> builtin
> 
> And it may even dynamically change to indicate the overriding of builtin
> 
> compute_query_id_method
> ---
> fancy_compute_query_id (overrides builtin)

This could be nice, but I'm not sure that it would work well if someones
install multiple extensions that calculate a queryid (which would be silly but
still), or load another one at runtime.




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-19 Thread Julien Rouhaud
On Fri, Mar 19, 2021 at 09:29:06AM -0400, Bruce Momjian wrote:
> On Fri, Mar 19, 2021 at 11:16:50AM +0800, Julien Rouhaud wrote:
> > Now that I'm back on the code I remember why I did it this way.  It's
> > unfortunately not really possible to make things work this way.
> > 
> > pg_stat_statements' post_parse_analyze_hook relies on a queryid already 
> > being
> > computed, as it's where we know where the constants are recorded.  It means:
> > 
> > - we have to call post_parse_analyze_hook *after* doing core queryid
> >   calculation
> > - if users want to use a third party module to calculate a queryid, they'll
> >   have to make sure that the module's post_parse_analyze_hook is called
> >   *before* pg_stat_statements' one.
> > - even if they do so, they'll still have to pay the price of core queryid
> >   calculation
> 
> OK, that makes perfect sense.  I think the best solution is to document
> that compute_query_id just controls the built-in computation of the
> query id, and that extensions can also compute it if this is off, and
> pg_stat_activity and log_line_prefix will display built-in or extension
> computed query ids.

So the last version of the patch should implement that behavior right?  It's
just missing some explicit guidance that third-party extensions should only
calculate a queryid if compute_query_id is off

> It might be interesting someday to check if the hook changed a
> pre-computed query id and warn the user in the logs, but that could
> cause more log-spam problems than help.  I am a little worried that
> someone might have compute_query_id enabled and then install an
> extension that overwrites it, but we will just have to document this
> issue.  Hopefully extensions will be clear that they are computing their
> own query id.

I agree.  And hopefully they will split the queryid calculation from the rest
of the extension so that users can use the combination they want.




Re: cleanup temporary files after crash

2021-03-19 Thread Euler Taveira
On Fri, Mar 19, 2021, at 12:23 AM, Tom Lane wrote:
> [ reads code... ]
> ... no, I think the problem is the test is still full of race conditions.
> 
> In the first place, waiting till you see the output of a SELECT that's
> before the useful query is not enough to guarantee that the useful query
> is done, or even started.  That's broken on both sessions.
That's an ugly and fragile mechanism to workaround the fact that pump_until
reacts after you have the query return.

> In the second place, even if the second INSERT has started, you don't know
> that it's reached the point of blocking on the tuple conflict yet.
> Which in turn means that it might not've filled its tuplestore yet.
> 
> In short, this script is designed to ensure that the test query can't
> finish too soon, but that proves nothing about whether the test query
> has even started.  And since you also haven't really guaranteed that the
> intended-to-be-blocking query is done, I don't think that the first
> condition really holds either.
In order to avoid the race condition between filling the tuplestore and killing
the backend, we could use a pool_query_until() before SIGKILL to wait the
temporary file being created. Do you think this modification will make this
test more stable?


--
Euler Taveira
EDB   https://www.enterprisedb.com/
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 8044849b73..41a91ebd06 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -100,6 +100,11 @@ ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Wait till a temporary file is created
+$node->poll_query_until(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)', '1');
+
 # Kill with SIGKILL
 my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');
@@ -168,6 +173,11 @@ ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Wait till a temporary file is created
+$node->poll_query_until(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)', '1');
+
 # Kill with SIGKILL
 $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');


Re: [PATCH] pg_stat_statements dealloc field ignores manual deallocation

2021-03-19 Thread Julien Rouhaud
On Fri, Mar 19, 2021 at 05:08:45PM +0300, Андрей Зубков wrote:
> 
> Since 2e0fedf there is a view pg_stat_statements_info is available in
> pg_stat_statements extension. It has a dealloc field, that should be a
> counter of deallocation events happened.
> Right now it accounts only automatic deallocation events, happened when
> we need a place for a new statement,

Yes, and that behavior is documented:

dealloc bigint

Total number of times pg_stat_statements entries about the least-executed
statements were deallocated because more distinct statements than
pg_stat_statements.max were observed

> but manual deallocation events
> caused by pg_stat_statements_reset() function for some subset of
> collected statements is not accounted.
> My opinion is that manual deallocation is a deallocation too and it
> must be accounted in dealloc field of pg_stat_statements_info view.

I disagree.  The point of that field is to help users configuring
pg_stat_statements.max, as evictions have a huge overhead in many workloads.

If users remove entries for some reasons, we don't have to give the impression
that pg_stat_statements.max is too low and that it should be increased,
especially since it requires a restart.




Re: support for MERGE

2021-03-19 Thread David Steele

On 1/18/21 11:48 AM, Alvaro Herrera wrote:

On 2021-Jan-18, Robert Haas wrote:


On Thu, Dec 31, 2020 at 8:48 AM Alvaro Herrera  wrote:

Here's a rebase of Simon/Pavan's MERGE patch to current sources.  I
cleaned up some minor things in it, but aside from rebasing, it's pretty
much their work (even the commit message is Simon's).


It's my impression that the previous discussion concluded that their
version needed pretty substantial design changes. Is there a plan to
work on that? Or was some of that stuff done already?


I think some of the issues were handled as I adapted the patch to
current sources.  However, the extensive refactoring that had been
recommended in the old threads has not been done.  I have these two
comments mainly:

https://postgr.es/m/CABOikdOeraX0RKojBs0jZxY5SmbQF3GJBP0+Rat=2g+vqsa...@mail.gmail.com
https://postgr.es/m/7168.1547584...@sss.pgh.pa.us

I'll try to get to those, but I have some other patches that I need to
handle first.


Since it does not appear this is being worked on for PG14 perhaps we 
should close it in this CF and then reopen it when a patch is available?


Regards,
--
-David
da...@pgmasters.net




  1   2   >