date:20211006

Re: Parallel vacuum workers prevent the oldest xmin from advancing

2021-10-06 Thread Amit Kapila

On Wed, Oct 6, 2021 at 12:41 PM Masahiko Sawada  wrote:
>
> Hi all,
>
> A customer reported that during parallel index vacuum, the oldest xmin
> doesn't advance. Normally, the calculation of oldest xmin
> (ComputeXidHorizons()) ignores xmin/xid of processes having
> PROC_IN_VACUUM flag in MyProc->statusFlags. But since parallel vacuum
> workers don’t set their statusFlags, the xmin of the parallel vacuum
> worker is considered to calculate the oldest xmin. This issue happens
> from PG13 where the parallel vacuum was introduced. I think it's a
> bug.
>

I agree. Your patch seems to be in the right direction but I haven't
tested it yet. Feel free to register in the next CF to avoid
forgetting it.

-- 
With Regards,
Amit Kapila.

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-10-06 Thread Bharath Rupireddy

On Mon, Sep 27, 2021 at 11:45 PM Mark Dilger
 wrote:
>
> Thanks for looking!  I have pulled together a new patch set which applies 
> cleanly against current master.

Hi Mark, thanks for this work. I'm late to be here in this thread,
please note that I didn't go through the entire thread as it is quite
long for me to read.

I have a question: it looks like the view pg_backend_memory_contexts
and the function pg_log_backend_memory_contexts are superuser only.
Isn't it a good idea to allow users with a pg_monitor or some other
similar role to use these as well? This question may be unrelated here
but I'm curious to know whether your patch set has a solution.

Regards,
Bharath Rupireddy.

Re: pgsql: Document XLOG_INCLUDE_XID a little better

2021-10-06 Thread Amit Kapila

On Sat, Oct 2, 2021 at 4:16 PM Dilip Kumar  wrote:
>
> On Fri, Oct 1, 2021 at 6:24 PM Alvaro Herrera  wrote:
> >
> > On 2021-Oct-01, Amit Kapila wrote:
>
> > I think a straight standalone variable (probably a static boolean in
> > xloginsert.c) might be less confusing.
>
> I have written two patches, Approach1 is as you described using a
> static boolean and Approach2 as a local variable to XLogAssembleRecord
> as described by Amit, attached both of them for your reference.
> IMHO, either of these approaches looks cleaner.
>

I have tried to improve some comments and a variable name in the
Approach-2 (use local variable) patch and also reverts one of the
comments introduced by the commit ade24dab97. I am fine if we decide
to go with Approach-1 as well but personally, I would prefer to keep
the code consistent with nearby code.

Let me know what you think of the attached?

With Regards,
Amit Kapila.

v2-0001-Replace-XLOG_INCLUDE_XID-flag-with-a-more-localiz.patch
Description: Binary data

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2021-10-06 Thread Bharath Rupireddy

On Wed, Oct 6, 2021 at 10:26 PM Jeremy Schneider
 wrote:
>
> On 10/5/21 17:43, Bruce Momjian wrote:
> > On Tue, Oct  5, 2021 at 03:30:07PM -0700, Jeremy Schneider wrote:
> >> Specifically exposing pg_waldump functionality in SQL could speed up
> >> finding bugs in the PG logical replication code. We found and fixed a
> >> few over this past year, but there are more lurking out there.
> >
> > Uh, why is pg_waldump more important than other command line tool
> > information?
>
> Going down the list of all other utilities in src/bin:
>
> * pg_amcheck, pg_config, pg_controldata: already available in SQL
> * psql, pgbench, pg_dump: already available as client-side apps
> * initdb, pg_archivecleanup, pg_basebackup, pg_checksums, pg_ctl,
> pg_resetwal, pg_rewind, pg_upgrade, pg_verifybackup: can't think of any
> possible use case outside server OS access, most of these are too low
> level and don't even make sense in SQL
> * pg_test_fsync, pg_test_timing: marginally interesting ideas in SQL,
> don't feel any deep interest myself
>
> Speaking selfishly, there are a few reasons I would be specifically
> interested in pg_waldump (the only remaining one on the list).

Thanks Jeremy for the analysis.

> First, to better support existing features around logical replication
> and decoding.
>
> In particular, it seems inconsistent to me that all the replication
> management SQL functions take LSNs as arguments - and yet there's no
> SQL-based way to find the LSNs that you are supposed to pass into these
> functions.
>
> https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-REPLICATION
>
> Over the past few years I've been pulled in to help several large PG
> users who ran into these bugs, and it's very painful - because the only
> real remediation is to drop and recreate the replication slot, which
> means either re-copying all the data to the downstream system or
> figuring out a way to resync it. Some PG users have 3rd party tools like
> HVR which can do row-by-row resync (IIUC), but no matter how you slice
> it, we're talking about a lot of pain for people replicating large data
> sets between multiple systems. In most cases, the only/best option even
> with very large tables is to just make a fresh copy of the data - which
> can translate to a business outage of hours or even days.
>
> My favorite example is the SQL function pg_replication_slot_advance() -
> this could really help PG users find less painful solutions to broken
> decoding, however it's not really possible to /know/ an LSN to advance
> to without inspecting WAL. ISTM there's a strong use case here for a SQL
> interface on WAL inspection.
>
> Second: debugging and troubleshooting logical replication and decoding bugs.
>
> I helped track down a few logical replication bugs and get fixed into
> code at postgresql.org this past year. (But I give credit to others who
> are much better at C than I am, and who did a lot more work than I did
> on these bugs!)
>
> Logical decoding bugs are some of the hardest to fix - because all you
> have is a WAL stream, but you don't know the SQL or workload patterns or
> PG code paths which created that WAL stream.
>
> Dumping the WAL - knowing which objects and which types of operations
> are involved and stats like number of changes or number of
> subtransactions - this helps identify which transaction and what SQL in
> the application triggered the failure, which can help find short-term
> workarounds. Businesses need those short-term workarounds so they can
> keep going while we work on finding and fixing bugs, which can take some
> time. This is another place where I think a SQL interface to WAL would
> be helpful to PG users. Especially the ability to filter and trace a
> single transaction through a large number of WAL files in the directory.
>
> Third: statistics on WAL - especially full page writes. Giving users the
> full power of SQL allows much more sophisticated analysis of the WAL
> records. Personally, I'd probably find myself importing all the WAL
> stats into the DB anyway because SQL is my preferred data manipulation
> language.

Just to add to the above points, with the new extension pg_walinspect
we will have following advantages:
1) Usability - SQL callable functions will be easier to use for the
users/admins/developers.
2) Access Control - we can provide better access control for the WAL data/stats.
3) Emitting the actual WAL data(as bytea structure) and stats via SQL
callable functions will help to analyze and answer questions like how
much WAL data is being generated in the system, what kind of WAL data
it is, how many FPWs are happening and so on. Jermey has already given
more realistic use cases.
4) I came across this -  there's a similar capability in SQL server -
https://www.mssqltips.com/sqlservertip/3076/how-to-read-the-sql-server-database-transaction-log/

> >> Having the extension in core is actually the only way to avoid
> >> duplicated effort, by having shared code which

Re: strange case of "if ((a & b))"

2021-10-06 Thread Michael Paquier

On Thu, Oct 07, 2021 at 01:27:34PM +0900, Masahiko Sawada wrote:
> On Thu, Oct 7, 2021 at 11:44 AM Justin Pryzby  wrote:
>> I'm not quite sure how you managed to search for it - well done ;)
> 
> I could not find the recent thread, though :)

Hm.  It looks like there are more occurences of "false : true" that
could be cleaned up, like in nodeResult.c or tablecmds.c.
--
Michael


signature.asc
Description: PGP signature

RE: Transactions involving multiple postgres foreign servers, take 2

2021-10-06 Thread k.jami...@fujitsu.com

Hi Fujii-san and Sawada-san,

Thank you very much for your replies.

> >> I noticed that this thread and its set of patches have been marked with
> "Returned with Feedback" by yourself.
> >> I find the feature (atomic commit for foreign transactions) very
> >> useful and it will pave the road for having a distributed transaction
> management in Postgres.
> >> Although we have not arrived at consensus at which approach is best,
> >> there were significant reviews and major patch changes in the past 2 years.
> >> By any chance, do you have any plans to continue this from where you left 
> >> off?
> >
> > As I could not reply to the review comments from Fujii-san for almost
> > three months, I don't have enough time to move this project forward at
> > least for now. That's why I marked this patch as RWF. I’d like to
> > continue working on this project in my spare time but I know this is
> > not a project that can be completed by using only my spare time. If
> > someone wants to work on this project, I’d appreciate it and am happy
> > to help.
> 
> Probably it's time to rethink the approach. The patch introduces foreign
> transaction manager into PostgreSQL core, but as far as I review the patch, 
> its
> changes look overkill and too complicated.
> This seems one of reasons why we could not have yet committed the feature even
> after several years.
> 
> Another concern about the approach of the patch is that it needs to change a
> backend so that it additionally waits for replication during commit phase 
> before
> executing PREPARE TRANSACTION to foreign servers. Which would decrease the
> performance during commit phase furthermore.
> 
> So I wonder if it's worth revisiting the original approach, i.e., add the 
> atomic
> commit into postgres_fdw. One disadvantage of this is that it supports atomic
> commit only between foreign PostgreSQL servers, not other various data
> resources like MySQL.
> But I'm not sure if we really want to do atomic commit between various FDWs.
> Maybe supporting only postgres_fdw is enough for most users. Thought?

The intention of Sawada-san's patch is grand although would be very much helpful
because it accommodates possible future support of atomic commit for
various types of FDWs. However, it's difficult to get the agreement altogether,
as other reviewers also point out the performance of commit. Another point is 
that
how it should work when we also implement atomic visibility (which is another
topic for distributed transactions but worth considering).
That said, if we're going to initially support it on postgres_fdw, which is 
simpler 
than the latest patches, we need to ensure that abnormalities and errors
are properly handled and prove that commit performance can be improved,
e.g. if we can commit not in serial but also possible in parallel.
And if possible, although not necessary during the first step, it may put at 
ease
the other reviewers if can we also think of the image on how to implement atomic
visibility on postgres_fdw. 
Thoughts?

Regards,
Kirk Jamison

Re: strange case of "if ((a & b))"

2021-10-06 Thread Masahiko Sawada

On Thu, Oct 7, 2021 at 11:44 AM Justin Pryzby  wrote:
>
> On Thu, Oct 07, 2021 at 11:18:24AM +0900, Masahiko Sawada wrote:
> > On Sat, Sep 11, 2021 at 2:44 PM Michael Paquier  wrote:
> > >
> > > On Thu, Sep 09, 2021 at 02:14:50PM +0900, Kyotaro Horiguchi wrote:
> > > > Maybe I'm missing something, but I can see several instances of the
> > > > "eval-bool ? true : false" pattern after fd0625c7a9 that are not in
> > > > the latest 0002.
> > >
> > > Yep.  There are more of these, and I have just looked at some of them
> > > as of the patches proposed.  What was sent looked clean enough to
> > > progress a bit and be done with it.
> >
> > While reading the decode.c I found the extra parentheses and arrived
> > at this thread.
>
> I'm not quite sure how you managed to search for it - well done ;)

I could not find the recent thread, though :)

>
> > The discussion seems to get inactive now but one (0001
> > patch) out of two patches Justin proposed [1] is not committed yet and
> > there seems no CF entry for this item (0002 patch already got
> > committed, fd0625c7a9). 0001 patch can be cleanly applied and looks
> > good to me.
>
> Note that I also included it here:
> https://www.postgresql.org/message-id/20210924215827.gs...@telsasoft.com

Good. Thank you for the information!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Lost logs with csvlog redirected to stderr under WIN32 service

2021-10-06 Thread Michael Paquier

On Wed, Oct 06, 2021 at 09:33:24PM -0500, Chris Bandy wrote:
> I don't have a windows machine to test, but this refactor looks good to me.

Thanks for the review!  I did test this on Windows, only MSVC builds.

>> +/* Write to CSV log, if enabled */
>> +if ((Log_destination & LOG_DESTINATION_CSVLOG) != 0)
> 
> This was originally "if (Log_destination & LOG_DESTINATION_CSVLOG)" and
> other conditions nearby still lack the "!= 0". Whatever the preferred
> style, the lines touched by this patch should probably do this consistently.

Yeah.  It looks like using a boolean expression here is easier for my
brain.
--
Michael


signature.asc
Description: PGP signature

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2021-10-06 Thread Masahiko Sawada

On Thu, Sep 23, 2021 at 5:44 PM Drouvot, Bertrand  wrote:
>
> Hi,
>
> On 7/29/21 10:25 AM, Masahiko Sawada wrote:
> > Thank you for reporting the issue.
> >
> > I could reproduce this issue by the steps you shared.
>
> Thanks for looking at it!
>
> >
> >> Currently, the system relies on processing Heap2::NEW_CID to keep track of 
> >> catalog modifying (sub)transactions.
> >>
> >> This context is lost if the logical decoding has to restart from a 
> >> Standby::RUNNING_XACTS that is written between the NEW_CID record and its 
> >> parent txn commit.
> >>
> >> If the logical stream restarts from this restart_lsn, then it doesn't have 
> >> the xid responsible for modifying the catalog.
> >>
> > I agree with your analysis. Since we don’t use commit WAL record to
> > track the transaction that has modified system catalogs, if we decode
> > only the commit record of such transaction, we cannot know the
> > transaction has been modified system catalogs, resulting in the
> > subsequent transaction scans system catalog with the wrong snapshot.
> Right.
> >
> > With the patch, if the commit WAL record has a XACT_XINFO_HAS_INVALS
> > flag and its xid is included in RUNNING_XACT record written at
> > restart_lsn,  we forcibly add the top XID and its sub XIDs as a
> > committed transaction that has modified system catalogs to the
> > snapshot. I might be missing something about your patch but I have
> > some comments on this approach:
> >
> > 1. Commit WAL record may not have invalidation message for system
> > catalogs  (e.g., when commit record has only invalidation message for
> > relcache) even if it has XACT_XINFO_HAS_INVALS flag.
>
> Right, good point (create policy for example would lead to an
> invalidation for relcache only).
>
> > In this case, the
> > transaction wrongly is added to the snapshot, is that okay?
> This transaction is a committed one, and IIUC the snapshot would be used
> only for catalog visibility, so i don't see any issue to add it in the
> snapshot, what do you think?

It seems to me that it's no problem since we always transaction with
catalog-changed when decoding XLOG_XACT_INVALIDATIONS records.

> >
> > 2. We might add a subtransaction XID as a committed transaction that
> > has modified system catalogs even if it actually didn't.
>
> Right, like when needs_timetravel is true.
>
> > As the
> > comment in SnapBuildBuildSnapshot() describes, we track only the
> > transactions that have modified the system catalog and store in the
> > snapshot (in the ‘xip' array). The patch could break that assumption.
> Right. It looks to me that breaking this assumption is not an issue.
>
> IIUC currently the committed ones that are not modifying the catalog are
> not stored "just" because we don't need them.
> > However, I’m really not sure how to deal with this point. We cannot
> > know which subtransaction has actually modified system catalogs by
> > using only the commit WAL record.
> Right, unless we rewrite this patch so that a commit WAL record will
> produce this information.
> >
> > 3. The patch covers only the case where the restart_lsn exactly
> > matches the LSN of RUNNING_XACT.
> Right.
> >   I wonder if there could be a case
> > where the decoding starts at a WAL record other than RUNNING_XACT but
> > the next WAL record is RUNNING_XACT.
>
> Not sure, but could a restart_lsn not be a RUNNING_XACTS?

I guess the decoding always starts from RUNING_XACTS.
After more thought, I think that the basic approach of the proposed
patch is a probably good idea, which we add xid whose commit record
has XACT_XINFO_HAS_INVALS to the snapshot. The problem as I see is
that during decoding COMMIT record we cannot know which transactions
(top transaction or subtransactions) actually did catalog changes. But
given that even if XLOG_XACT_INVALIDATION has only relcache
invalidation message we always mark the transaction with
catalog-changed, it seems no problem. Therefore, in the reported
cases, probably we can add both the top transaction xid and its
subscription xids to the snapshot.

Regarding the patch details, I have two comments:

---
+ if ((parsed->xinfo & XACT_XINFO_HAS_INVALS) && last_running)
+ {
+ /* make last_running->xids bsearch()able */
+ qsort(last_running->xids,
+  last_running->subxcnt + last_running->xcnt,
+  sizeof(TransactionId), xidComparator);

The patch does qsort() every time when the commit message has
XACT_XINFO_HAS_INVALS. IIUC the xids we need to remember is the only
xids that are recorded in the first replayed XLOG_RUNNING_XACTS,
right? If so, we need to do qsort() once, can remove xid from the
array once it gets committed, and then can eventually make
last_running empty so that we can skip even TransactionIdInArray().

---
Since last_running is allocated by malloc() and it isn't freed even
after finishing logical decoding.


Another idea to fix this problem would be that before calling
SnapBuildCommitTxn() we create transaction entries in

Question about client_connection_check_interval

2021-10-06 Thread kuroda.hay...@fujitsu.com

Dear Hackers,

While reading source codes about timeouts and GUC and I found that
strange behavior about client_connection_check_interval.

Currently we did not an assign_hook about client_connection_check_interval,
that means a timeout will not turn on immediately if users change the GUC
from zero to arbitrary positive integer.
In my understanding the timeout will fire only when:

* before starting transaction
* after firing the CLIENT_CONNECTION_CHECK_TIMEOUT timeout

Hence I thought following inconvenient scenario:

1. set client_connection_check_interval = 0 in postgresql.conf
2. start a tx
3. SET LOCAL client_connection_check_interval to non-zero value
  in order to checking clients until the end of the tx
4. users expect to firing the timeout, but it does not work
  because enable_timeout_after() will never execute in the tx

Is this an expected behavior?  If so, I think this spec should be documented.
If not, I think an assign_hook is needed for resolving the problem.

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: a comment in joinrel.c: compute_partition_bounds()

2021-10-06 Thread Amit Langote

Fujita-san,

On Wed, Oct 6, 2021 at 5:41 PM Etsuro Fujita  wrote:
> On Fri, Sep 24, 2021 at 4:20 PM Etsuro Fujita  wrote:
> > On Fri, Sep 24, 2021 at 3:34 PM Amit Langote  
> > wrote:
> > > I think there's a word missing in the following comment:
> > >
> > > /*
> > >  * See if the partition bounds for inputs are exactly the same, in
> > >  * which case we don't need to work hard: the join rel have the 
> > > same
> > >  * partition bounds as inputs, and the partitions with the same
> > >  * cardinal positions form the pairs.
> > >
> > > ": the join rel have the same..." seems to be missing a "will".
> > >
> > > Attached a patch to fix.
> >
> > Good catch!  Will fix.
>
> Rereading the comment, I think it would be better to add “will” to the
> second part “the partitions with the same cardinal positions form the
> pairs” as well.  Updated patch attached.

No objection from my side.

Thank you.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

RE: Allow escape in application_name

2021-10-06 Thread kuroda.hay...@fujitsu.com

Dear Fujii-san,

Thank you for reviewing!

> + else if (*p == '-' ? isdigit(p[1]) : isdigit(p[0]))
> + {
> + char *endptr;
> + padding = strtol(p, , 10);
> 
> Maybe isdigit() and strtol() work differently depending on locale setting?
> If so, I'm afraid that the behavior of this new code would be different from
> process_log_prefix_padding().

(Maybe strtol() should be strtoint(). This is my fault... but anyway)
You afraid that these functions may return non-zero value if locale is not "C"
and inputs are non-ascii characters?
I read the POSIX specification(isdigit: [1], strtol: [2]) and I found that
it suggests such functions are affected by locale settings.

For isdigit():
> The isdigit() and isdigit_l() functions shall test whether c is a character 
> of class digit in the current locale, 
> or in the locale represented by locale, respectively; see XBD Locale.

For strtol()
> In other than the C or POSIX locale, additional locale-specific subject 
> sequence forms may be accepted.

I seeked other sources, but I thought that they did not consider about
locale settings. For example, functions in numutils.c are use such functions
without any locale consideration.

And I considered executing setlocale(LC_CTYPE, "C"), but I found
the following comment pg_locale.c and I did not want to violate the policy.

> * Here is how the locale stuff is handled: LC_COLLATE and LC_CTYPE
> * are fixed at CREATE DATABASE time, stored in pg_database, and cannot
> * be changed. Thus, the effects of strcoll(), strxfrm(), isupper(),
> * toupper(), etc. are always in the same fixed locale.

But isdigit() and strtol() cannot accept multi byte character,
so I also thought we don't have to consider
that some unexpected character is set in LC_CTYPE digit category.

So now we can choose from followings:

* ignore such differences and use isdigit() and strtol()
* give up using them and implement two static support functions

How do you think? Someone knows any other knowledge about locale?

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/isdigit.html
[2]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/strtol.html

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: strange case of "if ((a & b))"

2021-10-06 Thread Justin Pryzby

On Thu, Oct 07, 2021 at 11:18:24AM +0900, Masahiko Sawada wrote:
> On Sat, Sep 11, 2021 at 2:44 PM Michael Paquier  wrote:
> >
> > On Thu, Sep 09, 2021 at 02:14:50PM +0900, Kyotaro Horiguchi wrote:
> > > Maybe I'm missing something, but I can see several instances of the
> > > "eval-bool ? true : false" pattern after fd0625c7a9 that are not in
> > > the latest 0002.
> >
> > Yep.  There are more of these, and I have just looked at some of them
> > as of the patches proposed.  What was sent looked clean enough to
> > progress a bit and be done with it.
> 
> While reading the decode.c I found the extra parentheses and arrived
> at this thread.

I'm not quite sure how you managed to search for it - well done ;)

> The discussion seems to get inactive now but one (0001
> patch) out of two patches Justin proposed [1] is not committed yet and
> there seems no CF entry for this item (0002 patch already got
> committed, fd0625c7a9). 0001 patch can be cleanly applied and looks
> good to me.

Note that I also included it here:
https://www.postgresql.org/message-id/20210924215827.gs...@telsasoft.com

Michael seems prefer writing (() != 0) in more cases than other people, so
didn't care for that patch.
https://www.postgresql.org/message-id/577206.1620628...@sss.pgh.pa.us

-- 
Justin

Re: Lost logs with csvlog redirected to stderr under WIN32 service

2021-10-06 Thread Chris Bandy

On 10/6/21 12:10 AM, Michael Paquier wrote:
> I have thought about various ways to
> fix that, and finished with a solution where we handle csvlog first,
> and fallback to stderr after so as there is only one code path for
> stderr, as of the attached.  This reduces a bit the confusion around
> the handling of the stderr data that gets free()'d in more code paths
> than really needed.

I don't have a windows machine to test, but this refactor looks good to me.

> + /* Write to CSV log, if enabled */
> + if ((Log_destination & LOG_DESTINATION_CSVLOG) != 0)

This was originally "if (Log_destination & LOG_DESTINATION_CSVLOG)" and
other conditions nearby still lack the "!= 0". Whatever the preferred
style, the lines touched by this patch should probably do this consistently.

-- Chris

Re: strange case of "if ((a & b))"

2021-10-06 Thread Masahiko Sawada

On Sat, Sep 11, 2021 at 2:44 PM Michael Paquier  wrote:
>
> On Thu, Sep 09, 2021 at 02:14:50PM +0900, Kyotaro Horiguchi wrote:
> > Maybe I'm missing something, but I can see several instances of the
> > "eval-bool ? true : false" pattern after fd0625c7a9 that are not in
> > the latest 0002.
>
> Yep.  There are more of these, and I have just looked at some of them
> as of the patches proposed.  What was sent looked clean enough to
> progress a bit and be done with it.

While reading the decode.c I found the extra parentheses and arrived
at this thread. The discussion seems to get inactive now but one (0001
patch) out of two patches Justin proposed [1] is not committed yet and
there seems no CF entry for this item (0002 patch already got
committed, fd0625c7a9). 0001 patch can be cleanly applied and looks
good to me.

Also, regarding "x ? true: false" pattern where x is guaranteed to
yield a boolean, I found other examples other than Horiguchi-san
mentioned[2]. I've attached the patch to remove them.

Regards,

[1] https://www.postgresql.org/message-id/20210906001110.GF26465%40telsasoft.com
[2] 
https://www.postgresql.org/message-id/20210909.141450.11969674682374713.horikyota.ntt%40gmail.com

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

remove_unnecessary_ternary_operations.patch
Description: Binary data

Re: Running tests under valgrind is getting slower at an alarming pace

2021-10-06 Thread Andres Freund

Hi,

On 2021-10-06 09:47:54 -0700, Andres Freund wrote:
> I'll also try to figure out print a bit more detail about timing for each tap
> test, looks like I need to figure out how to pass PROVE_TEST='--timer' through
> the buildfarm. Shouldn't be too hard.

Turns out that the buildfarm already adds --timer. I added -j4 to allow for
some concurrency in tap tests, but unfortunately my animals fell over after
that (thanks Michael for noticing).

Looks like the buildfarm client code isn't careful enough quoting PROVE_FLAGS?

my $pflags = "PROVE_FLAGS=--timer";
if (exists $ENV{PROVE_FLAGS})
{
$pflags =
  $ENV{PROVE_FLAGS}
  ? "PROVE_FLAGS=$ENV{PROVE_FLAGS}"
  : "";
}

@makeout =
  run_log("cd $dir && $make NO_LOCALE=1 $pflags $instflags 
$taptarget");

Which doesn't work if pflags ends up as '-j4 --timer' or such...

Greetings,

Andres Freund

Re: wrapping CF 2021-09

2021-10-06 Thread Michael Paquier

On Wed, Oct 06, 2021 at 10:55:53AM +0200, Daniel Gustafsson wrote:
> Indeed, thanks for all your work this CF!

+1.  Thanks, Jaime!
--
Michael


signature.asc
Description: PGP signature

Re: ALTER INDEX .. RENAME allows to rename tables/views as well

2021-10-06 Thread Michael Paquier

On Wed, Oct 06, 2021 at 06:43:25PM -0400, Tom Lane wrote:
> ... as, no doubt, are a lot of applications that this will gratuitously
> break.  We've long had a policy that ALTER TABLE will work on relations
> that aren't tables, so long as the requested operation is sensible.

Yeah, that was my first thought after seeing this thread.  There is a
risk in breaking something that was working previously.  Perhaps it
was just working by accident, but that could be surprising if an
application relied on the existing behavior.
--
Michael


signature.asc
Description: PGP signature

Re: More business with $Test::Builder::Level in the TAP tests

2021-10-06 Thread Michael Paquier

On Wed, Oct 06, 2021 at 07:33:22AM -0400, Andrew Dunstan wrote:
> We should probably state a requirement for this somewhere. Maybe in
> src/test/perl/README. AIUI, the general rule is that any subroutine that
> directly or indirectly calls ok() and friends should increase the level.
> Such subroutines that don't increase it should probably contain a
> comment stating why, so we can know in future that it's not just an
> oversight.

That makes sense.  How about something like that after the part about
Test::More::like and qr// in the section about writing tests?  Here it
is:
+Test::Builder::Level controls how far up in the call stack a test will look
+at when reporting a failure.  This should be incremented by any subroutine
+calling test routines from Test::More, like ok() or is():
+
+local $Test::Builder::Level = $Test::Builder::Level + 1;
--
Michael
diff --git a/src/bin/pg_archivecleanup/t/010_pg_archivecleanup.pl b/src/bin/pg_archivecleanup/t/010_pg_archivecleanup.pl
index 8134c2a62e..8d689b9601 100644
--- a/src/bin/pg_archivecleanup/t/010_pg_archivecleanup.pl
+++ b/src/bin/pg_archivecleanup/t/010_pg_archivecleanup.pl
@@ -72,6 +72,8 @@ command_fails_like(
 
 sub run_check
 {
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
 	my ($suffix, $test_name) = @_;
 
 	create_files();
diff --git a/src/bin/pg_ctl/t/004_logrotate.pl b/src/bin/pg_ctl/t/004_logrotate.pl
index aa0d64a4f7..13e91f3bc9 100644
--- a/src/bin/pg_ctl/t/004_logrotate.pl
+++ b/src/bin/pg_ctl/t/004_logrotate.pl
@@ -31,6 +31,8 @@ sub fetch_file_name
 # Check for a pattern in the logs associated to one format.
 sub check_log_pattern
 {
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
 	my $format   = shift;
 	my $logfiles = shift;
 	my $pattern  = shift;
diff --git a/src/bin/pg_verifybackup/t/005_bad_manifest.pl b/src/bin/pg_verifybackup/t/005_bad_manifest.pl
index 4f5b8f5a49..1420cfb352 100644
--- a/src/bin/pg_verifybackup/t/005_bad_manifest.pl
+++ b/src/bin/pg_verifybackup/t/005_bad_manifest.pl
@@ -176,6 +176,8 @@ EOM
 
 sub test_parse_error
 {
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
 	my ($test_name, $manifest_contents) = @_;
 
 	test_bad_manifest($test_name,
@@ -186,6 +188,8 @@ sub test_parse_error
 
 sub test_fatal_error
 {
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
 	my ($test_name, $manifest_contents) = @_;
 
 	test_bad_manifest($test_name, qr/fatal: $test_name/, $manifest_contents);
@@ -194,6 +198,8 @@ sub test_fatal_error
 
 sub test_bad_manifest
 {
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
 	my ($test_name, $regexp, $manifest_contents) = @_;
 
 	open(my $fh, '>', "$tempdir/backup_manifest") || die "open: $!";
diff --git a/src/bin/psql/t/010_tab_completion.pl b/src/bin/psql/t/010_tab_completion.pl
index 8695d22545..dbca56afad 100644
--- a/src/bin/psql/t/010_tab_completion.pl
+++ b/src/bin/psql/t/010_tab_completion.pl
@@ -127,6 +127,8 @@ sub check_completion
 # (won't work if we are inside a string literal!)
 sub clear_query
 {
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
 	check_completion("\\r\n", qr/postgres=# /, "\\r works");
 	return;
 }
@@ -136,6 +138,8 @@ sub clear_query
 # than clear_query because we lose evidence in the history file)
 sub clear_line
 {
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
 	check_completion("\025\n", qr/postgres=# /, "control-U works");
 	return;
 }
diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index c484237d07..968be3952f 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -221,6 +221,8 @@ sub test_access
 # As above, but test for an arbitrary query result.
 sub test_query
 {
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
 	my ($node, $role, $query, $expected, $gssencmode, $test_name) = @_;
 
 	# need to connect over TCP/IP for Kerberos
diff --git a/src/test/perl/README b/src/test/perl/README
index f04b2a2ea4..45de7661ff 100644
--- a/src/test/perl/README
+++ b/src/test/perl/README
@@ -61,9 +61,16 @@ Test::More::like entails use of the qr// operator.  Avoid Perl 5.8.8 bug
 #39185 by not using the "$" regular expression metacharacter in qr// when also
 using the "/m" modifier.  Instead of "$", use "\n" or "(?=\n|\z)".
 
-Read the Test::More documentation for more on how to write tests:
+Test::Builder::Level controls how far up in the call stack a test will look
+at when reporting a failure.  This should be incremented by any subroutine
+calling test routines from Test::More, like ok() or is():
+
+local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+Read the documentation for more on how to write tests:
 
 perldoc Test::More
+perldoc Test::Builder
 
 For available PostgreSQL-specific test methods and some example tests read the
 perldoc for the test modules, e.g.:
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index ac581c1c07..9916a36012 100644
---

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Peter Geoghegan

On Wed, Oct 6, 2021 at 2:28 PM Pavel Borisov  wrote:
>> It is the most consistent with the general design of the system, for
>> reasons that are pretty deeply baked into the system. I'm reminded of
>> the fact that REINDEX CONCURRENTLY's completion became blocked due to
>> similar trepidations. Understandably so.
>
>
> I may mistake, but I recall the fact that all indexes builds started during 
> some other (long) index build do not finish with indexes usable for selects 
> until that long index is built. This may and may not be a source of amcheck 
> misbehavior. Just a note what could be possibly considered.

I may have been unclear. I meant that work on the REINDEX CONCURRENTLY
feature (several years ago) was very difficult. It seemed to challenge
what "Postgres relation" really means.

Various community members had concerns about the definition at the
time. Remember, plain REINDEX actually gets a full AccessExclusiveLock
on the target index relation. This is practically as bad as getting
the same lock on the table itself for most users -- which is very
disruptive indeed. It's much more disruptive than plain CREATE INDEX
-- CREATE INDEX generally only blocks write DML. Whereas REINDEX tends
to block both writes and reads (in practice, barring some narrow cases
with prepared statements that are too confusing to users to be worth
discussing). Which is surprising in itself to users. Why should plain
REINDEX be so different to plain CREATE INDEX?

The weird (but also helpful) thing about the implementation of REINDEX
CONCURRENTLY is that we can have *two* pg_class entries for what the
user thinks of as one index/relation. Having two pg_class entries is
also why plain REINDEX had problems that plain CREATE INDEX never had
-- having only one pg_class entry was actually the true underlying
problem, all along.

Sometimes we have to make a difficult choice between "weird rules but
nice behavior" (as with REINDEX CONCURRENTLY), and "nice rules but
weird behavior" (as with plain REINDEX).

-- 
Peter Geoghegan

Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings

2021-10-06 Thread Fujii Masao





On 2021/10/07 4:58, Jelte Fennema wrote:

Ugh forgot to attach the patch. Here it is.


Thanks for working on this patch!


@@ -4546,10 +4684,21 @@ PQrequestCancel(PGconn *conn)
 
 		return false;

}
-
-   r = internal_cancel(>raddr, conn->be_pid, conn->be_key,

Since PQrequestCancel() is marked as deprecated, I don't think that
we need to add the feature into it.


+   if (cancel->pgtcp_user_timeout >= 0) {
+   if (setsockopt(tmpsock, IPPROTO_TCP, TCP_USER_TIMEOUT,
+  (char *) 
>pgtcp_user_timeout,
+  
sizeof(cancel->pgtcp_user_timeout)) < 0) {
+   goto cancel_errReturn;
+   }
+   }

libpq has already setKeepalivesXXX() functions to do the almost same thing.
Isn't it better to modify and reuse them instead of adding the almost
same code, to simplify the code?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-06 Thread Jaime Casanova

On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote:
>On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia
>  wrote:
> >
> > I tried to apply the patch on the master branch head and it's failing
> > with conflicts.
> >
> 
> Thanks, Rushabh, for the quick check, I have attached a rebased version for 
> the
> latest master head commit # f6b5d05ba9a.
> 

Hi,

I got this error while executing "make check" on src/test/recovery:

"""
t/026_overwrite_contrecord.pl  1/3 # poll_query_until timed out 
executing this query:
# SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn()
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Looks like your test exited with 29 just after 1.
t/026_overwrite_contrecord.pl  Dubious, test returned 29 (wstat 7424, 
0x1d00)
Failed 2/3 subtests 

Test Summary Report
---
t/026_overwrite_contrecord.pl  (Wstat: 7424 Tests: 1 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 3 tests but ran 1.
Files=26, Tests=279, 400 wallclock secs ( 0.27 usr  0.10 sys + 73.78 cusr 59.66 
csys = 133.81 CPU)
Result: FAIL
make: *** [Makefile:23: check] Error 1
"""



-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Re: ALTER INDEX .. RENAME allows to rename tables/views as well

2021-10-06 Thread Alvaro Herrera

On 2021-Oct-06, Bossart, Nathan wrote:

> On 10/6/21, 3:44 PM, "Tom Lane"  wrote:

> > The situation for "ALTER some-other-relation-kind" is a bit more
> > confused, because some cases throw errors and some don't; but I really
> > doubt that tightening things up here will earn you anything but
> > brickbats.  I *definitely* don't agree with discarding the policy
> > about ALTER TABLE, especially if it's only done for RENAME.
> 
> I think we should at least consider adding this check for ALTER INDEX
> since we choose a different lock level in that case.

I agree -- letting ALTER INDEX process relations that aren't indexes is
dangerous, with its current coding that uses a reduced lock level.  But
maybe erroring out is not necessary; can we instead loop, locking the
object with ShareUpdateExclusive first, assuming it *is* an index, and
if it isn't then we release and restart using the stronger lock this
time?

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Peter Geoghegan

On Wed, Oct 6, 2021 at 3:47 PM Mark Dilger  wrote:
> I totally agree that the amcheck functions cannot prove the absence of 
> corruption.
>
> I prefer not to even use language about proving the presence of corruption 
> when discussing pg_amcheck.

I agree that it doesn't usually help. But sometimes it is important.

> > This seems to come up a lot because at various points you seem to be
> > concerned about introducing specific imperfections. But it's not like
> > your starting point was ever perfection, or ever could be.

> From the point of view of detecting corruptions, I agree that it never could 
> be.  But I'm not talking about that.  I'm talking about whether pg_amcheck 
> issues all the commands that it is supposed to issue.  If I work for Daddy 
> Warbucks and he gives me 30 classic cars to take to 10 different mechanics, I 
> can do that job perfectly even if the mechanics do less than perfect work.  
> If I leave three cars in the driveway, that's on me.  Likewise, it's not on 
> pg_amcheck if the checking functions can't do perfect work, but it is on 
> pg_amcheck if it doesn't issue all the expected commands.  But later on in 
> this email, it appears we don't have any remaining disagreements about that.  
> Read on

When you say "expected commands", I am entitled to ask: expected by
whom, based on what underlying principle? Similarly, when you suggest
that amcheck should directly deal with !indisvalid indexes itself, it
naturally leads to a tricky discussion of the precise definition of a
relation (in particular in the presence of REINDEX CONCURRENTLY), and
the limits of what is possible with amcheck. That's just where the
discussion has to go.

You cannot say that amcheck must (say) "directly deal with indisvalid
indexes", without at least saying why. pg_amcheck works by querying
pg_class, finding relations to verify. There is no way that that can
work that allows pg_amcheck to completely sidestep these awkward
questions -- just like with pg_dump. There is no safe neutral starting
point for a program like that.

> > I can
> > always describe a scenario in which amcheck misses real corruption --
> > a scenario which may be very contrived. So the mere fact that some new
> > theoretical possibility of corruption is introduced by some action
> > does not in itself mean much. We're dealing with that constantly, and
> > always will be.
>
> I wish we could stop discussing this.  I really don't think this ticket has 
> anything to do with how well or how poorly or how completely the amcheck 
> functions work.

It's related to !indisvalid indexes. At one point you were concerned
about not having coverage of them in certain scenarios. Which is fine.
But the inevitable direction of that conversation is towards
fundamental definitional questions.

Quite happy to drop all of this now, though.

> Ok, excellent, that was probably the only thing that had me really hung up.  
> I thought you were still asking for pg_amcheck to filter out the 
> --parent-check option when in recovery, but if you're not asking for that, 
> then I might have enough to go on now.

Sorry about that. I realized my mistake (not specifically addressing
pg_is_in_recovery()) after I hit "send", and should have corrected the
record sooner.

> I was using "downgrading" to mean downgrading from bt_index_parent_check() to 
> bt_index_check() when pg_is_in_recovery() is true, but you've clarified that 
> you're not requesting that downgrade, so I think we've now gotten past the 
> last sticking point about that whole issue.

Right. I never meant anything like making a would-be
bt_index_parent_check() call into a bt_index_check() call, just
because of the state of the system (e.g., it's in recovery). That
seems awful, in fact.

> will complain if no tables match, giving the user the opportunity to notice 
> that they spelled "accounting" wrong.  If there happens to be a table named 
> "xyzacountngo", and that matches, too bad.  There isn't any way pg_amcheck 
> can be responsible for that.  But if there is a temporary table named 
> "xyzacountngo" and that gets skipped because it's a temp table, I don't know 
> what feedback the user should get.  That's a thorny user interfaces question, 
> not a corruption checking question, and I don't think you need to weigh in 
> unless you want to.  I'll most likely go with whatever is the simplest to 
> code and/or most similar to what is currently in the tree, because I don't 
> see any knock-down arguments one way or the other.

I agree with you that this is a UI thing, since in any case the temp
table is pretty much "not visible to pg_amcheck". I have no particular
feelings about it.

Thanks
-- 
Peter Geoghegan

Re: Role Self-Administration

2021-10-06 Thread Mark Dilger




> On Oct 6, 2021, at 1:48 PM, Stephen Frost  wrote:
> 
> This specific syntax, including the CASCADE bit, has, at minimum, at least 
> been contemplate by the SQL folks sufficiently to be described in one 
> specific way.  I don’t have a copy of 2016 handy, unfortunately, and so I’m 
> not sure if it’s described that way in a “stable” version of the standard or 
> not (it isn’t defined in the 2006 draft I’ve seen), but ultimately I don’t 
> think we are really talking about entirely net-new syntax here…
> 
> If we were, that would be different and perhaps we would just be guessing at 
> what the standard might do in the future, but I don’t think it’s an open 
> ended question at this point..
> 
> (Even if it was, I have to say that the direction that they’re going in 
> certainly seems consistent to me, anyway, with what’s been done in the past 
> and I think it’d be bad of us to go in a different direction from that since 
> it’d be difficult for us to change it later when the new spec comes out and 
> contradicts what we decided to do..)

Assuming no concept of role ownership exists, but that DROP ROLE bob CASCADE is 
implemented in a spec compliant way, if there is a role "bob" who owns various 
objects, what happens when DROP ROLE bob CASCADE is performed?  Do bob's 
objects get dropped, do they get orphaned, or do they get assigned to some 
other owner?  I would expect that they get dropped, but I'd like to know what 
the spec says about it before going any further with this discussion. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: plperl: update ppport.h and fix configure version check

2021-10-06 Thread Andrew Dunstan



On 10/5/21 10:30 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> Seems OK. Note that the Msys DTK perl currawong uses to build with is
>> ancient (5.6.1). That's going to stay as it is until it goes completely
>> out of scope in about 13 months. The perl it builds plperl against is
>> much more modern - 5.16.3.
> That brings up something I was intending to ask you about -- any special
> tips about running the buildfarm script with a different Perl version
> than is used in the PG build itself?  I'm trying to modernize a couple
> of my buildfarm animals to use non-stone-age SSL, but I don't really
> want to move the goalposts on what they're testing.
>
>   


Mostly if you set the perl you're building against in the path ahead of
the perl you running with things just work. A notable exception is TAP
tests, where you have to set PROVE in the config_env to point to the
prove script you're going to use.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: ALTER INDEX .. RENAME allows to rename tables/views as well

2021-10-06 Thread Bossart, Nathan

On 10/6/21, 3:44 PM, "Tom Lane"  wrote:
> "Bossart, Nathan"  writes:
>> Here's a patch that ERRORs if the object type and statement type do
>> not match.  Interestingly, some of the regression tests were relying
>> on this behavior.
>
> ... as, no doubt, are a lot of applications that this will gratuitously
> break.  We've long had a policy that ALTER TABLE will work on relations
> that aren't tables, so long as the requested operation is sensible.

Right.

> The situation for "ALTER some-other-relation-kind" is a bit more
> confused, because some cases throw errors and some don't; but I really
> doubt that tightening things up here will earn you anything but
> brickbats.  I *definitely* don't agree with discarding the policy
> about ALTER TABLE, especially if it's only done for RENAME.

I think we should at least consider adding this check for ALTER INDEX
since we choose a different lock level in that case.

Nathan

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Mark Dilger

> On Oct 6, 2021, at 3:20 PM, Peter Geoghegan  wrote:
> 
>> I think the disagreements are about something else.
> 
> Informally speaking, you could say that pg_amcheck and amcheck verify
> relations. More formally speaking, both amcheck (whether called by
> pg_amcheck or some other thing) can only prove the presence of
> corruption. They cannot prove its absence. (The amcheck docs have
> always said almost these exact words.)

I totally agree that the amcheck functions cannot prove the absence of 
corruption.

I prefer not to even use language about proving the presence of corruption when 
discussing pg_amcheck.  I have let that slide upthread as a convenient 
short-hand, but I think it doesn't help.  For pg_amcheck to take any view 
whatsoever on whether a btree index is corrupt, it would have to introspect the 
error message that it gets back from bt_index_check().  It doesn't do that, nor 
do I think that it should.  It just prints the contents of the error for the 
user and records that fact and eventually exits with a non-zero exit code.  The 
error might have been something about the command exiting due to the crash of 
another backend, or to do with a deadlock against some other process, or 
whatever, and pg_amcheck has no opinion about whether any of that is to do with 
corruption or not.

> This seems to come up a lot because at various points you seem to be
> concerned about introducing specific imperfections. But it's not like
> your starting point was ever perfection, or ever could be.

From the point of view of detecting corruptions, I agree that it never could 
be.  But I'm not talking about that.  I'm talking about whether pg_amcheck 
issues all the commands that it is supposed to issue.  If I work for Daddy 
Warbucks and he gives me 30 classic cars to take to 10 different mechanics, I 
can do that job perfectly even if the mechanics do less than perfect work.  If 
I leave three cars in the driveway, that's on me.  Likewise, it's not on 
pg_amcheck if the checking functions can't do perfect work, but it is on 
pg_amcheck if it doesn't issue all the expected commands.  But later on in this 
email, it appears we don't have any remaining disagreements about that.  Read 
on

> I can
> always describe a scenario in which amcheck misses real corruption --
> a scenario which may be very contrived. So the mere fact that some new
> theoretical possibility of corruption is introduced by some action
> does not in itself mean much. We're dealing with that constantly, and
> always will be.

I wish we could stop discussing this.  I really don't think this ticket has 
anything to do with how well or how poorly or how completely the amcheck 
functions work.

> Let's suppose I was to "directly fix amcheck + !indisvalid indexes". I
> don't even know what that means -- I honestly don't have a clue.
> You're focussing on one small piece of code in verify_nbtree.c, that
> seems to punt responsibility, but the fact is that there are deeply
> baked-in reasons why it does so. That's a reflection of how many
> things about the system work, in general. Attributing blame to any one
> small snippet of code (code in verify_nbtree.c, or wherever) just
> isn't helpful.

I think we have agreed that pg_amcheck can filter out invalid indexes.  I don't 
have a problem with that.  I admit that I did have a problem with that 
upthread, but its been a while since I conceded that point so I'd rather not 
have to argue it again.

>> In truth, all the pg_amcheck frontend client can take a view on is whether 
>> it was able to issue all the commands to the backend that it was asked to 
>> issue, and whether any of those commands responded with an error.
> 
> AFAICT that pg_amcheck has to do is follow the amcheck user docs, by
> generalizing from the example SQL query for the B-Tree stuff. And, it
> should separately filter non-temp relations for the heap stuff, for
> the same reasons (exactly the same situation there).

I think we have agreed on that one, too, without me having ever argued it.  I 
posted a patch to filter out the temporary tables already.

>> pg_amcheck -d db1 -d db2 -d db3 --table=mytable
>> 
>> In this case, mytable is a regular table on db1, a temporary table on db2, 
>> and an unlogged table on db3, and db3 is in recovery.
> 
> I don't think that pg_amcheck needs to care about being in recovery,
> at all. I agreed with you about using pg_is_in_recovery() from at one
> point. That was a mistake on my part.

Ok, excellent, that was probably the only thing that had me really hung up.  I 
thought you were still asking for pg_amcheck to filter out the --parent-check 
option when in recovery, but if you're not asking for that, then I might have 
enough to go on now.

>> I thought that we were headed toward a decision where (despite my 
>> discomfort) pg_amcheck would downgrade options as necessary, but now it 
>> sounds like that's not so.  So what should it do
> 
> Downgrade is how you refer to it. I just

Re: ALTER INDEX .. RENAME allows to rename tables/views as well

2021-10-06 Thread Tom Lane

"Bossart, Nathan"  writes:
> On 10/6/21, 1:52 PM, "Bruce Momjian"  wrote:
>> I can confirm this bug in git head, and I think it should be fixed.

> Here's a patch that ERRORs if the object type and statement type do
> not match.  Interestingly, some of the regression tests were relying
> on this behavior.

... as, no doubt, are a lot of applications that this will gratuitously
break.  We've long had a policy that ALTER TABLE will work on relations
that aren't tables, so long as the requested operation is sensible.

The situation for "ALTER some-other-relation-kind" is a bit more
confused, because some cases throw errors and some don't; but I really
doubt that tightening things up here will earn you anything but
brickbats.  I *definitely* don't agree with discarding the policy
about ALTER TABLE, especially if it's only done for RENAME.

In short: no, I do not agree that this is a bug to be fixed.  Perhaps
we should have done things differently years ago, but it's too late to
redefine it.

regards, tom lane

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Peter Geoghegan

On Wed, Oct 6, 2021 at 2:45 PM Mark Dilger  wrote:
> I think the disagreements are about something else.

Informally speaking, you could say that pg_amcheck and amcheck verify
relations. More formally speaking, both amcheck (whether called by
pg_amcheck or some other thing) can only prove the presence of
corruption. They cannot prove its absence. (The amcheck docs have
always said almost these exact words.)

This seems to come up a lot because at various points you seem to be
concerned about introducing specific imperfections. But it's not like
your starting point was ever perfection, or ever could be. I can
always describe a scenario in which amcheck misses real corruption --
a scenario which may be very contrived. So the mere fact that some new
theoretical possibility of corruption is introduced by some action
does not in itself mean much. We're dealing with that constantly, and
always will be.

Let's suppose I was to "directly fix amcheck + !indisvalid indexes". I
don't even know what that means -- I honestly don't have a clue.
You're focussing on one small piece of code in verify_nbtree.c, that
seems to punt responsibility, but the fact is that there are deeply
baked-in reasons why it does so. That's a reflection of how many
things about the system work, in general. Attributing blame to any one
small snippet of code (code in verify_nbtree.c, or wherever) just
isn't helpful.

> In truth, all the pg_amcheck frontend client can take a view on is whether it 
> was able to issue all the commands to the backend that it was asked to issue, 
> and whether any of those commands responded with an error.

AFAICT that pg_amcheck has to do is follow the amcheck user docs, by
generalizing from the example SQL query for the B-Tree stuff. And, it
should separately filter non-temp relations for the heap stuff, for
the same reasons (exactly the same situation there).

> pg_amcheck -d db1 -d db2 -d db3 --table=mytable
>
> In this case, mytable is a regular table on db1, a temporary table on db2, 
> and an unlogged table on db3, and db3 is in recovery.

I don't think that pg_amcheck needs to care about being in recovery,
at all. I agreed with you about using pg_is_in_recovery() from at one
point. That was a mistake on my part.

> I thought that we were headed toward a decision where (despite my discomfort) 
> pg_amcheck would downgrade options as necessary, but now it sounds like 
> that's not so.  So what should it do

Downgrade is how you refer to it. I just think of it as making sure
that pg_amcheck only asks amcheck to verify relations that are
basically capable of being verified (e.g., not a temp table).

-- 
Peter Geoghegan

Re: using extended statistics to improve join estimates

2021-10-06 Thread Tomas Vondra


On 10/6/21 23:03, Zhihong Yu wrote:

Hi,

+       conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+       /* if the new statistics covers more conditions, use it */
+       if (list_length(conditions2) > list_length(conditions1))
+       {
+           mcv = stat;

It seems conditions2 is calculated using mcv, I wonder why mcv is 
replaced by stat (for conditions1 whose length is shorter) ?




Yeah, that's wrong - it should be the other way around, i.e.

if (list_length(conditions1) > list_length(conditions2))

There's no test with multiple candidate statistics yet, so this went 
unnoticed :-/



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Mark Dilger

> On Oct 6, 2021, at 2:45 PM, Mark Dilger  wrote:
> 
> and db3 is in recovery.

> they're scattered across different databases, some in recovery, some not.

What I mean here is that, since pg_amcheck might run for many hours, and 
database may start in recovery but then exit recovery, or may be restarted and 
go into recovery while we're not connected to them, the tool may see 
differences when processing a pattern against one database at one point in time 
and the same or different patterns against the same or different databases at 
some other point in time.  We don't get the luxury of assuming that nothing 
changes out from under us.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Mark Dilger

> On Oct 6, 2021, at 1:49 PM, Peter Geoghegan  wrote:
> 
>> The analogy here is: are we trying to verify that the relations are
>> valid? Or are we just trying to verify that they are as valid as we
>> can expect them to be?
> 
> I think that we do the latter (or something much closer to the latter
> than to the former). It's actually a very Karl Popper thing. Absence
> of evidence isn't evidence of absence -- period. We can get into a
> conversation about degrees of confidence, but that doesn't seem like
> it'll ever affect how we go about designing these things.
> 
> A lot of my disagreements around this stuff (especially with Mark)
> seem to stem from this basic understanding of things, in one way or
> another.

I think the disagreements are about something else.

Talking about pg_amcheck "checking" a database, or "checking" a relation, is 
actually short-hand for saying that pg_amcheck handed off the objects to 
amcheck's functions.  The pg_amcheck client application itself isn't checking 
anything.  This short-hand leads to misunderstandings that makes it really hard 
for me to understand what people mean in this thread. Your comments suggest 
that I (or pg_amcheck) take some view on whether the database is corrupt, or 
whether we've proven that it is corrupt, or whether we've proven that it is not 
corrupt.  In truth, all the pg_amcheck frontend client can take a view on is 
whether it was able to issue all the commands to the backend that it was asked 
to issue, and whether any of those commands responded with an error.

Talking about pg_amcheck "failing" is also confusing.  I don't understand what 
people mean by this.  The example towards the top of this thread from Alexander 
was about pg_amcheck || echo "fail", but that suggests that failure is just a 
question of whether pg_amcheck exited with non-zero exit code.  In other parts 
of the thead, talking about pg_amcheck "failing" seems to be used to mean 
"pg_amcheck has diagnosed corruption".  This all gets muddled together.

Upthread, I decided to just make the changes to pg_amcheck that you seemed to 
want, but now I don't know what you want.  Can you opine on each of the 
following.  I need to know what they should print, and whether they should 
return with a non-zero exit status.  I genuinely can't post a patch until I 
know what these are supposed to do, because I need to update the regression 
tests accordingly: 

pg_amcheck -d db1 -d db2 -d db3 --table=mytable

In this case, mytable is a regular table on db1, a temporary table on db2, and 
an unlogged table on db3, and db3 is in recovery.

pg_amcheck --all --index="*accounting*" --parent-check 
--table="*human_resources*" --table="*peter*"  --relation="*alexander*"

Assume a multitude of databases, some primary, some standby, some indexes 
logged, some unlogged, some temporary.  Some of the human resources tables are 
unlogged, some not, and they're scattered across different databases, some in 
recovery, some not.  There is exactly one table per database that matches the 
pattern /*peter*/, but it's circumstances are different from one database to 
the next, and likewise for the pattern /*alexander*/ except that in some 
databases it matches an index and in others it matches a table.

I thought that we were headed toward a decision where (despite my discomfort) 
pg_amcheck would downgrade options as necessary, but now it sounds like that's 
not so.  So what should it do

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Pavel Borisov

>
>
> It is the most consistent with the general design of the system, for
> reasons that are pretty deeply baked into the system. I'm reminded of
> the fact that REINDEX CONCURRENTLY's completion became blocked due to
> similar trepidations. Understandably so.


I may mistake, but I recall the fact that all indexes builds started during
some other (long) index build do not finish with indexes usable for selects
until that long index is built. This may and may not be a source of amcheck
misbehavior. Just a note what could be possibly considered.

Best regards,
Pavel Borisov

Re: using extended statistics to improve join estimates

2021-10-06 Thread Zhihong Yu

On Wed, Oct 6, 2021 at 12:33 PM Tomas Vondra 
wrote:

> Hi,
>
> attached is an improved version of this patch, addressing some of the
> points mentioned in my last message:
>
> 1) Adds a couple regression tests, testing various join cases with
> expressions, additional conditions, etc.
>
> 2) Adds support for expressions, so the join clauses don't need to
> reference just simple columns. So e.g. this can benefit from extended
> statistics, when defined on the expressions:
>
>  -- CREATE STATISTICS s1 ON (a+1), b FROM t1;
>  -- CREATE STATISTICS s2 ON (a+1), b FROM t2;
>
>  SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b);
>
> 3) Can combine extended statistics and regular (per-column) statistics.
> The previous version required extended statistics MCV on both sides of
> the join, but adding extended statistics on both sides may impractical
> (e.g. if one side does not have correlated columns it's silly to have to
> add it just to make this patch happy).
>
> For example you may have extended stats on the dimension table but not
> the fact table, and the patch still can combine those two. Of course, if
> there's no MCV on either side, we can't do much.
>
> So this patch works when both sides have extended statistics MCV, or
> when one side has extended MCV and the other side regular MCV. It might
> seem silly, but the extended MCV allows considering additional baserel
> conditions (if there are any).
>
>
> examples
> 
>
> The table / data is very simple, but hopefully good enough for some
> simple examples.
>
>   create table t1 (a int, b int, c int);
>   create table t2 (a int, b int, c int);
>
>   insert into t1 select mod(i,50), mod(i,50), mod(i,50)
> from generate_series(1,1000) s(i);
>
>   insert into t2 select mod(i,50), mod(i,50), mod(i,50)
> from generate_series(1,1000) s(i);
>
>   analyze t1, t2;
>
> First, without extended stats (just the first line of explain analyze,
> to keep the message short):
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
>
> QUERY PLAN
> 
>  Hash Join  (cost=31.00..106.00 rows=400 width=24)
> (actual time=5.426..22.678 rows=2 loops=1)
>
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;
>
> QUERY PLAN
> 
>  Hash Join  (cost=28.50..160.75 rows=1 width=24)
> (actual time=5.325..19.760 rows=1 loops=1)
>
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
> 25 and t2.c > 25;
>
> QUERY PLAN
> 
>  Hash Join  (cost=24.50..104.75 rows=4800 width=24)
> (actual time=5.618..5.639 rows=0 loops=1)
>
>
> Now, let's create statistics:
>
>   create statistics s1 on a, b, c from t1 ;
>   create statistics s2 on a, b, c from t2 ;
>   analyze t1, t2;
>
> and now the same queries again:
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
>
> QUERY PLAN
> 
>  Hash Join  (cost=31.00..106.00 rows=2 width=24)
> (actual time=5.448..22.713 rows=2 loops=1)
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;
>
> QUERY PLAN
> 
>  Hash Join  (cost=28.50..160.75 rows=1 width=24)
> (actual time=5.317..16.680 rows=1 loops=1)
>
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
> 25 and t2.c > 25;
>
> QUERY PLAN
> 
>  Hash Join  (cost=24.50..104.75 rows=1 width=24)
> (actual time=5.647..5.667 rows=0 loops=1)
>
>
> Those examples are a bit simplistic, but the improvements are fairly
> clear I think.
>
>
> limitations & open issues
> =
>
> Let's talk about the main general restrictions and open issues in the
> current patch that I can think of at the moment.
>
> 1) statistics covering all join clauses
>
> The patch requires the statistics to cover all the join clauses, mostly
> because it simplifies the implementation. This means that to use the
> per-column statistics, there has to be just a single join clause.
>
> AFAICS this could be relaxed and we could use multiple statistics to
> estimate the clauses. But it'd make selection of statistics much more
> complicated, because we have to pick "matching" statistics on both sides
> of the join. So it seems like an overkill, and most joins have very few
> conditions anyway.
>
>
> 2) only equality join clauses
>
> The other restriction is that at the moment

Re: ALTER INDEX .. RENAME allows to rename tables/views as well

2021-10-06 Thread Bruce Momjian



I can confirm this bug in git head, and I think it should be fixed.

---

On Mon, Oct  4, 2021 at 10:23:23AM +, Onder Kalaci wrote:
> Hi hackers,
> 
> I realized a subtle behavior with ALTER INDEX .. RENAME. It seems like a bug 
> to
> me, please see the steps below.
> 
>  
> 
> Test 1: Rename table via RENAME .. INDEX
> 
> CREATE TABLE test_table (a int);
> 
> SELECT 'test_table'::regclass::oid;
> 
>   oid 
> 
> ---
> 
> 34470
> 
> (1 row)
> 
> -- rename table using ALTER INDEX ..
> 
> ALTER INDEX test_table RENAME TO test_table_2;
> 
> 
> -- see that table is rename
> 
> SELECT 34470::regclass;
> 
>regclass  
> 
> --
> 
> test_table_2
> 
> (1 row)
> 
> 
> Test 2: Rename view via RENAME .. INDEX
> CREATE VIEW test_view AS SELECT * FROM pg_class;
> 
> SELECT 'test_view'::regclass::oid;
> 
>   oid 
> 
> ---
> 
> 34473
> 
> (1 row)
> 
>  
> 
> ALTER INDEX test_view RENAME TO test_view_2;
> 
> ELECT 34473::regclass;
> 
>   regclass  
> 
> -
> 
> test_view_2
> 
> (1 row)
> 
>  
> 
> 
> It seems like an oversight in ExecRenameStmt(), and probably applies to
> sequences, mat. views and foreign tables as well.
> 
>  
> 
> I can reproduce this on both 13.2 and 14.0. Though haven’t checked earlier
> versions.
> 
>  
> 
> Thanks,
> 
> Onder
> 

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Peter Geoghegan

On Wed, Oct 6, 2021 at 1:15 PM Robert Haas  wrote:
> > Where I might go further than you or Mark (not sure) is on this: I
> > also think that it's the caller's job to not call the functions with
> > temp relations, or (in the case of the index verification stuff) with
> > !indisready or !indisvalid relations. I believe that these ought to
> > also be treated as basic questions about the relation, just like in my
> > GIN example. But that's as far as I go here.
>
> I am on board with this, with slight trepidation.

It may not be a great design, or even a good one. My argument is just
that it's the least worst design overall.

It is the most consistent with the general design of the system, for
reasons that are pretty deeply baked into the system. I'm reminded of
the fact that REINDEX CONCURRENTLY's completion became blocked due to
similar trepidations. Understandably so.

> > I agree that nbtree and heapam verification ought to agree here. But
> > my point was just that this behavior just makes sense: what we have is
> > something just like an empty relation.
>
> I am not confident that this behavior is optimal. It's pretty
> arbitrary. It's like saying "well, you asked me to check that everyone
> in the car was wearing seatbelts, and the car has no seatbelts, so
> we're good!"

I prefer to think of it as "there is nobody in the car, so we're all good!".

> The analogy here is: are we trying to verify that the relations are
> valid? Or are we just trying to verify that they are as valid as we
> can expect them to be?

I think that we do the latter (or something much closer to the latter
than to the former). It's actually a very Karl Popper thing. Absence
of evidence isn't evidence of absence -- period. We can get into a
conversation about degrees of confidence, but that doesn't seem like
it'll ever affect how we go about designing these things.

A lot of my disagreements around this stuff (especially with Mark)
seem to stem from this basic understanding of things, in one way or
another.

> No, that's existing code from btree_index_mainfork_expected. I thought
> you were saying that verify_heapam.c should adopt the same approach,
> and I was agreeing, not because I think it's necessarily the perfect
> approach, but for consistency.

Sorry, I somehow read that code as having an ERROR, not a NOTICE.
(Even though I wrote the code myself.)

-- 
Peter Geoghegan

Re: Role Self-Administration

2021-10-06 Thread Stephen Frost

Greetings,

On Wed, Oct 6, 2021 at 16:28 Robert Haas  wrote:

> On Wed, Oct 6, 2021 at 3:29 PM Stephen Frost  wrote:
> > Does that mean that we also get to change what a specific set of
> > commands, which are all well-defined in the standard, do even when that
> > goes against what an SQL compliant implementation would do?  I really
> > don't think so.  If this was *new* syntax to go along with some new
> > feature or extension in PG, sure, we can define what that syntax does
> > because the standard doesn't.  In this case we're talking entirely about
> > objects and statements which the standard does define.
>
> Well, I think what we're talking about is saying something like:
>
> CREATE USER mybigcustomer CREATEROLE;
>
> And then having the mybigcustomer role be able to create other roles,
> which would be automatically dropped if I later said:
>
> DROP USER mybigcustomer CASCADE;
>
> Since AFAIK CREATEROLE is not in the specification, I think we're
> perfectly free to say that it alters the behavior of the subsequent
> DROP USER command in any way that we judge reasonable. I agree that we
> need to have SQL-standard syntax do SQL-standard things, but it
> doesn't have to be the case that the whole command goes unmentioned by
> the specification. Options that we add to CREATE USER or CREATE TABLE
> or any other command can modify the behavior of those objects, and the
> spec has nothing to say about it.
>
> Now that doesn't intrinsically mean that it's a good idea. I think
> what I hear you saying is that you find it pretty terrifying that
> "DROP USER mybigcustomer CASCADE;" could blow away a lot of users and
> a lot of tables and that could be scary. And I agree, but that's a
> design question, not a spec question. Today, there is not, in
> PostgreSQL, a DROP USER .. CASCADE variant. If there are objects that
> depend on the user, DROP USER fails. So we could for example decide
> that DROP USER .. CASCADE will cascade to other users, but not to
> regular objects. Or maybe that's too inconsistent, and we should do
> something like DROP ROLES OWNED BY [role]. Or maybe having both DROP
> OWNED BY and DROP ROLES OWNED BY is too weird, and the existing DROP
> OWNED BY [role] command should also cascade to roles. Those kinds of
> things seem worth discussing to me, to come up with the behavior that
> will work best for people. But I do disagree with the idea that we're
> not free to innovate here. We make up new SQL syntax and new
> configuration variables and all kinds of new things all the time, and
> I don't think this is any different.


This specific syntax, including the CASCADE bit, has, at minimum, at least
been contemplate by the SQL folks sufficiently to be described in one
specific way.  I don’t have a copy of 2016 handy, unfortunately, and so I’m
not sure if it’s described that way in a “stable” version of the standard
or not (it isn’t defined in the 2006 draft I’ve seen), but ultimately I
don’t think we are really talking about entirely net-new syntax here…

If we were, that would be different and perhaps we would just be guessing
at what the standard might do in the future, but I don’t think it’s an open
ended question at this point..

(Even if it was, I have to say that the direction that they’re going in
certainly seems consistent to me, anyway, with what’s been done in the past
and I think it’d be bad of us to go in a different direction from that
since it’d be difficult for us to change it later when the new spec comes
out and contradicts what we decided to do..)

Thanks,

Stephen

>

Re: Role Self-Administration

2021-10-06 Thread Stephen Frost

Greetings,

On Wed, Oct 6, 2021 at 16:01 Mark Dilger 
wrote:

> > On Oct 6, 2021, at 11:48 AM, Stephen Frost  wrote:
> >
> > In the spec, under , there is a 'General Rules'
> > section (as there is with most statements) and in that section it says
> > that for every authorization identifier (that is, some privilege, be it
> > a GRANT of SELECT rights on an object, or GRANT of role membership in
> > some role) which references the role being dropped, the command:
> >
> > REVOKE R FROM A DB
> >
> > is effectively executed (without further access rule checking).
>
> I think you are saying that "DROP ROLE bob" implies revoking "bob" from
> anybody who has membership in role bob.  I agree with that entirely, and my
> proposal does not change that.  (Roles owned by "bob" are not typically
> members of role "bob" to begin with.)

Yes and no….   Specifically the spec says that “DROP ROLE bob CASCADE”
implies revoking memberships that bob is in. The other drop behavior is
“RESTRICT”, which, as you might expect, implies throwing an error instead.

> What I'm saying above is that the command explicitly listed there
> > *isn't* 'DROP ROLE A DB', even though that is something which the spec
> > *could* have done, had they wished to.
>
> Clearly the spec could have said that "DROP ROLE bob" implies "and drop
> all roles which are members of bob" and did not.  I fullly agree with that
> decision, and I'm not trying to change it one iota.

I’m not talking about what the spec says for just “DROP ROLE bob”, but
rather what the spec says for “DROP ROLE bob CASCADE”. The latest versions
add the drop behavior syntax to the end of DROP ROLE and it can be either
CASACDE or RESTRICT, and if it’s CASCADE then the rule is to run the
REVOKEs that I’ve been talking about.

>  Given that they didn't, it seems
> > very clear that making such a change would very much be a deviation and
> > violation of the spec.
>
> Sure, and I'm not proposing any such change.

But.. you are, because what I’ve been talking about has specifically been
the spec-defined “CASCADE” case, not bare DROP ROLE.

> That we invented some behind-the-scenes concept
> > of role ownership where we track who actually created what role and then
> > use that info to transform a REVOKE into a DROP doesn't make such a
> > transformation OK.
>
> I think I understand why you say this.  You seem to be conflating the idea
> of having privileges on role "bob" to being owned by role "bob".  That's
> not the case.  Maybe you are not conflating them, but I can't interpret
> what you are saying otherwise.

I’m talking specifically about what happens when someone runs a DROP ROLE
with CASCADE.

> Consider that with what you're proposing, a user could execute the
> > following series of entirely SQL-spec compliant statements, and get
> > very different results depending on if we have this 'ownership' concept
> > or not:
> >
> > SET ROLE postgres;
> > CREATE ROLE r1;
> >
> > SET ROLE r1;
> > CREATE ROLE r2;
> >
> > SET ROLE postgres;
> > DROP ROLE r1 CASCADE;
> >
> > With what you're suggesting, the end result would be that r2 no longer
> > exists, whereas with the spec-defined behvaior, r2 *would* still exist.
>
> If you try this on postgres 14, you get a syntax error because CASCADE is
> not supported in the grammar for DROP ROLE:
>
> mark.dilger=# drop role bob cascade;
> ERROR:  syntax error at or near "cascade"
>
> I don't know if those statements are "entirely SQL-spec compliant" because
> I have yet to find a reference to the spec saying what DROP ROLE ...
> CASCADE is supposed to do.  I found some Vertica docs that say what Vertica
> does.  I found some Enterprise DB docs about what Advanced Server does (or
> course, since I work here.)  I don't see much else.

They’re valid commands in the version I’m looking at, though I think
actually that this is a pre-release as apparently 2016 is the latest when I
thought there was something more recent. I’m not sure if the 2016 version
included the CASCADE option for DROP ROLE or not. Even if it’s only a
preview, sure looks like this is the direction they’re going in and it
seems consistent, at least to me, with other things they’ve done in this
area…

You have quoted me parts of the spec about what REVOKE is supposed to do,
> and I have responded about why I don't see the connection to DROP
> ROLE...CASCADE.

The bits from REVOKE that I quoted were only at the very start of this
thread…. This entire sub thread has only been about the DROP ROLE
statement..

Are there any other references to either the spec or how other common
> databases handle this?

Trying to get some more insight into the version of the spec I’m looking at
and if I can figure out a way that you’d be able to see what I’m talking
about directly.

Thanks,

Stephen

>

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2021-10-06 Thread Robert Haas

On Mon, Oct 4, 2021 at 12:44 PM Shruthi Gowda  wrote:
> Thanks for the inputs, Robert. In the v4 patch, an unused OID (i.e, 4)
> is fixed for the template0 and the same is removed from unused oid
> list.
>
> In addition to the review comment fixes, I have removed some code that
> is no longer needed/doesn't make sense since we preserve the OIDs.

This is not a full review, but I'm wondering about this bit of code:

-   if (!RELKIND_HAS_STORAGE(relkind) || OidIsValid(relfilenode))
+   if (!RELKIND_HAS_STORAGE(relkind) || (OidIsValid(relfilenode)
&& !create_storage))
create_storage = false;
else
{
create_storage = true;
-   relfilenode = relid;
+
+   /*
+* Create the storage with oid same as relid if relfilenode is
+* unspecified by the caller
+*/
+   if (!OidIsValid(relfilenode))
+   relfilenode = relid;
}

This seems hard to understand, and I wonder if perhaps it can be
simplified. If !RELKIND_HAS_STORAGE(relkind), then we're going to set
create_storage to false if it was previously true, and otherwise just
do nothing. Otherwise, if !create_storage, we'll enter the
create_storage = false branch which effectively does nothing.
Otherwise, if !OidIsValid(relfilenode), we'll set relfilenode = relid.
So couldn't we express that like this?

if (!RELKIND_HAS_STORAGE(relkind))
create_storage = false;
else if (create_storage && !OidIsValid(relfilenode))
relfilenode = relid;

If so, that seems more clear.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: parallelizing the archiver

2021-10-06 Thread Magnus Hagander

On Tue, Oct 5, 2021 at 5:32 AM Bossart, Nathan  wrote:

> On 10/4/21, 8:19 PM, "Stephen Frost"  wrote:
> > It's also been discussed, at least around the water cooler (as it were
> > in pandemic times- aka our internal slack channels..) that the existing
> > archive command might be reimplemented as an extension using these.  Not
> > sure if that's really necessary but it was a thought.  In any case,
> > thanks for working on this!
>
> Interesting.  I like the idea of having one code path for everything
> instead of branching for the hook and non-hook paths.  Thanks for
> sharing your thoughts.
>

I remember having had this discussion a few times, I think mainly with
Stephen and David as well (but not on their internal slack channels :P).

I definitely think that's the way to go. It gives a single path for
everything which makes it simpler in the most critical parts. And once you
have picked an implementation other than it, you're now completely rid of
the old implementation.  And of course the good old idea that having an
extension already using the API is a good way to show that the API is in a
good place.

As much as I dislike our current interface in archive_command, and would
like to see it go away completely, I do believe we need to ship something
that has it - if nothing else then for backwards compatibility. But an
extension like this would also make it easier to eventually, down the road,
deprecate this solution.

Oh, and please put said implementation in a better place than contrib :)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Role Self-Administration

2021-10-06 Thread Robert Haas

On Wed, Oct 6, 2021 at 3:29 PM Stephen Frost  wrote:
> Does that mean that we also get to change what a specific set of
> commands, which are all well-defined in the standard, do even when that
> goes against what an SQL compliant implementation would do?  I really
> don't think so.  If this was *new* syntax to go along with some new
> feature or extension in PG, sure, we can define what that syntax does
> because the standard doesn't.  In this case we're talking entirely about
> objects and statements which the standard does define.

Well, I think what we're talking about is saying something like:

CREATE USER mybigcustomer CREATEROLE;

And then having the mybigcustomer role be able to create other roles,
which would be automatically dropped if I later said:

DROP USER mybigcustomer CASCADE;

Since AFAIK CREATEROLE is not in the specification, I think we're
perfectly free to say that it alters the behavior of the subsequent
DROP USER command in any way that we judge reasonable. I agree that we
need to have SQL-standard syntax do SQL-standard things, but it
doesn't have to be the case that the whole command goes unmentioned by
the specification. Options that we add to CREATE USER or CREATE TABLE
or any other command can modify the behavior of those objects, and the
spec has nothing to say about it.

Now that doesn't intrinsically mean that it's a good idea. I think
what I hear you saying is that you find it pretty terrifying that
"DROP USER mybigcustomer CASCADE;" could blow away a lot of users and
a lot of tables and that could be scary. And I agree, but that's a
design question, not a spec question. Today, there is not, in
PostgreSQL, a DROP USER .. CASCADE variant. If there are objects that
depend on the user, DROP USER fails. So we could for example decide
that DROP USER .. CASCADE will cascade to other users, but not to
regular objects. Or maybe that's too inconsistent, and we should do
something like DROP ROLES OWNED BY [role]. Or maybe having both DROP
OWNED BY and DROP ROLES OWNED BY is too weird, and the existing DROP
OWNED BY [role] command should also cascade to roles. Those kinds of
things seem worth discussing to me, to come up with the behavior that
will work best for people. But I do disagree with the idea that we're
not free to innovate here. We make up new SQL syntax and new
configuration variables and all kinds of new things all the time, and
I don't think this is any different.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Robert Haas

On Wed, Oct 6, 2021 at 3:56 PM Peter Geoghegan  wrote:
> I agree, with the stipulation that the caller (in this case
> pg_amcheck) is required to know certain basic things about the
> relation in order to get useful behavior. For example, if you use
> bt_index_check() with a GIN index, you're going to get an error. That
> much we can all agree on, I'm sure.

Yes.

> Where I might go further than you or Mark (not sure) is on this: I
> also think that it's the caller's job to not call the functions with
> temp relations, or (in the case of the index verification stuff) with
> !indisready or !indisvalid relations. I believe that these ought to
> also be treated as basic questions about the relation, just like in my
> GIN example. But that's as far as I go here.

I am on board with this, with slight trepidation.

> > > --parent-check does present us with the question of what to do in Hot
> > > Standby mode, where it will surely fail (because it requires a
> > > relation level ShareLock, etc). But I actually don't think it's
> > > complicated: we must throw an error, because it's fundamentally not
> > > something that will ever work (with any index). Whether the error
> > > comes from pg_amcheck or amcheck proper doesn't seem important to me.
> >
> > That detail, to me, is actually very important.
>
> I believe that you actually reached the same conclusion, though: we
> should let it just fail. That makes this question easy.

Great.

> > > I think it's pretty clear that verify_heapam.c (from amcheck proper)
> > > should just follow verify_nbtree.c when directly invoked against an
> > > unlogged index in Hot Standby. That is, it should assume that the
> > > relation has no storage, but still "verify" it conceptually. Just show
> > > a NOTICE about it. Assume no storage to verify.
> >
> > I haven't checked the code, but that sounds right. I interpret this to
> > mean that the different sub-parts of amcheck don't handle this case in
> > ways that are consistent with each other, and that seems wrong. We
> > should make them consistent.
>
> I agree that nbtree and heapam verification ought to agree here. But
> my point was just that this behavior just makes sense: what we have is
> something just like an empty relation.

I am not confident that this behavior is optimal. It's pretty
arbitrary. It's like saying "well, you asked me to check that everyone
in the car was wearing seatbelts, and the car has no seatbelts, so
we're good!"

To which I respond: maybe. Were we trying to verify that people are
complying with safety regulations as well as may be possible under the
circumstances, or that people are actually safe?

The analogy here is: are we trying to verify that the relations are
valid? Or are we just trying to verify that they are as valid as we
can expect them to be?

For me, the deciding point is that verify_nbtree.c was here first, and
it set a precedent. Unless there is a compelling reason to do
otherwise, we should make later things conform to that precedent.
Whether that's actually best, I'm not certain. It might be, but I'm
not sure that it is.

> > > Finally, there is the question of what happens inside pg_amcheck (not
> > > amcheck proper) deals with unlogged relations in Hot Standby mode.
> > > There are two reasonable options: it can either "verify" the indexes
> > > (actually just show those NOTICE messages), or skip them entirely. I
> > > lean towards the former option, on the grounds that I don't think it
> > > should be special-cased. But I don't feel very strongly about it.
> >
> > I like having it do this:
> >
> > ereport(NOTICE,
> > (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
> >  errmsg("cannot verify unlogged index \"%s\"
> > during recovery, skipping",
> > RelationGetRelationName(rel;
> >
> > I think the fewer decisions the command-line tool makes, the better.
> > We should put the policy decisions in amcheck itself.
>
> Wait, so you're arguing that we should change amcheck (both nbtree and
> heapam verification) to simply reject unlogged indexes during
> recovery?

No, that's existing code from btree_index_mainfork_expected. I thought
you were saying that verify_heapam.c should adopt the same approach,
and I was agreeing, not because I think it's necessarily the perfect
approach, but for consistency.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Peter Geoghegan

On Wed, Oct 6, 2021 at 12:36 PM Mark Dilger
 wrote:
> The user may not know that the system has changed.
>
> For example, if I see errors in the logs suggesting corruption in a relation 
> named "mark" and run pg_amcheck --relation=mark, I expect that to check the 
> relation.  If that relation is a temporary table, I'd like to know that it's 
> not going to be checked, not just have pg_amcheck report that everything is 
> ok.

This is just a detail to me. I agree that it's reasonable to say "I
can't do that specific thing you asked for with the temp relation",
instead of "no such verifiable relation" -- but only because it's more
specific and user friendly. Providing a slightly friendlier error
message like this does not actually conflict with the idea of
generally treating temp relations as "not visible to pg_amcheck".
Ditto for the similar !indisready/!i.indisvalid B-Tree case.

> As another example, if I change my environment variables to connect to the 
> standby rather than the primary, and forget that I did so, and then run 
> pg_amcheck --relation=unlogged_relation, I'd rather get a complaint that I 
> can't check an unlogged relation on a standby than get nothing.  Sure, what I 
> did doesn't make sense, but why should the application paper over that 
> mistake?

I think that it shouldn't get an error at all -- this should be
treated like an empty relation, per the verify_nbtree.c precedent.
pg_amcheck doesn't need to concern itself with this at all.

-- 
Peter Geoghegan

Re: storing an explicit nonce

2021-10-06 Thread Bruce Momjian

On Wed, Oct  6, 2021 at 03:17:00PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Tue, Oct  5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote:
> > > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian  wrote:
> > > > We are still working on our TDE patch. Right now the focus is on 
> > > > refactoring
> > > > temporary file access to make the TDE patch itself smaller. 
> > > > Reconsidering
> > > > encryption mode choices given concerns expressed is next. Currently a 
> > > > viable
> > > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have 
> > > > an
-
> > > > issue with predictable IV and isn't totally broken in case of IV reuse.
> > > 
> > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> > > 16-byte blocks affect later blocks, meaning that hint bit changes would
> > > also affect later blocks.  I think this means we would need to write WAL
> > > full page images for hint bit changes to avoid torn pages.  Right now
> > > hint bit (single bit) changes can be lost without causing torn pages. 
> > > This was another of the advantages of using a stream cipher like CTR.
> > 
> > Another problem caused by block mode ciphers is that to use the LSN as
> > part of the nonce, the LSN must not be encrypted, but you then have to
> > find a 16-byte block in the page that you don't need to encrypt.
> 
> With AES-XTS, we don't need to use the LSN as part of the nonce though,
> so I don't think this argument is actually valid..?  As discussed
> previously regarding AES-XTS, the general idea was to use the path to
> the file and the filename itself plus the block number as the IV, and
> that works fine for XTS because it's ok to reuse it (unlike with CTR).

Yes, I would prefer we don't use the LSN.  I only mentioned it since
Ants Aasma mentioned LSN use above.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Role Self-Administration

2021-10-06 Thread Mark Dilger

> On Oct 6, 2021, at 11:48 AM, Stephen Frost  wrote:
> 
> In the spec, under , there is a 'General Rules'
> section (as there is with most statements) and in that section it says
> that for every authorization identifier (that is, some privilege, be it
> a GRANT of SELECT rights on an object, or GRANT of role membership in
> some role) which references the role being dropped, the command:
> 
> REVOKE R FROM A DB
> 
> is effectively executed (without further access rule checking).

I think you are saying that "DROP ROLE bob" implies revoking "bob" from anybody 
who has membership in role bob.  I agree with that entirely, and my proposal 
does not change that.  (Roles owned by "bob" are not typically members of role 
"bob" to begin with.)

> What I'm saying above is that the command explicitly listed there
> *isn't* 'DROP ROLE A DB', even though that is something which the spec
> *could* have done, had they wished to.

Clearly the spec could have said that "DROP ROLE bob" implies "and drop all 
roles which are members of bob" and did not.  I fullly agree with that 
decision, and I'm not trying to change it one iota.

>  Given that they didn't, it seems
> very clear that making such a change would very much be a deviation and
> violation of the spec.  

Sure, and I'm not proposing any such change.

> That we invented some behind-the-scenes concept
> of role ownership where we track who actually created what role and then
> use that info to transform a REVOKE into a DROP doesn't make such a
> transformation OK.

I think I understand why you say this.  You seem to be conflating the idea of 
having privileges on role "bob" to being owned by role "bob".  That's not the 
case.  Maybe you are not conflating them, but I can't interpret what you are 
saying otherwise.

> Consider that with what you're proposing, a user could execute the
> following series of entirely SQL-spec compliant statements, and get
> very different results depending on if we have this 'ownership' concept
> or not:
> 
> SET ROLE postgres;
> CREATE ROLE r1;
> 
> SET ROLE r1;
> CREATE ROLE r2;
> 
> SET ROLE postgres;
> DROP ROLE r1 CASCADE;
> 
> With what you're suggesting, the end result would be that r2 no longer
> exists, whereas with the spec-defined behvaior, r2 *would* still exist.

If you try this on postgres 14, you get a syntax error because CASCADE is not 
supported in the grammar for DROP ROLE:

mark.dilger=# drop role bob cascade;
ERROR:  syntax error at or near "cascade"

I don't know if those statements are "entirely SQL-spec compliant" because I 
have yet to find a reference to the spec saying what DROP ROLE ... CASCADE is 
supposed to do.  I found some Vertica docs that say what Vertica does.  I found 
some Enterprise DB docs about what Advanced Server does (or course, since I 
work here.)  I don't see much else.

You have quoted me parts of the spec about what REVOKE is supposed to do, and I 
have responded about why I don't see the connection to DROP ROLE...CASCADE.

Are there any other references to either the spec or how other common databases 
handle this?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings

2021-10-06 Thread Jelte Fennema

Ugh forgot to attach the patch. Here it is.

From: Jelte Fennema 
Sent: Wednesday, October 6, 2021 21:56
To: Zhihong Yu 
Cc: pgsql-hack...@postgresql.org 
Subject: Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, 
connect_timeout and keepalive settings

We actually ran into an issue caused by this in production, where a PQcancel 
connection was open on the client for a 2+ days because the server had 
restarted at the wrong moment in the cancel handshake. The client was now 
indefinitely waiting for the server to send an EOF back, and because keepalives 
were not enabled on this socket it was never closed.

I attached an updated patch which also uses the keepalive settings in PQ. The 
connect_timeout is a bit harder to get it to work. As far as I can tell it 
would require something like this. https://stackoverflow.com/a/2597774/2570866

> The other field names are quite short. How about naming the field tcp_timeout 
> ?

I kept the same names as in the pg_conn struct for consistency sake.

0001-Use-tcp_user_timeout-and-keepalives-in-PQcancel.patch
Description:  0001-Use-tcp_user_timeout-and-keepalives-in-PQcancel.patch

Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings

2021-10-06 Thread Jelte Fennema

We actually ran into an issue caused by this in production, where a PQcancel 
connection was open on the client for a 2+ days because the server had 
restarted at the wrong moment in the cancel handshake. The client was now 
indefinitely waiting for the server to send an EOF back, and because keepalives 
were not enabled on this socket it was never closed.

I attached an updated patch which also uses the keepalive settings in PQ. The 
connect_timeout is a bit harder to get it to work. As far as I can tell it 
would require something like this. https://stackoverflow.com/a/2597774/2570866

> The other field names are quite short. How about naming the field tcp_timeout 
>?

I kept the same names as in the pg_conn struct for consistency sake.

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Peter Geoghegan

On Wed, Oct 6, 2021 at 12:33 PM Robert Haas  wrote:
> To me, it doesn't matter which specific option we're talking about. If
> I tell pg_amcheck to pass a certain flag to the underlying functions,
> then it should do that. If the behavior needs to be changed, it should
> be changed in those underlying functions, not in pg_amcheck.

I agree, with the stipulation that the caller (in this case
pg_amcheck) is required to know certain basic things about the
relation in order to get useful behavior. For example, if you use
bt_index_check() with a GIN index, you're going to get an error. That
much we can all agree on, I'm sure.

Where I might go further than you or Mark (not sure) is on this: I
also think that it's the caller's job to not call the functions with
temp relations, or (in the case of the index verification stuff) with
!indisready or !indisvalid relations. I believe that these ought to
also be treated as basic questions about the relation, just like in my
GIN example. But that's as far as I go here.

> If we
> start putting some of the intelligence into amcheck itself, and some
> of it into pg_amcheck, I think it's going to become confusing and in
> fact I think it's going to become unreliable, at least from the user
> point of view. People will get confused if they run pg_amcheck and get
> some result (either pass or fail) and then they do the same thing with
> pg_amcheck and get a different result.

Agreed on all that.

> > --parent-check does present us with the question of what to do in Hot
> > Standby mode, where it will surely fail (because it requires a
> > relation level ShareLock, etc). But I actually don't think it's
> > complicated: we must throw an error, because it's fundamentally not
> > something that will ever work (with any index). Whether the error
> > comes from pg_amcheck or amcheck proper doesn't seem important to me.
>
> That detail, to me, is actually very important.

I believe that you actually reached the same conclusion, though: we
should let it just fail. That makes this question easy.

> > I think it's pretty clear that verify_heapam.c (from amcheck proper)
> > should just follow verify_nbtree.c when directly invoked against an
> > unlogged index in Hot Standby. That is, it should assume that the
> > relation has no storage, but still "verify" it conceptually. Just show
> > a NOTICE about it. Assume no storage to verify.
>
> I haven't checked the code, but that sounds right. I interpret this to
> mean that the different sub-parts of amcheck don't handle this case in
> ways that are consistent with each other, and that seems wrong. We
> should make them consistent.

I agree that nbtree and heapam verification ought to agree here. But
my point was just that this behavior just makes sense: what we have is
something just like an empty relation.

> > Finally, there is the question of what happens inside pg_amcheck (not
> > amcheck proper) deals with unlogged relations in Hot Standby mode.
> > There are two reasonable options: it can either "verify" the indexes
> > (actually just show those NOTICE messages), or skip them entirely. I
> > lean towards the former option, on the grounds that I don't think it
> > should be special-cased. But I don't feel very strongly about it.
>
> I like having it do this:
>
> ereport(NOTICE,
> (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
>  errmsg("cannot verify unlogged index \"%s\"
> during recovery, skipping",
> RelationGetRelationName(rel;
>
> I think the fewer decisions the command-line tool makes, the better.
> We should put the policy decisions in amcheck itself.

Wait, so you're arguing that we should change amcheck (both nbtree and
heapam verification) to simply reject unlogged indexes during
recovery?

That doesn't seem like very friendly or self-consistent behavior. At
first (in hot standby) it fails. As soon as the DB is promoted, we'll
then also have no on-disk storage for the same unlogged relation, but
now suddenly it's okay, just because of that. I find it far more
logical to just assume that there is no relfilenode storage to check
when in hot standby.

This isn't the same as the --parent-check thing at all, because that's
about an implementation restriction of Hot Standby. Whereas this is
about the physical index structure itself.

-- 
Peter Geoghegan

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Mark Dilger

> On Oct 6, 2021, at 12:28 PM, Peter Geoghegan  wrote:
> 
> I think that what I've said boils down to this:
> 
> * pg_amcheck shouldn't attempt to verify temp relations, on the
> grounds that this is fundamentally not useful, and not something that
> could ever be sensibly interpreted as "just doing what the user asked
> for".

Right.  I don't think there has been any disagreement on this.  There is a bug 
in pg_amcheck with respect to this issue, and we all agree on that.

> * pg_amcheck calls to bt_index_check()/bt_index_parent_check() must
> only be made with "i.indisready AND i.indisvalid" indexes, just like
> the old query from the docs. (Actually, the same query also filters
> out temp relations -- which is why I view this issue as almost
> identical to the first.)
> 
> Why would the user ask for something that fundamentally doesn't make
> any sense?

The user may not know that the system has changed.

For example, if I see errors in the logs suggesting corruption in a relation 
named "mark" and run pg_amcheck --relation=mark, I expect that to check the 
relation.  If that relation is a temporary table, I'd like to know that it's 
not going to be checked, not just have pg_amcheck report that everything is ok.

As another example, if I change my environment variables to connect to the 
standby rather than the primary, and forget that I did so, and then run 
pg_amcheck --relation=unlogged_relation, I'd rather get a complaint that I 
can't check an unlogged relation on a standby than get nothing.  Sure, what I 
did doesn't make sense, but why should the application paper over that mistake?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Robert Haas

On Wed, Oct 6, 2021 at 2:56 PM Peter Geoghegan  wrote:
> --heapallindexed doesn't complicate things for us at all. It changes
> nothing about the locking considerations. It's just an additive thing,
> some extra checks with the same basic underlying requirements. Maybe
> you meant to say --parent-check, not --heapallindexed?

To me, it doesn't matter which specific option we're talking about. If
I tell pg_amcheck to pass a certain flag to the underlying functions,
then it should do that. If the behavior needs to be changed, it should
be changed in those underlying functions, not in pg_amcheck. If we
start putting some of the intelligence into amcheck itself, and some
of it into pg_amcheck, I think it's going to become confusing and in
fact I think it's going to become unreliable, at least from the user
point of view. People will get confused if they run pg_amcheck and get
some result (either pass or fail) and then they do the same thing with
pg_amcheck and get a different result.

> --parent-check does present us with the question of what to do in Hot
> Standby mode, where it will surely fail (because it requires a
> relation level ShareLock, etc). But I actually don't think it's
> complicated: we must throw an error, because it's fundamentally not
> something that will ever work (with any index). Whether the error
> comes from pg_amcheck or amcheck proper doesn't seem important to me.

That detail, to me, is actually very important.

> I think it's pretty clear that verify_heapam.c (from amcheck proper)
> should just follow verify_nbtree.c when directly invoked against an
> unlogged index in Hot Standby. That is, it should assume that the
> relation has no storage, but still "verify" it conceptually. Just show
> a NOTICE about it. Assume no storage to verify.

I haven't checked the code, but that sounds right. I interpret this to
mean that the different sub-parts of amcheck don't handle this case in
ways that are consistent with each other, and that seems wrong. We
should make them consistent.

> Finally, there is the question of what happens inside pg_amcheck (not
> amcheck proper) deals with unlogged relations in Hot Standby mode.
> There are two reasonable options: it can either "verify" the indexes
> (actually just show those NOTICE messages), or skip them entirely. I
> lean towards the former option, on the grounds that I don't think it
> should be special-cased. But I don't feel very strongly about it.

I like having it do this:

ereport(NOTICE,
(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
 errmsg("cannot verify unlogged index \"%s\"
during recovery, skipping",
RelationGetRelationName(rel;

I think the fewer decisions the command-line tool makes, the better.
We should put the policy decisions in amcheck itself.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: using extended statistics to improve join estimates

2021-10-06 Thread Tomas Vondra

Hi,

attached is an improved version of this patch, addressing some of the
points mentioned in my last message:

1) Adds a couple regression tests, testing various join cases with
expressions, additional conditions, etc.

2) Adds support for expressions, so the join clauses don't need to
reference just simple columns. So e.g. this can benefit from extended
statistics, when defined on the expressions:

 -- CREATE STATISTICS s1 ON (a+1), b FROM t1;
 -- CREATE STATISTICS s2 ON (a+1), b FROM t2;

 SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b);

3) Can combine extended statistics and regular (per-column) statistics.
The previous version required extended statistics MCV on both sides of
the join, but adding extended statistics on both sides may impractical
(e.g. if one side does not have correlated columns it's silly to have to
add it just to make this patch happy).

For example you may have extended stats on the dimension table but not
the fact table, and the patch still can combine those two. Of course, if
there's no MCV on either side, we can't do much.

So this patch works when both sides have extended statistics MCV, or
when one side has extended MCV and the other side regular MCV. It might
seem silly, but the extended MCV allows considering additional baserel
conditions (if there are any).


examples


The table / data is very simple, but hopefully good enough for some
simple examples.

  create table t1 (a int, b int, c int);
  create table t2 (a int, b int, c int);

  insert into t1 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

  insert into t2 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

  analyze t1, t2;

First, without extended stats (just the first line of explain analyze,
to keep the message short):

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN

 Hash Join  (cost=31.00..106.00 rows=400 width=24)
(actual time=5.426..22.678 rows=2 loops=1)


explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN

 Hash Join  (cost=28.50..160.75 rows=1 width=24)
(actual time=5.325..19.760 rows=1 loops=1)


explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN

 Hash Join  (cost=24.50..104.75 rows=4800 width=24)
(actual time=5.618..5.639 rows=0 loops=1)


Now, let's create statistics:

  create statistics s1 on a, b, c from t1 ;
  create statistics s2 on a, b, c from t2 ;
  analyze t1, t2;

and now the same queries again:

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN

 Hash Join  (cost=31.00..106.00 rows=2 width=24)
(actual time=5.448..22.713 rows=2 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN

 Hash Join  (cost=28.50..160.75 rows=1 width=24)
(actual time=5.317..16.680 rows=1 loops=1)


explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN

 Hash Join  (cost=24.50..104.75 rows=1 width=24)
(actual time=5.647..5.667 rows=0 loops=1)


Those examples are a bit simplistic, but the improvements are fairly
clear I think.


limitations & open issues
=

Let's talk about the main general restrictions and open issues in the
current patch that I can think of at the moment.

1) statistics covering all join clauses

The patch requires the statistics to cover all the join clauses, mostly
because it simplifies the implementation. This means that to use the
per-column statistics, there has to be just a single join clause.

AFAICS this could be relaxed and we could use multiple statistics to
estimate the clauses. But it'd make selection of statistics much more
complicated, because we have to pick "matching" statistics on both sides
of the join. So it seems like an overkill, and most joins have very few
conditions anyway.


2) only equality join clauses

The other restriction is that at the moment this only supports simple
equality clauses, combined with AND. So for example this is supported

   t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1))

while these are not:

   t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1))

   t1 JOIN t2 ON ((t1.a - t2.a = 0) AND

Re: Role Self-Administration

2021-10-06 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Oct 6, 2021 at 2:48 PM Stephen Frost  wrote:
> > What I'm saying above is that the command explicitly listed there
> > *isn't* 'DROP ROLE A DB', even though that is something which the spec
> > *could* have done, had they wished to.  Given that they didn't, it seems
> > very clear that making such a change would very much be a deviation and
> > violation of the spec.  That we invented some behind-the-scenes concept
> > of role ownership where we track who actually created what role and then
> > use that info to transform a REVOKE into a DROP doesn't make such a
> > transformation OK.
> 
> If PostgreSQL implements extensions to the SQL specification, then we
> get to decide how those features interact with the features that are
> specified.

Does that mean that we also get to change what a specific set of
commands, which are all well-defined in the standard, do even when that
goes against what an SQL compliant implementation would do?  I really
don't think so.  If this was *new* syntax to go along with some new
feature or extension in PG, sure, we can define what that syntax does
because the standard doesn't.  In this case we're talking entirely about
objects and statements which the standard does define.

> For example, I presume the spec doesn't say that you can drop a
> function by dropping the extension that contains it, but that's just
> because extensions as we have them in PostgreSQL are not part of the
> SQL standard. It would be silly to have rejected that feature on those
> grounds, because nobody is forced to use extensions, and if you don't,
> then they do not cause any deviation from spec-mandated behavior.

The prior example that I used didn't include *any* non-SQL standard
statements, so I don't view this argument as applicable.

> In the same way, nobody would be forced to make a role own another
> role, and if you don't, then you'll never notice any deviation from
> spec-mandated behavior on account of that feature.

So you're suggesting that roles created by other roles wouldn't
*automatically* by owned by the creating role and that, instead, someone
would have to explicitly say something like:

ALTER ROLE x OWNED BY y;

after the role is created, and only then would a DROP ROLE y CASCADE;
turn into DROP ROLE x CASCADE; DROP ROLE y CASCADE; and that, absent
that happening, a DROP ROLE y CASCADE; would do what the standard says,
and not actually DROP all the associated objects but only run the REVOKE
statements?

I'll accept that, in such a case, we could argue that we're no longer
following the spec because the user has started to use some PG extension
to the spec, but, I've got a really hard time seeing how such a massive
difference in what DROP ROLE x CASCADE; does would be acceptable or at
all reasonable.

One could lead to hundreds of tables being dropped out of the database
and a massive outage while the other would just mean some role
memberships get cleaned up as part of a role being dropped.  Having one
command that does two vastly different things like that is a massive,
loaded, foot-pointed gun.

> If the SQL specification says that roles can own other roles, but that
> DROP has to have some special behavior in regards to that feature,
> then we should probably try to do what the spec says. But if the spec
> doesn't think that the concept of roles owning other roles even
> exists, but we choose to invent such a concept, then I think we can
> make it work however we like without worrying about
> spec-compatibility. We've already invented lots of other things like
> that, and the project is the better for it.

The SQL spec doesn't say that roles can own other roles.  I don't think
that means we get to rewrite what DROP ROLE ... CASCADE does.  Extend
DROP ROLE with other parameters which are relevant to our extension of
the spec?  Sure, perhaps, but not entirely redefine what the base
command does to be different from what the SQL spec says it does.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Peter Geoghegan

On Wed, Oct 6, 2021 at 11:55 AM Peter Geoghegan  wrote:
> I am pretty sure that I agree with you about all these details. We
> need to tease them apart some more.

I think that what I've said boils down to this:

* pg_amcheck shouldn't attempt to verify temp relations, on the
grounds that this is fundamentally not useful, and not something that
could ever be sensibly interpreted as "just doing what the user asked
for".

* pg_amcheck calls to bt_index_check()/bt_index_parent_check() must
only be made with "i.indisready AND i.indisvalid" indexes, just like
the old query from the docs. (Actually, the same query also filters
out temp relations -- which is why I view this issue as almost
identical to the first.)

Why would the user ask for something that fundamentally doesn't make
any sense? The argument "that's just what they asked for" has it
backwards, because *not* asking for it is very difficult, while asking
for it (which, remember, fundamentally makes no sense) is very easy.

* --parent-check can and should fail in hot standby mode.

The argument "that's just what the user asked for" works perfectly here.

-- 
Peter Geoghegan

Re: storing an explicit nonce

2021-10-06 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Tue, Oct  5, 2021 at 04:29:25PM -0400, Bruce Momjian wrote:
> > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian  wrote:
> > > We are still working on our TDE patch. Right now the focus is on 
> > > refactoring
> > > temporary file access to make the TDE patch itself smaller. Reconsidering
> > > encryption mode choices given concerns expressed is next. Currently a 
> > > viable
> > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
> > > issue with predictable IV and isn't totally broken in case of IV reuse.
> > 
> > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> > 16-byte blocks affect later blocks, meaning that hint bit changes would
> > also affect later blocks.  I think this means we would need to write WAL
> > full page images for hint bit changes to avoid torn pages.  Right now
> > hint bit (single bit) changes can be lost without causing torn pages. 
> > This was another of the advantages of using a stream cipher like CTR.
> 
> Another problem caused by block mode ciphers is that to use the LSN as
> part of the nonce, the LSN must not be encrypted, but you then have to
> find a 16-byte block in the page that you don't need to encrypt.

With AES-XTS, we don't need to use the LSN as part of the nonce though,
so I don't think this argument is actually valid..?  As discussed
previously regarding AES-XTS, the general idea was to use the path to
the file and the filename itself plus the block number as the IV, and
that works fine for XTS because it's ok to reuse it (unlike with CTR).

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Role Self-Administration

2021-10-06 Thread Robert Haas

On Wed, Oct 6, 2021 at 2:48 PM Stephen Frost  wrote:
> What I'm saying above is that the command explicitly listed there
> *isn't* 'DROP ROLE A DB', even though that is something which the spec
> *could* have done, had they wished to.  Given that they didn't, it seems
> very clear that making such a change would very much be a deviation and
> violation of the spec.  That we invented some behind-the-scenes concept
> of role ownership where we track who actually created what role and then
> use that info to transform a REVOKE into a DROP doesn't make such a
> transformation OK.

If PostgreSQL implements extensions to the SQL specification, then we
get to decide how those features interact with the features that are
specified.

For example, I presume the spec doesn't say that you can drop a
function by dropping the extension that contains it, but that's just
because extensions as we have them in PostgreSQL are not part of the
SQL standard. It would be silly to have rejected that feature on those
grounds, because nobody is forced to use extensions, and if you don't,
then they do not cause any deviation from spec-mandated behavior.

In the same way, nobody would be forced to make a role own another
role, and if you don't, then you'll never notice any deviation from
spec-mandated behavior on account of that feature.

If the SQL specification says that roles can own other roles, but that
DROP has to have some special behavior in regards to that feature,
then we should probably try to do what the spec says. But if the spec
doesn't think that the concept of roles owning other roles even
exists, but we choose to invent such a concept, then I think we can
make it work however we like without worrying about
spec-compatibility. We've already invented lots of other things like
that, and the project is the better for it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Peter Geoghegan

On Wed, Oct 6, 2021 at 11:32 AM Robert Haas  wrote:
> All of the decisions we're talking about here really have to do with
> determining the user's intent. I think that if the user says
> pg_amcheck --all, there's a good argument that they don't want us to
> check unlogged relations on a standby which will never be valid, or
> failed index builds which need not be valid. But even that is not
> necessarily true. If the user typed pg_amcheck -i
> some_index_that_failed_to_build, there is a pretty strong argument
> that they want us to check that index and maybe fail, not skip
> checking that index and report success without doing anything. I think
> it's reasonable to accept that unfortunate deviation from the user's
> intent in order to get the benefit of not failing for silly reasons
> when, as will normally be the case, somebody just tries to check the
> entire database, or some subset of tables and their corresponding
> indexes. In those cases the user pretty clearly only wants to check
> the valid things. So I agree, with some reservations, that excluding
> unlogged relations while in recovery and invalid indexes is probably
> the thing which is most likely to give the users what they want.
>
> But how can we possibly say that a user who specifies --heapallindexed
> doesn't really mean what they said?

I am pretty sure that I agree with you about all these details. We
need to tease them apart some more.

--heapallindexed doesn't complicate things for us at all. It changes
nothing about the locking considerations. It's just an additive thing,
some extra checks with the same basic underlying requirements. Maybe
you meant to say --parent-check, not --heapallindexed?

--parent-check does present us with the question of what to do in Hot
Standby mode, where it will surely fail (because it requires a
relation level ShareLock, etc). But I actually don't think it's
complicated: we must throw an error, because it's fundamentally not
something that will ever work (with any index). Whether the error
comes from pg_amcheck or amcheck proper doesn't seem important to me.

I think it's pretty clear that verify_heapam.c (from amcheck proper)
should just follow verify_nbtree.c when directly invoked against an
unlogged index in Hot Standby. That is, it should assume that the
relation has no storage, but still "verify" it conceptually. Just show
a NOTICE about it. Assume no storage to verify.

Finally, there is the question of what happens inside pg_amcheck (not
amcheck proper) deals with unlogged relations in Hot Standby mode.
There are two reasonable options: it can either "verify" the indexes
(actually just show those NOTICE messages), or skip them entirely. I
lean towards the former option, on the grounds that I don't think it
should be special-cased. But I don't feel very strongly about it.

-- 
Peter Geoghegan

Re: Running tests under valgrind is getting slower at an alarming pace

2021-10-06 Thread Andres Freund

Hi,

On 2021-10-06 12:58:34 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > One thing I think would really help is having the total time for each run
> > visible in an animals run history. That way we could pinpoint regressions
> > reasonably efficiently, right now that's not easily possible without writing
> > nontrivial queries to the buildfarm database...
>
> +1.  I've lost count of how often I've had to drill down to an individual
> run just because I wanted to see how long it took.  If we could fit that
> into the branch history pages like

I queried this in the DB for skink using

select snapshot::date, substring(git_head_ref, 1, 12) as git_rev, (SELECT 
SUM(stage_duration) FROM build_status_log_raw bslr WHERE bslr.sysname = 
bsr.sysname AND bslr.snapshot = bsr.snapshot) FROM build_status_raw bsr WHERE 
branch = 'HEAD' AND sysname = 'skink' and stage = 'OK' AND snapshot > 
'2021-01-01' order by snapshot desc;

  snapshot  |   git_rev|   sum
+--+--
 2021-10-06 | ec2133a44731 | 12:09:17
 2021-10-05 | 0266e98c6b86 | 10:55:10
 2021-10-03 | 2903f1404df3 | 10:24:11
 2021-09-30 | 20f8671ef69b | 10:31:43
...
 2021-06-14 | 2d689babe3cb | 10:29:07
 2021-06-12 | f452aaf7d4a9 | 10:26:12
 2021-06-11 | d08237b5b494 | 10:50:53
 2021-06-09 | 845cad4d51cb | 10:58:31
 2021-06-08 | eab81953682d | 09:06:35
 2021-06-06 | a2dee328bbe5 | 09:02:36
 2021-06-05 | e6159885b78e | 08:59:14
 2021-06-03 | 187682c32173 | 09:39:07
 2021-06-02 | df466d30c6ca | 09:03:05
 2021-06-03 | 187682c32173 | 09:39:07
 2021-06-02 | df466d30c6ca | 09:03:05
 2021-05-31 | 7c544ecdad81 | 09:09:42
 2021-05-30 | ba356a397de5 | 08:54:29
 2021-05-28 | d69fcb9caef1 | 09:00:36
 2021-05-27 | 388e75ad3348 | 09:39:14
 2021-05-25 | e30e3fdea873 | 08:51:04
 2021-05-24 | 99c5852e20a0 | 08:57:08
...
 2021-03-23 | 1e3e8b51bda8 | 09:19:40
 2021-03-21 | 96ae658e6238 | 08:29:05
 2021-03-20 | 61752afb2640 | 08:15:47
 2021-03-18 | da18d829c281 | 08:34:02
 2021-03-17 | 6b67d72b604c | 09:11:46
 2021-03-15 | 146cb3889c3c | 08:20:21
 2021-03-14 | 58f57490facd | 08:06:07
 2021-03-12 | d60e61de4fb4 | 08:41:12
 2021-03-11 | 3f0daeb02f8d | 08:04:44
 2021-03-08 | 8a812e5106c5 | 08:46:01
 2021-03-07 | f9a0392e1cf3 | 08:01:47
 2021-03-05 | 0ce4cd04da55 | 08:01:32
 2021-03-04 | 040af779382e | 07:56:31
 2021-03-02 | 5b2f2af3d9d5 | 08:20:50
 2021-03-01 | f5a5773a9dc4 | 07:59:14
...
 2021-01-02 | 4d3f03f42227 | 08:14:41
 2021-01-01 | 32d6287d2eef | 07:31:56

It's not too surprising that 2021-10-06 is slower, I yesterday changed things
so that more valgrind runs are done in parallel (increasing individual test
times, but still allowing to get results faster than testing 1-by-1).


I don't see anything immediately suspicious for the slowdowns around
eab81953682d. Perhaps there was a system update at that time causing
changes. Unfortunately I don't have logs from back then anymore. OTOH, I don't
see a clear slowdown in 13, 12 around that time.

Greetings,

Andres Freund

Re: Role Self-Administration

2021-10-06 Thread Stephen Frost

Greetings,

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Oct 6, 2021, at 11:09 AM, Stephen Frost  wrote:
> > After all, it says
> > "REOKVE R FROM A DB", not "DROP ROLE A CASCADE". 
> 
> Wait, are you arguing what DROP ROLE A CASCADE should do based on what the 
> spec says REVOKE R FROM A DB should do?  If so, I'd say that's irrelevant.  
> I'm not proposing to change what REVOKE does.  If not, could you clarify?  
> Did I misunderstand?

No, that's not what I'm saying.

In the spec, under , there is a 'General Rules'
section (as there is with most statements) and in that section it says
that for every authorization identifier (that is, some privilege, be it
a GRANT of SELECT rights on an object, or GRANT of role membership in
some role) which references the role being dropped, the command:

REVOKE R FROM A DB

is effectively executed (without further access rule checking).

What I'm saying above is that the command explicitly listed there
*isn't* 'DROP ROLE A DB', even though that is something which the spec
*could* have done, had they wished to.  Given that they didn't, it seems
very clear that making such a change would very much be a deviation and
violation of the spec.  That we invented some behind-the-scenes concept
of role ownership where we track who actually created what role and then
use that info to transform a REVOKE into a DROP doesn't make such a
transformation OK.

Consider that with what you're proposing, a user could execute the
following series of entirely SQL-spec compliant statements, and get
very different results depending on if we have this 'ownership' concept
or not:

SET ROLE postgres;
CREATE ROLE r1;

SET ROLE r1;
CREATE ROLE r2;

SET ROLE postgres;
DROP ROLE r1 CASCADE;

With what you're suggesting, the end result would be that r2 no longer
exists, whereas with the spec-defined behvaior, r2 *would* still exist.

If that doesn't make it clear enough then I'm afraid you'll just need to
either acquire a copy of the spec and point out what I'm
misunderstanding in it (or get someone else to who has access to it), or
accept that we need to use some other syntax for this capability.  I
don't think it's unreasonable to have different syntax for this,
particularly as it's a concept that doesn't even exist in the standard
(as far as I can tell, anyway).  Adopting SQL defined syntax to use with
a concept that the standard doesn't have sure seems like a violation of
the POLA.

If you feel really strongly that this must be part of DROP ROLE then
maybe we could do something like:

DROP ROLE r1 CASCADE OWNED ROLES;

or come up with something else, but just changing what DROP ROLE ..
CASCADE is defined by the spec to do isn't the right approach, imv.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Robert Haas

On Wed, Oct 6, 2021 at 1:57 PM Mark Dilger  wrote:
> > To me #2 sounds like a tautology. It could almost be phrased as
> > "pg_amcheck does not check the things that it cannot check".
>
> I totally disagree.  It is uncomfortable to me that `pg_amcheck 
> --parent-check` will now silently not perform the parent check that was 
> explicitly requested.  That reported an error before, and now it just 
> downgrades the check.  This is hardly tautological.  I'm only willing to post 
> a patch with that change because I can see a practical argument that somebody 
> might run that as a cron job and they don't want the cron job failing when 
> the database happens to go into recovery.  But again, not at all tautological.

Yeah, I don't think that's OK. -1 from me on making any such change.
If I say pg_amcheck --heapallindexed, I expect it to pass
heapallindexed = true to bt_index_check(). I don't expect it to make a
decision internally whether I really meant it when I said I wanted
--heapallindexed checking.

All of the decisions we're talking about here really have to do with
determining the user's intent. I think that if the user says
pg_amcheck --all, there's a good argument that they don't want us to
check unlogged relations on a standby which will never be valid, or
failed index builds which need not be valid. But even that is not
necessarily true. If the user typed pg_amcheck -i
some_index_that_failed_to_build, there is a pretty strong argument
that they want us to check that index and maybe fail, not skip
checking that index and report success without doing anything. I think
it's reasonable to accept that unfortunate deviation from the user's
intent in order to get the benefit of not failing for silly reasons
when, as will normally be the case, somebody just tries to check the
entire database, or some subset of tables and their corresponding
indexes. In those cases the user pretty clearly only wants to check
the valid things. So I agree, with some reservations, that excluding
unlogged relations while in recovery and invalid indexes is probably
the thing which is most likely to give the users what they want.

But how can we possibly say that a user who specifies --heapallindexed
doesn't really mean what they said?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Role Self-Administration

2021-10-06 Thread Mark Dilger

> On Oct 6, 2021, at 11:09 AM, Stephen Frost  wrote:
> 
> After all, it says
> "REOKVE R FROM A DB", not "DROP ROLE A CASCADE". 

Wait, are you arguing what DROP ROLE A CASCADE should do based on what the spec 
says REVOKE R FROM A DB should do?  If so, I'd say that's irrelevant.  I'm not 
proposing to change what REVOKE does.  If not, could you clarify?  Did I 
misunderstand?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Peter Geoghegan

On Wed, Oct 6, 2021 at 10:57 AM Mark Dilger
 wrote:
> > Clearly pg_amcheck never checked all relations, because it never
> > checked indexes that are not B-Tree indexes. I'm pretty sure that I
> > can poke big holes in almost any positivist statement like that with
> > little effort.
>
> There is a distinction here that you are (intentionally?) failing to 
> acknowledge. On the one hand, there are relation types that cannot be checked 
> because no checking functions for them exist.  (Hash, gin, gist, etc.)  On 
> the other hand, there are relations which could be check but for the current 
> state of the system, or could be checked in some particular way but for the 
> current state of the system.  One of those has to do with code that doesn't 
> exist, and the other has to do with the state of the system.  I'm only 
> talking about the second.

I specifically acknowledge and reject that distinction. That's my whole point.

Your words were: '--all no longer means "all relations" but rather
"all checkable relations"'. But somehow the original clean definition
of "--all" was made no less clean by not including GiST indexes and so
on from the start. You're asking me to believe that it was really
implied all along that "all checkable relations" didn't include the
relations that obviously weren't checkable. You're probably going to
have to keep making post-hoc amendments to your original statement
like this.

Obviously the gap in functionality from non-standard index AMs is far
more important than the totally theoretical issue with failed
CONCURRENTLY indexes. But even if they were equally important, your
emphasis on the latter would still be arbitrary.

> I totally disagree.  It is uncomfortable to me that `pg_amcheck 
> --parent-check` will now silently not perform the parent check that was 
> explicitly requested.

But the whole definition of "check that was explicitly requested"
relies on your existing understanding of what pg_amcheck is supposed
to do. That's not actually essential. I don't see it that way, for
example.

-- 
Peter Geoghegan

Re: Role Self-Administration

2021-10-06 Thread Stephen Frost

Greetings,

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Oct 6, 2021, at 10:20 AM, Stephen Frost  wrote:
> > 
> > Consistency is not having dangling pointers around to things which no
> > longer exist- FK reference kind of things.  Object management is about
> > actual *removal* of full blown objects like roles, tables, etc.  DROP
> > TABLE ... CASCADE doesn't drop tables which haven an FK dependency on
> > the dropped table, the FK is just removed.
> 
> Right, but DROP SCHEMA ... CASCADE does remove the tables within, no?  I 
> would see alice being a member of role bob as being analogous to the foreign 
> key example, and charlie being owned by bob as being more like the table 
> within a schema.

Objects aren't able to live outside of a schema, so it doesn't seem to
be quite the same case there.  Further, DROP SCHEMA is defined in the
standard as saying:

DROP (TABLE, VIEW, DOMAIN, etc) T CASCADE

> I'm fine with using a different syntax for this if what i'm proposing 
> violates the spec.  I'm just trying to wrap my head around how to interpret 
> the spec (of which i have no copy, mind you.)  I'm trying to distinguish 
> between statements like X SHALL DO Y and X SHALL DO NOTHING BUT Y.  I don't 
> know if the spec contains a concept of roles owning other roles, and if not, 
> does it forbid that concept?  I should think that if that concept is a 
> postgres extension not present in the spec, then we can make it do anything 
> we want.

I do think what you're suggesting is pretty clearly not what the SQL
committee imagined DROP ROLE ... CASCADE to do.  After all, it says
"REOKVE R FROM A DB", not "DROP ROLE A CASCADE".  Unfortunately, more
recent versions of the spec don't seem to be available very easily and
the older draft that I've seen around doesn't have CASCADE on DROP ROLE.
Working with roles, which are defined in the spec, it seems pretty
important to have access to the spec though to see these things.

As far as I can tell, no, there isn't a concept of role 'ownership' in
the spec.  If there was then perhaps things would be different ... but
that's not the case.  I disagree quite strongly that adding such an
extension would allow us to seriuosly deviate from what the spec says
should happen regarding DROP ROLE ... CASCADE though.  If that argument
held water, we could ignore what the spec says about just about anything
because PG has features that aren't in the spec.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Mark Dilger

> On Oct 6, 2021, at 10:39 AM, Peter Geoghegan  wrote:
> 
>> The differences in the upcoming version are
>> 
>> 1) --all no longer means "all relations" but rather "all checkable relations"
> 
> Clearly pg_amcheck never checked all relations, because it never
> checked indexes that are not B-Tree indexes. I'm pretty sure that I
> can poke big holes in almost any positivist statement like that with
> little effort.

There is a distinction here that you are (intentionally?) failing to 
acknowledge.  On the one hand, there are relation types that cannot be checked 
because no checking functions for them exist.  (Hash, gin, gist, etc.)  On the 
other hand, there are relations which could be check but for the current state 
of the system, or could be checked in some particular way but for the current 
state of the system.  One of those has to do with code that doesn't exist, and 
the other has to do with the state of the system.  I'm only talking about the 
second.

> 
>> 2) checking options should be automatically downgraded under circumstances 
>> where they cannot be applied
>> 3) unlogged relations during replication are by definition not corrupt
>> 
>> I think #1 and #3 are unsurprising enough that they need no documentation 
>> update.  #2 is slightly surprising (at least to me) so I updated the docs 
>> for it.
> 
> To me #2 sounds like a tautology. It could almost be phrased as
> "pg_amcheck does not check the things that it cannot check".

I totally disagree.  It is uncomfortable to me that `pg_amcheck --parent-check` 
will now silently not perform the parent check that was explicitly requested.  
That reported an error before, and now it just downgrades the check.  This is 
hardly tautological.  I'm only willing to post a patch with that change because 
I can see a practical argument that somebody might run that as a cron job and 
they don't want the cron job failing when the database happens to go into 
recovery.  But again, not at all tautological.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Peter Geoghegan

On Wed, Oct 6, 2021 at 10:19 AM Mark Dilger
 wrote:
> > A return value of 0 cannot be said to indicate that the database is
> > not corrupt;
>
> Nor can a non-zero value be said to indicate that the database is corrupt.

I never said otherwise. I think it's perfectly fine that there are
multiple non-zero return values. It's totally unrelated.

> I'm not sure how the idea that pg_amcheck should never give back a failure 
> code unless there is corruption got inserted into this thread, but I'm not on 
> board with that as an invariant statement.

I agree; I'm also not on board with it as an invariant statement.

> The differences in the upcoming version are
>
> 1) --all no longer means "all relations" but rather "all checkable relations"

Clearly pg_amcheck never checked all relations, because it never
checked indexes that are not B-Tree indexes. I'm pretty sure that I
can poke big holes in almost any positivist statement like that with
little effort.

> 2) checking options should be automatically downgraded under circumstances 
> where they cannot be applied
> 3) unlogged relations during replication are by definition not corrupt
>
> I think #1 and #3 are unsurprising enough that they need no documentation 
> update.  #2 is slightly surprising (at least to me) so I updated the docs for 
> it.

To me #2 sounds like a tautology. It could almost be phrased as
"pg_amcheck does not check the things that it cannot check".

-- 
Peter Geoghegan

Re: Role Self-Administration

2021-10-06 Thread Mark Dilger

> On Oct 6, 2021, at 10:20 AM, Stephen Frost  wrote:
> 
> Consistency is not having dangling pointers around to things which no
> longer exist- FK reference kind of things.  Object management is about
> actual *removal* of full blown objects like roles, tables, etc.  DROP
> TABLE ... CASCADE doesn't drop tables which haven an FK dependency on
> the dropped table, the FK is just removed.

Right, but DROP SCHEMA ... CASCADE does remove the tables within, no?  I would 
see alice being a member of role bob as being analogous to the foreign key 
example, and charlie being owned by bob as being more like the table within a 
schema.

I'm fine with using a different syntax for this if what i'm proposing violates 
the spec.  I'm just trying to wrap my head around how to interpret the spec (of 
which i have no copy, mind you.)  I'm trying to distinguish between statements 
like X SHALL DO Y and X SHALL DO NOTHING BUT Y.  I don't know if the spec 
contains a concept of roles owning other roles, and if not, does it forbid that 
concept?  I should think that if that concept is a postgres extension not 
present in the spec, then we can make it do anything we want.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2021-10-06 Thread Alvaro Herrera

On 2021-Oct-06, Jeremy Schneider wrote:

> Well this whole conversation is just theoretical anyway until the code
> is shared.  :)  But if Bharath is writing functions to decode WAL, then
> wouldn't we just have pg_waldump use these same functions in order to
> avoid duplicating code?

Actually, a lot of the code is already shared, since the rmgrdesc
routines are in src/backend.  Keep in mind that it was there before
pg_xlogdump existed, to support WAL_DEBUG.  When pg_xlogdump was added,
what we did was allow that backend-only code be compilable in a frontend
environment.  Also, we already have xlogreader.

So pg_waldump itself is mostly scaffolding to let the frontend
environment get argument values to pass to backend-enabled code.  The
only really interesting, novel thing is the --stats mode ... and I bet
you can write that with some SQL-level aggregation of the raw data, no
need for any C code.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: Role Self-Administration

2021-10-06 Thread Stephen Frost

Greetings,

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Oct 6, 2021, at 9:01 AM, Stephen Frost  wrote:
> > I can see how what you describe as the behavior you'd like to see of
> > DROP ROLE ... CASCADE could be useful...  However, at least in the
> > latest version of the standard that I'm looking at, when a
> > DROP ROLE ...  CASCADE is executed, what happens for all authorization
> > identifiers is:
> > 
> > REVOKE R FROM A DB
> > 
> > Where R is the role being dropped and A is the authoriztaion identifier.
> 
> I'm not proposing that all roles with membership in bob be dropped when role 
> bob is dropped.  I'm proposing that all roles *owned by* role bob also be 
> dropped.  Postgres doesn't currently have a concept of roles owning other 
> roles, but I'm proposing that we add such a concept.  Of course, any role 
> with membership in role bob would no longer have that membership, and any 
> role managed by bob would not longer be managed by bob.  The CASCADE would 
> not result drop those other roles merely due to membership or management 
> relationships.

I get all of that ... but you're also talking about changing the
behavior of something which is defined pretty clearly in the standard to
be something that's very different from what the standard says.

> > In other words, the SQL committee seems to disagree with you when it
> > comes to what CASCADE on DROP ROLE means (though I can't say I'm too
> > surprised- generally speaking, CASCADE is about getting rid of the
> > dependency so the system stays consistent, not as a method of object
> > management...).
> 
> I'm not sure I understand how what they are saying disagrees with what I am 
> saying, unless they are saying that REVOKE R FROM A DB is the one and only 
> thing that DROP ROLE .. CASCADE can do.  If they are excluding that it do 
> anything else, then yes, that would be an incompatibility.

That is exactly what DROP ROLE ... CASCADE is defined in the standard to
do.  That definition covers not just permissions on objects but also
permissions on roles.  To take that and turn it into a DROP ROLE for
roles looks like a *very* clear and serious deviation from the standard.

If we were to go down this road, I'd suggest we have some *other* syntax
that isn't defined by the standard to do something else.  eg:

DROP ROLES OWNED BY R;

or something along those lines.  I'm not saying that your idea is
without merit or that it wouldn't be useful, I'm just trying to make it
clear that the standard already says what DROP ROLE .. CASCADE means and
we should be loath to deviate very far from that.

> As far as keeping the system consistent, I think that's what this does.  As 
> soon as a role is defined as owning other stuff, then dropping the role 
> cascade means dropping the other stuff.
> 
> Could you elaborate more on the difference between object management and 
> consistency as it applies to this issue?

Consistency is not having dangling pointers around to things which no
longer exist- FK reference kind of things.  Object management is about
actual *removal* of full blown objects like roles, tables, etc.  DROP
TABLE ... CASCADE doesn't drop tables which haven an FK dependency on
the dropped table, the FK is just removed.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2021-10-06 Thread Bruce Momjian

On Wed, Oct  6, 2021 at 09:56:34AM -0700, Jeremy Schneider wrote:
> On 10/5/21 17:43, Bruce Momjian wrote:
> > On Tue, Oct  5, 2021 at 03:30:07PM -0700, Jeremy Schneider wrote:
> >> Specifically exposing pg_waldump functionality in SQL could speed up
> >> finding bugs in the PG logical replication code. We found and fixed a
> >> few over this past year, but there are more lurking out there.
> > 
> > Uh, why is pg_waldump more important than other command line tool
> > information?
> 
> Going down the list of all other utilities in src/bin:
> 
> * pg_amcheck, pg_config, pg_controldata: already available in SQL
> * psql, pgbench, pg_dump: already available as client-side apps
> * initdb, pg_archivecleanup, pg_basebackup, pg_checksums, pg_ctl,
> pg_resetwal, pg_rewind, pg_upgrade, pg_verifybackup: can't think of any
> possible use case outside server OS access, most of these are too low
> level and don't even make sense in SQL
> * pg_test_fsync, pg_test_timing: marginally interesting ideas in SQL,
> don't feel any deep interest myself
> 
> Speaking selfishly, there are a few reasons I would be specifically
> interested in pg_waldump (the only remaining one on the list).

This is the analysis I was looking for to understand if copying the
features of command-line tools in extensions was a wise direction.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Mark Dilger

> On Oct 6, 2021, at 10:16 AM, Peter Geoghegan  wrote:
> 
> A return value of 0 cannot be said to indicate that the database is
> not corrupt;

Nor can a non-zero value be said to indicate that the database is corrupt.

These invocations will still return a non-zero exit status:

pg_amcheck -D no_privs_database
pg_amcheck --index="no_such_index"
pg_amcheck --table="somebody_elses_temp_table"
pg_amcheck --index="somebody_elses_temp_index"

but these have been modified to no longer do so:

pg_amcheck -D my_database_in_recovery --parent-check
pg_amcheck -D my_database_in_recovery --heapallindexed
pg_amcheck --all

Please compare to:

find /private || echo "FAIL"
rm /not/my/file || echo "FAIL"

I'm not sure how the idea that pg_amcheck should never give back a failure code 
unless there is corruption got inserted into this thread, but I'm not on board 
with that as an invariant statement.  The differences in the upcoming version 
are

1) --all no longer means "all relations" but rather "all checkable relations"
2) checking options should be automatically downgraded under circumstances 
where they cannot be applied
3) unlogged relations during replication are by definition not corrupt

I think #1 and #3 are unsurprising enough that they need no documentation 
update.  #2 is slightly surprising (at least to me) so I updated the docs for 
it.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Peter Geoghegan

On Wed, Oct 6, 2021 at 9:25 AM Mark Dilger  wrote:
> Thanks for reviewing!
>
> I expect to post a new version shortly.

Not sure how much it matters, but I have some thoughts on the return
value of pg_amcheck. (I'm mostly going into this now because it seems
related to how we discuss these issues generally.)

A return value of 0 cannot be said to indicate that the database is
not corrupt; strictly speaking the verification process doesn't
actually verify anything. The null hypothesis is that the database
isn't corrupt. pg_amcheck looks for disconfirmatory evidence (evidence
of corruption) on a best-effort basis. This seems fundamental.

If this philosophy of science stuff seems too abstract, then I can be
more concrete: pg_amcheck doesn't even attempt to verify indexes that
aren't B-Tree indexes. Clearly we cannot be sure that the database
contains no corruption when there happens to be even one such index.
And yet the return value from pg_amcheck is still 0 (barring problems
elsewhere). I think that it'll always be possible to make *some*
argument like that, even in a world where pg_amcheck + amcheck are
very feature complete. As I said, it seems fundamental.

-- 
Peter Geoghegan

Re: Running tests under valgrind is getting slower at an alarming pace

2021-10-06 Thread Robert Haas

On Wed, Oct 6, 2021 at 12:47 PM Andres Freund  wrote:
> There's probably some of that.
>
> The fact that the tap test infrastructure does all communication with the
> server via psql each only execute only a single query is a problem -
> connection startup is expensive.

Ageed. safe_psql() is a poor-quality interface. I've been surprised in
the past that we were relying on something so primitive.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-10-06 Thread Bruce Momjian

On Wed, Oct  6, 2021 at 11:17:59AM -0400, Robert Haas wrote:
> If you enable checksums or set wal_log_hints=on, then you might incur
> a some write-ahead log records that would otherwise be avoided, and
> those records will include full page images. This can happen once per
> page per checkpoint cycle. However, if the first modification to a
> particular page within a given checkpoint cycle is a regular
> WAL-logged operation rather than a hint bit change, then the extra WAL
> record and full-page image are not needed so the overhead is zero.
> Also, if the first modification is a hint bit change, and then the
> page is evicted, prompting a full page write, but a regular WAL-logged
> operation occurs later within the same checkpoint, the later operation
> no longer needs a full page write. So you still paid the cost of an
> extra WAL record, but you didn't pay the cost of an extra full page
> write. In other words, enabling checksums or turning wal_log_hints=on
> has a relatively low cost except when you have pages that incur only
> hint-type changes, and no regular changes, within the course of a
> single checkpoint cycle.
> 
> On the other hand, in order to avoid IV reuse, your patch needed to
> bump the page LSN for every change, or at least for every eviction.
> That means you could potentially incur the overhead of an extra full
> page write multiple times per checkpoint cycle, and even if there were
> non-hint changes to that page in the same checkpoint cycle. Now you
> could say, well, let's not bump the page LSN for every hint-type
> change, and then your patch would have lower overhead than an approach
> based on XTS, but I think that also loses a ton of security, because
> now you're reusing IVs with an encryption system that is documented
> not to tolerate the reuse of IVs.
> 
> I'm not here to try to pretend that encryption is going to be cheap. I
> just don't believe this particular argument about why AES-XTS should
> be more expensive.

OK, good to know.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: storing an explicit nonce

2021-10-06 Thread Bruce Momjian

On Wed, Oct  6, 2021 at 12:54:49PM -0400, Bruce Momjian wrote:
> On Wed, Oct  6, 2021 at 11:01:25AM -0400, Robert Haas wrote:
> > On Tue, Oct 5, 2021 at 4:29 PM Bruce Momjian  wrote:
> > > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian  wrote:
> > > > We are still working on our TDE patch. Right now the focus is on 
> > > > refactoring
> > > > temporary file access to make the TDE patch itself smaller. 
> > > > Reconsidering
> > > > encryption mode choices given concerns expressed is next. Currently a 
> > > > viable
> > > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have 
> > > > an
> > > > issue with predictable IV and isn't totally broken in case of IV reuse.
> > >
> > > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> > > 16-byte blocks affect later blocks, meaning that hint bit changes would
> > > also affect later blocks.  I think this means we would need to write WAL
> > > full page images for hint bit changes to avoid torn pages.  Right now
> > > hint bit (single bit) changes can be lost without causing torn pages.
> > > This was another of the advantages of using a stream cipher like CTR.
> > 
> > This seems wrong to me. CTR requires that you not reuse the IV. If you
> > re-encrypt the page with a different IV, torn pages are a problem. If
> > you re-encrypt it with the same IV, then it's not secure any more.
> 
> We were not changing the IV for hint bit changes, meaning the hint bit
> changes were visible if you compared the blocks.

Oops, I was wrong above, and my patch docs prove it:

Hint Bits
- - - - -

For hint bit changes, the LSN normally doesn't change, which is a
problem.  By enabling wal_log_hints, you get full page writes to the WAL
after the first hint bit change of the checkpoint.  This is useful for
two reasons.  First, it generates a new LSN, which is needed for the IV
to be secure.  Second, full page images protect against torn pages,
which is an even bigger requirement for encryption because the new LSN
is re-encrypting the entire page, not just the hint bit changes.  You
can safely lose the hint bit changes, but you need to use the same LSN
to decrypt the entire page, so a torn page with an LSN change cannot be
decrypted.  To prevent this, wal_log_hints guarantees that the
pre-hint-bit version (and previous LSN version) of the page is restored.

However, if a hint-bit-modified page is written to the file system
during a checkpoint, and there is a later hint bit change switching the
same page from clean to dirty during the same checkpoint, we need a new
LSN, and wal_log_hints doesn't give us a new LSN here.  The fix for this
is to update the page LSN by writing a dummy WAL record via
xloginsert.c::LSNForEncryption() in such cases.

Seems my performance concerns were unfounded.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Running tests under valgrind is getting slower at an alarming pace

2021-10-06 Thread Tom Lane

Andres Freund  writes:
> One thing I think would really help is having the total time for each run
> visible in an animals run history. That way we could pinpoint regressions
> reasonably efficiently, right now that's not easily possible without writing
> nontrivial queries to the buildfarm database...

+1.  I've lost count of how often I've had to drill down to an individual
run just because I wanted to see how long it took.  If we could fit that
into the branch history pages like

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=jay=HEAD

it'd be really useful IMO.

Perhaps we could replace "OK" with the total time, so as to avoid making
these tables bigger?  (This presumes that the time for a failed run isn't
so interesting.)

regards, tom lane

Re: [PATCH] Added TRANSFORM FOR for COMMENT tab completion

2021-10-06 Thread Suraj Khamkar

Hello,
I reviewed your patch. At a first glance, I have the below comments.

   1. The below change crosses the 80-character limit in a line. Please
   maintain the same.
   -   "LARGE OBJECT", "TABLESPACE", "TEXT SEARCH", "ROLE");
   +   "LARGE OBJECT", "TABLESPACE", "TEXT SEARCH", "TRANSFORM FOR",
   "ROLE");
   2. There are trailing whitespaces after
   COMPLETE_WITH_QUERY(Query_for_list_of_languages);.
   Remove these extra whitespaces.
   surajkhamkar@localhost:tab_comment$ git apply
   fix_tab_complete_comment.patch
   fix_tab_complete_comment.patch:38: trailing whitespace.
   COMPLETE_WITH_QUERY(Query_for_list_of_languages);
   warning: 1 line adds whitespace errors.


Regards,
Suraj Khamkar

On Wed, Sep 29, 2021 at 2:04 PM katouknl  wrote:

> Hi,
>
> I created a patch for COMMENT tab completion.
> It was missing TRANSFORM FOR where it's supposed to be.
>
> Best wishes,
> Ken Kato

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2021-10-06 Thread Jeremy Schneider

On 10/5/21 17:43, Bruce Momjian wrote:
> On Tue, Oct  5, 2021 at 03:30:07PM -0700, Jeremy Schneider wrote:
>> Specifically exposing pg_waldump functionality in SQL could speed up
>> finding bugs in the PG logical replication code. We found and fixed a
>> few over this past year, but there are more lurking out there.
> 
> Uh, why is pg_waldump more important than other command line tool
> information?

Going down the list of all other utilities in src/bin:

* pg_amcheck, pg_config, pg_controldata: already available in SQL
* psql, pgbench, pg_dump: already available as client-side apps
* initdb, pg_archivecleanup, pg_basebackup, pg_checksums, pg_ctl,
pg_resetwal, pg_rewind, pg_upgrade, pg_verifybackup: can't think of any
possible use case outside server OS access, most of these are too low
level and don't even make sense in SQL
* pg_test_fsync, pg_test_timing: marginally interesting ideas in SQL,
don't feel any deep interest myself

Speaking selfishly, there are a few reasons I would be specifically
interested in pg_waldump (the only remaining one on the list).

.

First, to better support existing features around logical replication
and decoding.

In particular, it seems inconsistent to me that all the replication
management SQL functions take LSNs as arguments - and yet there's no
SQL-based way to find the LSNs that you are supposed to pass into these
functions.

https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-REPLICATION

Over the past few years I've been pulled in to help several large PG
users who ran into these bugs, and it's very painful - because the only
real remediation is to drop and recreate the replication slot, which
means either re-copying all the data to the downstream system or
figuring out a way to resync it. Some PG users have 3rd party tools like
HVR which can do row-by-row resync (IIUC), but no matter how you slice
it, we're talking about a lot of pain for people replicating large data
sets between multiple systems. In most cases, the only/best option even
with very large tables is to just make a fresh copy of the data - which
can translate to a business outage of hours or even days.

My favorite example is the SQL function pg_replication_slot_advance() -
this could really help PG users find less painful solutions to broken
decoding, however it's not really possible to /know/ an LSN to advance
to without inspecting WAL. ISTM there's a strong use case here for a SQL
interface on WAL inspection.

.

Second: debugging and troubleshooting logical replication and decoding bugs.

I helped track down a few logical replication bugs and get fixed into
code at postgresql.org this past year. (But I give credit to others who
are much better at C than I am, and who did a lot more work than I did
on these bugs!)

Logical decoding bugs are some of the hardest to fix - because all you
have is a WAL stream, but you don't know the SQL or workload patterns or
PG code paths which created that WAL stream.

Dumping the WAL - knowing which objects and which types of operations
are involved and stats like number of changes or number of
subtransactions - this helps identify which transaction and what SQL in
the application triggered the failure, which can help find short-term
workarounds. Businesses need those short-term workarounds so they can
keep going while we work on finding and fixing bugs, which can take some
time. This is another place where I think a SQL interface to WAL would
be helpful to PG users. Especially the ability to filter and trace a
single transaction through a large number of WAL files in the directory.

.

Third: statistics on WAL - especially full page writes. Giving users the
full power of SQL allows much more sophisticated analysis of the WAL
records. Personally, I'd probably find myself importing all the WAL
stats into the DB anyway because SQL is my preferred data manipulation
language.

>> Having the extension in core is actually the only way to avoid
>> duplicated effort, by having shared code which the pg_dump binary and
>> the extension both wrap or call.
> 
> Uh, aren't you duplicating code by having pg_waldump as a command-line
> tool and an extension?

Well this whole conversation is just theoretical anyway until the code
is shared.  :)  But if Bharath is writing functions to decode WAL, then
wouldn't we just have pg_waldump use these same functions in order to
avoid duplicating code?

Bharath - was some code already posted and I just missed it?

Looking at the proposed API from the initial email, I like that there's
both stats functionality and WAL record inspection functionality
(similar to pg_waldump). I like that there's the ability to pull the
records as raw bytea data, however I think we're also going to want an
ability in v1 of the patch to decode it (similar to pageinspect
heap_page_item_attrs, etc).

Another feature that might be interesting down the road would be the
ability to provide filtering of WAL records for security purposes. For

Re: storing an explicit nonce

2021-10-06 Thread Bruce Momjian

On Wed, Oct  6, 2021 at 11:01:25AM -0400, Robert Haas wrote:
> On Tue, Oct 5, 2021 at 4:29 PM Bruce Momjian  wrote:
> > On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian  wrote:
> > > We are still working on our TDE patch. Right now the focus is on 
> > > refactoring
> > > temporary file access to make the TDE patch itself smaller. Reconsidering
> > > encryption mode choices given concerns expressed is next. Currently a 
> > > viable
> > > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
> > > issue with predictable IV and isn't totally broken in case of IV reuse.
> >
> > Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> > 16-byte blocks affect later blocks, meaning that hint bit changes would
> > also affect later blocks.  I think this means we would need to write WAL
> > full page images for hint bit changes to avoid torn pages.  Right now
> > hint bit (single bit) changes can be lost without causing torn pages.
> > This was another of the advantages of using a stream cipher like CTR.
> 
> This seems wrong to me. CTR requires that you not reuse the IV. If you
> re-encrypt the page with a different IV, torn pages are a problem. If
> you re-encrypt it with the same IV, then it's not secure any more.

We were not changing the IV for hint bit changes, meaning the hint bit
changes were visible if you compared the blocks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Running tests under valgrind is getting slower at an alarming pace

2021-10-06 Thread Andres Freund

Hi,

On 2021-10-06 12:09:36 -0400, Robert Haas wrote:
> Is the problem here that we're adding a lot of new new test cases? Or
> is the problem that valgrind runs are getting slower for the same
> number of test cases?

I don't know precisely. It's probably a combination of several factors. I do
think we regressed somewhere around valgrind specifically - the leak origin
tracking in older branches seems to work better than in newer branches. But I
don't know if that affects performance.

> If it's taking longer because we have more test cases, I'm honestly
> not sure that's really something we should try to fix.

I'm not arguing for having fewer tests. But clearly, executing them serially
is problematic, when the times are going up like this. Skink is hosted on a
machine with a CPU clocking around ~3.9GHZ for most of the test - getting a
faster machine won't help that much. But most of the time only a few cores are
active.

This isn't just a problem with valgrind, the reporting times for other animals
also aren't getting shorter...

It takes my workstation 2min20s to execute check-world parallely, but > 16min
sequentially. The BF executes tap tests sequentially...

> I mean, I'm sure we have some bad test cases here and there, but overall I
> think we still have too little test coverage, not too much. The recent
> discovery that recovery_end_command had zero test coverage is one fine
> example of that.
> 
> But if we've done something that increases the relative cost of
> valgrind, maybe we can fix that in a centralized way.

There's probably some of that.

The fact that the tap test infrastructure does all communication with the
server via psql each only execute only a single query is a problem -
connection startup is expensive. I think this is particularly a problem for
things like PostgresNode::poll_query_until(), which is also used by
::wait_for_catchup(), because a) process creation is more expensive on
valgrind b) things take longer on valgrind, so we pay that price many more
times.

At the same time increasing the timeout for the poll loop also makes the tests
slower, because all the waits for things that already finished do add up.

I'd guess that at the very least driving individual poll_query_until() via a
psql that's running across queries would reduce this substantially, and
perhaps allow us to reduce the polling time. But there's probably some
nontrivial challenges around recognizing query boundaries :/

Briefly looking at a profile of valgrind, it looks like a lot of the cpu time
is spent doing syscalls related to logging. So far skink had
log_statement=all, log_connections=on, log_disconnections=on - I've turned
them off for the next runs. We'll see if that helps.

I'll also try to figure out print a bit more detail about timing for each tap
test, looks like I need to figure out how to pass PROVE_TEST='--timer' through
the buildfarm. Shouldn't be too hard.

One thing I think would really help is having the total time for each run
visible in an animals run history. That way we could pinpoint regressions
reasonably efficiently, right now that's not easily possible without writing
nontrivial queries to the buildfarm database...

Greetings,

Andres Freund

Re: Parallel vacuum workers prevent the oldest xmin from advancing

2021-10-06 Thread Bossart, Nathan

On 10/6/21, 12:13 AM, "Masahiko Sawada"  wrote:
> A customer reported that during parallel index vacuum, the oldest xmin
> doesn't advance. Normally, the calculation of oldest xmin
> (ComputeXidHorizons()) ignores xmin/xid of processes having
> PROC_IN_VACUUM flag in MyProc->statusFlags. But since parallel vacuum
> workers don’t set their statusFlags, the xmin of the parallel vacuum
> worker is considered to calculate the oldest xmin. This issue happens
> from PG13 where the parallel vacuum was introduced. I think it's a
> bug.

+1

> To fix it, I thought that we change the create index code and the
> vacuum code so that the individual parallel worker sets its status
> flags according to the leader’s one. But ISTM it’d be better to copy
> the leader’s status flags to workers in ParallelWorkerMain(). I've
> attached a patch for HEAD.

The patch seems reasonable to me.

Nathan

Re: Pre-allocating WAL files

2021-10-06 Thread Bossart, Nathan

On 10/6/21, 5:20 AM, "Maxim Orlov"  wrote:
> We've looked through the code and everything looks good except few minor 
> things:
> 1). Using dedicated bg worker seems not optimal, it introduces a lot of 
> redundant code for little single action.
> We'd join initial proposal of Andres to implement it as an extra fuction 
> of checkpointer.

Thanks for taking a look!

I have been thinking about the right place to put this logic.  My
first thought is that it sounds like something that ought to go in the
WAL writer process, but as Andres noted upthread, that is undesirable
because it'll add latency for asynchronous commits.  The checkpointer
process seems to be another candidate, but I'm not totally sure how
it'll fit in.  My patch works by maintaining a small pool of pre-
allocated segments that is quickly replenished whenever one is used.
If the checkpointer is spending most of its time checkpointing, this
small pool could remain empty for long periods of time.  To keep pre-
allocating WAL while we're checkpointing, perhaps we could add another
function like CheckpointWriteDelay() that is called periodically.
There still might be several operations in CheckPointGuts() that take
a while and leave the segment pool empty, but maybe that's okay for
now.

Nathan

Re: [Proposal] Global temporary tables

2021-10-06 Thread Andrew Bille

On master with the v54 patches applied the following script leads to crash:
export
ASAN_OPTIONS=detect_leaks=0:abort_on_error=1:disable_coredump=0:strict_string_checks=1:check_initialization_order=1:strict_init_order=1
initdb -D data
pg_ctl -w -t 5 -D data -l server.log start
psql -c "create global temp table tmp_table_test_statistics(a int); insert
into temp_table_test_statistics values(generate_series(1,10));" &
sleep 1
pg_ctl -w -t 5 -D data -l server.log stop

and i got error
=
==1022892==ERROR: AddressSanitizer: heap-use-after-free on address
0x6254c640 at pc 0x562435348750 bp 0x7ffee8487e60 sp 0x7ffee8487e50
READ of size 8 at 0x6254c640 thread T0
---

with backtrace:

Core was generated by `postgres: andrew regression [local] INSERT
 '.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x7fa8fd008859 in __GI_abort () at abort.c:79
#2  0x56243471eae2 in __sanitizer::Abort() ()
#3  0x56243472968c in __sanitizer::Die() ()
#4  0x56243470ad1c in
__asan::ScopedInErrorReport::~ScopedInErrorReport() ()
#5  0x56243470a793 in __asan::ReportGenericError(unsigned long,
unsigned long, unsigned long, unsigned long, bool, unsigned long, unsigned
int, bool) ()
#6  0x56243470b5db in __asan_report_load8 ()
#7  0x562435348750 in DropRelFileNodesAllBuffers
(smgr_reln=smgr_reln@entry=0x6254c640, nnodes=nnodes@entry=1) at
bufmgr.c:3211
#8  0x5624353ec8a8 in smgrdounlinkall (rels=rels@entry=0x6254c640,
nrels=nrels@entry=1, isRedo=isRedo@entry=false) at smgr.c:397
#9  0x562434aa76e1 in gtt_storage_removeall (code=,
arg=) at storage_gtt.c:726
#10 0x562435371962 in shmem_exit (code=code@entry=1) at ipc.c:236
#11 0x562435371d4f in proc_exit_prepare (code=code@entry=1) at ipc.c:194
#12 0x562435371f74 in proc_exit (code=code@entry=1) at ipc.c:107
#13 0x56243581e35c in errfinish (filename=,
filename@entry=0x562435b800e0 "postgres.c", lineno=lineno@entry=3191,
funcname=funcname@entry=0x562435b836a0 <__func__.26025>
"ProcessInterrupts") at elog.c:666
#14 0x5624353f5f86 in ProcessInterrupts () at postgres.c:3191
#15 0x562434eb26d6 in ExecProjectSet (pstate=0x6253f150) at
nodeProjectSet.c:51
#16 0x562434eaae8e in ExecProcNode (node=0x6253f150) at
../../../src/include/executor/executor.h:257
#17 ExecModifyTable (pstate=0x6253ec98) at nodeModifyTable.c:2429
#18 0x562434df5755 in ExecProcNodeFirst (node=0x6253ec98) at
execProcnode.c:463
#19 0x562434dd678a in ExecProcNode (node=0x6253ec98) at
../../../src/include/executor/executor.h:257
#20 ExecutePlan (estate=estate@entry=0x6253ea20,
planstate=0x6253ec98, use_parallel_mode=,
use_parallel_mode@entry=false, operation=operation@entry=CMD_INSERT,
sendTuples=false, numberTuples=numberTuples@entry=0,
direction=ForwardScanDirection,
dest=0x62545550, execute_once=true) at execMain.c:1555
#21 0x562434dd9867 in standard_ExecutorRun (queryDesc=0x619015a0,
direction=ForwardScanDirection, count=0, execute_once=execute_once@entry=true)
at execMain.c:361
#22 0x562434dd9a83 in ExecutorRun
(queryDesc=queryDesc@entry=0x619015a0,
direction=direction@entry=ForwardScanDirection, count=count@entry=0,
execute_once=execute_once@entry=true) at execMain.c:305
#23 0x562435401be6 in ProcessQuery (plan=plan@entry=0x62545480,
sourceText=0x62505220 "insert into temp_table_test_statistics
values(generate_series(1,10));", params=0x0, queryEnv=0x0,
dest=dest@entry=0x62545550, qc=qc@entry=0x7ffee84886d0)
at pquery.c:160
#24 0x562435404a32 in PortalRunMulti (portal=portal@entry=0x62520a20,
isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x62545550, altdest=altdest@entry=0x62545550,
qc=qc@entry=0x7ffee84886d0)
at pquery.c:1274
#25 0x56243540598d in PortalRun (portal=portal@entry=0x62520a20,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x62545550,
altdest=altdest@entry=0x62545550, qc=)
at pquery.c:788
#26 0x5624353fa917 in exec_simple_query
(query_string=query_string@entry=0x62505220
"insert into temp_table_test_statistics
values(generate_series(1,10));") at postgres.c:1214
#27 0x5624353ff61d in PostgresMain (dbname=dbname@entry=0x62911278
"regression", username=username@entry=0x62911258 "andrew") at
postgres.c:4497
#28 0x5624351f65c7 in BackendRun (port=port@entry=0x61502d80) at
postmaster.c:4560
#29 0x5624351ff1c5 in BackendStartup (port=port@entry=0x61502d80)
at postmaster.c:4288
#30 0x5624351ff970 in ServerLoop () at

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Mark Dilger




> On Oct 6, 2021, at 8:14 AM, Pavel Borisov  wrote:
> 
> We've looked through the initial patch and the exclusion of temporary tables 
> from pg_amcheck seems the right thing. Also it is not the matter anyone 
> disagrees here, and we propose to commit it alone.

Thanks for reviewing!

I expect to post a new version shortly. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-10-06 Thread Yura Sokolov

I've made some remarks in related thread:
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1FB976EF@G01JPEXMBYT05

The new status of this patch is: Waiting on Author

Re: Role Self-Administration

2021-10-06 Thread Mark Dilger

> On Oct 6, 2021, at 9:01 AM, Stephen Frost  wrote:
> 
> I can see how what you describe as the behavior you'd like to see of
> DROP ROLE ... CASCADE could be useful...  However, at least in the
> latest version of the standard that I'm looking at, when a
> DROP ROLE ...  CASCADE is executed, what happens for all authorization
> identifiers is:
> 
> REVOKE R FROM A DB
> 
> Where R is the role being dropped and A is the authoriztaion identifier.

I'm not proposing that all roles with membership in bob be dropped when role 
bob is dropped.  I'm proposing that all roles *owned by* role bob also be 
dropped.  Postgres doesn't currently have a concept of roles owning other 
roles, but I'm proposing that we add such a concept.  Of course, any role with 
membership in role bob would no longer have that membership, and any role 
managed by bob would not longer be managed by bob.  The CASCADE would not 
result drop those other roles merely due to membership or management 
relationships.

> In other words, the SQL committee seems to disagree with you when it
> comes to what CASCADE on DROP ROLE means (though I can't say I'm too
> surprised- generally speaking, CASCADE is about getting rid of the
> dependency so the system stays consistent, not as a method of object
> management...).

I'm not sure I understand how what they are saying disagrees with what I am 
saying, unless they are saying that REVOKE R FROM A DB is the one and only 
thing that DROP ROLE .. CASCADE can do.  If they are excluding that it do 
anything else, then yes, that would be an incompatibility.

As far as keeping the system consistent, I think that's what this does.  As 
soon as a role is defined as owning other stuff, then dropping the role cascade 
means dropping the other stuff.

Could you elaborate more on the difference between object management and 
consistency as it applies to this issue?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use simplehash.h instead of dynahash in SMgr

2021-10-06 Thread Yura Sokolov

Good day, David and all.

В Вт, 05/10/2021 в 11:07 +1300, David Rowley пишет:
> On Mon, 4 Oct 2021 at 20:37, Jaime Casanova
>  wrote:
> > Based on your comments I will mark this patch as withdrawn at midday
> > of
> > my monday unless someone objects to that.
> 
> I really think we need a hash table implementation that's faster than
> dynahash and supports stable pointers to elements (simplehash does not
> have stable pointers). I think withdrawing this won't help us move
> towards getting that.

Agree with you. I believe densehash could replace both dynahash and
simplehash. Shared memory usages of dynahash should be reworked to
other less dynamic hash structure. So there should be densehash for
local hashes and statichash for static shared memory.

densehash slight slowness compared to simplehash in some operations
doesn't worth keeping simplehash beside densehash.

> Thomas voiced his concerns here about having an extra hash table
> implementation and then also concerns that I've coded the hash table
> code to be fast to iterate over the hashed items.  To be honest, I
> think both Andres and Thomas must be misunderstanding the bitmap part.
> I get the impression that they both think the bitmap is solely there
> to make interations faster, but in reality it's primarily there as a
> compact freelist and can also be used to make iterations over sparsely
> populated tables fast. For the freelist we look for 0-bits, and we
> look for 1-bits during iteration.

I think this part is overengineered. More below.

> I think I'd much rather talk about the concerns here than just
> withdraw this. Even if what I have today just serves as something to
> aid discussion.
> 
> It would also be good to get the points Andres raised with me off-list
> on this thread.  I think his primary concern was that bitmaps are
> slow, but I don't really think maintaining full pointers into freed
> items is going to improve the performance of this.
> 
> David

First on "quirks" in the patch I was able to see:

DH_NEXT_ZEROBIT:

   DH_BITMAP_WORD mask = (~(DH_BITMAP_WORD) 0) << DH_BITNUM(prevbit);
   DH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */

really should be

   DH_BITMAP_WORD mask = (~(DH_BITMAP_WORD) 0) << DH_BITNUM(prevbit);
   DH_BITMAP_WORD word = (~words[wordnum]) & mask; /* flip bits */

But it doesn't harm because DH_NEXT_ZEROBIT is always called with
`prevbit = -1`, which is incremented to `0`. Therefore `mask` is always
`0x...ff`.

DH_INDEX_TO_ELEMENT

   /* ensure this segment is marked as used */
should be
   /* ensure this item is marked as used in the segment */

DH_GET_NEXT_UNUSED_ENTRY

   /* find the first segment with an unused item */
   while (seg != NULL && seg->nitems == DH_ITEMS_PER_SEGMENT)
   seg = tb->segments[++segidx];

No protection for `++segidx <= tb->nsegments` . I understand, it could
not happen due to `grow_threshold` is always lesser than
`nsegment * DH_ITEMS_PER_SEGMENT`. But at least comment should be
leaved about legality of absence of check.

Now architecture notes:

I don't believe there is need for configurable DH_ITEMS_PER_SEGMENT. I
don't event believe it should be not equal to 16 (or 8). Then segment
needs only one `used_items` word, which simplifies code a lot.
There is no much difference in overhead between 1/16 and 1/256.

And then I believe, segment doesn't need both `nitems` and `used_items`.
Condition "segment is full" will be equal to `used_items == 0x`.

Next, I think it is better to make real free list instead of looping to
search such one. Ie add `uint32 DH_SEGMENT->next` field and maintain
list starting from `first_free_segment`.
If concern were "allocate from lower-numbered segments first", than min-
heap could be created. It is possible to create very efficient non-
balanced "binary heap" with just two fields (`uint32 left, right`).
Algorithmic PoC in Ruby language is attached.

There is also allocation concern: AllocSet tends to allocate in power2
sizes. Use of power2 segments with header (nitems/used_items) certainly
will lead to wasted 2x space on every segment if element size is also
power2, and a bit lesser for other element sizes.
There could be two workarounds:
- make segment a bit less capable (15 elements instead of 16)
- move header from segment itself to `DH_TYPE->segments` array.
I think, second option is more prefered:
- `DH_TYPE->segments[x]` inevitable accessed on every operation,
  therefore why not store some info here?
- if nitems/used_items will be in `DH_TYPE->segments[x]`, then
  hashtable iteration doesn't need bitmap at all - there will be no need
  in `DH_TYPE->used_segments` bitmap. Absence of this bitmap will
  reduce overhead on usual operations (insert/delete) as well.

Hope I was useful.

regards

Yura Sokolov
y.soko...@postgrespro.ru
funny.fal...@gmail.com

heap.rb
Description: application/ruby

Re: Running tests under valgrind is getting slower at an alarming pace

2021-10-06 Thread Robert Haas

On Wed, Oct 6, 2021 at 1:57 AM Andres Freund  wrote:
> After a recent migration of the skink and a few other animals (sorry for the
> false reports on BF, I forgot to adjust a path), I looked at the time it takes
> to complete a valgrind run:
>
> 9.6: Consumed 4h 53min 18.518s CPU time
> 10: Consumed 5h 32min 50.839s CPU time
> 11: Consumed 7h 7min 17.455s CPU time
> 14: still going at 11h 51min 57.951s
> HEAD: 14h 32min 29.571s CPU time
>
> I changed it so that HEAD with be built in parallel separately from the other
> branches, so that HEAD gets results within a useful timeframe. But even with
> that, the test times are increasing at a rate we're not going to be able to
> keep up.

Is the problem here that we're adding a lot of new new test cases? Or
is the problem that valgrind runs are getting slower for the same
number of test cases?

If it's taking longer because we have more test cases, I'm honestly
not sure that's really something we should try to fix. I mean, I'm
sure we have some bad test cases here and there, but overall I think
we still have too little test coverage, not too much. The recent
discovery that recovery_end_command had zero test coverage is one fine
example of that.

But if we've done something that increases the relative cost of
valgrind, maybe we can fix that in a centralized way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Role Self-Administration

2021-10-06 Thread Stephen Frost

Greetings,

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Oct 5, 2021, at 10:20 AM, Stephen Frost  wrote:
> > On Tue, Oct 5, 2021 at 13:17 Mark Dilger  
> > wrote:
> > > On Oct 5, 2021, at 10:14 AM, Stephen Frost  wrote:
> > > 
> > > What does the “ownership” concept actually buy us then?
> > 
> > DROP ... CASCADE
> > 
> > I’m not convinced that we need to invent the concept of ownership in order 
> > to find a sensible way to make this work- though it would be helpful to 
> > first get everyone’s idea of just what *would* this command do if run on a 
> > role who “owns” or has “admin rights” of another role?
> 
> Ok, I'll start.  Here is how I envision it:
> 
> If roles have owners, then DROP ROLE bob CASCADE drops bob, bob's objects, 
> roles owned by bob, their objects and any roles they own, recursively.  Roles 
> which bob merely has admin rights on are unaffected, excepting that they are 
> administered by one fewer roles once bob is gone.  
> 
> This design allows you to delegate to a new role some task, and you don't 
> have to worry what network of other roles and objects they create, because in 
> the end you just drop the one role cascade and all that other stuff is 
> guaranteed to be cleaned up without any leaks.
> 
> If roles do not have owners, then DROP ROLE bob CASCADE drops role bob plus 
> all objects that bob owns.  It doesn't cascade to other roles because the 
> concept of "roles that bob owns" doesn't exist.  If bob created other roles, 
> those will be left around.  Objects that bob created and then transferred to 
> these other roles are also left around.

I can see how what you describe as the behavior you'd like to see of
DROP ROLE ... CASCADE could be useful...  However, at least in the
latest version of the standard that I'm looking at, when a
DROP ROLE ...  CASCADE is executed, what happens for all authorization
identifiers is:

REVOKE R FROM A DB

Where R is the role being dropped and A is the authoriztaion identifier.

In other words, the SQL committee seems to disagree with you when it
comes to what CASCADE on DROP ROLE means (though I can't say I'm too
surprised- generally speaking, CASCADE is about getting rid of the
dependency so the system stays consistent, not as a method of object
management...).

I'm not against having something that would do what you want, but it
seems like we'd have to at least call it something else and maybe we
should worry about that later, once we've addressed the bigger issue of
making the system handle GRANTORs correctly.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Running tests under valgrind is getting slower at an alarming pace

2021-10-06 Thread Tom Lane

Andres Freund  writes:
> After a recent migration of the skink and a few other animals (sorry for the
> false reports on BF, I forgot to adjust a path), I looked at the time it takes
> to complete a valgrind run:

> 9.6: Consumed 4h 53min 18.518s CPU time
> 10: Consumed 5h 32min 50.839s CPU time
> 11: Consumed 7h 7min 17.455s CPU time
> 14: still going at 11h 51min 57.951s
> HEAD: 14h 32min 29.571s CPU time

I have observed similar slowdowns across versions on just-plain-slow
animals, too.  Awhile ago (last December, I think), I tried enabling
--enable-tap-tests across the board on prairiedog, and observed
these buildfarm cycle times:

9.5 01:50:24
9.6 02:06:32
10  02:26:34
11  02:54:44
12  03:41:11
13  04:46:31
HEAD04:49:04

I went back to not running TAP tests in the back branches :-(

prairiedog's latest HEAD run consumed 5:30, so it's gotten way
worse since December.

In the same comparison, my other animal longfin had gone from 14 to 18
minutes (and it's now up to 22 minutes).  It's not clear to me whether
greater available parallelism (12 CPUs vs 1) is alone enough to
explain why the more modern machine isn't suffering so badly.  As you
say, the TAP tests are not well parallelized at all, so that doesn't
seem to fit the facts.

In any case, it seems like we do need to be paying more attention to
how long it takes to do the TAP tests.  We could try to shave more
cycles off the overhead, and we should think a little harder about
the long-term value of every test case we add.

regards, tom lane

Re: storing an explicit nonce

2021-10-06 Thread Robert Haas

On Wed, Oct 6, 2021 at 11:22 AM Stephen Frost  wrote:
> Seems like it'd be easier to achieve that by having something that looks
> very close to how write() looks, but just happens to have the option to
> run the data through a stream cipher and maybe does better error
> handling for us.  Making that layer also do block-based access to the
> files underneath seems like a much larger effort that, sure, may make
> some things better too but if we could do that with the same API then it
> could also be done later if someone's interested in that.

Yeah, it's possible that is the best option, but I'm not really
convinced. I think the places that are doing I/O in small chunks are
pretty questionable. Like look at this code from pgstat.c, with block
comments omitted for brevity:

rc = fwrite(_id, sizeof(format_id), 1, fpout);
(void) rc;  /* we'll check for error with ferror */
rc = fwrite(, sizeof(globalStats), 1, fpout);
(void) rc;  /* we'll check for error with ferror */
rc = fwrite(, sizeof(archiverStats), 1, fpout);
(void) rc;  /* we'll check for error with ferror */
rc = fwrite(, sizeof(walStats), 1, fpout);
(void) rc;  /* we'll check for error with ferror */
   rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
(void) rc;  /* we'll check for error with ferror */

I don't know exactly what the best way to write this code is, but I'm
fairly sure this isn't it. I suppose that whoever wrote this code
chose to use fwrite() rather than write() to get buffering, but that
had the effect of delaying the error checking by an amount that I
would consider unacceptable in new code -- we do all the fwrite()
calls to generate the entire file and then only check ferror() once at
the very end! If we did our own buffering, we could do this a lot
better. And if we used that same buffering layer everywhere, it might
not be too hard to make it use a block cipher rather than a stream
cipher. Now I don't intrinsically have strong feeling about whether
block ciphers or stream ciphers are better, but I think it's going to
be easier to analyze the security of the system and to maintain it
across future developments in cryptography if we can use the same kind
of cipher everywhere. If we use block ciphers for some things and
stream ciphers for other things, it is more complicated.

Perhaps that is unavoidable and I just shouldn't worry about it. It
may work out that we'll end up needing to do that anyway for one
reason or another. But all things being equal, I think it's nice if we
make all the places where we do I/O look more like each other, not
specifically because of TDE but because that's just better in general.
For example Andres is working on async I/O. Maybe this particular
piece of code is moot in terms of that project because I think Andres
is hoping to get the shared memory stats collector patch committed.
But, say that doesn't happen. The more all of the I/O looks the same,
the easier it will be to make all of it use whatever async I/O
infrastructure he creates. The more every module does things in its
own way, the harder it is. And batching work together into
reasonable-sized blocks is probably necessary for async I/O too.

I just can't look at code like that shown above and think anything
other than "blech".

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Role Self-Administration

2021-10-06 Thread Robert Haas

On Tue, Oct 5, 2021 at 3:41 PM Mark Dilger  wrote:
> If roles have owners, then DROP ROLE bob CASCADE drops bob, bob's objects, 
> roles owned by bob, their objects and any roles they own, recursively.  Roles 
> which bob merely has admin rights on are unaffected, excepting that they are 
> administered by one fewer roles once bob is gone.
>
> This design allows you to delegate to a new role some task, and you don't 
> have to worry what network of other roles and objects they create, because in 
> the end you just drop the one role cascade and all that other stuff is 
> guaranteed to be cleaned up without any leaks.
>
> If roles do not have owners, then DROP ROLE bob CASCADE drops role bob plus 
> all objects that bob owns.  It doesn't cascade to other roles because the 
> concept of "roles that bob owns" doesn't exist.  If bob created other roles, 
> those will be left around.  Objects that bob created and then transferred to 
> these other roles are also left around.

I'm not sure that I'm totally on board with the role ownership
concept, but I do think it has some potential advantages. For
instance, suppose the dba creates a bunch of "master tenant" roles
which correspond to particular customers: say, amazon, google, and
facebook. Now each of those master tenant rolls creates roles under it
which represent the various people or applications from those
companies that will be accessing the server: e.g. sbrin and lpage.
Now, if Google runs out of money and stops paying the hosting bill, we
can just "DROP ROLE google CASCADE" and sbrin and lpage get nuked too.
That's kind of cool. What happens if we don't have that? Then we'll
need to work harder to make sure all traces of Google are expunged
from the system.

In fact, how do we do that, exactly? In this particular instance it
should be straightforward: if we see that google can administrer sbrin
and lpage and nobody else can, then it probably follows that those
roles should be nuked when the google role is nuked. But what if we
have separate users apple and nextstep either of whom can administer
the role sjobs? If nextstep goes away, we had better not remove sjobs
because he's still able to be administered by apple, but if apple also
goes away, then we'll want to remove sjobs then. That's doable, but
complicated, and all the logic that figures this out now lives outside
the database. With role ownership, we can enforce that the roles form
a tree, and subtrees can be easily lopped off without the user needing
to do anything complicated.

Without role ownership, we've just got a directed graph of who can
administer who, and it need not be connected or acyclic. Now we may be
able to cope with that, or we may be able to set things up so that
users can cope with that using logic external to the database without
anything getting too complicated. But I certainly see the appeal of a
system where the lines of control form a DAG rather than a general
directed graph. It seems to make it a whole lot easier to reason about
what operations should and should not be permitted and how the whole
thing should actually work. It's a fairly big change from the status
quo, though, and maybe it has disadvantages that make it a suboptimal
choice.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-10-06 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Oct 5, 2021 at 1:24 PM Robert Haas  wrote:
> > On Mon, Oct 4, 2021 at 10:00 PM Stephen Frost  wrote:
> > > I do want to point out, as I think I did when we discussed this but want
> > > to be sure it's also captured here- I don't think that temporary file
> > > access should be forced to be block-oriented when it's naturally (in
> > > very many cases) sequential.  To that point, I'm thinking that we need a
> > > temp file access API through which various systems work that's
> > > sequential and therefore relatively similar to the existing glibc, et
> > > al, APIs, but by going through our own internal API (which more
> > > consistently works with the glibc APIs and provides better error
> > > reporting in the event of issues, etc) we can then extend it to work as
> > > an encrypted stream instead.
> >
> > Regarding this, would it use block-oriented access on the backend?
> >
> > I agree that we need a better API layer through which all filesystem
> > access is routed. One of the notable weaknesses of the Cybertec patch
> > is that it has too large a code footprint,
> 
> (sent too soon)
> 
> ...precisely because PostgreSQL doesn't have such a layer.
> 
> But I think ultimately we do want to encrypt and decrypt in blocks, so
> if we create such a layer, it should expose byte-oriented APIs but
> combine the actual I/Os somehow. That's also good for cutting down the
> number of system calls, which is a benefit unto itself.

I have to say that this seems to be moving the goalposts quite far down
the road from just developing a layer to allow for sequential reading
and writing to files that allows us to get away from bare write() calls.
While I agree that we want to encrypt/decrypt in blocks when we're
working with our own blocks, I don't know that it's really necessary to
do for these kinds of use cases.  I appreciate the goal of reducing the
number of syscalls though.

Part of my concern here is that a patch which changes all of our
existing sequential access using write() and friends to work in a block
manner instead ends up probably being just as big and invasive as those
parts of the TDE patch which did the same, and it isn't actually
necessary as there are stream ciphers which we could certainly use for,
well, stream-based access patterns.  No, that doesn't improve the
situation around the number of syscalls, but it also doesn't make that
any worse than it is today.

Perhaps this is all too meta and we need to work through some specific
ideas around just what this would look like.  In particular, thinking
about what this API would look like and how it would be used by
reorderbuffer.c, which builds up changes in memory and then does a bare
write() call, seems like a main use-case to consider.  The gist there
being "can we come up with an API to do all these things that doesn't
require entirely rewriting ReorderBufferSerializeChange()?"

Seems like it'd be easier to achieve that by having something that looks
very close to how write() looks, but just happens to have the option to
run the data through a stream cipher and maybe does better error
handling for us.  Making that layer also do block-based access to the
files underneath seems like a much larger effort that, sure, may make
some things better too but if we could do that with the same API then it
could also be done later if someone's interested in that.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: storing an explicit nonce

2021-10-06 Thread Robert Haas

On Wed, Oct 6, 2021 at 9:35 AM Bruce Momjian  wrote:
> The above text isn't very clear.  What I am saying is that currently
> torn pages can be tolerated by hint bit writes because only a single
> byte is changing.  If we use a block cipher like AES-XTS, later 16-byte
> encrypted blocks would be changed by hint bit changes, meaning torn
> pages could not be tolerated.  This means we would have to use full page
> writes for hint bit changes, perhaps making this feature have
> unacceptable performance overhead.

Actually, I think this would have *less* performance overhead than your patch.

If you enable checksums or set wal_log_hints=on, then you might incur
a some write-ahead log records that would otherwise be avoided, and
those records will include full page images. This can happen once per
page per checkpoint cycle. However, if the first modification to a
particular page within a given checkpoint cycle is a regular
WAL-logged operation rather than a hint bit change, then the extra WAL
record and full-page image are not needed so the overhead is zero.
Also, if the first modification is a hint bit change, and then the
page is evicted, prompting a full page write, but a regular WAL-logged
operation occurs later within the same checkpoint, the later operation
no longer needs a full page write. So you still paid the cost of an
extra WAL record, but you didn't pay the cost of an extra full page
write. In other words, enabling checksums or turning wal_log_hints=on
has a relatively low cost except when you have pages that incur only
hint-type changes, and no regular changes, within the course of a
single checkpoint cycle.

On the other hand, in order to avoid IV reuse, your patch needed to
bump the page LSN for every change, or at least for every eviction.
That means you could potentially incur the overhead of an extra full
page write multiple times per checkpoint cycle, and even if there were
non-hint changes to that page in the same checkpoint cycle. Now you
could say, well, let's not bump the page LSN for every hint-type
change, and then your patch would have lower overhead than an approach
based on XTS, but I think that also loses a ton of security, because
now you're reusing IVs with an encryption system that is documented
not to tolerate the reuse of IVs.

I'm not here to try to pretend that encryption is going to be cheap. I
just don't believe this particular argument about why AES-XTS should
be more expensive.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: BUG #17212: pg_amcheck fails on checking temporary relations

2021-10-06 Thread Pavel Borisov

Hi, hackers!

We've looked through the initial patch and the exclusion of temporary
tables from pg_amcheck seems the right thing. Also it is not the matter
anyone disagrees here, and we propose to commit it alone.
Supplementary things/features might be left for further discussion but
refusing to check temporary tables is the only option IMO.

The patch applies cleanly, tests succeed. I'd propose to set it as RFC.
--
Best regards,
Pavel Borisov, Maxim Orlov

Postgres Professional: http://postgrespro.com

Re[2]: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2021-10-06 Thread Мельников Антон Андреевич


Hi, Andrey!
 
I’ve tried to apply your patch v2-0001 on current master, but i failed.
There were git apply errors at:
pg_stat_statements.out:941
pg_stat_statements.sql:385
 
Best Regards,

Anton Melnikov
Postgres Professional:  http://www.postgrespro.com
The Russian Postgres Company

Re: Compressing temporary files

2021-10-06 Thread Tomas Vondra

Hi,

On 9/11/21 2:31 PM, Andrey Borodin wrote:
> Hi hackers!
> 
> There's a lot of compression discussions nowadays. And that's cool! 
> Recently Naresh Chainani in private discussion shared with me the
> idea to compress temporary files on disk. And I was thrilled to find
> no evidence of implementation of this interesting idea.
> 
> I've prototyped Random Access Compressed File for fun[0]. The code is
> very dirty proof-of-concept. I compress Buffile by one block at a
> time. There are directory pages to store information about the size
> of each compressed block. If any byte of the block is changed - whole
> block is recompressed. Wasted space is never reused. If compressed
> block is more then BLCSZ - unknown bad things will happen :)
> 

Might be an interesting feature, and the approach seems reasonable too
(of course, it's a PoC, so it has rough edges that'd need to be solved).

Not sure if compressing it at the 8kB block granularity is good or bad.
Presumably larger compression blocks would give better compression, but
that's a detail we would investigate later.

> Here are some my observations.
> 
> 0. The idea seems feasible. API of fd.c used by buffile.c can easily
> be abstracted for compressed temporary files. Seeks are necessary,
> but they are not very frequent. It's easy to make temp file
> compression GUC-controlled.
> 

Hmm. How much more expensive the seeks are, actually? If we compress the
files block by block, then it's decompression of 8kB of data. Of course,
that's not free, but if you compare it to doing less I/O, it may easily
be a significant win.

> 1. Temp file footprint can be easily reduced. For example query 
> create unlogged table y as select random()::text t from
> generate_series(0,999) g; uses for toast index build 14000
> bytes of temp file. With patch this value is reduced to 40841704
> (x3.42 smaller).
> 

That seems a bit optimistic, really. The problem is that while random()
is random, it means we're only dealing with 10 characters in the text
value. That's pretty redundant, and the compression benefits from that.

But then again, data produced by queries (which we may need to sort,
which generates temp files) is probably redundant too.

> 2. I have not found any evidence of performance improvement. I've
> only benchmarked patch on my laptop. And RAM (page cache) diminished
> any difference between writing compressed block and uncompressed
> block.
> 

I expect the performance improvement to be less direct, requiring
contention for resources (memory and I/O bandwidth). If you have
multiple sessions and memory pressure, that'll force temporary files
from page cache to disk. The compression will reduce the memory pressure
(because of less data written to page cache), possibly even eliminating
the need to write dirty pages to disk. And if we still have to write
data to disk, this reduces the amount we have to write.

Of course, it may also reduce the disk space required for temp files,
which is also nice.

> How do you think: does it worth to pursue the idea? OLTP systems
> rarely rely on data spilled to disk. Are there any known good random
> access compressed file libs? So we could avoid reinventing the
> wheel. Maybe someone tried this approach before?
> 

I'd say it's worth investigating further.

Not sure about existing solutions / libraries for this problem, but my
guess is the overall approach is roughly what you implemented.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: storing an explicit nonce

2021-10-06 Thread Robert Haas

On Tue, Oct 5, 2021 at 4:29 PM Bruce Momjian  wrote:
> On Tue, Sep 28, 2021 at 12:30:02PM +0300, Ants Aasma wrote:
> > On Mon, 27 Sept 2021 at 23:34, Bruce Momjian  wrote:
> > We are still working on our TDE patch. Right now the focus is on refactoring
> > temporary file access to make the TDE patch itself smaller. Reconsidering
> > encryption mode choices given concerns expressed is next. Currently a viable
> > option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an
> > issue with predictable IV and isn't totally broken in case of IV reuse.
>
> Uh, yes, AES-XTS has benefits, but since it is a block cipher, previous
> 16-byte blocks affect later blocks, meaning that hint bit changes would
> also affect later blocks.  I think this means we would need to write WAL
> full page images for hint bit changes to avoid torn pages.  Right now
> hint bit (single bit) changes can be lost without causing torn pages.
> This was another of the advantages of using a stream cipher like CTR.

This seems wrong to me. CTR requires that you not reuse the IV. If you
re-encrypt the page with a different IV, torn pages are a problem. If
you re-encrypt it with the same IV, then it's not secure any more.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: storing an explicit nonce

2021-10-06 Thread Robert Haas

On Tue, Oct 5, 2021 at 1:55 PM Antonin Houska  wrote:
> I'm just trying to make our changes to buffile.c less invasive. Or do you mean
> that this module should be reworked regardless the encryption?

I wasn't thinking of buffile.c specifically. I think improving that
might be a really good idea, although I'm not 100% sure I know what
that would look like. I was thinking that it's unfortunate that there
are so many different ways that I/O happens overall. Like, there are
direct write() and pg_pwrite() calls in various places, for example.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Compressing temporary files

2021-10-06 Thread Robert Haas

On Sat, Sep 11, 2021 at 8:31 AM Andrey Borodin  wrote:
> I've prototyped Random Access Compressed File for fun[0]. The code is very 
> dirty proof-of-concept.
> I compress Buffile by one block at a time. There are directory pages to store 
> information about the size of each compressed block. If any byte of the block 
> is changed - whole block is recompressed. Wasted space is never reused. If 
> compressed block is more then BLCSZ - unknown bad things will happen :)

Just reading this description, I suppose it's also Bad if the block is
recompressed and the new compressed size is larger than the previous
compressed size. Or do you have some way to handle that?

I think it's probably quite tricky to make this work if the temporary
files can be modified after the data is first written. If you have a
temporary file that's never changed after the fact, then you could
compress all the blocks and maintain, on the side, an index that says
where the compressed version of each block starts. That could work
whether or not the blocks expand when you try to compress them, and
you could even skip compression for blocks that get bigger when
"compressed" or which don't compress nicely, just by including a
boolean flag in your index saying whether that particular block is
compressed or not. But as soon as you have a case where the blocks can
get modified after they are created, then I don't see how to make it
work nicely. You can't necessarily fit the new version of the block in
the space allocated for the old version of the block, and putting it
elsewhere could turn sequential I/O into random I/O.

Leaving all that aside, I think this feature has *some* potential,
because I/O is expensive and compression could let us do less of it.
The problem is that a lot of the I/O that PostgreSQL thinks it does
isn't real I/O. Everybody is pretty much forced to set work_mem
conservatively to avoid OOM, which means a large proportion of
operations that exceed work_mem and thus spill to files don't actually
result in real I/O. They end up fitting in memory after all; it's only
that the memory in question belongs to the OS rather than to
PostgreSQL. And for operations of that type, which I believe to be
very common, compression is strictly a loss. You're doing extra CPU
work to avoid I/O that isn't actually happening.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Extension relocation vs. schema qualification

2021-10-06 Thread Robert Haas

On Tue, Oct 5, 2021 at 8:26 AM Tom Lane  wrote:
> I spent some time awhile ago on fixing this via new-style SQL functions
> [1].  That didn't get finished for reasons explained in the thread,
> but I think that's probably a more productive direction to go in.

Hmm, if that'll solve the problem, +1 from me. I'm not a big fan of
giving up on the ability to relocate extensions. I know it's a pain to
support and I'm not sure that we have all the ideas we need to make it
work cleanly, but I don't feel like it should be intrinsically
impossible to get working, and I think users want it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

1 2 >

1 - 100 of 134 matches

Mail list logo