date:20190826



Bonjour Michaël,


The error returning stuff is simple enough, but I'm unclear about what to do
with pg_uint64, which has a totally different signature. Should it be
aligned?


I am not sure what you mean with aligned here.


I meant same signature.

If you mean consistent, getting into a state where we have all functions 
for all three sizes, signed and unsigned, would be nice.


Ok, I look into it.

--
Fabien.

Re: Statement timeout in pg_rewind

On Mon, Aug 26, 2019 at 03:42:46PM +0200, Alexander Kukushkin wrote:
> Well, I was thinking about it and came to the conclusion that we are
> neither taking heavy locks nor explicitly opening a transaction and
> therefore we can avoid changing them.
> But maybe you are right, having them set to the safe value shouldn't
> hurt.

I'd rather be on the safe side and as we are looking at this at this
area..  Who knows if this logic is going to change in the future and
how it will change.

> I don't think we can use the same wrapper for run_simple_query() and
> for places where we call a SET, because PQresultStatus() returns
> PGRES_TUPLES_OK and PGRES_COMMAND_OK respectively.
> Passing expected ExecStatusType to the wrapper for comparison is
> looking a bit ugly to me.

Oops, I misread this part.  What about a simple wrapper
run_simple_command which checks after PGRES_COMMAND_OK, and frees the
result then?  This could be used for the temporary table creation and
when setting synchronous_commit.
--
Michael


signature.asc
Description: PGP signature

Re: fix "Success" error messages

On Mon, Aug 26, 2019 at 09:40:23PM +0200, Peter Eisentraut wrote:
> Here is an updated patch set that rearranges this a bit according to
> your suggestions, and also fixes some similar issues in pg_checksums.

Thanks for the new patch, and you are right that pg_checksums has been
slacking here.  There is the same issue with pg_verify_checksums in
11.  Not sure that's worth a back-patch though.  Those parts could
find their way to v12 easily.

> -ereport(ERROR,
> -(errcode_for_file_access(),
> - errmsg("could not access status of transaction %u", 
> xid),
> - errdetail("Could not read from file \"%s\" at offset 
> %u: %m.",
> -   path, offset)));
> +if (errno)
> +ereport(ERROR,
> +(errcode_for_file_access(),
> + errmsg("could not access status of transaction %u", 
> xid),
> + errdetail("Could not read from file \"%s\" at 
> offset %u: %m.",
> +   path, offset)));
> +else
> +ereport(ERROR,
> +(errmsg("could not access status of transaction %u", 
> xid),
> + errdetail("Could not read from file \"%s\" at 
> offset %u: read too few bytes.", path, offset)));

Last time I worked on that, the following suggestion was made for
error messages with shorter reads or writes:
could not read file \"%s\": read %d of %zu
Still this is clearly an improvement and I that's not worth the extra
complication, so +1 for this way of doing things.

>  if (r == 0)
>  break;
> -if (r != BLCKSZ)
> +else if (r < 0)
> +{
> +pg_log_error("could not read block %u in file \"%s\": %m",
> + blockno, fn);
> +exit(1);
> +}
> +else if (r != BLCKSZ)
>  {
>  pg_log_error("could not read block %u in file \"%s\": read %d of 
> %d",
>   blockno, fn, r, BLCKSZ);

Other code areas (xlog.c, pg_waldump.c, etc.) prefer doing it this
way, after checking the size read:
if (r != BLCKSZ)
{
if (r < 0)
pg_log_error("could not read blah: %m");
else
pg_log_error("could not read blah: read %d of %d")
}

>  /* Write block with checksum */
> -if (write(f, buf.data, BLCKSZ) != BLCKSZ)
> +w = write(f, buf.data, BLCKSZ);
> +if (w != BLCKSZ)
>  {
> -pg_log_error("could not write block %u in file \"%s\": %m",
> - blockno, fn);
> +if (w < 0)
> +pg_log_error("could not write block %u in file \"%s\": 
> %m",
> + blockno, fn);
> +else
> +pg_log_error("could not write block %u in file \"%s\": 
> wrote %d of %d",
> + blockno, fn, w, BLCKSZ);
>  exit(1);
>  }
>  }

This is consistent.
--
Michael


signature.asc
Description: PGP signature

Re: Re: Email to hackers for test coverage

On Mon, Aug 26, 2019 at 05:10:59PM +0800, movead...@highgo.ca wrote:
> Thanks for your remind, I have modified the patch and now it is 
> 'regression_20190826.patch' in attachment, and I have done a successful
> test on Cygwin.

There is a section in float4.sql which deals with overflow and
underflow, so wouldn't it be better to move the tests there?  You
could just trigger the failures with that:
=# insert into float4_tbl values ('-10e-70'::float8);
ERROR:  22003: value out of range: underflow
LOCATION:  check_float4_val, float.h:145
=# insert into float4_tbl values ('-10e70'::float8);
ERROR:  22003: value out of range: overflow
LOCATION:  check_float4_val, float.h:140

I would also test all four patterns: 10e70, 10e-70, -10e70, -10e-70.

For the numeric part, this improves the case of
ApplySortAbbrevFullComparator() where both values are not NULL.  Could
things be done so as the other code paths are fully covered?  One
INSERT is fine by the way to add the extra coverage.
--
Michael

signature.asc
Description: PGP signature

RE: Resume vacuum and autovacuum from interruption and cancellation

2019-08-26 Thread Jamison, Kirk

On Monday, August 19, 2019 10:39 AM (GMT+9), Masahiko Sawada wrote:
> Fixed.
> 
> Attached the updated version patch.

Hi Sawada-san,

I haven't tested it with heavily updated large tables, but I think the patch
is reasonable as it helps to shorten the execution time of vacuum by removing
the redundant vacuuming and prioritizing reclaiming the garbage instead.
I'm not sure if it's commonly reported to have problems even when we repeat
vacuuming the already-vacuumed blocks, but I think it's a reasonable 
improvement.

I skimmed the patch and have few comments. If they deem fit, feel free to
follow, but it's also ok if you don't.
1.
>+ Block number to resume vacuuming from
Perhaps you could drop "from".

2.
>+  . This behavior is helpful
>+  when to resume vacuuming from interruption and cancellation.The default
when resuming vacuum run from interruption and cancellation.
There should also be space between period and "The".

3.
>+  set to true. This option is ignored if either the 
>FULL,
>+  the FREEZE or 
>DISABLE_PAGE_SKIPPING
>+  option is used.
..if either of the FULL, FREEZE, or 
DISABLE_PAGE_SKIPPING options is used.

4.
>+  next_fsm_block_to_vacuum,
>+  next_block_to_resume;
Clearer one would be "next_block_to_resume_vacuum"?
You may disregard if that's too long.

5.
>+  Assert(start_blkno <= nblocks); /* both are the same iif it's empty */
iif -> if there are no blocks / if nblocks is 0

6.
>+   * If not found a valid saved block number, resume from the
>+   * first block.
>+   */
>+  if (!found ||
>+  tabentry->vacuum_resume_block >= 
>RelationGetNumberOfBlocks(onerel))
This describes when vacuum_resume_block > RelationGetNumberOfBlocks.., isn't it?
Perhaps a better framing would be
"If the saved block number is found invalid,...",

7.
>+  boolvacuum_resume;  /* enables vacuum to resuming 
>from last
>+   * 
>vacuumed block. */
resuming --> resume


Regards,
Kirk Jamison

Re: old_snapshot_threshold vs indexes

On Tue, Aug 27, 2019 at 1:54 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > OK I started writing a patch and realised there were a few ugly
> > problems that I was about to report here... but now I wonder if, based
> > on the comment for RelationHasUnloggedIndex(), we shouldn't just nuke
> > all this code.  We don't actually support unlogged indexes on
> > permanent tables (there is no syntax to create them, and
> > RelationHasUnloggedIndex() will never return true in practice because
> > RelationNeedsWAL() will always return false first).
>
> Oh!  That explains why the code coverage report shows clearly that
> RelationHasUnloggedIndex never returns true ;-)
>
> > This is a locking
> > protocol violation and a performance pessimisation in support of a
> > feature we don't have.  If we add support for that in some future
> > release, we can figure out how to do it properly then, no?
>
> +1.  That fix is also back-patchable, which adding fields to relcache
> entries would not be.

There is a fly in the ointment: REL9_6_STABLE's copy of
RelationHasUnloggedIndex() is hardcoded to return true for hash
indexes (see commit 2cc41acd8).

However, I now see that there isn't a buffer content lock deadlock
risk here after all, because we don't reach RelationHasUnloggedIndex()
if IsCatalogRelation(rel).  That reminds me of commit 4fd05bb55b4.  It
still doesn't seem like a great idea to be doing catalog access while
holding the buffer content lock, though.

So I think we need to leave 9.6 as is, and discuss how far back to
back-patch the attached.  It could go back to 10, but perhaps we
should be cautious and push it to master only for now, if you agree
with my analysis of the deadlock thing.

> It's not really apparent to me that unlogged indexes on logged tables
> would ever be a useful combination, so I'm certainly willing to nuke
> poorly-thought-out code that putatively supports it.  But perhaps
> we should add some comments to remind us that this area would need
> work if anyone ever wanted to support that.  Not sure where.

It might make sense for some kind of in-memory index that is rebuilt
from the heap at startup, but then I doubt such a thing would have an
index relation with a relpersistence to check anyway.

-- 
Thomas Munro
https://enterprisedb.com

0001-Don-t-perform-catalog-lookups-in-TestForOldSnapshot.patch
Description: Binary data

Re: refactoring - share str2*int64 functions

On Mon, Aug 26, 2019 at 11:05:55AM +0200, Fabien COELHO wrote:
> I have started to do something, and I can spend some time on that this week,
> but I'm pretty unclear about what exactly should be done.

Thanks.

> The error returning stuff is simple enough, but I'm unclear about what to do
> with pg_uint64, which has a totally different signature. Should it be
> aligned?

I am not sure what you mean with aligned here.  If you mean
consistent, getting into a state where we have all functions for all
three sizes, signed and unsigned, would be nice.
--
Michael


signature.asc
Description: PGP signature

Re: understand the pg locks in in an simple case

2019-08-26 Thread Alex

On Tue, Aug 20, 2019 at 10:52 PM Alex  wrote:

>
>
> On Tue, Aug 20, 2019 at 4:59 PM Heikki Linnakangas 
> wrote:
>
>> On 20/08/2019 10:23, Alex wrote:
>> > I have troubles to understand the pg lock in the following simple
>> > situation.
>> >
>> >
>> > Session 1:
>> >
>> >
>> > begin;   update  tset  a=  1  where  a=  10;
>> >
>> >
>> > Session 2:
>> >
>> >
>> > begin;  update  tset  a=  2  where  a=  10;
>> >
>> >
>> > They update the same row and session 2 is blocked by session 1 without
>> > surprise.
>> >
>> >
>> > The pretty straight implementation is:
>> >
>> > Session 1 lock the the *tuple (ExclusiveLock)* mode.
>> >
>> > when session 2 lock it in exclusive mode,  it is blocked.
>> >
>> >
>> > But when I check the pg_locks: session 1.  I can see *no tuple
>> > lock*there,  when I check the session 2,   I can see a
>> > *tuple(ExclusiveLock) is granted*,  but it is waiting for a
>> transactionid.
>> >
>> >
>> > since every tuple has txn information,  so it is not hard to implement
>> > it this way.  but is there any benefits over the the straight way?
>> >   with the current implementation, what is the point
>> > of tuple(ExclusiveLock) for session 2?
>>
>> The reason that tuple locking works with XIDs, rather than directly
>> acquiring a lock on the tuple, is that the memory allocated for the lock
>> manager is limited. One transaction can lock millions of tuples, and if
>> it had to hold a normal lock on every tuple, you would run out of memory
>> very quickly.
>>
>
> Thank you!
>
> so can I understand that we don't need a lock on every tuple we updated
> since
> 1).  the number of lock may be  huge,  if we do so,  it will consume a lot
> of memory
> 2).  the tuple header which includes xid info are unavoidable due to MVCC
> requirement, and it can be used here, so we saved the individual lock
>
> and in my above example,  when session 2 waiting for a xid lock,  it is
> *granted* with a tuple lock with ExclusiveLock mode,  what is the purpose
> of this lock?
>

I will try to answer this question myself.  the purpose of the tuple lock
(with ExclusiveLock mode) is to protect there is no more than 1 client to
add the transaction lock on the same tuple at the same time.  once the txn
lock is added,  the tuple lock can be released.


So it may seem that we don't need heavy-weight locks on individual
>> tuples at all. But we still them to establish the order that backends
>> are waiting. The locking protocol is:
>>
>> 1. Check if a tuple's xmax is set.
>> 2. If it's set, obtain a lock on the tuple's TID.
>> 3. Wait on the transaction to finish, by trying to acquire lock on the
>> XID.
>> 4. Update the tuple, release the lock on the XID, and on the TID.
>>
>> It gets more complicated if there are multixids, or you update a row you
>> have earlier locked in a weaker mode, but that's the gist of it.
>>
>> We could skip the lock on the tuple's TID, but then if you have multiple
>> backends trying to update or lock a row, it would be not be
>> deterministic, who gets the lock first. For example:
>>
>> Session A: BEGIN; UPDATE foo SET col='a' WHERE id = 123;
>> Session B: UPDATE foo SET col='b' WHERE id = 123; 
>> Session C: UPDATE foo SET col='c' WHERE id = 123; 
>> Session A: ROLLBACK;
>>
>> Without the lock on the TID, it would be indeterministic, whether
>> session B or C gets to update the tuple, when A rolls back. With the
>> above locking protocol, B will go first. B will acquire the lock on the
>> TID, and block on the XID lock, while C will block on the TID lock held
>> by B. If there were more backends trying to do the same, they would
>> queue for the TID lock, too.
>>
>> - Heikki
>>
>

Re: gharial segfaulting on REL_12_STABLE only

On Tue, Aug 27, 2019 at 1:48 PM Tom Lane  wrote:
> A stack trace would likely be really useful right about now.

Yeah.  Looking into how to get that.


--
Thomas Munro
https://enterprisedb.com

RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-26 Thread Smith, Peter

Greetings, 

(Apologies for any naïve thoughts below. Please correct my misunderstandings)

I am trying to understand the background for the ideas proposed and/or already 
decided, but it is increasingly difficult to follow.

I’ve been watching the TDE list for several months and over that time there 
have been dozens of different ideas floated; Each of them have their own good 
points; Some are conflicting;

IMO any TDE implementation will be a trade-off between a number of factors:
* Design – e.g. Simple v Complex solution
* Secureness – e.g. Acceptance that a simpler solution may not handle every 
possible threat
* Cost/Feasibility – e.g. How hard will TDE be to implement/maintain. 
* User expectations - e.g. What is the “threat model” the end user actually 
wants to protect against
* User expectations – e.g. Comparison with other products
* Completeness – e.g. Acknowledgement that first implementation may not meet 
the end-goal.
* Future proof – e.g. ability to evolve in future TDE versions (with minimal 
re-write of what came before)
* Usability – e.g. Online/offline considerations
* Usability – e.g. Will a more complex solution end up being too difficult to 
actually use/administer
* etc…

New TDE ideas keep popping up all the time. The discussion sometimes has become 
mired in technical details; I’m losing sight of the bigger picture.

Would it be possible to share a *tabulation* for all the TDE components; Each 
component may be a number of design choices (options); And have brief lists of 
Pros/Cons for each of those options so that each can be concisely summarised on 
their respective merits.

I think this would be of a great help to understand how we got to where we are 
now, as well as helping to focus on how to proceed.

For example,

=
Component: TDKEY
* Option: use derived keys; List of Pros/Cons
* Option: use random keys; List of Pros/Cons
* Option: use keys from some external source and encrypted by MDKEY; List of 
Pros/Cons
* Option: use same TKEY for all tables/tablespaces; List of Pros/Cons
* Option: … 
* Option: …
* => Decision (i.e. the least-worst compromise/combination of the possible 
options)
=

~

Postscript: 

After writing this, I recalled recently reading a mail from Antonin 
https://www.postgresql.org/message-id/44057.1565977657%40antos which says 
pretty much the same thing!

Also, I recognise that there is an offline shared Googledoc which already 
includes some of this information, but I think it would be valuable if it could 
be formatted as Pros/Cons summary table and shared on the Wiki page for 
everybody to see.


Kind Regards,
---
Peter Smith
Fujitsu Australia

Re: old_snapshot_threshold vs indexes

Thomas Munro  writes:
> OK I started writing a patch and realised there were a few ugly
> problems that I was about to report here... but now I wonder if, based
> on the comment for RelationHasUnloggedIndex(), we shouldn't just nuke
> all this code.  We don't actually support unlogged indexes on
> permanent tables (there is no syntax to create them, and
> RelationHasUnloggedIndex() will never return true in practice because
> RelationNeedsWAL() will always return false first).

Oh!  That explains why the code coverage report shows clearly that
RelationHasUnloggedIndex never returns true ;-)

> This is a locking
> protocol violation and a performance pessimisation in support of a
> feature we don't have.  If we add support for that in some future
> release, we can figure out how to do it properly then, no?

+1.  That fix is also back-patchable, which adding fields to relcache
entries would not be.

It's not really apparent to me that unlogged indexes on logged tables
would ever be a useful combination, so I'm certainly willing to nuke
poorly-thought-out code that putatively supports it.  But perhaps
we should add some comments to remind us that this area would need
work if anyone ever wanted to support that.  Not sure where.

regards, tom lane

Re: gharial segfaulting on REL_12_STABLE only

Thomas Munro  writes:
> This is apparently an EDB-owned machine but I have no access to it
> currently (I could ask if necessary).  For some reason it's been
> failing for a week, but only on REL_12_STABLE, with this in the log:

Yeah, I've been puzzling over that to little avail.

> It's hard to see how cdc8d371e2, the only non-doc commit listed on the
> first failure, could have anything to do with that.

Exactly :-(.  It seems completely reproducible since then, but how
could that have triggered a failure over here?  And why only in this
branch?  The identical patch went into HEAD.

> 2019-08-20 04:31:48.886 MDT [13421:4] LOG:  server process (PID 13871)
> was terminated by signal 11: unrecognized signal
> 2019-08-20 04:31:48.886 MDT [13421:5] DETAIL:  Failed process was
> running: SET default_table_access_method = '';

> Apparently HPUX's sys_siglist doesn't recognise that most popular of
> signals, 11, but by googling I see that it has its traditional meaning
> there.

HPUX hasn't *got* sys_siglist, nor strsignal() which is what we're
actually relying on these days (cf. pgstrsignal.c).  I was puzzled
by that too to start with, though.  I wonder if we shouldn't rearrange
pg_strsignal so that the message in the !HAVE_STRSIGNAL case is
something like "signal names not available on this platform" rather
than something that looks like we should've recognized it and didn't.

> 2019-08-20 04:31:22.422 MDT [13871:34] pg_regress/create_am LOG:
> statement: SET default_table_access_method = '';

> Perhaps it was really running the next statement.

Hard to see how, because this should have reported

ERROR:  invalid value for parameter "default_table_access_method": ""
DETAIL:  default_table_access_method cannot be empty.

but it didn't get that far.  It seems like it must have died either
in the (utterly trivial) check that leads to the above-quoted
complaint, or somewhere in the ereport mechanism.  Neither theory
seems very credible.

The seeming action-at-a-distance nature of the failure has me
speculating about compiler or linker bugs, but I dislike
jumping to that type of conclusion without hard evidence.

A stack trace would likely be really useful right about now.

regards, tom lane

Re: IoT/sensor data and B-Tree page splits

2019-08-26 Thread Peter Geoghegan

On Mon, Aug 26, 2019 at 4:29 PM Arcadiy Ivanov  wrote:
> This problem is not limited to IoT but to RT financial transaction
> ingestion as well.

Not surprising, since the TPC-E benchmark models a financial trading
application. Perhaps it exhibits this behavior because it is actually
representative of a real trading application.

Note that pg_stats.correlation is 1.0 for the leading indexed column
(in the trade_history PK index), indicating *perfect* correlation.
It's not perfectly correlated when you look at it under a microscope,
though.

> I found BRIN indices to work exceptionally well for that, while B-tree
> taking enormous amounts of space with no performance difference or win
> going to BRIN.

That won't work with the TPC-E example, though, since it's a primary key index.

> The situation gets even worse when B-tree index is subjected to
> identical tuples which often happens when you have an avalanche of
> timestamps that are within less than 1ms of each other (frequent TS
> rounding resolution).

The good news there is that that will almost certainly be a lot better
in Postgres 12. TPC-E also has a number of very low cardinality
indexes, despite being an OLTP benchmark. Some of these indexes are
also listed in the 2012 problem report I linked to. Those same indexes
will be a lot smaller on Postgres 12. It should also generate a lot
less WAL compared to previous versions. (Plus we may get dynamic
B-Tree deduplication in Postgres 13, which would improve matters
further with low cardinality indexes.)

-- 
Peter Geoghegan

Re: A problem about partitionwise join

2019-08-26 Thread Amit Langote

Hi Richard,

On Mon, Aug 26, 2019 at 6:33 PM Richard Guo  wrote:
>
> Hi All,
>
> To generate partitionwise join, we need to make sure there exists an
> equi-join condition for each pair of partition keys, which is performed
> by have_partkey_equi_join(). This makes sense and works well.
>
> But if, let's say, one certain pair of partition keys (foo.k = bar.k)
> has formed an equivalence class containing consts, no join clause would
> be generated for it, since we have already generated 'foo.k = const' and
> 'bar.k = const' and pushed them into the proper restrictions earlier.
>
> This will make partitionwise join fail to be planned if there are
> multiple partition keys and the pushed-down restrictions 'xxx = const'
> fail to prune away any partitions.
>
> Consider the examples below:
>
> create table p (k1 int, k2 int, val int) partition by range(k1,k2);
> create table p_1 partition of p for values from (1,1) to (10,100);
> create table p_2 partition of p for values from (10,100) to (20,200);
>
> If we are joining on each pair of partition keys, we can generate
> partitionwise join:
>
> # explain (costs off)
> select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 = bar.k2;
>   QUERY PLAN
> --
>  Append
>->  Hash Join
>  Hash Cond: ((foo.k1 = bar.k1) AND (foo.k2 = bar.k2))
>  ->  Seq Scan on p_1 foo
>  ->  Hash
>->  Seq Scan on p_1 bar
>->  Hash Join
>  Hash Cond: ((foo_1.k1 = bar_1.k1) AND (foo_1.k2 = bar_1.k2))
>  ->  Seq Scan on p_2 foo_1
>  ->  Hash
>->  Seq Scan on p_2 bar_1
> (11 rows)
>
> But if we add another qual 'foo.k2 = const', we will be unable to
> generate partitionwise join any more, because have_partkey_equi_join()
> thinks not every partition key has an equi-join condition.
>
> # explain (costs off)
> select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 = bar.k2 
> and foo.k2 = 16;
>QUERY PLAN
> -
>  Hash Join
>Hash Cond: (foo.k1 = bar.k1)
>->  Append
>  ->  Seq Scan on p_1 foo
>Filter: (k2 = 16)
>  ->  Seq Scan on p_2 foo_1
>Filter: (k2 = 16)
>->  Hash
>  ->  Append
>->  Seq Scan on p_1 bar
>  Filter: (k2 = 16)
>->  Seq Scan on p_2 bar_1
>  Filter: (k2 = 16)
> (13 rows)
>
> Is this a problem?

Perhaps.  Maybe it has to do with the way have_partkey_equi_join() has
been coded.  If it was coded such that it figured out on its own that
the equivalence (foo.k2, bar.k2, ...) does exist, then that would
allow partitionwise join to occur, which I think would be OK to do.
But maybe I'm missing something.

Thanks,
Amit

gharial segfaulting on REL_12_STABLE only

Hi,

This is apparently an EDB-owned machine but I have no access to it
currently (I could ask if necessary).  For some reason it's been
failing for a week, but only on REL_12_STABLE, with this in the log:

2019-08-20 04:31:48.886 MDT [13421:4] LOG:  server process (PID 13871)
was terminated by signal 11: unrecognized signal
2019-08-20 04:31:48.886 MDT [13421:5] DETAIL:  Failed process was
running: SET default_table_access_method = '';

Apparently HPUX's sys_siglist doesn't recognise that most popular of
signals, 11, but by googling I see that it has its traditional meaning
there.  That's clearly in the create_am test:

019-08-20 04:31:22.404 MDT [13871:31] pg_regress/create_am HINT:  Use
DROP ... CASCADE to drop the dependent objects too.
2019-08-20 04:31:22.404 MDT [13871:32] pg_regress/create_am STATEMENT:
 DROP ACCESS METHOD gist2;
2019-08-20 04:31:22.405 MDT [13871:33] pg_regress/create_am LOG:
statement: DROP ACCESS METHOD gist2 CASCADE;
2019-08-20 04:31:22.422 MDT [13871:34] pg_regress/create_am LOG:
statement: SET default_table_access_method = '';

Perhaps it was really running the next statement.

It's hard to see how cdc8d371e2, the only non-doc commit listed on the
first failure, could have anything to do with that.

-- 
Thomas Munro
https://enterprisedb.com

Re: Zedstore - compressed in-core columnar storage

2019-08-26 Thread Ashwin Agrawal

On Mon, Aug 26, 2019 at 5:36 AM Ashutosh Sharma 
wrote:

> Thanks Ashwin and Heikki for your responses. I've one more query here,
>
> If BTree index is created on a zedstore table, the t_tid field of
> Index tuple contains the physical tid that is not actually pointing to
> the data block instead it contains something from which the logical
> tid can be derived. So, when IndexScan is performed on a zedstore
> table, it fetches the physical tid from the index page and derives the
> logical tid out of it and then retrieves the data corresponding to
> this logical tid from the zedstore table. For that, it kind of
> performs SeqScan on the zedstore table for the given tid.

Nope, it won't perform seqscan. As zedstore is laid out as btree itself
with logical TID as its key. It can quickly find which page the logical TID
belongs to and only access that page. It doesn't need to perform the
seqscan for the same. That's one of the rationals for laying out things in
btree fashion to easily connect logical to physical world and not keep any
external mapping.

AFAIU, the following user level query on zedstore table
>
> select * from zed_tab where a = 3;
>
> gets internally converted to
>
> select * from zed_tab where tid = 3; -- assuming that index is created
> on column 'a' and the logical tid associated with a = 3 is 3.
>

So, for this it will first only access the TID btree, find the leaf page
with tid=3. Perform the visibility checks for the tuple and if tuple is
visible, then only will fetch all the columns for that TID. Again using the
btrees for those columns to only fetch leaf page for that logical tid.

Hope that helps to clarify the confusion.

Re: old_snapshot_threshold vs indexes

On Tue, Aug 27, 2019 at 10:59 AM Tom Lane  wrote:
> > At first glance it seems
> > like we need to capture PageGetLSN(page) while we have the lock, and
> > then later pass that into TestForOldSnapshot() instead of the page.
> > I'll look into that and write a patch, probably in a day or two.
>
> Hm, but surely we need to do other things to the page besides
> TestForOldSnapshot?  I was imagining that we'd collect the
> RelationHasUnloggedIndex flag (or perhaps better, the
> RelationAllowsEarlyPruning result) before attempting to lock
> the page, and then pass it to TestForOldSnapshot.

OK I started writing a patch and realised there were a few ugly
problems that I was about to report here... but now I wonder if, based
on the comment for RelationHasUnloggedIndex(), we shouldn't just nuke
all this code.  We don't actually support unlogged indexes on
permanent tables (there is no syntax to create them, and
RelationHasUnloggedIndex() will never return true in practice because
RelationNeedsWAL() will always return false first).  This is a locking
protocol violation and a performance pessimisation in support of a
feature we don't have.  If we add support for that in some future
release, we can figure out how to do it properly then, no?

-- 
Thomas Munro
https://enterprisedb.com

Re: IoT/sensor data and B-Tree page splits

2019-08-26 Thread Peter Geoghegan

On Mon, Aug 26, 2019 at 4:59 PM Arcadiy Ivanov  wrote:
> But apart from TPC-E and having to perform to it, is there any practical
> real world usefulness in trying to have a B-tree index on TS-based data
> just to have a PK on it, as opposed to having a BRIN on a TS field and
> calling it a day?

The index in question isn't an index on a timestamp. Its leading
column is an integer that is almost monotonically increasing, but not
quite.

If a BRIN index works for you, then that's probably what you should
use. I don't think that that's relevant, though. BRIN indexes have
clear disadvantages, including the fact that you have to know that
your data is amenable to BRIN indexing in order to use a BRIN index.

-- 
Peter Geoghegan

Re: IoT/sensor data and B-Tree page splits

2019-08-26 Thread Arcadiy Ivanov


On 8/26/19 6:48 PM, Peter Geoghegan wrote:

Such data often consists of timestamps from a large number
of low cost devices -- event data that arrives *approximately* in
order. This is more or less the problem that the TimescaleDB extension
targets, so it seems likely that a fair number of users care about
getting it right, even if they don't know it.


This problem is not limited to IoT but to RT financial transaction 
ingestion as well.
I found BRIN indices to work exceptionally well for that, while B-tree 
taking enormous amounts of space with no performance difference or win 
going to BRIN.
The situation gets even worse when B-tree index is subjected to 
identical tuples which often happens when you have an avalanche of 
timestamps that are within less than 1ms of each other (frequent TS 
rounding resolution).


--
Arcadiy Ivanov
arca...@gmail.com | @arcivanov | https://ivanov.biz
https://github.com/arcivanov

Re: IoT/sensor data and B-Tree page splits

2019-08-26 Thread Arcadiy Ivanov


On 8/26/19 7:49 PM, Peter Geoghegan wrote:

Not surprising, since the TPC-E benchmark models a financial trading
application.



The good news there is that that will almost certainly be a lot better
in Postgres 12. TPC-E also has a number of very low cardinality
indexes, despite being an OLTP benchmark.



But apart from TPC-E and having to perform to it, is there any practical 
real world usefulness in trying to have a B-tree index on TS-based data 
just to have a PK on it, as opposed to having a BRIN on a TS field and 
calling it a day?


--
Arcadiy Ivanov
arca...@gmail.com | @arcivanov | https://ivanov.biz
https://github.com/arcivanov

IoT/sensor data and B-Tree page splits

2019-08-26 Thread Peter Geoghegan

The well known rightmost page split optimization (where we apply leaf
page fill factor) usually does a great job of maximizing space
utilization with indexes on a single auto-incrementing column or
timestamp column, by packing the left half of the rightmost page full.
Splitting the rightmost page predicts further splits of the new
rightmost page. If we are inserting ever-increasing values into the
index, then we win by packing the left half full -- we get very high
space utilization. If we're wrong to assume that another rightmost
page split is sure to follow soon, then we lose on space utilization,
though only by a tiny amount. We may occasionally lose out on space
utilization because our assumption that the rightmost page is special
was wrong. But, since it isn't special, we only inappropriately share
space unevenly once in a blue moon, which doesn't matter at all.

This seems to me to be fundamentally based on the assumption that you
either have random insertions or sequential insertions, without any
gray area between the two. Commit f21668f328c more or less generalized
this long established optimization to work with cases where it makes
sense to split at the rightmost page of a *grouping* within the index,
rather than the *entire* index, but that is almost unrelated to what I
want to bring up now. My concern right now is sensor data and IoT
data, which seem to call the "random or sequential" assumption into
question. Such data often consists of timestamps from a large number
of low cost devices -- event data that arrives *approximately* in
order. This is more or less the problem that the TimescaleDB extension
targets, so it seems likely that a fair number of users care about
getting it right, even if they don't know it.

I happened to notice that the largest TPC-E index has this exact
problem (it was the largest back when VMware ran TPC-E on Postgres
[1], which they specifically complained about at the time). The
"trade_history" table's primary key has tuples inserted
almost-but-not-quite in order. Sometimes packing the left half of the
rightmost page 90% full works out, because the remaining space is
enough to absorb all future insertions that arrive "slightly late"
(though barely). More often it turns out that that isn't enough space,
and the almost-rightmost page is split a second time, which is very
inefficient in lots of ways, even compared to the case where we reduce
fill factor to 50.

I have found that reducing trade_history_pkey's fill factor to about
70 increases leaf page space utilization, leaving it at about 85% (it
was under 50% with a fill factor of 90). I tend to doubt that this is
a practical tuning strategy in the real world, though, since varying
TPC-E scale alters which exact setting is effective. Ideally,
nbtpslitloc.c could intelligently adapt to slightly out of order
insertions in order to maximize space utilization -- it could
*dynamically* tune fill factor in response to ongoing observations
about page splits. Importantly, this would allow nbtree to avoid
unnecessarily splitting the same page twice in a row, having packed it
almost full on the first split -- this situation seems truly
pathological to me (think about the WAL overhead).

ISTM that adding such an optimization would require maintaining
dedicated book keeping metadata, which doesn't seem particularly
appealing. It might have acceptable overhead within a single backend.
I'm thinking of something like the RelationGetTargetBlock() stuff. Has
anybody else thought about this? What would a robust algorithm look
like?

Perhaps this is a problem that isn't worth solving right now, but it
is definitely a real problem.

[1]
https://www.postgresql.org/message-id/66ce997fb523c04e9749452273184c6c137cb88...@exch-mbx-113.vmware.com
--
Peter Geoghegan

Re: old_snapshot_threshold vs indexes

Thomas Munro  writes:
> On Tue, Aug 27, 2019 at 9:28 AM Tom Lane  wrote:
>> It is hard to express what a bad idea it is to be asking for complex
>> catalog searches while holding a buffer lock.  We could easily get
>> into undetectable deadlocks that way, for example.  We need to refactor
>> these call sites to arrange that the catalog lookup happens outside
>> the low-level page access.

> Hmm.  Right.  Perhaps the theory was that it was OK because it's
> shared (rather than exclusive), or perhaps the catalog lookup was
> sufficiently well hidden and was forgotten.

I strongly suspect the latter.  Also, it may well be that the
unlogged-index check was not in the original design but was
added later with insufficient thought about where it'd be called
from.

> At first glance it seems
> like we need to capture PageGetLSN(page) while we have the lock, and
> then later pass that into TestForOldSnapshot() instead of the page.
> I'll look into that and write a patch, probably in a day or two.

Hm, but surely we need to do other things to the page besides
TestForOldSnapshot?  I was imagining that we'd collect the
RelationHasUnloggedIndex flag (or perhaps better, the
RelationAllowsEarlyPruning result) before attempting to lock
the page, and then pass it to TestForOldSnapshot.

regards, tom lane

Re: old_snapshot_threshold vs indexes

On Tue, Aug 27, 2019 at 9:28 AM Tom Lane  wrote:
> I agree that this code is absolutely horrid as it stands.  However,
> it doesn't look to me like caching RelationHasUnloggedIndex is quite
> enough to fix it.  The other problem is that the calls in question
> seem to be mostly in TestForOldSnapshot, which is called in places
> like heapgetpage:
>
> LockBuffer(buffer, BUFFER_LOCK_SHARE);
>
> dp = BufferGetPage(buffer);
> TestForOldSnapshot(snapshot, scan->rs_base.rs_rd, dp);
> lines = PageGetMaxOffsetNumber(dp);
> ntup = 0;
>
> It is hard to express what a bad idea it is to be asking for complex
> catalog searches while holding a buffer lock.  We could easily get
> into undetectable deadlocks that way, for example.  We need to refactor
> these call sites to arrange that the catalog lookup happens outside
> the low-level page access.

Hmm.  Right.  Perhaps the theory was that it was OK because it's
shared (rather than exclusive), or perhaps the catalog lookup was
sufficiently well hidden and was forgotten.  At first glance it seems
like we need to capture PageGetLSN(page) while we have the lock, and
then later pass that into TestForOldSnapshot() instead of the page.
I'll look into that and write a patch, probably in a day or two.

> Your 0001 patch looks reasonable for the purpose of caching the
> result, though.

Thanks for the review.  I'll wait until we figure out what to do about
the other problem and what needs to be back-patched.

-- 
Thomas Munro
https://enterprisedb.com

Re: Misleading comment in tuplesort_set_bound

James Coleman  writes:
> While digging into the incremental sort patch, I noticed in
> tuplesort.c at the beginning of the function in $SUBJECT we have this
> comment and assertion:

> tuplesort_set_bound(Tuplesortstate *state, int64 bound)
> {
> /* Assert we're called before loading any tuples */
> Assert(state->status == TSS_INITIAL);

> But AFAICT from reading the code in puttuple_common the state remains
> TSS_INITIAL while tuples are inserted (unless we reach a point where
> we decide to transition it to TSS_BOUNDED or TSS_BUILDRUNS).

You missed the relevance of the next line:

Assert(state->memtupcount == 0);

I think the comment is fine as-is.  Perhaps the code would be clearer
though, if we merged those two asserts into one?

/* Assert we're called before loading any tuples */
Assert(state->status == TSS_INITIAL &&
   state->memtupcount == 0);

I'm not totally sure about the usefulness/relevance of the two
assertions following these, but they could likely do with their
own comment(s), because this one surely isn't covering them.

regards, tom lane

Re: old_snapshot_threshold vs indexes

Thomas Munro  writes:
> On my laptop, all prewarmed, no concurrency, the mere existence of 10
> brin indexes causes a sequential scan to take ~5% longer and an
> uncorrelated index scan to take ~45% longer (correlated index scans
> don't suffer).  Here's a draft patch for v13 that fixes that problem
> by caching the result of RelationHasUnloggedIndex().

I agree that this code is absolutely horrid as it stands.  However,
it doesn't look to me like caching RelationHasUnloggedIndex is quite
enough to fix it.  The other problem is that the calls in question
seem to be mostly in TestForOldSnapshot, which is called in places
like heapgetpage:

LockBuffer(buffer, BUFFER_LOCK_SHARE);

dp = BufferGetPage(buffer);
TestForOldSnapshot(snapshot, scan->rs_base.rs_rd, dp);
lines = PageGetMaxOffsetNumber(dp);
ntup = 0;

It is hard to express what a bad idea it is to be asking for complex
catalog searches while holding a buffer lock.  We could easily get
into undetectable deadlocks that way, for example.  We need to refactor
these call sites to arrange that the catalog lookup happens outside
the low-level page access.

Your 0001 patch looks reasonable for the purpose of caching the
result, though.

regards, tom lane

Re: Check the number of potential synchronous standbys

"=?gb18030?B?1cXOxL3c?=" <757634...@qq.com> writes:
> When the number of potential synchronous standbys is smaller than num_sync, 
> such as 'FIRST 3 (1,2)', 'ANY 4 (1,2,3)' in the synchronous_standby_names, 
> the processes will wait for synchronous replication forever. 
> Obviously, it's not expected. I think return false and a error message may be 
> better. And attached is a patch that implements the simple check. 

Well, it's not *that* simple; this patch rejects cases like "ANY 2(*)"
which need to be accepted.  That causes the src/test/recovery tests
to fail (you should have tried check-world).

I also observe that there's a test case in 007_sync_rep.pl which is
actually exercising the case you want to reject:

# Check that sync_state of each standby is determined correctly
# when num_sync exceeds the number of names of potential sync standbys
# specified in synchronous_standby_names.
test_sync_state(
$node_master, qq(standby1|0|async
standby2|4|sync
standby3|3|sync
standby4|1|sync),
'num_sync exceeds the num of potential sync standbys',
'6(standby4,standby0,standby3,standby2)');

So it can't be said that nobody thought about this at all.

Now, I'm not convinced that this represents a useful use-case as-is.
However, because we can't know how many standbys may match "*",
it's clear that the code has to do something other than just
abort when the situation happens.  Conceivably we could fail at
runtime (not GUC parse time) if the number of required standbys
exceeds the number available, rather than waiting indefinitely.
However, if standbys can come online dynamically, a wait involving
"*" might be satisfiable after awhile even if it isn't immediately.

On the whole, given the fuzziness around "*", I'm not sure that
it's easy to make this much better.

regards, tom lane

Re: subscriptionCheck failures on nightjar

2019-08-26 Thread Tomas Vondra


On Mon, Aug 26, 2019 at 11:01:20AM -0400, Tom Lane wrote:

Tomas Vondra  writes:

I'm willing to take a stab at it, but to do that I need a way to
reproduce it. Tom, you mentioned you've managed to reproduce it in a
qemu instance, but that it took some fiddling with qemu parmeters or
something. Can you share what exactly was necessary?


I don't recall exactly what I did anymore, and it was pretty fiddly
anyway.  Upthread I suggested


Now that we know where the problem is, you could probably make it highly
reproducible by inserting a sleep of a few msec between the rename and the
second fsync.


so why not try that first?



Ah, right. I'll give that a try.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Optimize single tuple fetch from nbtree index

2019-08-26 Thread Floris Van Nee


> It seems that it contradicts the very idea of your patch, so probably we
> should look for other ways to optimize this use-case.
> Maybe this restriction can be relaxed for write only tables, that never
> have to reread the page because of visibility, or something like that.
> Also we probably can add to IndexScanDescData info about expected number
> of tuples, to allow index work more optimal
> and avoid the overhead for other loads.=

The idea of the patch is exactly to relax this limitation. I forgot to update 
that README file though. The current implementation of the patch should be 
correct like this - that's why I added the look-back code on the page if the 
tuple couldn't be found anymore on the same location on the page. Similarly, 
it'll look on the page to the right if it detected a page split. These two 
measures combined should give a correct implementation of the 'it's possible 
that a scan stops in the middle of a page' relaxation. However, as Peter and 
Tom pointed out earlier, they feel that the performance advantage that this 
approach gives, does not outweigh the extra complexity at this time. I'd be 
open to other suggestions though.

> That's true. It took me quite some time to understand that existing code
> is correct.
> There is a comment for the structure's field that claims that
> BufferIsValid is the same that BufferIsPinned in ScanPos context.
> Attached patch contains some comments' updates. Any suggestions on how
> to improve them are welcome.

I'll have a look tomorrow. Thanks a lot for writing this up!

-Floris

Re: assertion at postmaster start

I wrote:
> I propose the attached.  I'm inclined to think that the risk/benefit
> of back-patching this is not very good, so I just want to stick it in
> HEAD, unless somebody can explain why dead_end children are likely to
> crash in the field.

Pushed at ee3278239.

I'm still curious as to the explanation for a dead_end child exiting
with code 15, but I have no way to pursue the point.

regards, tom lane

Re: fix "Success" error messages

2019-08-26 Thread Peter Eisentraut

On 2019-06-19 04:51, Michael Paquier wrote:
> On Tue, Jun 18, 2019 at 09:13:19AM -0700, Shawn Debnath wrote:
>>> case SLRU_WRITE_FAILED:
>>> ereport(ERROR,
>>> (errcode_for_file_access(),
>>>  errmsg("could not access status of 
>>> transaction %u", xid),
>>> -errdetail("Could not write to file 
>>> \"%s\" at offset %u: %m.",
>>> -  path, offset)));
>>> +errno == 0
>>> +? errdetail("Short write to file 
>>> \"%s\" at offset %u.", path, offset)
>>> +: errdetail("Could not write to file 
>>> \"%s\" at offset %u: %m.",
>>> +path, 
>>> offset)));
> 
> There is no point to call errcode_for_file_access() if we know that
> errno is 0.  Not a big deal, still..  The last time I looked at that,
> the best way I could think of would be to replace slru_errcause with a
> proper error string generated at error time.  Perhaps the partial
> read/write case does not justify this extra facility.  I don't know.
> At least that would be more flexible.

Here is an updated patch set that rearranges this a bit according to
your suggestions, and also fixes some similar issues in pg_checksums.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From d1d306f80c53a59b67823ae044bf85a2718ea94e Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 26 Aug 2019 21:37:10 +0200
Subject: [PATCH v2 1/2] Better error messages for short reads/writes in SLRU

This avoids getting a

Could not read from file ...: Success.

for a short read or write (since errno is not set in that case).
Instead, report a more specific error messages.
---
 src/backend/access/transam/slru.c | 31 +--
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/slru.c 
b/src/backend/access/transam/slru.c
index 0fbcb4e6fe..e38f9199dd 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -920,18 +920,29 @@ SlruReportIOError(SlruCtl ctl, int pageno, TransactionId 
xid)
   path, offset)));
break;
case SLRU_READ_FAILED:
-   ereport(ERROR,
-   (errcode_for_file_access(),
-errmsg("could not access status of 
transaction %u", xid),
-errdetail("Could not read from file 
\"%s\" at offset %u: %m.",
-  path, offset)));
+   if (errno)
+   ereport(ERROR,
+   (errcode_for_file_access(),
+errmsg("could not access 
status of transaction %u", xid),
+errdetail("Could not read from 
file \"%s\" at offset %u: %m.",
+  path, 
offset)));
+   else
+   ereport(ERROR,
+   (errmsg("could not access 
status of transaction %u", xid),
+errdetail("Could not read from 
file \"%s\" at offset %u: read too few bytes.", path, offset)));
break;
case SLRU_WRITE_FAILED:
-   ereport(ERROR,
-   (errcode_for_file_access(),
-errmsg("could not access status of 
transaction %u", xid),
-errdetail("Could not write to file 
\"%s\" at offset %u: %m.",
-  path, offset)));
+   if (errno)
+   ereport(ERROR,
+   (errcode_for_file_access(),
+errmsg("could not access 
status of transaction %u", xid),
+errdetail("Could not write to 
file \"%s\" at offset %u: %m.",
+  path, 
offset)));
+   else
+   ereport(ERROR,
+   (errmsg("could not access 
status of transaction %u", xid),
+errdetail("Could not write to 
file \"%s\" at offset %u: wrote too few bytes.",
+

Re: "ago" times on buildfarm status page

2019-08-26 Thread Andrew Dunstan



On 8/26/19 2:55 PM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 8/21/19 4:16 PM, Tom Lane wrote:
>>> Andrew Dunstan  writes:
 Still, if we simply added the skew to the snapshot time that might be
 enough to achieve what you want. That would be a one line change, I think.
>>> +1
>> Done. It's only happening prospectively, so we'll need to wait a few
>> days to see it flow through.
> Hm, doesn't seem to have done the trick.  The current dashboard page shows
> (in the v12 branch)
>
> mule... 01:17 ago  OK   [97205d0] Config
> loach   ... 01:32 ago  OK   [97205d0] Config
> dangomushi  ... 02:11 ago  OK   [97205d0] Config
> bowerbird   ... 02:23 ago  scriptsCheck [97205d0] Details
> snapper ... 02:48 ago  OK   [63fc3b1] Config
> caiman  ... 03:04 ago  OK   [97205d0] Config
> nightjar... 03:17 ago  recoveryCheck [97205d0] Details
> chub... 03:29 ago  OK   [97205d0] Config
> clam... 03:34 ago  OK   [97205d0] Config
> demoiselle  ... 03:45 ago  OK   [97205d0] Config
>
> snapper is clearly out of line here: the commit it claims
> to have fetched 2:48 ago was obsoleted around seven hours ago.
>
> (Snapper is one of the machines that is typically inconsistent
> in this way.  I've been assuming that's because its system clock
> is a few hours off ... but maybe there's something else going on?)
>
>   




I think this is the problem:

 'scmrepo' => '/home/pgbf/pgmirror.git',

Probably this isn't updated often enough. It probably has little to do with the 
clock settings.

This is the kind of old-fashioned way of doing things. These days 
"git_keep_mirror => 1" along with the community repo as the base would avoid 
these problems.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Online checksums patch - once again

2019-08-26 Thread Magnus Hagander

OK, let's try this again :)

This is work mainly based in the first version of the online checksums
patch, but based on top of Andres WIP patchset for global barriers (
https://www.postgresql.org/message-id/20181030051643.elbxjww5jjgnjaxg%40alap3.anarazel.de
)

Andres patch has been enhanced with wait events per
https://www.postgresql.org/message-id/CABUevEwy4LUFqePC5YzanwtzyDDpYvgrj6R5WNznwrO5ouVg1w%40mail.gmail.com
.

I'm including both those as attachments here to hopefully trick the cfbot
into being able to build things :)

Other than that, we believe that the list of objections from the first one
should be covered by now, but it's been quite some time and many emails, so
it's possible we missed some. So if you find them, point them out!

Documentation needs another go-over in particular base don changes since,
but the basic principles of how it works should not have changed.

//Magnus & Daniel
From 14b20affc98ecb893531a683d83a3bef03fcff62 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 29 Oct 2018 10:14:15 -0700
Subject: [PATCH 1/2] WIP: global barriers

This is a squash of three patches from Andres:
* Use procsignal_sigusr1_handler for all shmem connected bgworkers.
* Use  procsignal_sigusr1_handler in all auxiliary processes.
* WIP: global barriers.

And one from Magnus:
* Wait event for global barriers
---
 src/backend/postmaster/autovacuum.c   |   3 +-
 src/backend/postmaster/bgworker.c |  31 +---
 src/backend/postmaster/bgwriter.c |  24 ++
 src/backend/postmaster/checkpointer.c |  19 ++---
 src/backend/postmaster/pgstat.c   |   3 +
 src/backend/postmaster/startup.c  |  18 ++---
 src/backend/postmaster/walwriter.c|  17 +---
 src/backend/replication/walreceiver.c |  20 +
 src/backend/storage/buffer/bufmgr.c   |   4 +
 src/backend/storage/ipc/procsignal.c  | 141 ++
 src/backend/storage/lmgr/proc.c   |  20 +
 src/backend/tcop/postgres.c   |   7 ++
 src/include/pgstat.h  |   1 +
 src/include/storage/proc.h|   9 +++
 src/include/storage/procsignal.h  |  23 +-
 15 files changed, 255 insertions(+), 85 deletions(-)

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 073f313337..24e28dd3a3 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -649,8 +649,9 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		ResetLatch(MyLatch);
 
-		/* Process sinval catchup interrupts that happened while sleeping */
+		/* Process pending interrupts. */
 		ProcessCatchupInterrupt();
+		ProcessGlobalBarrierIntterupt();
 
 		/* the normal shutdown case */
 		if (got_SIGTERM)
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index b66b517aca..f300f9285b 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -734,23 +734,32 @@ StartBackgroundWorker(void)
 	/*
 	 * Set up signal handlers.
 	 */
+
+
+	/*
+	 * SIGINT is used to signal canceling the current action for processes
+	 * able to run queries.
+	 */
 	if (worker->bgw_flags & BGWORKER_BACKEND_DATABASE_CONNECTION)
-	{
-		/*
-		 * SIGINT is used to signal canceling the current action
-		 */
 		pqsignal(SIGINT, StatementCancelHandler);
-		pqsignal(SIGUSR1, procsignal_sigusr1_handler);
-		pqsignal(SIGFPE, FloatExceptionHandler);
-
-		/* XXX Any other handlers needed here? */
-	}
 	else
-	{
 		pqsignal(SIGINT, SIG_IGN);
+
+	/*
+	 * Everything with a PGPROC should be able to receive procsignal.h style
+	 * signals.
+	 */
+	if (worker->bgw_flags & (BGWORKER_BACKEND_DATABASE_CONNECTION |
+			 BGWORKER_SHMEM_ACCESS))
+		pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	else
 		pqsignal(SIGUSR1, bgworker_sigusr1_handler);
+
+	if (worker->bgw_flags & BGWORKER_BACKEND_DATABASE_CONNECTION)
+		pqsignal(SIGFPE, FloatExceptionHandler);
+	else
 		pqsignal(SIGFPE, SIG_IGN);
-	}
+
 	pqsignal(SIGTERM, bgworker_die);
 	pqsignal(SIGHUP, SIG_IGN);
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 8ec16a3fb8..80a8e3cf4b 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -51,6 +51,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
+#include "storage/procsignal.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
@@ -97,7 +98,6 @@ static volatile sig_atomic_t shutdown_requested = false;
 static void bg_quickdie(SIGNAL_ARGS);
 static void BgSigHupHandler(SIGNAL_ARGS);
 static void ReqShutdownHandler(SIGNAL_ARGS);
-static void bgwriter_sigusr1_handler(SIGNAL_ARGS);
 
 
 /*
@@ -115,10 +115,7 @@ BackgroundWriterMain(void)
 	WritebackContext wb_context;
 
 	/*
-	 * Properly accept or ignore signals the postmaster might send us.
-	 *
-	 * bgwriter doesn't participate in ProcSignal signalling, but a SIGUSR1
-	 * handler is still needed for latch wakeups.
+	 * Properly accept or ignore s

Re: "ago" times on buildfarm status page

Andrew Dunstan  writes:
> On 8/21/19 4:16 PM, Tom Lane wrote:
>> Andrew Dunstan  writes:
>>> Still, if we simply added the skew to the snapshot time that might be
>>> enough to achieve what you want. That would be a one line change, I think.

>> +1

> Done. It's only happening prospectively, so we'll need to wait a few
> days to see it flow through.

Hm, doesn't seem to have done the trick.  The current dashboard page shows
(in the v12 branch)

mule... 01:17 ago  OK   [97205d0] Config
loach   ... 01:32 ago  OK   [97205d0] Config
dangomushi  ... 02:11 ago  OK   [97205d0] Config
bowerbird   ... 02:23 ago  scriptsCheck [97205d0] Details
snapper ... 02:48 ago  OK   [63fc3b1] Config
caiman  ... 03:04 ago  OK   [97205d0] Config
nightjar... 03:17 ago  recoveryCheck [97205d0] Details
chub... 03:29 ago  OK   [97205d0] Config
clam... 03:34 ago  OK   [97205d0] Config
demoiselle  ... 03:45 ago  OK   [97205d0] Config

snapper is clearly out of line here: the commit it claims
to have fetched 2:48 ago was obsoleted around seven hours ago.

(Snapper is one of the machines that is typically inconsistent
in this way.  I've been assuming that's because its system clock
is a few hours off ... but maybe there's something else going on?)

regards, tom lane

Re: pg_upgrade: Error out on too many command-line arguments

2019-08-26 Thread Ibrar Ahmed

On Mon, Aug 26, 2019 at 9:46 AM Michael Paquier  wrote:

> On Sun, Aug 25, 2019 at 05:10:47PM +0200, Julien Rouhaud wrote:
> > I did some searching, and oid2name.c is also missing this.
>
> And pgbench, no?
>

Yes, the patch is slightly different.


> --
> Michael
>


-- 
Ibrar Ahmed


0001-pgbench-Error-out-on-too-many-command-line-argume.patch
Description: Binary data

Re: Does TupleQueueReaderNext() really need to copy its result?

2019-08-26 Thread Andres Freund

Hi,

On 2019-08-26 14:09:45 -0400, Robert Haas wrote:
> On Fri, Aug 23, 2019 at 10:22 PM Thomas Munro  wrote:
> > Couldn't resist trying this, and it seems to work.  Based on the
> > comment "the buffer size is a multiple of MAXIMUM_ALIGNOF, and each
> > read and write is as well", it should always work (though I wish
> > shm_mq_receive_bytes()'s documentation would discuss message alignment
> > explicitly if that's true).  On the other hand, I doubt it makes a
> > difference, so this is more of a question: is this the way it was
> > supposed to work?
> 
> There's a comment in htup.h which says:
> 
>  * * Separately allocated tuple: t_data points to a palloc'd chunk that
>  *   is not adjacent to the HeapTupleData.  (This case is deprecated since
>  *   it's difficult to tell apart from case #1.  It should be used only in
>  *   limited contexts where the code knows that case #1 will never apply.)
> 
> I got scared and ran away.

Perhaps this'd could be sidestepped by funneling through MinimalTuples
instead of HeapTuples. Afaict that should always be sufficient, because
all system column accesses ought to happen below (including being
projected into a separate column, if needed above). With the added
benefit of needing less space, of course.

Greetings,

Andres Freund

Re: range_agg

2019-08-26 Thread Jeff Davis

On Wed, 2019-08-21 at 21:54 -0700, Paul A Jungwirth wrote:
> Sometimes I think about having a maybe type instead of null/not
> null. SQL nulls are already very "monadic" I think but nullability
> doesn't travel.

Yeah, that would be a great direction, but there is some additional
complexity that we'd need to deal with that a "normal" compiler does
not:

  * handling both existing global types (like int) as well as on-the-
fly types like Maybe>
  * types need to do more things, like have serialized representations,
interface with indexing strategies, and present the optimizer with
choices that may influence which indexes can be used or not
  * at some point needs to work with normal SQL types and NULL
  * there are a lot of times we care not just whether a type is
sortable, but we actually care about the way it's sorted (e.g.
localization). typeclasses+newtype would probably be unacceptable for
trying to match SQL behavior here.

I'm all in favor of pursuing this, but it's not going to bear fruit
very soon.

> Btw I have working multirange_{send,recv,in,out} now, and I
> automatically create a multirange type and its array type when
> someone
> creates a new range type. I have a decent start on passing tests and
> no compiler warnings. I also have a start on anymultirange and
> anyrangearray. (I think I need the latter to support a range-like
> constructor function, so you can say `int4multirange(int4range(1,4),
> int4range(8,10))`.) I want to get the any* types done and improve the
> test coverage, and then I'll probably be ready to share a patch.

Awesome!

> Here are a couple other questions:
> 
> - Does anyone have advice for the typanalyze function? I feel pretty
> out of my depth there (although I haven't looked into typanalyze
> stuff
> very deeply yet). I can probably get some inspiration from
> range_typanalyze and array_typanalyze, but those are both quite
> detailed (their statistics functions that is).

I think Alexander Korotkov did a lot of the heavy lifting here, perhaps
he has a comment? I'd keep it simple for now if you can, and we can try
to improve it later.

> - What should a multirange do if you give it an empty range? I'm
> thinking it should just ignore it, but then `'{}'::int4multirange =
> '{empty}'::int4multirange`. Does that seem okay? (It does to me
> actually, if we think of `empty` as the additive identity. Similarly
> mr + empty = mr.

I agree. Multiranges are more than just an array of ranges, so they
coalesce into some canonical form.

> - What should a multirange do if you give it a null, like
> `int4multirange(int4range(1,4), null)`. I'm thinking it should be
> null, just like mr + null = null. Right?

Yes. NULL is for the overall multirange datum (that is, a multirange
column can be NULL), but I don't think individual parts of a datatype
make much sense as NULL. So, I agree that mr + null = null. (Note that
arrays and records can have NULL parts, but I don't see a reason we
should follow those examples for multiranges.)

Regards,
Jeff Davis

Re: Procedure support improvements

2019-08-26 Thread Dave Cramer

On Mon, 26 Aug 2019 at 14:14, Tom Lane  wrote:

> Laurenz Albe  writes:
> > Dave Cramer wrote:
> > test=> BEGIN;
> > BEGIN
> > test=> CALL testproc();
> > ERROR:  invalid transaction termination
> > CONTEXT:  PL/pgSQL function testproc() line 1 at COMMIT
>
> > What is the rationale for this?
>
> A procedure shouldn't be able to force commit of the surrounding
> transaction.
>
> As Dave noted, what would be nicer is for procedures to be able
> to start and commit autonomous transactions, without affecting
> the state of the outer transaction.  We haven't got that though,
> and it looks like a lot of work to get there.
>

I'm less than motivated to hack the driver to make something work here
until we finish the server feature.

Who knows what that might bring ?


Dave Cramer

da...@postgresintl.com
www.postgresintl.com


>
>

Re: Does TupleQueueReaderNext() really need to copy its result?

2019-08-26 Thread Robert Haas

On Fri, Aug 23, 2019 at 10:22 PM Thomas Munro  wrote:
> Couldn't resist trying this, and it seems to work.  Based on the
> comment "the buffer size is a multiple of MAXIMUM_ALIGNOF, and each
> read and write is as well", it should always work (though I wish
> shm_mq_receive_bytes()'s documentation would discuss message alignment
> explicitly if that's true).  On the other hand, I doubt it makes a
> difference, so this is more of a question: is this the way it was
> supposed to work?

There's a comment in htup.h which says:

 * * Separately allocated tuple: t_data points to a palloc'd chunk that
 *   is not adjacent to the HeapTupleData.  (This case is deprecated since
 *   it's difficult to tell apart from case #1.  It should be used only in
 *   limited contexts where the code knows that case #1 will never apply.)

I got scared and ran away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Procedure support improvements

Laurenz Albe  writes:
> Dave Cramer wrote:
> test=> BEGIN;
> BEGIN
> test=> CALL testproc();
> ERROR:  invalid transaction termination
> CONTEXT:  PL/pgSQL function testproc() line 1 at COMMIT

> What is the rationale for this?

A procedure shouldn't be able to force commit of the surrounding
transaction.

As Dave noted, what would be nicer is for procedures to be able
to start and commit autonomous transactions, without affecting
the state of the outer transaction.  We haven't got that though,
and it looks like a lot of work to get there.

regards, tom lane

Re: Procedure support improvements

2019-08-26 Thread Laurenz Albe

[CC to -hackers]
Dave Cramer wrote:
> On Mon, 26 Aug 2019 at 13:43,
Laurenz Albe 
> wrote:
> > Dave Cramer wrote:
>
> > As I said, I'd entertain a connection parameter that switched the
>
> > CALL to call procedures but ideally you'd complain to the server
> >
> folks to make Procedures useful.
> > 
> > Apart from the obvious
problem that procedures make life hard for 
> > the JDBC driver, because
it does not know if it shall render a call
> > as SELECT or CALL:
> >
What is missing in PostgreSQL procedures to make them useful?
> 
> being
able to use transactions inside a procedure inside a 
> transaction.

test=> CREATE OR REPLACE PROCEDURE testproc() LANGUAGE plpgsql AS
   $$BEGIN PERFORM 42; COMMIT; PERFORM 'x'; END;$$;
CREATE PROCEDURE
test=> CALL testproc();
CALL
test=> BEGIN;
BEGIN
test=> CALL testproc();
ERROR:  invalid transaction termination
CONTEXT:  PL/pgSQL function testproc() line 1 at COMMIT

Oops.
I find that indeed surprising.

What is the rationale for this?

Yours,
Laurenz Albe

Re: mingw32 floating point diff

I wrote:
>> I'm very hesitant to apply a volatile-qualification approach to
>> eliminate those issues, for fear of pessimizing performance-critical
>> code on more modern platforms.  I wonder whether there is a reasonable
>> way to tell at compile time if we have a platform with 80-bit math.

> Hmmm ... I find that dromedary's compiler predefines __FLT_EVAL_METHOD__
> as 2 not 0 when -mfpmath=387 is given.  This seems to be something
> that was standardized in C99 (without the double underscores), so
> maybe we could do something like
> #if __FLT_EVAL_METHOD__ > 0 || FLT_EVAL_METHOD > 0

After further poking around, it seems that testing FLT_EVAL_METHOD
should be sufficient ---  appears to define that correctly
even in very old C99 installations.

However, I'm losing interest in the problem after finding that I can't
reproduce it anywhere except on dromedary (with "-mfpmath=387" added).

For instance, I have a 32-bit FreeBSD system with what claims to be
the same compiler (gcc 4.2.1), but it passes regression just fine,
with or without -mfpmath=387.  Earlier and later gcc versions also
don't show a problem.  I suspect that Apple bollixed something with
local mods to their version of 4.2.1, or possibly they are allowing
inlining of isinf() in a way that nobody else does.

Also, using that compiler with "-mfpmath=387", I see that every
supported PG version back to 9.4 fails regression due to not
detecting float8 multiply overflow.  So this isn't a problem that
we introduced with v12's changes, as I'd first suspected; and it's
not a problem that anyone is hitting in the field, or we'd have
heard complaints.

So, barring somebody showing that we have an issue on some platform
that people actually care about currently, I'm inclined not to do
anything more here.

regards, tom lane

Re: mingw32 floating point diff

2019-08-26 Thread Andrew Dunstan

On 8/25/19 4:23 PM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 8/20/19 8:59 AM, Peter Eisentraut wrote:
>>> Running the regression tests on mingw32, I get the following diff in
>>> circle.out:
>>> -  | <(3,5),0>  | <(1,2),3>  | 0.60555127546399
>>> +  | <(3,5),0>  | <(1,2),3>  | 0.605551275463989
>> I complained about this a year ago:
>> 
>> +1 for fixing it by any reasonable means.
> Now that that fix is in, could we get a buildfarm member running on
> such a platform?  It seems to behave differently from anything else.
>

I'm pretty much tapped out for Windows resources, I have one physical
and one virtual machine which do nothing but run my 6 Windows based animals.

I don't know if the community has spare resources available from those
that Amazon donate to us. There is already one animal I manage running
there, so maybe another would be feasible.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Proposal: Better generation of values in GENERATED columns.

Peter Eisentraut  writes:
> On 2019-08-26 13:28, Daniel Migowski wrote:
>> I would like to implement a fallback solution 
>> that detects such errors and automatically updates the nextvalue of the 
>> sequence when the nextvalue is already used on insert.

> ISTM that such a system would likely have many of the same problems as
> the often-proposed ignore-errors mode for COPY, namely that you can't
> catch errors and do something else other than rethrowing the error.

In principle you could probably use the same infrastructure used by
ON CONFLICT to detect the unique-key violation.  But ON CONFLICT is
mighty complicated, and not very cheap either.  I don't for one second
believe Daniel's assertion that this could be done without a
significant COPY performance hit.

I'm also dubious that the right response to a duplicate key would be
as simple as "try the next nextval() result".  I for one wouldn't be
satisfied with waiting for COPY to grind through a few thousand/million
sequential nextval values before finding one that doesn't conflict with
the existing table entries.

The actually-sound advice for loading data that might have conflicting
serial values is to do the equivalent of

setval('sequence', max(existing keys) + 1)

before you start loading.  I wonder whether there's a way to make that
simpler.

regards, tom lane

Re: Creating partitions automatically at least on HASH?



Hello Rafia,


   CREATE TABLE Stuff (...)
 PARTITION BY [HASH | RANGE | LIST] (…)
   DO NONE -- this is the default
   DO [IMMEDIATE|DEFERRED] USING (…)

Where the USING part would be generic keword value pairs, eg:

For HASH: (MODULUS 8) and/or (NPARTS 10)

For RANGE: (START '1970-01-01', STOP '2020-01-01', INCREMENT '1 year')
 and/or (START 1970, STOP 2020, NPARTS 50)

And possibly for LIST: (IN (…), IN (…), …), or possibly some other
keyword.

The "DEFERRED" could be used as an open syntax for dynamic partitioning,
if later someone would feel like doing it.


ISTM that "USING" is better than "WITH" because WITH is already used

specifically for HASH and other optional stuff in CREATE TABLE.

The text constant would be interpreted depending on the partitioning
expression/column type.

Any opinion about the overall approach?


I happen to start a similar discussion [1] being unaware of this one 
and there Ashutosh Sharma talked about interval partitioning in Oracle. 
Looking

closely it looks like we can have this automatic partitioning more
convenient by having something similar. Basically, it is creating
partitions on demand or lazy partitioning.


Yep, the "what" of dynamic partitioning is more or less straightforward, 
along the line you are describing.


For me there are really two questions:

 - having a extendable syntax, hence the mail I sent, which would cover
   both automatic static & dynamic partitioning and their parameters,
   given that we already have manual static, automatic static should
   be pretty easy.

 - implementing the stuff, with limited performance impact if possible
   for the dynamic case, which is non trivial.

To explain a bit more, let's take range partition for example, first 
parent table is created and it's interval and start and end values are 
specified and it creates only the parent table just like it works today.


Now, if there comes a insertion that does not belong to the existing (or 
any, in the case of first insertion) partition(s), then the 
corresponding partition is created,


Yep. Now, you also have to deal with race conditions issues, i.e. two 
parallel session inserting tuples that must create the same partition, and 
probably you would like to avoid a deadlock.


I think it is extensible to other partitioning schemes as well. Also it 
is likely to have a positive impact on the queries, because there will 
be required partitions only and would not require to educate 
planner/executor about many empty partitions.


Yep, but it creates other problems to solve…

--
Fabien.

Re: pgbench - allow to create partitioned tables



Just one suggestion --partition-method option should be made dependent 
on --partitions, because it has no use unless used with --partitions. 
What do you think?


Why not. V4 attached.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 816f9cc4c7..38f4ac1557 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -306,6 +306,31 @@ pgbench  options  d
   
  
 
+ 
+  --partitions=NUM
+  
+   
+Create a partitioned pgbench_accounts table with
+NUM partitions of nearly equal size for
+the scaled number of accounts.
+Default is 0, meaning no partitioning.
+   
+  
+ 
+
+ 
+  --partition-method=NAME
+  
+   
+Create a partitioned pgbench_accounts table with
+NAME method.
+Expected values are range or hash.
+This option requires that --partitions is set to non-zero.
+If unspecified, default is range.
+   
+  
+ 
+
  
   --tablespace=tablespace
   
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 570cf3306a..6d8476af5c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -186,6 +186,11 @@ int64		latency_limit = 0;
 char	   *tablespace = NULL;
 char	   *index_tablespace = NULL;
 
+/* partitioning for pgbench_accounts table, 0 for no partitioning */
+int 		partitions = 0;
+enum { PART_NONE, PART_RANGE, PART_HASH }
+			partition_method = PART_NONE;
+
 /* random seed used to initialize base_random_sequence */
 int64		random_seed = -1;
 
@@ -617,6 +622,9 @@ usage(void)
 		   "  --foreign-keys   create foreign key constraints between tables\n"
 		   "  --index-tablespace=TABLESPACE\n"
 		   "   create indexes in the specified tablespace\n"
+		   "  --partitions=NUM partition account table in NUM parts (defaults: 0)\n"
+		   "  --partition-method=(range|hash)\n"
+		   "   partition account table with this method (default: range)\n"
 		   "  --tablespace=TABLESPACE  create tables in the specified tablespace\n"
 		   "  --unlogged-tablescreate tables as unlogged tables\n"
 		   "\nOptions to select what to run:\n"
@@ -3601,6 +3609,17 @@ initDropTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
+/*
+ * add fillfactor percent option if not 100.
+ */
+static void
+append_fillfactor(char *opts, int len)
+{
+	if (fillfactor < 100)
+		snprintf(opts + strlen(opts), len - strlen(opts),
+ " with (fillfactor=%d)", fillfactor);
+}
+
 /*
  * Create pgbench's standard tables
  */
@@ -3625,6 +3644,7 @@ initCreateTables(PGconn *con)
 		const char *bigcols;	/* column decls if accountIDs are 64 bits */
 		int			declare_fillfactor;
 	};
+
 	static const struct ddlinfo DDLs[] = {
 		{
 			"pgbench_history",
@@ -3651,11 +3671,10 @@ initCreateTables(PGconn *con)
 			1
 		}
 	};
-	int			i;
 
 	fprintf(stderr, "creating tables...\n");
 
-	for (i = 0; i < lengthof(DDLs); i++)
+	for (int i = 0; i < lengthof(DDLs); i++)
 	{
 		char		opts[256];
 		char		buffer[256];
@@ -3664,9 +3683,17 @@ initCreateTables(PGconn *con)
 
 		/* Construct new create table statement. */
 		opts[0] = '\0';
-		if (ddl->declare_fillfactor)
+
+		/* Partition pgbench_accounts table */
+		if (partitions >= 1 && strcmp(ddl->table, "pgbench_accounts") == 0)
+		{
 			snprintf(opts + strlen(opts), sizeof(opts) - strlen(opts),
-	 " with (fillfactor=%d)", fillfactor);
+	 " partition by %s (aid)",
+	 partition_method == PART_RANGE ? "range" : "hash");
+		}
+		else if (ddl->declare_fillfactor)
+			append_fillfactor(opts, sizeof(opts));
+
 		if (tablespace != NULL)
 		{
 			char	   *escape_tablespace;
@@ -3686,6 +3713,54 @@ initCreateTables(PGconn *con)
 
 		executeStatement(con, buffer);
 	}
+
+	if (partitions >= 1)
+	{
+		int64		part_size = (naccounts * (int64) scale + partitions - 1) / partitions;
+		char		ff[64];
+		ff[0] = '\0';
+		append_fillfactor(ff, sizeof(ff));
+
+		fprintf(stderr, "creating %d partitions...\n", partitions);
+
+		for (int p = 1; p <= partitions; p++)
+		{
+			char		query[256];
+
+			if (partition_method == PART_RANGE)
+			{
+char		minvalue[32], maxvalue[32];
+
+if (p == 1)
+	sprintf(minvalue, "minvalue");
+else
+	sprintf(minvalue, INT64_FORMAT, (p-1) * part_size + 1);
+
+if (p < partitions)
+	sprintf(maxvalue, INT64_FORMAT, p * part_size + 1);
+else
+	sprintf(maxvalue, "maxvalue");
+
+snprintf(query, sizeof(query),
+		 "create%s table pgbench_accounts_%d\n"
+		 "  partition of pgbench_accounts\n"
+		 "  for values from (%s) to (%s)%s\n",
+		 unlogged_tables ? " unlogged" : "", p,
+		 minvalue, maxvalue, ff);
+			}
+			else if (partition_method == PART_HASH)
+snprintf(query, sizeof(query),
+		 "create%s table pgbench_accounts_%d\n"
+		 "  partition of pgbench_accounts\n"
+		 "  for values with (modulus %d, remainder %d)%s\n",
+		 unlo

Re: Proposal: Better generation of values in GENERATED columns.

2019-08-26 Thread Peter Eisentraut

On 2019-08-26 13:28, Daniel Migowski wrote:
> I would like to implement a fallback solution 
> that detects such errors and automatically updates the nextvalue of the 
> sequence when the nextvalue is already used on insert.

ISTM that such a system would likely have many of the same problems as
the often-proposed ignore-errors mode for COPY, namely that you can't
catch errors and do something else other than rethrowing the error.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)

Justin Pryzby  writes:
> On Mon, Aug 26, 2019 at 12:45:24PM -0400, Tom Lane wrote:
>> However ... there is some pretty interesting info at
>> https://bugzilla.redhat.com/show_bug.cgi?id=1338673
>> suggesting that compiling with a late-model gcc against older RHEL6
>> headers could result in bad code.  I wonder whether the reporters'
>> servers were built using such a configuration.  (Although the linkage,
>> if any, to this report still wouldn't be very clear.)

> I can tell it was compiled using
> version | PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 
> 20120313 (Red Hat 4.4.7-23), 64-bit

Ah, that appears to be the default compiler for RHEL6, so that theory
is out the window.  It's still interesting that we're only seeing this
reported from RHEL6 ... maybe there's something specific to the code
that this gcc version generates?

regards, tom lane

Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)

2019-08-26 Thread Justin Pryzby

On Mon, Aug 26, 2019 at 12:45:24PM -0400, Tom Lane wrote:
> However ... there is some pretty interesting info at
> https://bugzilla.redhat.com/show_bug.cgi?id=1338673
> suggesting that compiling with a late-model gcc against older RHEL6
> headers could result in bad code.  I wonder whether the reporters'
> servers were built using such a configuration.  (Although the linkage,
> if any, to this report still wouldn't be very clear.)

That's a question for Devrim ?

Can you tell which glibc headers were installed when compiling this package?
postgresql11-server-11.5-1PGDG.rhel6.x86_64

I can tell it was compiled using
version | PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 
20120313 (Red Hat 4.4.7-23), 64-bit

Maybe it doesn't matter, since, for better or worse, is not a "late model" cc.

https://www.gnu.org/software/gcc/gcc-4.4/
GCC 4.4.7
March 13, 2012 (changes) 
This release series is no longer maintained.

Justin

Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)

Thomas Munro  writes:
> Of course it's entirely possible that we have a bug here and I'm very
> keen to find it, but I can't help noticing the common factor here is
> that they're all running ancient RHEL 6.x releases, except Merlin who
> didn't say.  Merlin?

Hm, Justin said he had
glibc-2.12-1.192.el6.x86_64
which is a little bit behind the times but not *that* ancient.

For reference, attached is the rpm --changelog output on an up-to-date
RHEL6 box for glibc running back to 2.12-1.192.  Unfortunately a lot
of the referenced bugzilla entries aren't public, but the one-liner
descriptions don't seem to suggest that they found any heap overruns
that would be interesting for us.

However ... there is some pretty interesting info at
https://bugzilla.redhat.com/show_bug.cgi?id=1338673
suggesting that compiling with a late-model gcc against older RHEL6
headers could result in bad code.  I wonder whether the reporters'
servers were built using such a configuration.  (Although the linkage,
if any, to this report still wouldn't be very clear.)

regards, tom lane


* Mon Apr 01 2019 Florian Weimer  - 2.12-1.212.3
- Remove abort() warning in manual (#1577437)

* Mon Apr 01 2019 Florian Weimer  - 2.12-1.212.2
- ja_JP: Add new Japanese Era name (#1555930)

* Mon Apr 01 2019 Florian Weimer  - 2.12-1.212.1
- regex: Fix false match in trailing characters (#1668169)

* Fri Nov 17 2017 Patsy Franklin  - 2.12-1.212
- CVE-2017-15670: glob: Fix one-byte overflow with GLOB_TILDE (#1504810)
- CVE-2017-15804: glob: Fix buffer overflow in GLOB_TILDE unescaping (#1504810)

* Mon Jun 19 2017 Florian Weimer  - 2.12-1.211
- Avoid large allocas in the dynamic linker (#1452717)

* Wed Mar 29 2017 Carlos O'Donell  - 2.12-1.210
- Fix thread cancellation issues for setmntent() and others (#1437147).

* Wed Jan 25 2017 Florian Weimer  - 2.12-1.209
- Fix AF_INET6 getaddrinfo with nscd (#1416496)

* Tue Oct 18 2016 Carlos O'Donell  - 2.12-1.208
- Update tests for struct sockaddr_storage changes (#1338673)

* Mon Oct 17 2016 Martin Sebor  - 2.12-1.207
- Use FL_CLOEXEC in internal calls to fopen (#1012343).

* Mon Oct 17 2016 Carlos O'Donell  - 2.12-1.206
- Fix CVE-2015-8779 glibc: Unbounded stack allocation in catopen function
  (#1358015).

* Mon Oct 17 2016 DJ Delorie  - 2.12-1.205
- Make padding in struct sockaddr_storage explicit (#1338673)

* Thu Oct 13 2016 Carlos O'Donell  - 2.12-1.204
- Fix detection of Intel FMA hardware (#1384281).

* Tue Oct 11 2016 Carlos O'Donell  - 2.12-1.203
- Add support for el_GR@euro, ur_IN, and wal_ET locales (#1101858).

* Tue Oct 11 2016 Patsy Franklin  - 2.12-1.202
- Change malloc/tst-malloc-thread-exit.c to use fewer threads and 
  avoid timeout (#1318380).

* Tue Oct 11 2016 Patsy Franklin  - 2.12-1.201
- df can fail on some systems (#1307029).

* Wed Sep 21 2016 DJ Delorie  - 2.12-1.200
- Log uname, cpuinfo, meminfo during build (#1307029).

* Mon Sep 12 2016 DJ Delorie  - 2.12-1.199
- Draw graphs for heap and stack only if MAXSIZE_HEAP and MAXSIZE_STACK
  are non-zero (#1331304).

* Mon Sep 12 2016 DJ Delorie  - 2.12-1.198
- Avoid unneeded calls to __check_pf in getadddrinfo (#1270950)

* Mon Sep 12 2016 Martin Sebor  - 2.12-1.197
- Fix CVE-2015-8778 glibc: Integer overflow in hcreate and hcreate_r
  (#1358013).

* Mon Sep 12 2016 Martin Sebor  - 2.12-1.196
- Fix CVE-2015-8776 glibc: Segmentation fault caused by passing
  out-of-range data to strftime() (#1358011).

* Mon Sep 12 2016 Florian Weimer  - 2.12-1.195
- tzdata-update: Ignore umask setting (#1373646)

* Thu Sep 08 2016 Florian Weimer  - 2.12-1.194
- CVE-2014-9761: Fix unbounded stack allocation in nan* (#1358014)

* Thu Feb 04 2016 Florian Weimer  - 2.12-1.193
- Avoid using uninitialized data in getaddrinfo (#1223095)

* Thu Jan 28 2016 Carlos O'Donell  - 2.12-1.192
- Update fix for CVE-2015-7547 (#1296029).

Re: Optimize single tuple fetch from nbtree index

2019-08-26 Thread Anastasia Lubennikova

25.08.2019 0:59, Floris Van Nee wrote:

Though, I got interested in the comment inconsistency you have found.
I added debug message into this code branch of the patch and was able to
see it in regression.diffs after 'make check':
Speaking of your patch, it seems that the buffer was unpinned and pinned
again between two reads,
and the condition of holding it continuously has not been met.

May I ask what makes you conclude that the condition of holding the pin 
continuously has not been met?
Your reply encouraged me to dig a little bit more into this today. First, I 
wanted to check if indeed the pin was continuously held by the backend or not. 
I added some debug info to ReleaseBuffer for this: it turned out that the pin 
on the buffer was definitely never released by the backend between the calls to 
_bt_first and _bt_next. So the buffer got compacted while the backend held a 
pin on it.
After some more searching I found the following code: _bt_vacuum_one_page in 
nbtinsert.c
This function compacts one single page without taking a super-exclusive lock. 
It is used during inserts to make room on a page. I verified that if I comment 
out the calls to this function, the compacting never happens while I have a pin 
on the buffer.
So I guess that answers my own question: cleaning up garbage during inserts is 
one of the cases where compacting may happen even while other backends hold a 
pin to the buffer. Perhaps this should also be more clearly phrased in the 
comments in eg. _bt_killitems? Because currently those comments make it look 
like this case never occurs.

You're right, the pin was not released between page reads.
I also added debug to UnpinBuffer, but now I see that I had interpreted 
it wrongly.

As far as I understand, the issue with your patch is that it breaks the 
*scan stops "between" pages* assumption

and thus it unsafely interacts with _bt_vacuum_one_page() cleanup.

See README:
>Page read locks are held only for as long as a scan is examining a page.
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage.  The heap tuple IDs are then processed while
not holding any page lock within the index.  We do continue to hold a pin
on the leaf page in some circumstances, to protect against concurrent
deletions (see below).  In this state the scan is effectively stopped
"between" pages, either before or after the page it has pinned. This is
safe in the presence of concurrent insertions and even page splits, because
items are never moved across pre-existing page boundaries --- so the scan
cannot miss any items it should have seen, nor accidentally return the same
item twice.

and

>Once an index tuple has been marked LP_DEAD it can actually be removed
from the index immediately; since index scans only stop "between" pages,
no scan can lose its place from such a deletion.

It seems that it contradicts the very idea of your patch, so probably we 
should look for other ways to optimize this use-case.
Maybe this restriction can be relaxed for write only tables, that never 
have to reread the page because of visibility, or something like that.
Also we probably can add to IndexScanDescData info about expected number 
of tuples, to allow index work more optimal

and avoid the overhead for other loads.=

While reading the code to answer your question, I noticed that
BTScanPosIsPinned macro name is misleading.
It calls BufferIsValid(), not BufferIsPinned() as one could expect.
And BufferIsValid in bufmgr.h comment explicitly states that it
shouldn't be confused with BufferIsPinned.
The same goes for BTScanPosUnpinIfPinned().

I agree the name is misleading. It clearly does something else than how it's 
named. However, I don't believe this introduces problems in these particular 
pieces of code, as long as the macro's are always used. BTScanPosIsPinned 
actually checks whether it's valid and not necessarily whether it's pinned, as 
you mentioned. However, any time the buffer gets unpinned using the macro 
BTScanPosUnpin, the buffer gets set to Invalid by the macro as well. Therefore, 
any consecutive call to BTScanPosIsPinned should indeed return false. It'd 
definitely be nice if this gets clarified in comments though.

That's true. It took me quite some time to understand that existing code 
is correct.
There is a comment for the structure's field that claims that 
BufferIsValid is the same that BufferIsPinned in ScanPos context.
Attached patch contains some comments' updates. Any suggestions on how 
to improve them are welcome.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b172c1..316a42d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1751,6 +1751,8 @@ _bt_killitems(IndexScanDesc s

Re: Can't we give better table bloat stats easily?

2019-08-26 Thread Jeff Janes

On Fri, Aug 16, 2019 at 8:39 PM Greg Stark  wrote:

> Everywhere I've worked I've seen people struggle with table bloat. It's
> hard to even measure how much of it you have or where, let alone actually
> fix it.
>
> If you search online you'll find dozens of different queries estimating
> how much empty space are in your tables and indexes based on pg_stats
> statistics and suppositions about header lengths and padding and plugging
> them into formulas of varying credibility.
>

There is not much we can do to suppress bad advice that people post on
their own blogs.  If wiki.postgresql.org is hosting bad advice, by all
means we should fix that.

> But isn't this all just silliiness these days? We could actually sum up
> the space recorded in the fsm and get a much more trustworthy number in
> milliseconds.
>

If you have bloat problems, then you probably have vacuuming problems.  If
you have vacuuming problems, how much can you trust fsm anyway?

Cheers,

Jeff

Re: subscriptionCheck failures on nightjar

Tomas Vondra  writes:
> I'm willing to take a stab at it, but to do that I need a way to
> reproduce it. Tom, you mentioned you've managed to reproduce it in a
> qemu instance, but that it took some fiddling with qemu parmeters or
> something. Can you share what exactly was necessary?

I don't recall exactly what I did anymore, and it was pretty fiddly
anyway.  Upthread I suggested

>> Now that we know where the problem is, you could probably make it highly
>> reproducible by inserting a sleep of a few msec between the rename and the
>> second fsync.

so why not try that first?

regards, tom lane

Re: PostgreSQL and TLS 1.2

2019-08-26 Thread Chapman Flack

On 8/26/19 10:10 AM, ROS Didier wrote:
> Hi
> 
> I would like to check that postgresql is compatible with TLS 1.2.
> what test could I do to check this compatibility?

Hi,

I just now pointed this command at our PG 9.5 server at $work:

openssl s_client -connect dbhost:5432 -starttls postgres

and got the following response (excerpted for the relevant parts):

SSL handshake has read 5465 bytes and written 737 bytes
Verification: OK
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384


Then I tried this version:

openssl s_client -connect dbhost:5432 -starttls postgres -tls1_2

and got this result:

SSL handshake has read 5258 bytes and written 343 bytes
Verification: OK
---
New, TLSv1.2, Cipher is ECDHE-RSA-AES256-GCM-SHA384


Regards,
-Chap

PostgreSQL and TLS 1.2

2019-08-26 Thread ROS Didier

Hi

I would like to check that postgresql is compatible with TLS 1.2.
what test could I do to check this compatibility?

Thanks in advance

Best Regards
Didier ROS
EDF







Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.

Re: Statement timeout in pg_rewind

2019-08-26 Thread Alexander Kukushkin

Hi,

On Mon, 26 Aug 2019 at 06:28, Michael Paquier  wrote:

> Alexander, it seems to me that we should also consider lock_timeout
> and idle_in_transaction_session_timeout (new as of 9.6), no?  We could

Well, I was thinking about it and came to the conclusion that we are
neither taking heavy locks nor explicitly opening a transaction and
therefore we can avoid changing them.
But maybe you are right, having them set to the safe value shouldn't hurt.

> also group the PQexec/PQresultStatus into a simple wrapper which gets
> also called by run_simple_query().

I don't think we can use the same wrapper for run_simple_query() and
for places where we call a SET, because PQresultStatus() returns
PGRES_TUPLES_OK and PGRES_COMMAND_OK respectively.
Passing expected ExecStatusType to the wrapper for comparison is
looking a bit ugly to me.

Regards,
--
Alexander Kukushkin

Re: subscriptionCheck failures on nightjar

2019-08-26 Thread Tomas Vondra


On Tue, Aug 13, 2019 at 05:04:35PM +0900, Michael Paquier wrote:

On Wed, Feb 13, 2019 at 01:51:47PM -0800, Andres Freund wrote:

I'm not yet sure that that's actually something that's supposed to
happen, I got to spend some time analysing how this actually
happens. Normally the contents of the slot should actually prevent it
from being removed (as they're newer than
ReplicationSlotsComputeLogicalRestartLSN()). I kind of wonder if that's
a bug in the drop logic in newer releases.


In the same context, could it be a consequence of 9915de6c which has
introduced a conditional variable to control slot operations?  This
could have exposed more easily a pre-existing race condition.
--


This is one of the remaining open items, and we don't seem to be moving
forward with it :-(

I'm willing to take a stab at it, but to do that I need a way to
reproduce it. Tom, you mentioned you've managed to reproduce it in a
qemu instance, but that it took some fiddling with qemu parmeters or
something. Can you share what exactly was necessary?

An observation about the issue - while we started to notice this after
Decemeber, that's mostly because the PANIC patch went it shortly before.
We've however seen the issue before, as Thomas Munro mentioned in [1].

Those reports are from August, so it's quite possible something in the
first CF upset the code. And there's only a single commit in 2018-07
that seems related to logical decoding / snapshots [2], i.e. f49a80c:

commit f49a80c481f74fa81407dce8e51dea6956cb64f8
Author: Alvaro Herrera 
Date:   Tue Jun 26 16:38:34 2018 -0400

   Fix "base" snapshot handling in logical decoding

   ...

The other reason to suspect this is related is that the fix also made it
to REL_11_STABLE at that time, and if you check the buildfarm data [3],
you'll see 11 fails on nightjar too, from time to time.

This means it's not a 12+ only issue, it's a live issue on 11. I don't
know if f49a80c is the culprit, or if it simply uncovered a pre-existing
bug (e.g. due to timing).


[1] 
https://www.postgresql.org/message-id/CAEepm%3D0wB7vgztC5sg2nmJ-H3bnrBT5GQfhUzP%2BFfq-WT3g8VA%40mail.gmail.com

[2] https://commitfest.postgresql.org/18/1650/

[3] 
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=nightjar&br=REL_11_STABLE

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Zedstore - compressed in-core columnar storage

2019-08-26 Thread Ashutosh Sharma

Thanks Ashwin and Heikki for your responses. I've one more query here,

If BTree index is created on a zedstore table, the t_tid field of
Index tuple contains the physical tid that is not actually pointing to
the data block instead it contains something from which the logical
tid can be derived. So, when IndexScan is performed on a zedstore
table, it fetches the physical tid from the index page and derives the
logical tid out of it and then retrieves the data corresponding to
this logical tid from the zedstore table. For that, it kind of
performs SeqScan on the zedstore table for the given tid. From this it
appears to me as if the Index Scan is as good as SeqScan for zedstore
table. If that is true, will we be able to get the benefit of
IndexScan on zedstore tables? Please let me know if i am missing
something here.

AFAIU, the following user level query on zedstore table

select * from zed_tab where a = 3;

gets internally converted to

select * from zed_tab where tid = 3; -- assuming that index is created
on column 'a' and the logical tid associated with a = 3 is 3.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 15, 2019 at 3:08 PM Heikki Linnakangas  wrote:
>
> On 14/08/2019 20:32, Ashwin Agrawal wrote:
> > On Wed, Aug 14, 2019 at 2:51 AM Ashutosh Sharma wrote:
> >> 2) Is there a chance that IndexOnlyScan would ever be required for
> >>zedstore tables considering the design approach taken for it?
> >
> > We have not given much thought to IndexOnlyScans so far. But I think
> > IndexOnlyScan definitely would be beneficial for zedstore as
> > well. Even for normal index scans as well, fetching as many columns
> > possible from Index itself and only getting rest of required columns
> > from the table would be good for zedstore. It would help to further
> > cut down IO. Ideally, for visibility checking only TidTree needs to be
> > scanned and visibility checked with the same, so the cost of checking
> > is much lower compared to heap (if VM can't be consulted) but still is
> > a cost. Also, with vacuum, if UNDO log gets trimmed, the visibility
> > checks are pretty cheap. Still given all that, having VM type thing to
> > optimize the same further would help.
>
> Hmm, yeah. An index-only scan on a zedstore table could perform the "VM
> checks" by checking the TID tree in the zedstore. It's not as compact as
> the 2 bits per TID in the heapam's visibility map, but it's pretty good.
>
> >> Further, I tried creating a zedstore table with btree index on one of
> >> it's column and loaded around 50 lacs record into the table. When the
> >> indexed column was scanned (with enable_seqscan flag set to off), it
> >> went for IndexOnlyScan and that took around 15-20 times more than it
> >> would take for IndexOnly Scan on heap table just because IndexOnlyScan
> >> in zedstore always goes to heap as the visibility check fails.
>
> Currently, an index-only scan on zedstore should be pretty much the same
> speed as a regular index scan. All the visibility checks will fail, and
> you end up fetching every row from the table, just like a regular index
> scan. So I think what you're seeing is that the index fetches on a
> zedstore table is much slower than on heap.
>
> Ideally, on a column store the index fetches would only fetch the needed
> columns, but I don't think that's been implemented yet, so all the
> columns are fetched. That can make a big difference, if you have a wide
> table with lots of columns, but only actually need a few of them. Was
> your test case something like that?
>
> We haven't spent much effort on optimizing index fetches yet, so I hope
> there's many other little tweaks there as well, that we can do to make
> it faster.
>
> - Heikki

Re: pgbench - allow to create partitioned tables

2019-08-26 Thread Asif Rehman

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Hi,

The patch looks good to me, Just one suggestion --partition-method option 
should be made dependent on --partitions, because it has no use unless used 
with --partitions. What do you think? 
 
Regards,
Asif

The new status of this patch is: Waiting on Author

Re: Creating partitions automatically at least on HASH?

2019-08-26 Thread Rafia Sabih

On Sun, 18 Aug 2019 at 11:33, Fabien COELHO  wrote:

>
> Hello Robert & Robert,
>
> > - no partitions are created immediately (current case)
> >   but will have to be created manually later
> >
> > - static partitions are created automatically, based on provided
> >   parameters
> >
> > - dynamic partitions will be created later, when needed, based
> >   on provided parameters again.
> >
> > Even if all that is not implemented immediately.
> >
> >> We need something that will let you specify just a modulus for hash
> >> partitions, a start, end, and interval for range partitions, and a list
> of
> >> bounds for list partitions.  If we're willing to create a new keyword,
> we
> >> could make PARTITIONS a keyword. Then:
> >>
> >> PARTITION BY HASH (whatever) PARTITIONS 8
> >
> > I think that it should reuse already existing keywords, i.e. MODULUS
> should
> > appear somewhere.
> >
> > Maybe:
> >
> >  ... PARTITION BY HASH (whatever)
> >  [ CREATE [IMMEDIATE | DEFERRED] PARTITIONS (MODULUS 8) |
> >NOCREATE or maybe NO CREATE ];
>
> I have given a small go at the parser part of that.
>
> There are 3 types of partitions with 3 dedicated syntax structures to
> handle their associated parameters (WITH …, FROM … TO …, IN …). ISTM that
> it is a "looks good from far away" idea, but when trying to extend that it
> is starting to be a pain. If a 4th partition type is added, should it be
> yet another syntax? So I'm looking for an generic and extensible syntax
> that could accomodate all cases for automatic creation of partitions.
>
> Second problem, adding a "CREATE" after "PARTITION BY … (…)" create
> shift-reduce conflicts with potential other CREATE TABLE option
> specification syntax. Not sure which one, but anyway. So the current
> generic syntax I'm considering is using "DO" as a trigger to start the
> optional automatic partition creation stuff:
>
>CREATE TABLE Stuff (...)
>  PARTITION BY [HASH | RANGE | LIST] (…)
>DO NONE -- this is the default
>DO [IMMEDIATE|DEFERRED] USING (…)
>
> Where the USING part would be generic keword value pairs, eg:
>
> For HASH: (MODULUS 8) and/or (NPARTS 10)
>
> For RANGE: (START '1970-01-01', STOP '2020-01-01', INCREMENT '1 year')
>  and/or (START 1970, STOP 2020, NPARTS 50)
>
> And possibly for LIST: (IN (…), IN (…), …), or possibly some other
> keyword.
>
> The "DEFERRED" could be used as an open syntax for dynamic partitioning,
> if later someone would feel like doing it.
>
ISTM that "USING" is better than "WITH" because WITH is already used
> specifically for HASH and other optional stuff in CREATE TABLE.
>
> The text constant would be interpreted depending on the partitioning
> expression/column type.
>
> Any opinion about the overall approach?
>
>
> I happen to start a similar discussion [1] being unaware of this one and
there Ashutosh Sharma talked about interval partitioning in Oracle. Looking
closely it looks like we can have this automatic partitioning more
convenient by having something similar. Basically, it is creating
partitions on demand or lazy partitioning. To explain a bit more, let's
take range partition for example, first parent table is created and it's
interval and start and end values are specified and it creates only the
parent table just like it works today. Now,  if there comes a insertion
that does not belong to the existing (or any, in the case of first
insertion) partition(s), then the corresponding partition is created, I
think it is extensible to other partitioning schemes as well. Also it is
likely to have a positive impact on the queries, because there will be
required partitions only and would not require to educate planner/executor
about many empty partitions.

[1]
https://www.postgresql.org/message-id/flat/20190820205005.GA25823%40alvherre.pgsql#c67245b98e2cfc9c3bd261f134d05368

-- 
Regards,
Rafia Sabih

Proposal: Better generation of values in GENERATED columns.

2019-08-26 Thread Daniel Migowski


Hello,

one of the most frustating things when I started with PostgreSQL was 
that IDENTITY columns are based on sequences that are completly 
disconnected from the table contents and manually imported data will 
lead to errors like 'duplicate key value violates unique constraint 
"xyz_pkey"'.


I had to fight this initially with an insert trigger that always updates 
the sequences on each insert, or with client side code that updates the 
sequence when such an error occurs and then retries the insert.


Even Microsoft Access did a better job at autogenerated primaries keys, 
and while I love the elegant design of PostgreSQL in many ways I believe 
we can do better here. I would like to implement a fallback solution 
that detects such errors and automatically updates the nextvalue of the 
sequence when the nextvalue is already used on insert.


I believe this can be implemented without affecting performance 
negatively when one just does extra stuff in the error case, so I 
wouldn't do table scans when creating the insert initially.


Any reasons why this isn't a good idea to try?

Regards,
Daniel Migowski

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-26 Thread Joe Conway

On 8/26/19 2:53 AM, Masahiko Sawada wrote:
> I guess that this depends on the number of encryption keys we use. If
> we have encryption keys per tablespace or database the number of keys
> would be at most several dozen or several hundred. It's enough to have
> them in flat-file format on the disk and to load them to the hash
> table on the shared memory. We would not need a complex mechanism.
> OTOH if we have keys per tables, we would need to consider indexes and
> buffering as they might not fit in the memory.

Master key(s) need to be kept in memory, but derived keys (using KDF)
would be calculated at time of use, I would think.

>> Some kind of flat-file based approach with a temp file and renaming of
>> files using durable_rename(), like what we used to do with
>> pg_shadow/authid, and now do with replorigin_checkpoint and such?
> 
> The PoC patch I created does that for the keyring file. When key
> rotation, the correspond WAL contains all re-encrypted keys with the
> master key identifier, and the recovery restores the keyring file. One
> good point of this approach is that external tools and startup process
> read it easier. It doesn't require backend codes such as system cache
> and heap functions.

That sounds like a good approach.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

On Mon, Aug 26, 2019 at 7:49 PM Joe Conway  wrote:
>
> On 8/26/19 2:53 AM, Masahiko Sawada wrote:
> > I guess that this depends on the number of encryption keys we use. If
> > we have encryption keys per tablespace or database the number of keys
> > would be at most several dozen or several hundred. It's enough to have
> > them in flat-file format on the disk and to load them to the hash
> > table on the shared memory. We would not need a complex mechanism.
> > OTOH if we have keys per tables, we would need to consider indexes and
> > buffering as they might not fit in the memory.
>
> Master key(s) need to be kept in memory, but derived keys (using KDF)
> would be calculated at time of use, I would think.

Yes, we can do that and the PoC patch does so. I'm rather concerned
the salt and info to derive keys. We would need at least info, which
could be OID perhaps, for each keys. Also these data need to be
accessible by both frontend tool and startup process. If the info is
very small data, say 4 byte of OID, we could have all of them on the
memory even if we have keys per tables.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence

2019-08-26 Thread Antonin Houska

Peter Geoghegan  wrote:

> Consumers of this new infrastructure probably won't be limited to the
> deduplication feature;

It'd also solve an open problem of the aggregate push-down patch [1], in
particular see the mention of pg_opclass in [2]: the partial aggregate
node below the final join must not put multiple opclass-equal values of
which are not byte-wise equal into the same group because some
information needed by WHERE or JOIN/ON condition may be lost this
way. The scale of the numeric type is the most obvious example.

> I would like to:
> 
> * Get some buy-in on whether or not the precise distinctions I would
> like to make are correct for deduplication in particular, and as
> useful as possible for other cases that we may need to add later on.
> 
> * Figure out the exact interface through which opclass/opfamily
> authors can represent that their notion of equality is compatible with
> deduplication/compression.

It's not entirely clear to me whether opclass or opfamily should carry
this information. opclass probably makes more sense for index related
problems and the aggregate push-down patch can live with that. I don't
see particular reason to add any flag to opfamily. (Planner uses uses
both pg_opclass and pg_opfamily catalogs.)

I think the fact that the aggregate push-down would benefit from this
enhancement should affect choice of the new catalog attribute name,
i.e. it should be not mention words as concrete as "deduplication" or
"compression".

> (I think that the use of nondeterministic collations should disable
> deduplication without explicit action from the operator class -- that
> should just be baked in.)

(I think the aggregate push-down needs to consider the nondeterministic
collations too, I missed that so far.)

[1] https://commitfest.postgresql.org/24/1247/

[2] https://www.postgresql.org/message-id/10529.1547561178%40localhost

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Yet another fast GiST build

2019-08-26 Thread Heikki Linnakangas


On 26/08/2019 10:59, Andrey Borodin wrote:

Hi!

In many cases GiST index can be build fast using z-order sorting.

I've looked into proof of concept by Nikita Glukhov [0] and it looks very 
interesting.
So, I've implemented yet another version of B-tree-like GiST build.


Cool!


My biggest concern is that passing function to relation option seems
a bit hacky. You can pass there any function matching sort support
signature. Embedding this function into opclass makes no sense: it
does not affect scan anyhow.


I think it should be a new, optional, GiST "AM support function", in 
pg_amproc. That's how the sort support functions are defined for B-tree 
opfamilies, too.


- Heikki

Re: Yet another fast GiST build

2019-08-26 Thread Komяpa

Hello,

This is very interesting. In my pipeline currently GiST index rebuild is
the biggest time consuming step.

I believe introducing optional concept of order in the GiST opclass will be
beneficial not only for fast build, but for other tasks later:
 - CLUSTER can order the table using that notion, in parallel way.
 - btree_gist can be even closer to btree by getting the tuples sorted
inside page.
 - tree descend on insertion in future can traverse the list in more
opportunistic way, calculating penalty for siblings-by-order first.

I believe everywhere the idea of ordering is needed it's provided by giving
a btree opclass.

How about giving a link to btree opclass inside a gist opclass?


On Mon, Aug 26, 2019 at 10:59 AM Andrey Borodin 
wrote:

> Hi!
>
> In many cases GiST index can be build fast using z-order sorting.
>
> I've looked into proof of concept by Nikita Glukhov [0] and it looks very
> interesting.
> So, I've implemented yet another version of B-tree-like GiST build.
> It's main use case and benefits can be summarized with small example:
>
> postgres=# create table x as select point (random(),random()) from
> generate_series(1,300,1);
> SELECT 300
> Time: 5061,967 ms (00:05,062)
> postgres=# create index ON x using gist (point ) with
> (fast_build_sort_function=gist_point_sortsupport);
> CREATE INDEX
> Time: 6140,227 ms (00:06,140)
> postgres=# create index ON x using gist (point );
> CREATE INDEX
> Time: 32061,200 ms (00:32,061)
>
> As you can see, Z-order build is on order of magnitude faster. Select
> performance is roughly the same. Also, index is significantly smaller.
>
> Nikita's PoC is faster because it uses parallel build, but it intervenes
> into B-tree code a lot (for reuse). This patchset is GiST-isolated.
> My biggest concern is that passing function to relation option seems a bit
> hacky. You can pass there any function matching sort support signature.
> Embedding this function into opclass makes no sense: it does not affect
> scan anyhow.
>
> In current version, docs and tests are not implemented. I want to discuss
> overall design. Do we really want yet another GiST build, if it is 3-10
> times faster?
>
> Thanks!
>
> Best regards, Andrey Borodin.
>
> [0]
> https://github.com/postgres/postgres/compare/master...glukhovn:gist_btree_build
>
>

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

A problem about partitionwise join

2019-08-26 Thread Richard Guo

Hi All,

To generate partitionwise join, we need to make sure there exists an
equi-join condition for each pair of partition keys, which is performed
by have_partkey_equi_join(). This makes sense and works well.

But if, let's say, one certain pair of partition keys (foo.k = bar.k)
has formed an equivalence class containing consts, no join clause would
be generated for it, since we have already generated 'foo.k = const' and
'bar.k = const' and pushed them into the proper restrictions earlier.

This will make partitionwise join fail to be planned if there are
multiple partition keys and the pushed-down restrictions 'xxx = const'
fail to prune away any partitions.

Consider the examples below:

create table p (k1 int, k2 int, val int) partition by range(k1,k2);
create table p_1 partition of p for values from (1,1) to (10,100);
create table p_2 partition of p for values from (10,100) to (20,200);

If we are joining on each pair of partition keys, we can generate
partitionwise join:

# explain (costs off)
select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 = bar.k2;
  QUERY PLAN
--
 Append
   ->  Hash Join
 Hash Cond: ((foo.k1 = bar.k1) AND (foo.k2 = bar.k2))
 ->  Seq Scan on p_1 foo
 ->  Hash
   ->  Seq Scan on p_1 bar
   ->  Hash Join
 Hash Cond: ((foo_1.k1 = bar_1.k1) AND (foo_1.k2 = bar_1.k2))
 ->  Seq Scan on p_2 foo_1
 ->  Hash
   ->  Seq Scan on p_2 bar_1
(11 rows)

But if we add another qual 'foo.k2 = const', we will be unable to
generate partitionwise join any more, because have_partkey_equi_join()
thinks not every partition key has an equi-join condition.

# explain (costs off)
select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 = bar.k2
and foo.k2 = 16;
   QUERY PLAN
-
 Hash Join
   Hash Cond: (foo.k1 = bar.k1)
   ->  Append
 ->  Seq Scan on p_1 foo
   Filter: (k2 = 16)
 ->  Seq Scan on p_2 foo_1
   Filter: (k2 = 16)
   ->  Hash
 ->  Append
   ->  Seq Scan on p_1 bar
 Filter: (k2 = 16)
   ->  Seq Scan on p_2 bar_1
 Filter: (k2 = 16)
(13 rows)

Is this a problem?

Thanks
Richard

Re: Re: Email to hackers for test coverage

2019-08-26 Thread movead...@highgo.ca

  On Mon, 26 Aug 2019 12:48:40 +0800 mich...@paquier.xyz wrote
>Your patch has forgotten to update the alternate output in
>float4-misrounded-input.out.

Thanks for your remind, I have modified the patch and now it is 
'regression_20190826.patch' in attachment, and I have done a successful
test on Cygwin.

--
Movead


regression_20190826.patch
Description: Binary data

Re: refactoring - share str2*int64 functions



Bonjour Michaël,


So I'll set out to write and use "pg_strtou?int64", i.e. 2 functions, and
then possibly generalize to lower sizes, 32, 16, depending on what is
actually needed.


I am interested in this patch, and the next commit fest is close by.
Are you working on an updated version?  If not, do you mind if I work
on it and post a new version by the beginning of next week based on
all the feedback gathered?


I have started to do something, and I can spend some time on that this 
week, but I'm pretty unclear about what exactly should be done.


The error returning stuff is simple enough, but I'm unclear about what to 
do with pg_uint64, which has a totally different signature. Should it be 
aligned?


--
Fabien Coelho - CRI, MINES ParisTech

Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)

2019-08-26 Thread Tomas Vondra

On Mon, Aug 26, 2019 at 02:34:31PM +1200, Thomas Munro wrote:

On Mon, Aug 26, 2019 at 1:44 PM Justin Pryzby  wrote:

On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote:
> On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan  wrote:
> > I was reminded of this issue from last year, which also appeared to
> > involve BufFileClose() and a double-free:
> >
> > https://postgr.es/m/87y3hmee19@news-spur.riddles.org.uk
> >
> > That was a BufFile that was under the control of a tuplestore, so it
> > was similar to but different from your case. I suspect it's related.
>
> Hmm.  tuplestore.c follows the same coding pattern as nodeHashjoin.c:
> it always nukes its pointer after calling BufFileFlush(), so it
> shouldn't be capable of calling it twice for the same pointer, unless
> we have two copies of that pointer somehow.
>
> Merlin's reported a double-free apparently in ExecHashJoin(), not
> ExecHashJoinNewBatch() like this report.  Unfortunately that tells us
> very little.

Here's another one:

https://www.postgresql.org/message-id/flat/20170601081104.1500.56202%40wrigleys.postgresql.org

Hmm.  Also on RHEL/CentOS 6, and also involving sorting, hashing,
BufFileClose() but this time the glibc double free error is in
repalloc().

And another one (repeatedly happening):

https://www.postgresql.org/message-id/flat/3976998C-8D3B-4825-9B10-69ECB70A597A%40appnexus.com

Also on RHEL/CentOS 6, this time a sort in once case and a hash join
in another case.

Of course it's entirely possible that we have a bug here and I'm very
keen to find it, but I can't help noticing the common factor here is
that they're all running ancient RHEL 6.x releases, except Merlin who
didn't say.  Merlin?

It'd be interesting to know the exact glibc version for those machines.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Yet another fast GiST build

2019-08-26 Thread Andrey Borodin

Hi!

In many cases GiST index can be build fast using z-order sorting.

I've looked into proof of concept by Nikita Glukhov [0] and it looks very 
interesting.
So, I've implemented yet another version of B-tree-like GiST build.
It's main use case and benefits can be summarized with small example:

postgres=# create table x as select point (random(),random()) from 
generate_series(1,300,1);
SELECT 300
Time: 5061,967 ms (00:05,062)
postgres=# create index ON x using gist (point ) with 
(fast_build_sort_function=gist_point_sortsupport);
CREATE INDEX
Time: 6140,227 ms (00:06,140)
postgres=# create index ON x using gist (point );
CREATE INDEX
Time: 32061,200 ms (00:32,061)

As you can see, Z-order build is on order of magnitude faster. Select 
performance is roughly the same. Also, index is significantly smaller.

Nikita's PoC is faster because it uses parallel build, but it intervenes into 
B-tree code a lot (for reuse). This patchset is GiST-isolated.
My biggest concern is that passing function to relation option seems a bit 
hacky. You can pass there any function matching sort support signature.
Embedding this function into opclass makes no sense: it does not affect scan 
anyhow.

In current version, docs and tests are not implemented. I want to discuss 
overall design. Do we really want yet another GiST build, if it is 3-10 times 
faster?

Thanks!

Best regards, Andrey Borodin.

[0] 
https://github.com/postgres/postgres/compare/master...glukhovn:gist_btree_build



0001-function-relopt-for-gist-build.patch
Description: Binary data


0003-Implement-GiST-build-using-sort-support.patch
Description: Binary data


0002-Add-sort-support-for-point-gist_point_sortsupport.patch
Description: Binary data

Re: Can't we give better table bloat stats easily?

On Sat, Aug 17, 2019 at 9:59 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-16 20:39:21 -0400, Greg Stark wrote:
> > But isn't this all just silliiness these days? We could actually sum up the
> > space recorded in the fsm and get a much more trustworthy number in
> > milliseconds.
>
> You mean like pgstattuple_approx()?
>
> https://www.postgresql.org/docs/current/pgstattuple.html
>
> Or something different?
>
> > I rigged up a quick proof of concept and the code seems super simple and
> > quick. There's one or two tables where the number is a bit suspect and
> > there's no fsm if vacuum hasn't run but that seems pretty small potatoes
> > for such a huge help in reducing user pain.
>
> Hard to comment on what you propose, without more details. But note that
> you can't just look at the FSM, because in a lot of workloads it is
> often hugely out of date. And fairly obviously it doesn't provide you
> with information about how much space is currently occupied by dead
> tuples.  What pgstattuple_approx does is to use the FSM for blocks that
> are all-visible, and look at the page otherwise.
>

It's just an idea but we could have pgstattuple_approx use sample scan
to estimate the table bloat more faster.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: refactoring - share str2*int64 functions

Hi Fabien,

On Thu, Aug 01, 2019 at 04:48:35PM +0200, Fabien COELHO wrote:
> Ok, so there is an agreement on reworking the unsigned function. I missed
> this bit.
> 
> So I'll set out to write and use "pg_strtou?int64", i.e. 2 functions, and
> then possibly generalize to lower sizes, 32, 16, depending on what is
> actually needed.

I am interested in this patch, and the next commit fest is close by.
Are you working on an updated version?  If not, do you mind if I work
on it and post a new version by the beginning of next week based on
all the feedback gathered?
--
Michael


signature.asc
Description: PGP signature

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

On Fri, Aug 23, 2019 at 11:35 PM Stephen Frost  wrote:
>
> Greetings,
>
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Fri, Aug 23, 2019 at 07:45:22AM -0400, Stephen Frost wrote:
> > > Having listed out the feature set of each of the other major databases
> > > when it comes to TDE is exactly how we objectively look at what is being
> > > done in the industry, and that then gives us an understanding of what
> > > users (and auditors) coming from other platforms will expect.
> > >
> > > I entirely agree that we shouldn't just copy N feature from X other
> > > database system unless we feel that's the best approach, but when every
> > > other database system out there has capability Y for the general feature
> > > X that we're thinking about implementing, we should be questioning an
> > > approach which doesn't include that.
> >
> > Agreed.  The features of other databases are a clear source for what we
> > should consider and run through the useful/reasonable filter.
>
> Following on from that- when other databases don't have something that
> we're thinking about implementing, maybe we should be contemplating if
> it really makes sense as a requirement for us.
>
> Specifically in this case- I went back and tried to figure out what
> other database systems have an "encrypt EVERYTHING" option.  I didn't
> have much luck finding one though.  So I think we need to ask ourselves-
> the "check box" that we're trying to check off with TDE, do the other
> database system check that box?  If so, then it looks like the "check
> box" isn't actually "encrypt EVERYTHING", it's more along the lines of
> "make sure all regular user data is encrypted automatically" or some
> such, and that's a very different requirement, which seems to be
> answered by the other systems by having a KMS + tablespace/database
> level encryption.  We certainly shouldn't be putting a lot of effort
> into building something that is either overkill or won't be interesting
> to users due to limitations like "have to take the entire cluster
> offline to re-key it".
>
> Now, that KMS has to be encrypted using a master key, of course, and we
> have to make sure that it is able to survive across a crash, and it'd
> sure be nice if it was indexed.  One option for such a KMS would be
> something entirely external (which could potentially just be another PG
> database or something) but it'd be nice if we had something built-in.
> We might also want it to be replicated (or maybe we don't, as was
> discussed on the call, to allow for a replica to use an independent set
> of keys- of course that leads to issues with pg_rewind and such though).

I think most user would expect the physical standby server uses the
same key as the primary server's one, at least for the master key.
Otherwise they would need to use different keys every time of fail
over. Even for WAL encryption keys, since it's common to fetch
archived WAL files that are produced by the primary server by
restore_command using scp the standby server needs to use the same
keys or at least know it. In logical replication, I think that since
we would sent unencrypted data and encrypt it on the subscriber that
is initiated as a different database cluster we can use the different
keys on both sides.

> Anything built-in does seem like it'd be a fair bit of work to get it to
> address those requirements, but that does seem to be what the other
> database systems have done.  Unfortunately, their documentation doesn't
> seem to really say exactly what they've done to address that.

I guess that this depends on the number of encryption keys we use. If
we have encryption keys per tablespace or database the number of keys
would be at most several dozen or several hundred. It's enough to have
them in flat-file format on the disk and to load them to the hash
table on the shared memory. We would not need a complex mechanism.
OTOH if we have keys per tables, we would need to consider indexes and
buffering as they might not fit in the memory.

> A couple random ideas that probably won't work, but I'll put them out
> there for others to shoot down-
>
> Some kind of 2-phase WAL pass, where we do WAL replay for the
> non-encrypted bits first (which would include the KMS) and then go back
> and WAL replay the encrypted stuff.  Seems terrible.
>
> An independent WAL for the KMS only.  Ugh, do we need another walwriter
> then?  and buffers, and lots of other stuff.
>
> Some kind of flat-file based approach with a temp file and renaming of
> files using durable_rename(), like what we used to do with
> pg_shadow/authid, and now do with replorigin_checkpoint and such?

The PoC patch I created does that for the keyring file. When key
rotation, the correspond WAL contains all re-encrypted keys with the
master key identifier, and the recovery restores the keyring file. One
good point of this approach is that external tools and startup process
read it easier. It doesn't require backend codes such as system cache
and heap

Re: [HACKERS] [PATCH] pageinspect function to decode infomasks