Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-19 Thread Amit Kapila
On Sat, Jul 20, 2019 at 4:17 AM Peter Geoghegan  wrote:
>
> On Fri, Jul 19, 2019 at 12:52 PM Robert Haas  wrote:
> > Right, that's definitely a big part of the concern here, but I don't
> > really believe that retaining locks is absolutely required, or even
> > necessarily desirable.  For instance, suppose that I create a table,
> > bulk-load a whole lotta data into it, and then abort.  Further support
> > that by the time we start trying to process the undo in the
> > background, we can't get the lock. Well, that probably means somebody
> > is performing DDL on the table.
>
> I believe that the primary reason why certain other database systems
> retain locks until rollback completes (or release their locks in
> reverse order, as UNDO processing progresses) is that application code
> will often repeat exactly the same actions on receiving a transient
> error, until the action finally completes successfully. Just like with
> serialization failures, or with manually implemented UPSERT loops that
> must sometimes retry. This is why UNDO is often (or always) processed
> synchronously, blocking progress of the client connection as its xact
> rolls back.
>
> Obviously these other systems could easily hand off the work of
> rolling back the transaction to an asynchronous worker process, and
> return success to the client that encounters an error (or asks to
> abort/roll back) almost immediately. I have to imagine that they
> haven't implemented this straightforward optimization because it makes
> sense that the cost of rolling back the transaction is primarily borne
> by the client that actually rolls back.
>

It is also possible that there are some other disadvantages or
technical challenges in those other systems due to which they decided
not to have such a mechanism.  I think one such database prepares the
consistent copy of pages during read operation based on SCN or
something like that.  It might not be as easy for such a system to
check if there is some pending undo which needs to be consulted.  I am
not telling that there are no ways to overcome such things but that
might have incurred much more cost or has some other disadvantages.
I am not sure if it is straight-forward to imagine why some other
system does the things in some particular way unless there is some
explicit documentation about the same.

Having said that, I agree that there are a good number of advantages
of performing the actions in the client that actually rolls back and
we should try to do that where it is not a good idea to transfer to
background workers like for short transactions.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-19 Thread Amit Kapila
On Fri, Jul 19, 2019 at 10:58 PM Robert Haas  wrote:
>
> One other thing that seems worth noting is that we have to consider
> what happens after a restart.  After a crash, and depending on exactly
> how we design it perhaps also after a non-crash restart, we won't
> immediately know how many outstanding transactions need undo; we'll
> have to grovel through the undo logs to find out. If we've got a hard
> cap, we can't allow new undo-using transactions to start until we
> finish that work.  It's possible that, at the moment of the crash, the
> maximum number of items had already been pushed into the background,
> and every foreground session was busy trying to undo an abort as well.
> If so, we're already up against the limit.  We'll have to scan through
> all of the undo logs and examine each transaction to get a count on
> how many transactions are already in a needs-undo-work state; only
> once we have that value do we know whether it's OK to admit new
> transactions to using the undo machinery, and how many we can admit.
> In typical cases, that won't take long at all, because there won't be
> any pending undo work, or not much, and we'll very quickly read the
> handful of transaction headers that we need to consult and away we go.
> However, if the hard limit is pretty big, and we're pretty close to
> it, counting might take a long time. It seems bothersome to have this
> interval between when we start accepting transactions and when we can
> accept transactions that use undo. Instead of throwing an ERROR, we
> can probably just teach the system to wait for the background process
> to finish doing the counting; that's what Amit's patch does currently.
>

Yeah, however, we wait for a certain threshold period of time (one
minute) for counting to finish and then error out.  We can wait till
the counting is finished but I am not sure if that is a good idea
because anyway user can try again after some time.

> Or, we could not even open for connections until the counting has been
> completed.
>
> When I first thought about this, I was really concerned about the idea
> of a hard limit, but the more I think about it the less problematic it
> seems. I think in the end it boils down to a question of: when things
> break, what behavior would users prefer?
>

One minor thing I would like to add here is that we are providing some
knobs wherein the systems having more number of rollbacks can
configure to have a much higher value of hard limit such that it won't
hit in their systems.  I know it is not always easy to find the right
value, but I guess they can learn from the behavior and then change it
to avoid the same in future.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-07-19 Thread Dilip Kumar
On Fri, Jul 19, 2019 at 6:35 PM Robert Haas  wrote:
>
> On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila  wrote:
> > We are doing exactly what you have written in the last line of the
> > next paragraph "stop the transaction from writing undo when the hash
> > table is already too full.".  So we will
> > never face the problems related to repeated crash recovery.  The
> > definition of too full is that we stop allowing the new transactions
> > that can write undo when we have the hash table already have entries
> > equivalent to (UndoRollbackHashTableSize() - MaxBackends).  Does this
> > make sense?
>
> Oops, I was looking in the wrong place.  Yes, that makes sense, but:
>
> 1. It looks like the test you added to PrepareUndoInsert has no
> locking, and I don't see how that can be right.
>

+if (ProcGlobal->xactsHavingPendingUndo >
+(UndoRollbackHashTableSize() - MaxBackends))

Actual HARD_LIMIT is UndoRollbackHashTableSize(), but we only allow a
new backend to prepare the undo record if we have MaxBackends number
of empty slots in the hash table.  This will guarantee us to always
have at least one slot in the hash table for our current prepare, even
if all the backend which running their transaction has aborted and
inserted an entry in the hash table.

I think the problem with this check is that for any backend to prepare
an undo there must be MaxBackend number of empty slots in the hash
table for any concurrent backend to insert the request and this seems
too restrictive.

Having said that I think we must ensure MaxBackends*2 empty slots in
the hash table as each backend can enter 2 requests in the hash table.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-07-19 Thread Dilip Kumar
On Fri, Jul 19, 2019 at 6:37 PM Robert Haas  wrote:
>
> On Fri, Jul 19, 2019 at 7:54 AM Amit Khandekar  wrote:
> > + * We just want to mask the cid in the undo record header.  So
> > + * only if the partial record in the current page include the undo
> > + * record header then we need to mask the cid bytes in this page.
> > + * Otherwise, directly jump to the next record.
> > Here, I think you mean : "So only if the partial record in the current
> > page includes the *cid* bytes", rather than "includes the undo record
> > header"
> > May be we can say :
> > We just want to mask the cid. So do the partial record masking only if
> > the current page includes the cid bytes from the partial record
> > header.
>
> Hmm, but why is it correct to mask the CID at all?  Shouldn't that match?
>
We don't write CID in the WAL. Because In hot-standby or after
recovery we don't need actual CID for the visibility.   So during REDO
while generating the undo record we set CID as 'FirstCommandId' which
is different from the DO time.  That's the reason we mask it.
-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-19 Thread Peter Geoghegan
On Fri, Jul 19, 2019 at 12:32 PM Peter Geoghegan  wrote:
> On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova
>  wrote:
> > Patch 0002 (must be applied on top of 0001) implements preserving of
> > correct TID order
> > inside posting list when inserting new tuples.
> > This version passes all regression tests including amcheck test.
> > I also used following script to test insertion into the posting list:
>
> Nice!

Hmm. So, the attached test case fails amcheck verification for me with
the latest version of the patch:

$ psql -f amcheck-compress-test.sql
DROP TABLE
CREATE TABLE
CREATE INDEX
CREATE EXTENSION
INSERT 0 2001
psql:amcheck-compress-test.sql:6: ERROR:  down-link lower bound
invariant violated for index "idx_desc_nl"
DETAIL:  Parent block=3 child index tid=(2,2) parent page lsn=10/F87A3438.

Note that this test only has an INSERT statement. You have to use
bt_index_parent_check() to see the problem -- bt_index_check() will
not detect the problem.

-- 
Peter Geoghegan


amcheck-compress-test.sql
Description: Binary data


Re: Compile from source using latest Microsoft Windows SDK

2019-07-19 Thread Michael Paquier
On Fri, Jul 19, 2019 at 08:30:38AM -0400, Andrew Dunstan wrote:
> My tests of the VS2017 stuff used this install mechanism on a fresh
> Windows instance:
> 
> choco install -y visualstudio2017-workload-vctools --package-parameters
> "--includeOptional"
> 
> This installed Windows Kits 8.1 and 10, among many other things.

So you have bypassed the problem by installing the v8.1 SDK.  And if
you don't do that and only include the v10 SDK, then you see the
problem.  Functionally, it also means that with a VS2017 compilation
the SDK version is forcibly downgraded, isn't that a bad idea anyway?
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )

2019-07-19 Thread Michael Paquier
On Fri, Jul 19, 2019 at 10:40:42PM +0900, Ian Barwick wrote:
> Good point, it does actually fail with an error if an impossible slot name
> is provided, so the escaping is superfluous anyway.

FWIW, ReplicationSlotValidateName() gives the reason behind that
restriction:
Slot names may consist out of [a-z0-9_]{1,NAMEDATALEN-1} which should allow
the name to be used as a directory name on every supported OS.
  
> I'll take another look at it later as it's not exactly critical, just stuck
> out when I was passing through the code.

This restriction is unlikely going to be removed, still I would rather
keep the escaped logic in pg_basebackup.  This is the usual,
recommended coding pattern, and there is a risk that folks refer to
this code block for their own fancy stuff, spreading the problem.  The
intention behind the code is to use an escaped name as well.  For 
those reasons your patch is fine by me.
--
Michael


signature.asc
Description: PGP signature


Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-19 Thread Peter Geoghegan
On Fri, Jul 19, 2019 at 4:14 PM Robert Haas  wrote:
> I don't think this matters here at all. As long as there's only DML
> involved, there won't be any lock conflicts anyway - everybody's
> taking RowExclusiveLock or less, and it's all fine. If you update a
> row in zheap, abort, and then try to update again before the rollback
> happens, we'll do a page-at-a-time rollback in the foreground, and
> proceed with the update; when we get around to applying the undo,
> we'll notice that page has already been handled and skip the undo
> records that pertain to it.  To get the kinds of problems I'm on about
> here, somebody's got to be taking some more serious locks.

If I'm not mistaken, you're tacitly assuming that you'll always be
using zheap, or something sufficiently similar to zheap. It'll
probably never be possible to UNDO changes to something like a GIN
index on a zheap table, because you can never do that with sensible
concurrency/deadlock behavior.

I don't necessarily have a problem with that. I don't pretend to
understand how much of a problem it is. Obviously it partially depends
on what your ambitions are for this infrastructure. Still, assuming
that I have it right, ISTM that UNDO/zheap/whatever should explicitly
own this restriction.

-- 
Peter Geoghegan




Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-19 Thread Robert Haas
On Fri, Jul 19, 2019 at 6:47 PM Peter Geoghegan  wrote:
> I believe that the primary reason why certain other database systems
> retain locks until rollback completes (or release their locks in
> reverse order, as UNDO processing progresses) is that application code
> will often repeat exactly the same actions on receiving a transient
> error, until the action finally completes successfully. Just like with
> serialization failures, or with manually implemented UPSERT loops that
> must sometimes retry. This is why UNDO is often (or always) processed
> synchronously, blocking progress of the client connection as its xact
> rolls back.

I don't think this matters here at all. As long as there's only DML
involved, there won't be any lock conflicts anyway - everybody's
taking RowExclusiveLock or less, and it's all fine. If you update a
row in zheap, abort, and then try to update again before the rollback
happens, we'll do a page-at-a-time rollback in the foreground, and
proceed with the update; when we get around to applying the undo,
we'll notice that page has already been handled and skip the undo
records that pertain to it.  To get the kinds of problems I'm on about
here, somebody's got to be taking some more serious locks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Multivariate MCV list vs. statistics target

2019-07-19 Thread Tomas Vondra

On Fri, Jul 19, 2019 at 06:12:20AM +, Jamison, Kirk wrote:

On Tuesday, July 9, 2019, Tomas Vondra wrote:

>apparently v1 of the ALTER STATISTICS patch was a bit confused about
>length of the VacAttrStats array passed to statext_compute_stattarget,
>causing segfaults. Attached v2 patch fixes that, and it also makes sure
>we print warnings about ignored statistics only once.
>

v3 of the patch, adding pg_dump support - it works just like when you tweak
statistics target for a column, for example. When the value stored in the
catalog is not -1, pg_dump emits a separate ALTER STATISTICS command setting
it (for the already created statistics object).



Hi Tomas, I stumbled upon your patch.

According to the CF bot, your patch applies cleanly, builds successfully, and
passes make world. Meaning, the pg_dump tap test passed, but there was no
test for the new SET STATISTICS yet. So you might want to add a regression
test for that and integrate it in the existing alter_generic file.

Upon quick read-through, the syntax and docs are correct because it's similar
to the format of ALTER TABLE/INDEX... SET STATISTICS... :
   ALTER [ COLUMN ] column_name SET STATISTICS integer

+   /* XXX What if the target is set to 0? Reset the statistic? */

This also makes me wonder. I haven't looked deeply into the code, but since 0 is
a valid value, I believe it should reset the stats.


I agree resetting the stats after setting the target to 0 seems quite
reasonable. But that's not what we do for attribute stats, because in
that case we simply skip the attribute during the future ANALYZE runs -
we don't reset the stats, we keep the existing stats. So I've done the
same thing here, and I've removed the XXX comment.

If we want to change that, I'd do it in a separate patch for both the
regular and extended stats.


After lookup though, this is how it's tested in ALTER TABLE:
/test/regress/sql/stats_ext.sql:-- Ensure things work sanely with SET 
STATISTICS 0



Well, yeah. But that tests we skip building the extended statistic
(because we excluded the column from the ANALYZE run).


I've considered making it part of CREATE STATISTICS itself, but it seems a
bit cumbersome and we don't do it for columns either. I'm not against adding
it in the future, but at this point I don't see a need.


I agree. Perhaps that's for another patch should you decide to add it in the 
future.



Right.

Attached is v4 of the patch, with a couple more improvements:

1) I've renamed the if_not_exists flag to missing_ok, because that's
more consistent with the "IF EXISTS" clause in the grammar (the old flag
was kinda the exact opposite), and I've added a NOTICE about the skip.

2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows, because
that's what the function was doing anyway (computing sample size).

3) I've added a couple of regression tests to stats_ext.sql

Aside from that, I've cleaned up a couple of places and improved a bunch
of comments. Nothing huge.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/doc/src/sgml/ref/alter_statistics.sgml 
b/doc/src/sgml/ref/alter_statistics.sgml
index 58c7ed020d..987f9dbc6f 100644
--- a/doc/src/sgml/ref/alter_statistics.sgml
+++ b/doc/src/sgml/ref/alter_statistics.sgml
@@ -26,6 +26,7 @@ PostgreSQL documentation
 ALTER STATISTICS name OWNER TO { 
new_owner | CURRENT_USER | 
SESSION_USER }
 ALTER STATISTICS name RENAME TO 
new_name
 ALTER STATISTICS name SET SCHEMA 
new_schema
+ALTER STATISTICS name SET 
STATISTICS new_target
 
  
 
@@ -93,6 +94,22 @@ ALTER STATISTICS name SET SCHEMA 
  
 
+ 
+  new_target
+  
+   
+The statistic-gathering target for this statistic object for subsequent
+ operations.
+The target can be set in the range 0 to 1; alternatively, set it
+to -1 to revert to using the system default statistics
+target ().
+For more information on the use of statistics by the
+PostgreSQL query planner, refer to
+.
+   
+  
+ 
+
 

   
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8d633f2585..edbd24f5e9 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -307,7 +307,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
VacAttrStats **vacattrstats;
AnlIndexData *indexdata;
int targrows,
-   numrows;
+   numrows,
+   minrows;
double  totalrows,
totaldeadrows;
HeapTuple  *rows;
@@ -491,6 +492,16 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
}
}
 
+   /*
+* Look at extended statistic objects too, as those may define custom
+* statistic target. So we may need to 

Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-19 Thread Peter Geoghegan
On Fri, Jul 19, 2019 at 12:52 PM Robert Haas  wrote:
> Right, that's definitely a big part of the concern here, but I don't
> really believe that retaining locks is absolutely required, or even
> necessarily desirable.  For instance, suppose that I create a table,
> bulk-load a whole lotta data into it, and then abort.  Further support
> that by the time we start trying to process the undo in the
> background, we can't get the lock. Well, that probably means somebody
> is performing DDL on the table.

I believe that the primary reason why certain other database systems
retain locks until rollback completes (or release their locks in
reverse order, as UNDO processing progresses) is that application code
will often repeat exactly the same actions on receiving a transient
error, until the action finally completes successfully. Just like with
serialization failures, or with manually implemented UPSERT loops that
must sometimes retry. This is why UNDO is often (or always) processed
synchronously, blocking progress of the client connection as its xact
rolls back.

Obviously these other systems could easily hand off the work of
rolling back the transaction to an asynchronous worker process, and
return success to the client that encounters an error (or asks to
abort/roll back) almost immediately. I have to imagine that they
haven't implemented this straightforward optimization because it makes
sense that the cost of rolling back the transaction is primarily borne
by the client that actually rolls back. And, as I said, because a lot
of application code will immediately retry on failure, which needs to
not deadlock with an asynchronous roll back process.

> If they just did LOCK TABLE or ALTER
> TABLE SET STATISTICS, we are going to need to execute that same undo
> once the DDL is complete. However, if the DDL is DROP TABLE, we're
> going to find that once we can get the lock, the undo is obsolete, and
> we don't need to worry about it any more. Had we made it 100% certain
> that the DROP TABLE couldn't go through until the undo was performed,
> we could avoid having to worry about the undo having become obsolete
> ... but that's hardly a win.  We're better off allowing the drop and
> then just chucking the undo.

I'm sure that there are cases like that. And, I'm pretty sure that at
least one of the other database systems that I'm thinking of isn't as
naive as I suggest, without being sure of the specifics. The classic
approach is to retain the locks, even though that sucks in some cases.
That doesn't mean that you have to do it that way, but it's probably a
good idea to present your design in a way that compares and contrasts
with the classic approach.

I'm pretty sure that this is related to the way in which other systems
retain coarse-grained locks when bitmap indexes are used, even though
that makes them totally unusable with OLTP apps. It seems like it
would help users a lot if their bitmap indexes didn't come with that
problem, but it's a price that they continue to have to pay.

> Point being - there's at least some chance that the operations which
> block forward progress also represent progress of another sort.

That's good, provided that there isn't observable lock starvation. I
don't think that you need to eliminate the theoretical risk of lock
starvation. It deserves careful, ongoing consideration, though. It's
difficult to codify exactly what I have in mind, but I can give you an
informal definition now: It's probably okay if there is the occasional
implementation-level deadlock because the user got unlucky once.
However, it's not okay for there to be *continual* deadlocks because
the user got unlucky just once. Even if the user had *extraordinarily*
bad luck that one time. In short, my sense is that it's never okay for
the system as a whole to "get stuck" in a deadlock or livelock loop.
Actually, it might even be okay if somebody had a test case that
exhibits "getting stuck" behavior, provided the test case is very
delicate, and looks truly adversarial (i.e. it goes beyond being
extraordinarily unlucky).

I know that this is all pretty hand-wavy, and I don't expect you to
have a definitive response. These are some high level concerns that I
have, that may or may not apply to what you're trying to do.

-- 
Peter Geoghegan




Re: partition routing layering in nodeModifyTable.c

2019-07-19 Thread Andres Freund
Hi,

On 2019-07-19 17:11:10 -0400, Alvaro Herrera wrote:
> On 2019-Jul-19, Andres Freund wrote:
> > > - slot = ExecDelete(node, tupleid, oldtuple, 
> > > planSlot,
> > > -   
> > > >mt_epqstate, estate,
> > > + slot = ExecDelete(node, resultRelInfo, tupleid, 
> > > oldtuple,
> > > +   planSlot, 
> > > >mt_epqstate, estate,
> > > true, 
> > > node->canSetTag,
> > > false /* 
> > > changingPart */ , NULL, NULL);
> > >   break;
> > 
> > This reminds me of another complaint: ExecDelete and ExecInsert() have
> > gotten more boolean parameters for partition moving, but only one of
> > them is explained with a comment (/* changingPart */) - think we should
> > do that for all.
> 
> Maybe change the API to use a flags bitmask?
> 
> (IMO the placement of the comment inside the function call, making the
> comma appear preceded with a space, looks ugly.  If we want to add
> comments, let's put each param on its own line with the comment beyond
> the comma.  That's what we do in other places where this pattern is
> used.)

Well, that's the pre-existing style, so I'd just have gone with
that. I'm not sure I buy there's much point in going for a bitmask, as
this is file-private code, not code where changing the signature
requires modifying multiple files.


> > >   /* Initialize the executor state. */
> > > - estate = create_estate_for_relation(rel);
> > > + estate = create_estate_for_relation(rel, );
> > 
> > Hm. It kinda seems cleaner if we were to instead return the relevant
> > index, rather than the entire ResultRelInfo, as an output from
> > create_estate_for_relation().  Makes it clearer that it's still in the
> > EState.
> 
> Yeah.
> 
> > Or perhaps we ought to compute it in a separate step? Then that'd be
> > more amenable to support replcating into partition roots.
> 
> I'm not quite seeing the shape that you're imagining this would take.
> I vote not to mess with that for this patch; I bet that we'll have to
> change a few other things in this code when we add better support for
> partitioning in logical replication.

Yea, I think it's fine to do that separately.  If we wanted to support
replication roots as replication targets, we'd obviously need to do
something pretty similar to what ExecInsert()/ExecUpdate() already
do. And there we can't just reference an index in EState, as partition
children aren't in there.

I kind of was wondering if we were to have a separate function for
getting the ResultRelInfo targeted, we'd be able to just extend that
function to support replication.  But now that I think about it a bit
more, that's so much just scratching the surface...

We really ought to have the replication "sink" code share more code with
nodeModifyTable.c.

Greetings,

Andres Freund




Re: partition routing layering in nodeModifyTable.c

2019-07-19 Thread Alvaro Herrera
On 2019-Jul-19, Andres Freund wrote:

> On 2019-07-19 17:52:20 +0900, Amit Langote wrote:
> > The first one (0001) deals with reducing the core executor's reliance
> > on es_result_relation_info to access the currently active result
> > relation, in favor of receiving it from the caller as a function
> > argument.  So no piece of core code relies on it being correctly set
> > anymore.  It still needs to be set correctly for the third-party code
> > such as FDWs.
> 
> I'm inclined to just remove it. There's not much code out there relying
> on it, as far as I can tell. Most FDWs don't support the direct modify
> API, and that's afaict the case where we one needs to use
> es_result_relation_info?

Yeah, I too agree with removing it; since our code doesn't use it, it
seems very likely that it will become slightly out of sync with reality
and we'd not notice until some FDW misbehaves weirdly.

> > -   slot = ExecDelete(node, tupleid, oldtuple, 
> > planSlot,
> > - 
> > >mt_epqstate, estate,
> > +   slot = ExecDelete(node, resultRelInfo, tupleid, 
> > oldtuple,
> > + planSlot, 
> > >mt_epqstate, estate,
> >   true, 
> > node->canSetTag,
> >   false /* 
> > changingPart */ , NULL, NULL);
> > break;
> 
> This reminds me of another complaint: ExecDelete and ExecInsert() have
> gotten more boolean parameters for partition moving, but only one of
> them is explained with a comment (/* changingPart */) - think we should
> do that for all.

Maybe change the API to use a flags bitmask?

(IMO the placement of the comment inside the function call, making the
comma appear preceded with a space, looks ugly.  If we want to add
comments, let's put each param on its own line with the comment beyond
the comma.  That's what we do in other places where this pattern is
used.)

> > /* Initialize the executor state. */
> > -   estate = create_estate_for_relation(rel);
> > +   estate = create_estate_for_relation(rel, );
> 
> Hm. It kinda seems cleaner if we were to instead return the relevant
> index, rather than the entire ResultRelInfo, as an output from
> create_estate_for_relation().  Makes it clearer that it's still in the
> EState.

Yeah.

> Or perhaps we ought to compute it in a separate step? Then that'd be
> more amenable to support replcating into partition roots.

I'm not quite seeing the shape that you're imagining this would take.
I vote not to mess with that for this patch; I bet that we'll have to
change a few other things in this code when we add better support for
partitioning in logical replication.

> > +/*
> > + * ExecMove
> > + * Move an updated tuple from the input result relation to the
> > + * new partition of its root parent table
> > + *
> > + * This works by first deleting the tuple from the input result relation
> > + * followed by inserting it into the root parent table, that is,
> > + * mtstate->rootResultRelInfo.
> > + *
> > + * Returns true if it's detected that the tuple we're trying to move has
> > + * been concurrently updated.
> > + */
> > +static bool
> > +ExecMove(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo,
> > +ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot 
> > *planSlot,
> > +EPQState *epqstate, bool canSetTag, TupleTableSlot **slot,
> > +TupleTableSlot **inserted_tuple)
> > +{
>
> I know that it was one of the names I proposed, but now that I'm
> thinking about it again, it sounds too generic. Perhaps
> ExecCrossPartitionUpdate() wouldn't be a quite so generic name? Since
> there's only one reference the longer name wouldn't be painful.

That name sounds good.  Isn't the return convention backwards?  Sounds
like "true" should mean that it succeeded.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Built-in connection pooler

2019-07-19 Thread Konstantin Knizhnik



On 19.07.2019 6:36, Ryan Lambert wrote:
Here's what I found tonight in your latest patch (9).  The output from 
git apply is better, fewer whitespace errors, but still getting the 
following.  Ubuntu 18.04 if that helps.


git apply -p1 < builtin_connection_proxy-9.patch
:79: tab in indent.
                  Each proxy launches its own subset of backends.
:634: indent with spaces.
    union {
:635: indent with spaces.
       struct sockaddr_in inaddr;
:636: indent with spaces.
       struct sockaddr addr;
:637: indent with spaces.
    } a;
warning: squelched 54 whitespace errors
warning: 59 lines add whitespace errors.


A few more minor edits.  In config.sgml:

"If the max_sessions limit is reached new 
connection are not accepted."

Should be "connections".

"The default value is 10, so up to 10 backends will server each database,"
"sever" should be "serve" and the sentence should end with a period 
instead of a comma.



In postmaster.c:

/* The socket number we are listening for poolled connections on */
"poolled" --> "pooled"


"(errmsg("could not create listen socket for locahost")));"

"locahost" -> "localhost".


" * so to support order balancing we should do dome smart work here."

"dome" should be "some"?

I don't see any tests covering this new functionality.  It seems that 
this is significant enough functionality to warrant some sort of 
tests, but I don't know exactly what those would/should be.




Thank you once again for this fixes.
Updated patch is attached.

Concerning testing: I do not think that connection pooler needs some 
kind of special tests.
The idea of built-in connection pooler is that it should be able to 
handle all requests normal postgres can do.

I have added to regression tests extra path with enabled connection proxies.
Unfortunately, pg_regress is altering some session variables, so it 
backend becomes tainted and so

pooling is not actually used (but communication through proxy is tested).

It is  also possible to run pg_bench with different number of 
connections though connection pooler.




Thanks,
Ryan


diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..f8b93f16ed 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -719,6 +719,123 @@ include_dir 'conf.d'
   
  
 
+ 
+  max_sessions (integer)
+  
+   max_sessions configuration parameter
+  
+  
+  
+
+  The maximum number of client sessions that can be handled by
+  one connection proxy when session pooling is switched on.
+  This parameter does not add any memory or CPU overhead, so
+  specifying a large max_sessions value
+  does not affect performance.
+  If the max_sessions limit is reached new connections are not accepted.
+
+
+  The default value is 1000. This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  session_pool_size (integer)
+  
+   session_pool_size configuration parameter
+  
+  
+  
+
+  Enables session pooling and defines the maximum number of
+  backends that can be used by client sessions for each database/user combination.
+  Launched non-tainted backends are never terminated even if there are no active sessions.
+  Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements.
+  Tainted backend can server only one client.
+
+
+  The default value is 10, so up to 10 backends will serve each database,
+
+  
+ 
+
+ 
+  proxy_port (integer)
+  
+   proxy_port configuration parameter
+  
+  
+  
+
+  Sets the TCP port for the connection pooler.
+  Clients connected to main "port" will be assigned dedicated backends,
+  while client connected to proxy port will be connected to backends through proxy which
+  performs transaction level scheduling. 
+   
+
+  The default value is 6543.
+
+  
+ 
+
+ 
+  connection_proxies (integer)
+  
+   connection_proxies configuration parameter
+  
+  
+  
+
+  Sets number of connection proxies.
+  Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing).
+  Each proxy launches its own subset of backends.
+  So maximal number of non-tainted backends is  session_pool_size*connection_proxies*databases*roles.
+   
+
+  The default value is 0, so session pooling is disabled.
+
+  
+ 
+
+ 
+  session_schedule (enum)
+  
+   session_schedule configuration parameter
+  
+  
+  
+
+  Specifies scheduling policy for assigning session to proxies in case of

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-19 Thread James Coleman
On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
 wrote:
> Now, consider this example:
>
>   create table t (a int, b int, c int);
>   insert into t select mod(i,100),mod(i,100),i from 
> generate_series(1,1000) s(i);
>   create index on t (a);
>   analyze t;
>   explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;
>
> With 0001+0002+0003 pathes, I get a plan like this:
>
>  QUERY PLAN
> 
>  Limit  (cost=10375.39..10594.72 rows=1 width=16)
>->  Incremental Sort  (cost=10375.39..2203675.71 rows=1 width=16)
>  Sort Key: a, b, (sum(c))
>  Presorted Key: a, b
>  ->  GroupAggregate  (cost=10156.07..2203225.71 rows=1 width=16)
>Group Key: a, b
>->  Gather Merge  (cost=10156.07..2128124.39 rows=1175 
> width=12)
>  Workers Planned: 2
>  ->  Incremental Sort  (cost=9156.04..972856.05 
> rows=4166740 width=12)
>Sort Key: a, b
>Presorted Key: a
>->  Parallel Index Scan using t_a_idx on t  
> (cost=0.43..417690.30 rows=4166740 width=12)
> (12 rows)
>
>
> and with 0004, I get this:
>
>   QUERY PLAN
> --
>  Limit  (cost=20443.84..20665.32 rows=1 width=16)
>->  Incremental Sort  (cost=20443.84..2235250.05 rows=1 width=16)
>  Sort Key: a, b, (sum(c))
>  Presorted Key: a, b
>  ->  GroupAggregate  (cost=20222.37..2234800.05 rows=1 width=16)
>Group Key: a, b
>->  Incremental Sort  (cost=20222.37..2159698.74 rows=1175 
> width=12)
>  Sort Key: a, b
>  Presorted Key: a
>  ->  Index Scan using t_a_idx on t  (cost=0.43..476024.65 
> rows=1175 width=12)
> (10 rows)
>
> Notice that cost of the second plan is almost double the first one. That
> means 0004 does not even generate the first plan, i.e. there are cases
> where we don't try to add the explicit sort before passing the path to
> generate_gather_paths().
>
> And I think I know why is that - while gather_grouping_paths() tries to
> add explicit sort below the gather merge, there are other places that
> call generate_gather_paths() that don't do that. In this case it's
> probably apply_scanjoin_target_to_paths() which simply builds
>
>parallel (seq|index) scan + gather merge
>
> and that's it. The problem is likely the same - the code does not know
> which pathkeys are "interesting" at that point. We probably need to
> teach planner to do this.

I've been working on figuring out sample queries for each of the
places we're looking at adding create_increment_sort() (starting with
the cases enabled by gather-merge nodes). The
generate_useful_gather_paths() call in
apply_scanjoin_target_to_paths() is required to generate the above
preferred plan.

But I found that if I set enable_sort=off (with our without the
_useful_ variant of generate_gather_paths()) I get a very similar plan
that's actually lower cost yet:

QUERY PLAN
--
 Limit  (cost=10255.98..10355.77 rows=1 width=16)
   ->  Incremental Sort  (cost=10255.98..1008228.67 rows=1 width=16)
 Sort Key: a, b, (sum(c))
 Presorted Key: a, b
 ->  Finalize GroupAggregate  (cost=10156.20..1007778.67
rows=1 width=16)
   Group Key: a, b
   ->  Gather Merge  (cost=10156.20..1007528.67 rows=2 width=16)
 Workers Planned: 2
 ->  Partial GroupAggregate
(cost=9156.18..1004220.15 rows=1 width=16)
   Group Key: a, b
   ->  Incremental Sort
(cost=9156.18..972869.60 rows=4166740 width=12)
 Sort Key: a, b
 Presorted Key: a
 ->  Parallel Index Scan using t_a_idx
on t  (cost=0.43..417703.85 rows=4166740 width=12)

Is that something we should consider a bug at this stage? It's also
not clear to mean (costing aside) which plan is intuitively
preferable.

James




Re: minimizing pg_stat_statements performance overhead

2019-07-19 Thread Jeff Davis
On Wed, 2019-04-03 at 23:20 +, Raymond Martin wrote:
> Hi Christoph, 
> 
> > you sent the patch as UTF-16, could you re-send it as plain ascii?
> 
> Apologies. I re-attached the plain ascii version. 

Committed. Thanks!

Regards,
Jeff Davis






Re: sepgsql seems rather thoroughly broken on Fedora 30

2019-07-19 Thread Mike Palmiotto
On Fri, Jul 19, 2019 at 4:29 PM Tom Lane  wrote:
>
> Mike Palmiotto  writes:
> > We probably need to polish this a bit more, but what do you think
> > about something similar to the attached patches? They should hopefully
> > reduce some of the complexity of running these regression tests.
>
> I can confirm that the 0001 patch fixes things on my Fedora 30 box.
> So that's good, though I don't know enough to evaluate it for style
> or anything like that.

I think the policy is in need of review/rewriting anyway. The proper
thing to do would be to create a common template for all of the
SELinux regtest user domains and create more of a hierarchical policy
to reduce redundancy. If you want to wait for more formal policy
updates, I can do that in my spare time. Otherwise, the patch I posted
should work with the general style of this policy module.

>
> I don't think I like the 0002 patch very much, because of its putting
> all the sudo actions into the script.  I'd rather not give a script
> root permissions, thanks.  Maybe I'm in the minority on that.

Definitely not. I cringed a little bit as I was making those
additions, but figured it was fine since it's just a test script (and
we have to run `sudo` for various other installation items as well).

> Also, since the documentation explicitly says that the
> /usr/share/selinux/devel/Makefile path is not to be relied on,
> why would we hard-wire it into the script?
>
> A bigger-picture issue is that right now, configuring a cluster for
> sepgsql is a very manual process (cf. section F.35.2).  I think there's
> some advantage in forcing the user to run through that before running
> the regression test, namely that they'll get the bugs out of any
> misunderstandings or needed local changes.  If we had that a bit more
> automated then maybe having the test script do-it-for-you would be
> sensible.  (IOW, the fact that the test process is more like "make
> installcheck" than "make check" seems like a feature not a bug.)

Makes sense to me. Thanks for the feedback!

-- 
Mike Palmiotto
Software Engineer
Crunchy Data Solutions
https://crunchydata.com




Re: sepgsql seems rather thoroughly broken on Fedora 30

2019-07-19 Thread Tom Lane
Mike Palmiotto  writes:
> We probably need to polish this a bit more, but what do you think
> about something similar to the attached patches? They should hopefully
> reduce some of the complexity of running these regression tests.

I can confirm that the 0001 patch fixes things on my Fedora 30 box.
So that's good, though I don't know enough to evaluate it for style
or anything like that.

I don't think I like the 0002 patch very much, because of its putting
all the sudo actions into the script.  I'd rather not give a script
root permissions, thanks.  Maybe I'm in the minority on that.
Also, since the documentation explicitly says that the 
/usr/share/selinux/devel/Makefile path is not to be relied on,
why would we hard-wire it into the script?

A bigger-picture issue is that right now, configuring a cluster for
sepgsql is a very manual process (cf. section F.35.2).  I think there's
some advantage in forcing the user to run through that before running
the regression test, namely that they'll get the bugs out of any
misunderstandings or needed local changes.  If we had that a bit more
automated then maybe having the test script do-it-for-you would be
sensible.  (IOW, the fact that the test process is more like "make
installcheck" than "make check" seems like a feature not a bug.)

regards, tom lane




Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-19 Thread Andres Freund
On 2019-07-19 15:57:45 -0400, Robert Haas wrote:
> On Fri, Jul 19, 2019 at 3:12 PM Andres Freund  wrote:
> > Isn't that pretty inherently required? How are otherwise ever going to
> > be able to roll back a transaction that holds an AEL on a relation it
> > also modifies?  I might be standing on my own head here, though.
> 
> I think you are.  If a transaction holds an AEL on a relation it also
> modifies, we still only need something like RowExclusiveLock to roll
> it back.  If we retain the transaction's locks until undo is complete,
> we will not deadlock, but we'll also hold AccessExclusiveLock for a
> long time.  If we release the transaction's locks, we can perform the
> undo in the background with only RowExclusiveLock, which is full of
> win.  Even you insist that the undo task should acquire the same lock
> the relation has, which seems entirely excessive to me, that hardly
> prevents undo from being applied.  Once the original transaction has
> released its locks, those locks are released, and the undo system can
> acquire those locks the next time the relation isn't busy (or when it
> gets to the head of the lock queue).

Good morning, Mr Freund. Not sure what you were thinking there.




Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-19 Thread Robert Haas
On Fri, Jul 19, 2019 at 3:12 PM Andres Freund  wrote:
> On 2019-07-19 14:50:22 -0400, Robert Haas wrote:
> > On Fri, Jul 19, 2019 at 2:04 PM Andres Freund  wrote:
> > > It doesn't seem that hard - and kind of required for robustness
> > > independent of the decision around "completeness" - to find a way to use
> > > the locks already held by the prepared transaction.
> >
> > I'm not wild about finding more subtasks to put on the must-do list,
> > but I agree it's doable.
>
> Isn't that pretty inherently required? How are otherwise ever going to
> be able to roll back a transaction that holds an AEL on a relation it
> also modifies?  I might be standing on my own head here, though.

I think you are.  If a transaction holds an AEL on a relation it also
modifies, we still only need something like RowExclusiveLock to roll
it back.  If we retain the transaction's locks until undo is complete,
we will not deadlock, but we'll also hold AccessExclusiveLock for a
long time.  If we release the transaction's locks, we can perform the
undo in the background with only RowExclusiveLock, which is full of
win.  Even you insist that the undo task should acquire the same lock
the relation has, which seems entirely excessive to me, that hardly
prevents undo from being applied.  Once the original transaction has
released its locks, those locks are released, and the undo system can
acquire those locks the next time the relation isn't busy (or when it
gets to the head of the lock queue).

As far as I can see, the only reason why you would care about this is
to make the back-pressure system effective against prepared
transactions.  Different people may want that more or less, but I have
a little trouble with the idea that it is a hard requirement.

> Well, then perhaps that admin ought not to constantly terminate
> connections...  I was thinking that new connections wouldn't be forced
> to do that if there were still a lot of headroom regarding
> #transactions-to-be-rolled-back.  And if undo workers kept up, you'd
> also not hit this.

Sure, but cascading failure scenarios suck.

> > Hmm, that's not a bad idea. So the transactions would have to "count"
> > the moment they insert their first undo record, which is exactly the
> > right thing anyway.
> >
> > Hmm, but what about transactions that are only touching unlogged tables?
>
> Wouldn't we throw all that UNDO away in a crash restart? There's no
> underlying table data anymore, after all.
>
> And for proper shutdown checkpoints they could just be included.

On thirty seconds thought, that sounds like it would work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sepgsql seems rather thoroughly broken on Fedora 30

2019-07-19 Thread Mike Palmiotto
On Fri, Jul 19, 2019 at 11:19 AM Tom Lane  wrote:
>
> I got around to trying this, and lookee here:
>
> $ sudo sesearch -A -s sepgsql_regtest_user_t -t passwd_file_t
> allow domain file_type:blk_file map; [ domain_can_mmap_files ]:True
> allow domain file_type:chr_file map; [ domain_can_mmap_files ]:True
> allow domain file_type:file map; [ domain_can_mmap_files ]:True
> allow domain file_type:lnk_file map; [ domain_can_mmap_files ]:True
>
> Nothing about passwd_file_t.  So *something* is different about the
> way the policy is being expanded.

Okay, I was finally able to replicate the issue (and fix it). It looks
like perhaps the userdom_base_user_template changed and no longer
allows reading of passwd_file_t? At any rate, I added some policy to
ensure that we have the proper permissions.

I also beefed up the test script a bit so it now:
- installs the SELinux policy module
- spins up a temporary cluster to muddy postgresql.conf and run the
setup sql in an isolated environment

We probably need to polish this a bit more, but what do you think
about something similar to the attached patches? They should hopefully
reduce some of the complexity of running these regression tests.







--
Mike Palmiotto
Software Engineer
Crunchy Data Solutions
https://crunchydata.com
From 7f22a9d40ab5ad5334351932bd9010f538ba222a Mon Sep 17 00:00:00 2001
From: Mike Palmiotto 
Date: Fri, 19 Jul 2019 14:15:23 -0400
Subject: [PATCH 1/2] Make sepgsql-regtest policy module less error-prone

---
 contrib/sepgsql/sepgsql-regtest.te | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/contrib/sepgsql/sepgsql-regtest.te b/contrib/sepgsql/sepgsql-regtest.te
index e5d65243e6..5d9af1a0dd 100644
--- a/contrib/sepgsql/sepgsql-regtest.te
+++ b/contrib/sepgsql/sepgsql-regtest.te
@@ -31,6 +31,9 @@ userdom_base_user_template(sepgsql_regtest_superuser)
 userdom_manage_home_role(sepgsql_regtest_superuser_r, sepgsql_regtest_superuser_t)
 userdom_exec_user_home_content_files(sepgsql_regtest_superuser_t)
 userdom_write_user_tmp_sockets(sepgsql_regtest_superuser_t)
+
+auth_read_passwd(sepgsql_regtest_superuser_t)
+
 optional_policy(`
 	postgresql_stream_connect(sepgsql_regtest_superuser_t)
 	postgresql_unconfined(sepgsql_regtest_superuser_t)
@@ -60,6 +63,9 @@ userdom_base_user_template(sepgsql_regtest_dba)
 userdom_manage_home_role(sepgsql_regtest_dba_r, sepgsql_regtest_dba_t)
 userdom_exec_user_home_content_files(sepgsql_regtest_dba_t)
 userdom_write_user_tmp_sockets(sepgsql_regtest_user_t)
+
+auth_read_passwd(sepgsql_regtest_dba_t)
+
 optional_policy(`
 	postgresql_admin(sepgsql_regtest_dba_t, sepgsql_regtest_dba_r)
 	postgresql_stream_connect(sepgsql_regtest_dba_t)
@@ -98,6 +104,9 @@ userdom_base_user_template(sepgsql_regtest_user)
 userdom_manage_home_role(sepgsql_regtest_user_r, sepgsql_regtest_user_t)
 userdom_exec_user_home_content_files(sepgsql_regtest_user_t)
 userdom_write_user_tmp_sockets(sepgsql_regtest_user_t)
+
+auth_read_passwd(sepgsql_regtest_user_t)
+
 optional_policy(`
 	postgresql_role(sepgsql_regtest_user_r, sepgsql_regtest_user_t)
 	postgresql_stream_connect(sepgsql_regtest_user_t)
@@ -126,6 +135,8 @@ userdom_manage_home_role(sepgsql_regtest_pool_r, sepgsql_regtest_pool_t)
 userdom_exec_user_home_content_files(sepgsql_regtest_pool_t)
 userdom_write_user_tmp_sockets(sepgsql_regtest_pool_t)
 
+auth_read_passwd(sepgsql_regtest_pool_t)
+
 type sepgsql_regtest_foo_t;
 type sepgsql_regtest_var_t;
 type sepgsql_regtest_foo_table_t;
-- 
2.21.0

From 53fb4c662347311ef6e21ca4c82b64c358f2096b Mon Sep 17 00:00:00 2001
From: Mike Palmiotto 
Date: Fri, 19 Jul 2019 14:29:22 -0400
Subject: [PATCH 2/2] Add sandboxed cluster for sepgsql regression tests

---
 contrib/sepgsql/test_sepgsql | 47 
 1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/contrib/sepgsql/test_sepgsql b/contrib/sepgsql/test_sepgsql
index 7530363d2c..d14a1c28bc 100755
--- a/contrib/sepgsql/test_sepgsql
+++ b/contrib/sepgsql/test_sepgsql
@@ -15,7 +15,43 @@
 PG_BINDIR=`pg_config --bindir`
 
 # we must move to contrib/sepgsql directory to run pg_regress correctly
-cd `dirname $0`
+cd `dirname $0` || exit 1
+
+# Shut down existing test cluster and delete tmp directory if we have it
+if [ -d tmp/ ]; then
+# Make sure we don't have a lingering regression test cluster installed
+$PG_BINDIR/pg_ctl -D tmp -o "-p 15432" stop
+
+sudo rm -rf tmp/
+fi
+
+# Iniitalize our test environment
+if ! $PG_BINDIR/pg_ctl initdb -D tmp; then
+echo "test cluster initdb error"
+exit 1
+fi
+
+echo "shared_preload_libraries = 'sepgsql'" >> tmp/postgresql.conf
+
+for DBNAME in template0 template1 postgres;
+do
+$PG_BINDIR/postgres --single -F -c exit_on_error=true -p 15432 -D tmp/ $DBNAME \
+< sepgsql.sql > /dev/null
+done
+
+# Reload the policy module
+if ! sudo make -f /usr/share/selinux/devel/Makefile reload; then
+echo "policy reload error"
+echo ""
+echo "Unable to build sepgsql-regtest policy 

Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-19 Thread Robert Haas
On Fri, Jul 19, 2019 at 2:57 PM Peter Geoghegan  wrote:
> On Fri, Jul 19, 2019 at 10:28 AM Robert Haas  wrote:
> > In scenario #2, the undo work is going to have to be retried in the
> > background, and perforce that means reacquiring locks that have been
> > released, and so there is a chance of long lock waits and/or deadlock
> > that cannot really be avoided.
>
> I haven't studied the UNDO or zheap stuff in any detail, but I am
> concerned about rollbacks that deadlock. I'd feel a lot better about
> it if forward progress was guaranteed, somehow. That seems to imply
> that locks are retained, which is probably massively inconvenient to
> ensure. Not least because it probably requires cooperation from
> underlying access methods.

Right, that's definitely a big part of the concern here, but I don't
really believe that retaining locks is absolutely required, or even
necessarily desirable.  For instance, suppose that I create a table,
bulk-load a whole lotta data into it, and then abort.  Further support
that by the time we start trying to process the undo in the
background, we can't get the lock. Well, that probably means somebody
is performing DDL on the table.  If they just did LOCK TABLE or ALTER
TABLE SET STATISTICS, we are going to need to execute that same undo
once the DDL is complete. However, if the DDL is DROP TABLE, we're
going to find that once we can get the lock, the undo is obsolete, and
we don't need to worry about it any more. Had we made it 100% certain
that the DROP TABLE couldn't go through until the undo was performed,
we could avoid having to worry about the undo having become obsolete
... but that's hardly a win.  We're better off allowing the drop and
then just chucking the undo.

Likely, something like CLUSTER or VACUUM FULL would take care of
removing any rows created by aborted transactions along the way, so
the undo could be thrown away afterwards without processing it.

Point being - there's at least some chance that the operations which
block forward progress also represent progress of another sort.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-19 Thread Peter Geoghegan
On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova
 wrote:
> Patch 0002 (must be applied on top of 0001) implements preserving of
> correct TID order
> inside posting list when inserting new tuples.
> This version passes all regression tests including amcheck test.
> I also used following script to test insertion into the posting list:

Nice!

> I suppose it is not the final version of the patch yet,
> so I left some debug messages and TODO comments to ease review.

I'm fine with leaving them in. I have sometimes distributed a separate
patch with debug messages, but now that I think about it, that
probably wasn't a good use of time.

You will probably want to remove at least some of the debug messages
during performance testing. I'm thinking of code that appears in very
tight inner loops, such as the _bt_compare() code.

> Please, in your review, pay particular attention to usage of
> BTreeTupleGetHeapTID.
> For posting tuples it returns the first tid from posting list like
> BTreeTupleGetMinTID,
> but maybe some callers are not ready for that and want
> BTreeTupleGetMaxTID instead.
> Incorrect usage of these macros may cause some subtle bugs,
> which are probably not covered by tests. So, please double-check it.

One testing strategy that I plan to use for the patch is to
deliberately corrupt a compressed index in a subtle way using
pg_hexedit, and then see if amcheck detects the problem. For example,
I may swap the order of two TIDs in the middle of a posting list,
which is something that is unlikely to produce wrong answers to
queries, and won't even be detected by the "heapallindexed" check, but
is still wrong. If we can detect very subtle, adversarial corruption
like this, then we can detect any real-world problem.

Once we have confidence in amcheck's ability to detect problems with
posting lists in general, we can use it in many different contexts
without much thought. For example, we'll probably need to do long
running benchmarks to validate the performance of the patch. It's easy
to add amcheck testing at the end of each run. Every benchmark is now
also a correctness/stress test, for free.

> Next week I'm going to check performance and try to find specific
> scenarios where this
> feature can lead to degradation and measure it, to understand if we need
> to make this deduplication optional.

Sounds good, though I think it might be a bit too early to decide
whether or not it needs to be enabled by default. For one thing, the
approach to WAL-logging within _bt_compress_one_page() is probably
fairly inefficient, which may be a problem for certain workloads. It's
okay to leave it that way for now, because it is not relevant to the
core design of the patch. I'm sure that _bt_compress_one_page() can be
carefully optimized when the time comes.

My current focus is not on the raw performance itself. For now, I am
focussed on making sure that the compression works well, and that the
resulting indexes "look nice" in general. FWIW, the first few versions
of my v12 work on nbtree didn't actually make *anything* go faster. It
took a couple of months to fix the more important regressions, and a
few more months to fix all of them. I think that the work on this
patch may develop in a similar way. I am willing to accept regressions
in the unoptimized code during development because it seems likely
that you have the right idea about the data structure itself, which is
the one thing that I *really* care about. Once you get that right, the
remaining problems are very likely to either be fixable with further
work on optimizing specific code, or a price that users will mostly be
happy to pay to get the benefits.

--
Peter Geoghegan




Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-19 Thread Andres Freund
Hi,

On 2019-07-19 14:50:22 -0400, Robert Haas wrote:
> On Fri, Jul 19, 2019 at 2:04 PM Andres Freund  wrote:
> > It doesn't seem that hard - and kind of required for robustness
> > independent of the decision around "completeness" - to find a way to use
> > the locks already held by the prepared transaction.
> 
> I'm not wild about finding more subtasks to put on the must-do list,
> but I agree it's doable.

Isn't that pretty inherently required? How are otherwise ever going to
be able to roll back a transaction that holds an AEL on a relation it
also modifies?  I might be standing on my own head here, though.


> > You could force new connections to complete the rollback processing of
> > the terminated connection, if there's too much pending UNDO. That'd be a
> > way of providing back-pressure against such crazy scenarios.  Seems
> > again that it'd be good to have that pressure, independent of the
> > decision on completeness.
> 
> That would definitely provide a whole lot of back-pressure, but it
> would also make the system unusable if the undo handler finds a way to
> FATAL, or just hangs for some stupid reason (stuck I/O?). It would be
> a shame if the administrative action needed to fix the problem were
> prevented by the back-pressure mechanism.

Well, then perhaps that admin ought not to constantly terminate
connections...  I was thinking that new connections wouldn't be forced
to do that if there were still a lot of headroom regarding
#transactions-to-be-rolled-back.  And if undo workers kept up, you'd
also not hit this.


> > Couldn't we record the outstanding transactions in the checkpoint, and
> > then recompute the changes to that record during WAL replay?
> 
> Hmm, that's not a bad idea. So the transactions would have to "count"
> the moment they insert their first undo record, which is exactly the
> right thing anyway.
> 
> Hmm, but what about transactions that are only touching unlogged tables?

Wouldn't we throw all that UNDO away in a crash restart? There's no
underlying table data anymore, after all.

And for proper shutdown checkpoints they could just be included.

Greetings,

Andres Freund




Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-19 Thread Peter Geoghegan
On Fri, Jul 19, 2019 at 10:28 AM Robert Haas  wrote:
> In scenario #2, the undo work is going to have to be retried in the
> background, and perforce that means reacquiring locks that have been
> released, and so there is a chance of long lock waits and/or deadlock
> that cannot really be avoided.

I haven't studied the UNDO or zheap stuff in any detail, but I am
concerned about rollbacks that deadlock. I'd feel a lot better about
it if forward progress was guaranteed, somehow. That seems to imply
that locks are retained, which is probably massively inconvenient to
ensure. Not least because it probably requires cooperation from
underlying access methods.

--
Peter Geoghegan




Re: pgsql: Sync our copy of the timezone library with IANA release tzcode20

2019-07-19 Thread Tom Lane
I wrote:
> Michael Paquier  writes:
>> This is causing a compilation warning on Windows:
> ...so I think your compiler has a point.  I shall complain to upstream.

The IANA folk want to fix it like this:

diff --git a/zic.c b/zic.c
index 8bf5628..a84703a 100644
--- a/zic.c
+++ b/zic.c
@@ -2145,7 +2145,7 @@ writezone(const char *const name, const char *const 
string, char version,
}
if (pass == 1 && !want_bloat()) {
  utcnt = stdcnt = thisleapcnt = 0;
- thistimecnt = - locut - hicut;
+ thistimecnt = - (locut + hicut);
  thistypecnt = thischarcnt = 1;
  thistimelim = thistimei;
}

I'm not quite convinced whether that will silence the warning, but
at least it's a bit less unreadable.

regards, tom lane




Re: SQL/JSON path issues/questions

2019-07-19 Thread Steven Pousty
I would like to help review this documentation. Can you please point me in
the right direction?
Thanks
Steve

On Fri, Jul 19, 2019 at 2:02 AM Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Thu, Jul 18, 2019 at 5:08 PM Thom Brown  wrote:
> > On Tue, 16 Jul 2019 at 19:44, Alexander Korotkov
> >  wrote:
> > >
> > > On Tue, Jul 16, 2019 at 9:22 PM Thom Brown  wrote:
> > > > Now I'm looking at the @? and @@ operators, and getting a bit
> > > > confused.  This following query returns true, but I can't determine
> > > > why:
> > > >
> > > > # SELECT '{"a":[1,2,3,4,5]}'::jsonb @? '$.b == "hello"'::jsonpath;
> > > >  ?column?
> > > > --
> > > >  t
> > > > (1 row)
> > > >
> > > > "b" is not a valid item, so there should be no match.  Perhaps it's
> my
> > > > misunderstanding of how these operators are supposed to work, but the
> > > > documentation is quite terse on the behaviour.
> > >
> > > So, the result of jsonpath evaluation is single value "false".
> > >
> > > # SELECT jsonb_path_query_array('{"a":[1,2,3,4,5]}'::jsonb, '$.b ==
> "hello"');
> > >  jsonb_path_query_array
> > > 
> > >  [false]
> > > (1 row)
> > >
> > > @@ operator checks that result is "true".  This is why it returns
> "false".
> > >
> > > @? operator checks if result is not empty.  So, it's single "false"
> > > value, not empty list.  This is why it returns "true".
> > >
> > > Perhaps, we need to clarify this in docs providing more explanation.
> >
> > Understood.  Thanks.
> >
> > Also, is there a reason why jsonb_path_query doesn't have an operator
> analog?
>
> The point of existing operator analogues is index support.  We
> introduced operators for searches we can accelerate using GIN indexes.
>
> jsonb_path_query() doesn't return bool.  So, even if we have an
> operator for that, it wouldn't get index support.
>
> However, we can discuss introduction of operator analogues for other
> functions as syntax sugar.
>
> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>


Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-19 Thread Robert Haas
On Fri, Jul 19, 2019 at 2:04 PM Andres Freund  wrote:
> It doesn't seem that hard - and kind of required for robustness
> independent of the decision around "completeness" - to find a way to use
> the locks already held by the prepared transaction.

I'm not wild about finding more subtasks to put on the must-do list,
but I agree it's doable.

> I'm not following, unfortunately.
>
> I don't understand what exactly the scenario is you refer to. You say
> "session N+1 can just stay in the same transaction", but then you also
> reference something taking "probably a five digit number of
> transactions". Are those transactions the prepared ones?

So you begin a bunch of transactions.  All but one of them begin a
transaction, insert data into a table, and then prepare.  The last one
begins a transaction and locks the table.  Now you roll back all the
prepared transactions.  Those sessions now begin new transactions,
insert data into a second table, and prepare the second set of
transactions.  The last session, which still has the first table
locked, now locks the second table in addition.  Now you again roll
back all the prepared transactions.  At this point you have 2 *
max_prepared_transactions that are waiting for undo, all blocked on
that last session that holds locks on both tables.  So now you go have
all of those sessions begin a third transaction, and they all insert
into a third table, and prepare.  The last session now attempts AEL on
that third table, and once it's waiting, you roll back all the
prepared transactions, after which that last session successfully
picks up its third table lock.

You can keep repeating this, locking a new table each time, until you
run out of lock table space, by which time you will have roughly
max_prepared_transactions * size_of_lock_table transactions waiting
for undo processing.

> You could force new connections to complete the rollback processing of
> the terminated connection, if there's too much pending UNDO. That'd be a
> way of providing back-pressure against such crazy scenarios.  Seems
> again that it'd be good to have that pressure, independent of the
> decision on completeness.

That would definitely provide a whole lot of back-pressure, but it
would also make the system unusable if the undo handler finds a way to
FATAL, or just hangs for some stupid reason (stuck I/O?). It would be
a shame if the administrative action needed to fix the problem were
prevented by the back-pressure mechanism.

One thing I've thought about, which I think would be helpful for a
variety of scenarios, is to have a facility that forces a computed
delay at the each write transaction (when it first writes WAL, or when
an XID is assigned), or we could adapt that to this case and say the
beginning of each undo-using transaction. So for example if you are
about to run out of space in pg_wal, you can slow thinks down to let
the checkpoint complete, or if you are about to run out of XIDs, you
can slow things down to let autovacuum complete, or if you are about
to run out of undo slots, you can slow things down to let some undo to
complete.  The trick is to make sure that you only wait when it's
likely to do some good; if you wait because you're running out of XIDs
and the reason you're running out of XIDs is because somebody left a
replication slot or a prepared transaction around, the back-pressure
is useless.

> Couldn't we record the outstanding transactions in the checkpoint, and
> then recompute the changes to that record during WAL replay?

Hmm, that's not a bad idea. So the transactions would have to "count"
the moment they insert their first undo record, which is exactly the
right thing anyway.

Hmm, but what about transactions that are only touching unlogged tables?

> Yea, I think that's what it boils down to... Would be good to have a few
> more opinions on this.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [sqlsmith] Crash in mcv_get_match_bitmap

2019-07-19 Thread Tom Lane
Tomas Vondra  writes:
> [ mcv fixes ]

These patches look OK to me.

regards, tom lane




Re: Tid scan improvements

2019-07-19 Thread Andres Freund
Hi,

On 2019-07-19 13:54:59 +1200, David Rowley wrote:
> On Thu, 18 Jul 2019 at 14:30, Andres Freund  wrote:
> > I think the AM part of this patch might be the wrong approach - it won't
> > do anything meaningful for an AM that doesn't directly map the ctid to a
> > specific location in a block (e.g. zedstore).  To me it seems the
> > callback ought to be to get a range of tids, and the tidrange scan
> > shouldn't do anything but determine the range of tids the AM should
> > return.
> 
> Sounds like that's going to require adding some new fields to
> HeapScanDescData, then some callback similar to heap_setscanlimits to
> set those fields.
> 
> Then, we'd either need to:
> 
> 1. Make the table_scan_getnextslot() implementations check the tuple
> falls within the range, or
> 2. add another callback that pays attention to the set TID range.

> The problem with #1 is that would add overhead to normal seqscans,
> which seems like a bad idea.
> 
> Did you imagined two additional callbacks, 1 to set the TID range,
> then one to scan it?  Duplicating the logic in heapgettup_pagemode()
> and heapgettup() looks pretty horrible, but I guess we could add a
> wrapper around it that loops until it gets the first tuple and bails
> once it scans beyond the final tuple.
> 
> Is that what you had in mind?

Yea, I was thinking of something like 2.  We already have a few extra
types of scan nodes (bitmap heap and sample scan), it'd not be bad to
add another. And as you say, they can just share most of the guts: For
heap I'd just implement pagemode, and perhaps split heapgettup_pagemode
into two parts (one to do the page processing, the other to determine
the relevant page).

You say that we'd need new fields in HeapScanDescData - not so sure
about that, it seems feasible to just provide the boundaries in the
call? But I think it'd also just be fine to have the additional fields.

Greetings,

Andres Freund




Re: Broken defenses against dropping a partitioning column

2019-07-19 Thread Tom Lane
I wrote:
> So I think we're probably stuck with the approach of adding new internal
> dependencies.  If we go that route, then our options for the released
> branches are (1) do nothing, or (2) back-patch the code that adds such
> dependencies, but without a catversion bump.  That would mean that only
> tables created after the next minor releases would have protection against
> this problem.  That's not ideal but maybe it's okay, considering that we
> haven't seen actual field reports of trouble of this kind.

Here's a proposed patch for that.  It's mostly pretty straightforward,
except I had to add some recursion defenses in findDependentObjects that
weren't there before.  But those seem like a good idea anyway to prevent
infinite recursion in case of bogus entries in pg_depend.

I also took the liberty of improving some related error messages that
I thought were unnecessarily vague and not up to project standards.

Per above, I'm envisioning applying this to HEAD and v12 with a catversion
bump, and to v11 and v10 with no catversion bump.

regards, tom lane

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 6315fc4..3356461 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -543,6 +543,7 @@ findDependentObjects(const ObjectAddress *object,
 ObjectIdGetDatum(object->objectId));
 	if (object->objectSubId != 0)
 	{
+		/* Consider only dependencies of this sub-object */
 		ScanKeyInit([2],
 	Anum_pg_depend_objsubid,
 	BTEqualStrategyNumber, F_INT4EQ,
@@ -550,7 +551,10 @@ findDependentObjects(const ObjectAddress *object,
 		nkeys = 3;
 	}
 	else
+	{
+		/* Consider dependencies of this object and any sub-objects it has */
 		nkeys = 2;
+	}
 
 	scan = systable_beginscan(*depRel, DependDependerIndexId, true,
 			  NULL, nkeys, key);
@@ -567,6 +571,18 @@ findDependentObjects(const ObjectAddress *object,
 		otherObject.objectId = foundDep->refobjid;
 		otherObject.objectSubId = foundDep->refobjsubid;
 
+		/*
+		 * When scanning dependencies of a whole object, we may find rows
+		 * linking sub-objects of the object to the object itself.  (Normally,
+		 * such a dependency is implicit, but we must make explicit ones in
+		 * some cases involving partitioning.)  We must ignore such rows to
+		 * avoid infinite recursion.
+		 */
+		if (otherObject.classId == object->classId &&
+			otherObject.objectId == object->objectId &&
+			object->objectSubId == 0)
+			continue;
+
 		switch (foundDep->deptype)
 		{
 			case DEPENDENCY_NORMAL:
@@ -855,6 +871,16 @@ findDependentObjects(const ObjectAddress *object,
 		otherObject.objectSubId = foundDep->objsubid;
 
 		/*
+		 * If what we found is a sub-object of the current object, just ignore
+		 * it.  (Normally, such a dependency is implicit, but we must make
+		 * explicit ones in some cases involving partitioning.)
+		 */
+		if (otherObject.classId == object->classId &&
+			otherObject.objectId == object->objectId &&
+			object->objectSubId == 0)
+			continue;
+
+		/*
 		 * Must lock the dependent object before recursing to it.
 		 */
 		AcquireDeletionLock(, 0);
@@ -1588,8 +1614,10 @@ recordDependencyOnExpr(const ObjectAddress *depender,
  * As above, but only one relation is expected to be referenced (with
  * varno = 1 and varlevelsup = 0).  Pass the relation OID instead of a
  * range table.  An additional frammish is that dependencies on that
- * relation (or its component columns) will be marked with 'self_behavior',
- * whereas 'behavior' is used for everything else.
+ * relation's component columns will be marked with 'self_behavior',
+ * whereas 'behavior' is used for everything else; also, if reverse_self
+ * is true, those dependencies are reversed so that the columns are made
+ * to depend on the table not vice versa.
  *
  * NOTE: the caller should ensure that a whole-table dependency on the
  * specified relation is created separately, if one is needed.  In particular,
@@ -1602,7 +1630,7 @@ recordDependencyOnSingleRelExpr(const ObjectAddress *depender,
 Node *expr, Oid relId,
 DependencyType behavior,
 DependencyType self_behavior,
-bool ignore_self)
+bool reverse_self)
 {
 	find_expr_references_context context;
 	RangeTblEntry rte;
@@ -1626,7 +1654,8 @@ recordDependencyOnSingleRelExpr(const ObjectAddress *depender,
 	eliminate_duplicate_dependencies(context.addrs);
 
 	/* Separate self-dependencies if necessary */
-	if (behavior != self_behavior && context.addrs->numrefs > 0)
+	if ((behavior != self_behavior || reverse_self) &&
+		context.addrs->numrefs > 0)
 	{
 		ObjectAddresses *self_addrs;
 		ObjectAddress *outobj;
@@ -1657,11 +1686,23 @@ recordDependencyOnSingleRelExpr(const ObjectAddress *depender,
 		}
 		context.addrs->numrefs = outrefs;
 
-		/* Record the self-dependencies */
-		if (!ignore_self)
+		/* Record the self-dependencies with the appropriate direction */
+		if (!reverse_self)
 			

Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-19 Thread Andres Freund
Hi,

On 2019-07-19 13:28:14 -0400, Robert Haas wrote:
> I want to consider three specific scenarios that could cause undo
> application to fail, and then offer some observations about them.
> 
> Scenario #1:
> 
> 1. Sessions 1..N each begin a transaction and write a bunch of data to
> a table (at least enough that we'll try to perform undo in the
> background).
> 2. Session N+1 begins a transaction and tries to lock the same table.
> It blocks.
> 3. Sessions 1..N abort, successfully pushing the undo work into the 
> background.
> 4. Session N+1 now acquires the lock and sits on it.
> 5. Optionally, repeat steps 1-4 K times, each time for a different table.
> 
> Scenario #2:
> 
> 1. Any number of sessions begin a transaction, write a bunch of data,
> and then abort.
> 2. They all try to perform undo in the foreground.
> 3. They get killed using pg_terminate_backend().
> 
> Scenario #3:
> 
> 1. A transaction begins, does some work, and then aborts.
> 2. When undo processing occurs, 1% of such transactions fail during
> undo apply because of a bug in the table AM.
> 3. When undo processing retries after a failure, it fails again
> because the bug is triggered by something about the contents of the
> undo record, rather than by, say, concurrency.


> However, if prepared transactions are in use, we could have a variant
> of scenario #1 in which each transaction is first prepared, and then
> the prepared transaction is rolled back.  Unlike the ordinary case,
> this can lead to a nearly-unbounded growth in the number of
> transactions that are pending undo, because we don't have a way to
> transfer the locks held by PGPROC used for the prepare to some running
> session that could perform the undo.

It doesn't seem that hard - and kind of required for robustness
independent of the decision around "completeness" - to find a way to use
the locks already held by the prepared transaction.


> It's not necessary to have a
> large value for max_prepared_transactions; it only has to be greater
> than 0, because we can keep reusing the same slots with different
> tables.  That is, let N = max_prepared_xacts, and let K be anything at
> all; session N+1 can just stay in the same transaction and keep on
> taking new locks one at a time until the lock table fills up; not sure
> exactly how long that will take, but it's probably a five digit number
> of transactions, or maybe six. In this case, we can't force undo into
> the foreground, so we can exceed the number of transactions that are
> supposed to be backgrounded.

I'm not following, unfortunately.

I don't understand what exactly the scenario is you refer to. You say
"session N+1 can just stay in the same transaction", but then you also
reference something taking "probably a five digit number of
transactions". Are those transactions the prepared ones?

Aloso, if someobdy fills up the entire lock table, then the system is
effectively down - independent of UNDO, and no meaningful amount of UNDO
is going to be written. Perhaps we need some better resource control,
but that's really independent of UNDO.

Perhaps you can just explain the scenario in a few more words? My
comments regarding it probably make no sense, given how little I
understand what the scenario is.


> In scenario #2, the undo work is going to have to be retried in the
> background, and perforce that means reacquiring locks that have been
> released, and so there is a chance of long lock waits and/or deadlock
> that cannot really be avoided. I think there is basically no way at
> all to avoid an unbounded accumulation of transactions requiring undo
> in this case, just as in the similar case where the cluster is
> repeatedly shut down or repeatedly crashes. Eventually, if you have a
> hard cap on the number of transactions requiring undo, you're going to
> hit it, and have to start refusing new undo-using transactions. As
> Thomas pointed out, that might still be better than some other systems
> which use undo, where the system doesn't open for any transactions at
> all after a restart until all undo is retired, and/or where undo is
> never processed in the background. But it's a possible concern. On the
> other hand, if you don't have a hard cap, the system may just get
> further and further behind until it eventually melts, and that's also
> a possible concern.

You could force new connections to complete the rollback processing of
the terminated connection, if there's too much pending UNDO. That'd be a
way of providing back-pressure against such crazy scenarios.  Seems
again that it'd be good to have that pressure, independent of the
decision on completeness.



> One other thing that seems worth noting is that we have to consider
> what happens after a restart.  After a crash, and depending on exactly
> how we design it perhaps also after a non-crash restart, we won't
> immediately know how many outstanding transactions need undo; we'll
> have to grovel through the undo logs to find out. If we've got a 

Re: pg_receivewal documentation

2019-07-19 Thread Robert Haas
On Tue, Jul 16, 2019 at 9:38 PM Michael Paquier  wrote:
> I think we should really document the caveat with priority-based sets
> of standbys as much as quorum-based sets.  For example if a user sets
> synchronous_commit = remote_apply in postgresql.conf, and then sets
> s_s_names to '2(pg_receivewal, my_connected_standby)' to get a
> priority-based set, then you have the same problem, and pg_receivewal
> is not the only synchronous standby in this configuration.  The patch
> does not cover that case properly.

I don't agree with this approach. It seems to me that the original was
too precise already, and making it more precise only exacerbates the
situation.  The point is that synchronous_commit = remote_apply is
*categorically* a bad idea for sessions running pg_receivewal.  The
reason why you're adding all this complexity is to try to distinguish
between the case where it's merely a bad idea and the case where it
will also completely fail to work. But why is it important to describe
the scenarios under which it will altogether fail to work?

You could just say something like:

Since pg_receivewal does not apply WAL, you should not allow it to
become a synchronous standby when synchronous_commit = remote_apply.
If it does, it will appear to be a standby which never catches up,
which may cause commits to block.  To avoid this, you should either
configure an appropriate value for synchronous_standby_names, or
specify an application_name for pg_receivewal that does not match it,
or change the value of synchronous_commit to something other than
remote_apply.

I think that'd be a lot more useful than enumerating the total-failure
scenarios.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-19 Thread Anastasia Lubennikova

17.07.2019 19:36, Anastasia Lubennikova:


There is one major issue left - preserving TID order in posting lists.
For a start, I added a sort into BTreeFormPostingTuple function.
It turned out to be not very helpful, because we cannot check this 
invariant lazily.


Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to 
implement
insertion into the middle of the posting list. I'll send a new version 
this week.


Patch 0002 (must be applied on top of 0001) implements preserving of 
correct TID order

inside posting list when inserting new tuples.
This version passes all regression tests including amcheck test.
I also used following script to test insertion into the posting list:

set client_min_messages to debug4;
drop table tbl;
create table tbl (i1 int, i2 int);
insert into tbl select 1, i from generate_series(0,1000) as i;
insert into tbl select 1, i from generate_series(0,1000) as i;
create index idx on tbl (i1);
delete from tbl where i2 <500;
vacuum tbl ;
insert into tbl select 1, i from generate_series(1001, 1500) as i;

The last insert triggers several insertions that can be seen in debug 
messages.

I suppose it is not the final version of the patch yet,
so I left some debug messages and TODO comments to ease review.

Please, in your review, pay particular attention to usage of 
BTreeTupleGetHeapTID.
For posting tuples it returns the first tid from posting list like 
BTreeTupleGetMinTID,
but maybe some callers are not ready for that and want 
BTreeTupleGetMaxTID instead.

Incorrect usage of these macros may cause some subtle bugs,
which are probably not covered by tests. So, please double-check it.

Next week I'm going to check performance and try to find specific 
scenarios where this
feature can lead to degradation and measure it, to understand if we need 
to make this deduplication optional.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 9126c18..2b05b1e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1033,12 +1033,34 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-			  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+IndexTuple	onetup;
+int i;
+
+/* Fingerprint all elements of posting tuple one by one */
+for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+{
+	onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+	norm = bt_normalize_tuple(state, onetup);
+	bloom_add_element(state->filter, (unsigned char *) norm,
+	  IndexTupleSize(norm));
+	/* Be tidy */
+	if (norm != onetup)
+		pfree(norm);
+	pfree(onetup);
+}
+			}
+			else
+			{
+norm = bt_normalize_tuple(state, itup);
+bloom_add_element(state->filter, (unsigned char *) norm,
+  IndexTupleSize(norm));
+/* Be tidy */
+if (norm != itup)
+	pfree(norm);
+			}
 		}
 
 		/*
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f884..26ddf32 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
 #include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xloginsert.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 		 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static bool insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to compress the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
 	}
 	else
 	{
@@ -806,6 +815,11 @@ _bt_findinsertloc(Relation rel,
 			}
 
 			/*
+			 * Before considering moving right, try to compress the page
+			 */
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
+
+			/*
 			 * Nope, so check conditions (b) and (c) enumerated above
 			 *
 			 * The earlier _bt_check_unique() call may well have established a
@@ -2286,3 +2300,241 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (compressed or not) to the page, 

should there be a hard-limit on the number of transactions pending undo?

2019-07-19 Thread Robert Haas
On Tue, Jul 16, 2019 at 6:23 PM Andres Freund  wrote:
> Yea, that seems like a question independent of the "completeness"
> requirement. If desirable, it seems trivial to either have
> RollbackHashEntry have per-persistence level status (for one entry per
> xid), or not (for per-persistence entries).

I want to talk more about the "completeness" issue, which is basically
a question of whether we should (a) put a hard limit on the number of
transactions that have unprocessed undo and that are either aborted or
in progress (and thus capable of aborting) as proposed by Andres or
(b) not have such a hard limit, as originally proposed by me.  I think
everyone who is not working on this code has installed an automatic
filter rule to send the original thread to /dev/null, so I'm changing
the subject line in the hopes of getting some of those people to pay
attention.  If it doesn't work, at least the concerns will be
memorialized in case it comes up later.

I originally proposed (b) because if undo application is failing for
some reason, it seemed better not to bring the whole system to a
screeching halt, but rather just cause incremental performance
degradation or something. However, Andres has pointed out this may
postpone remedial action and let bugs go undetected, so it might not
actually be better. Also, some of the things we need to do, like
computing the oldest XID whose undo has not been retired, are
tricky/expensive if you don't have complete data in memory, and you
can't be sure of having complete data in shared memory without a hard
limit.  No matter which way we go, failures to apply undo had better
be really rare, or we're going to be in really serious trouble, so
we're only concerned here with how to handle what is hopefully a very
rare scenario, not the common case.

I want to consider three specific scenarios that could cause undo
application to fail, and then offer some observations about them.

Scenario #1:

1. Sessions 1..N each begin a transaction and write a bunch of data to
a table (at least enough that we'll try to perform undo in the
background).
2. Session N+1 begins a transaction and tries to lock the same table.
It blocks.
3. Sessions 1..N abort, successfully pushing the undo work into the background.
4. Session N+1 now acquires the lock and sits on it.
5. Optionally, repeat steps 1-4 K times, each time for a different table.

Scenario #2:

1. Any number of sessions begin a transaction, write a bunch of data,
and then abort.
2. They all try to perform undo in the foreground.
3. They get killed using pg_terminate_backend().

Scenario #3:

1. A transaction begins, does some work, and then aborts.
2. When undo processing occurs, 1% of such transactions fail during
undo apply because of a bug in the table AM.
3. When undo processing retries after a failure, it fails again
because the bug is triggered by something about the contents of the
undo record, rather than by, say, concurrency.

In scenario one, the problem is mostly self-correcting. When we decide
that we've got too many things queued up for background processing,
and start to force undo processing to happen in the foreground, it
will start succeeding, because the foreground process will have
retained the lock that it took before writing any data and can
therefore undo those writes without having to wait for the lock.
However, this will do nothing to help the requests that are already in
the background, which will just sit there until they can get the lock.
I think there is a good argument that they should not actually wait
for the lock, or only for a certain time period, and then give up and
put the transaction on the error queue for reprocessing at a later
time. Otherwise, we're pinning down undo workers, which could easily
lead to starvation, just as it does for autovacuum. On the whole, this
doesn't sound too bad. We shouldn't be able to fill up the queue with
small transactions, because of the restriction that we only push undo
work into the background when the transaction is big enough, and if we
fill it up with big transactions, then (1) back-pressure will prevent
the problem from crippling the system and (2) eventually the problem
will be self-correcting, because when the transaction in session N+1
ends, the undo will all go through and everything will be fine. The
only real user impact of this scenario is that unrelated work on the
system might notice that large rollbacks are now happening in the
foreground rather than the background, and if that causes a problem,
the DBA can fix it by terminating session N+1. Even if she doesn't,
you shouldn't ever hit the hard cap.

However, if prepared transactions are in use, we could have a variant
of scenario #1 in which each transaction is first prepared, and then
the prepared transaction is rolled back.  Unlike the ordinary case,
this can lead to a nearly-unbounded growth in the number of
transactions that are pending undo, because we don't have a way to
transfer the locks held by 

Re: global / super barriers (for checksums)

2019-07-19 Thread Andres Freund
Hi,

On 2019-07-10 15:31:11 +0200, Magnus Hagander wrote:
> In re-reading this, I notice there are a lot of references to Intterrupt
> (with two t). I'm guessing this is just a spelling error, and not something
> that actually conveys some meaning?

Just a spelling error. I think I wrote the patch in a night after
pgconf.eu, to allow you to quickly make progress :P


> Can you elaborate on what you mean with:
> +   /* XXX: need a more principled approach here */

> Is that the thing you refer to above about "checksum internals"?

I think I didn't actually mean "checksum" but instead "checkpoint".  It
does bother me that we have an operation as long-running as BufferSync()
commonly is, without a proper way to accept event.  There's a hack for
doing something similar-ish in CheckpointWriteDelay(), for absorbing
fsync requests, but it doesn't trigger for checkpoints not done in
checkpointer, nor is it really extensible.


> Also in checking we figured it'd be nice to have a wait event for this,
> since a process can potentially get stuck in an infinite loop waiting for
> some other process if it's misbehaving. Kind of like the attached?

Yea, that makes sense.

Greetings,

Andres Freund




Re: [RFC] Removing "magic" oids

2019-07-19 Thread Andres Freund
Hi,

On 2019-07-07 10:00:35 -0700, Noah Misch wrote:
> On Tue, Nov 20, 2018 at 01:20:04AM -0800, Andres Freund wrote:
> > On 2018-11-14 21:02:41 -0800, Andres Freund wrote:
> > > > The point of the test is to exercise OidGenLock by issuing many parallel
> > > > GetNewOidWithIndex() and verifying absence of duplicates.  There's 
> > > > nothing
> > > > special about OidGenLock, but it is important to use an operation that 
> > > > takes a
> > > > particular LWLock many times, quickly.  If the test query spends too 
> > > > much time
> > > > on things other than taking locks, it will catch locking races too 
> > > > rarely.
> > > 
> > > Sequences ought to do that, too. And if it's borked, we'd hopefully see
> > > unique violations. But it's definitely not a 1:1 replacement.
> 
> > I've tested this on ppc.  Neither the old version nor the new version
> > stress test spinlocks sufficiently to error out with weakened spinlocks
> > (not that surprising, there are no spinlocks in any hot path of either
> > workload). Both versions very reliably trigger on weakened lwlocks. So I
> > think we're comparatively good on that front.
> 
> I tested this on xlc, the compiler that motivated the OID test, and the v12+
> version of the test didn't catch the bug[1] with xlc 13.1.3.  CREATE TYPE
> ... AS ENUM generates an OID for each label, so the attached patch makes the
> v12+ test have locking behavior similar to its v11 ancestor.

Interesting.  It's not obvious to me why that works meaningfully better.


> [1] 
> https://postgr.es/m/flat/a72cfcb0-37d0-de2f-b3ec-f38ad8d6a...@postgrespro.ru

> diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl 
> b/src/bin/pgbench/t/001_pgbench_with_server.pl
> index dc2c72f..3b097a9 100644
> --- a/src/bin/pgbench/t/001_pgbench_with_server.pl
> +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
> @@ -58,27 +58,20 @@ sub pgbench
>   return;
>  }
>  
> -# Test concurrent insertion into table with serial column.  This
> -# indirectly exercises LWLock and spinlock concurrency.  This test
> -# makes a 5-MiB table.
> -
> -$node->safe_psql('postgres',
> - 'CREATE UNLOGGED TABLE insert_tbl (id serial primary key); ');
> -
> +# Test concurrent OID generation via pg_enum_oid_index.  This indirectly
> +# exercises LWLock and spinlock concurrency.
> +my $labels = join ',', map { "'l$_'" } 1 .. 1000;
>  pgbench(
>   '--no-vacuum --client=5 --protocol=prepared --transactions=25',
>   0,
>   [qr{processed: 125/125}],
>   [qr{^$}],
> - 'concurrent insert workload',
> + 'concurrent OID generation',
>   {
>   '001_pgbench_concurrent_insert' =>
> -   'INSERT INTO insert_tbl SELECT FROM generate_series(1,1000);'
> +   "CREATE TYPE pg_temp.e AS ENUM ($labels); DROP TYPE 
> pg_temp.e;"
>   });

Hm, perhaps we should just do something stupid an insert into a catalog
table, determining the oid to insert with pg_nextoid? That ought to be a
lot faster and thus more "stress testing" than going through a full
blown DDL statement?  But perhaps that's just too ugly.

Greetings,

Andres Freund




Re: pg_receivewal documentation

2019-07-19 Thread Jesper Pedersen

Hi,

On 7/18/19 9:09 PM, Michael Paquier wrote:

pg_receivewal -D /tmp/wal -S replica1 --synchronous -h localhost -p 5432 -U
repluser -W
psql -c 'SELECT * FROM pg_stat_replication;' postgres
psql -c 'SELECT * FROM pg_replication_slots;' postgres
psql -c 'CREATE DATABASE test' postgres

In what scenarios do you see 'on' working ?


Because the code says so, "on" is an alias for "remote_flush" (which
is not user-visible by the way):
src/include/access/xact.h:#define SYNCHRONOUS_COMMIT_ON
SYNCHRONOUS_COMMIT_REMOTE_FLUSH

And if you do that it works fine (pg_receivewal --synchronous runs in
the background and I created a dummy table):
=# SELECT application_name, sync_state, flush_lsn, replay_lsn FROM
pg_stat_replication;
  application_name | sync_state | flush_lsn | replay_lsn
--++---+
  pg_receivewal| sync   | 0/15E1F88 | null
(1 row)
=# set synchronous_commit to on ;
SET
=# insert into aa values (2);
INSERT 0 1



I forgot to use pg_receivewal -d with application_name instead of -h -p -U.

Maybe we should have an explicit option for that, but that is a separate 
thread.


Best regards,
 Jesper




Re: pg_receivewal documentation

2019-07-19 Thread Jesper Pedersen

Hi,

On 7/18/19 9:27 PM, Michael Paquier wrote:

The location of the warning is
also harder to catch for the reader, so instead let's move it to the
top where we have an extra description for --synchronous.  I am
finishing with the attached that I would be fine to commit and
back-patch as needed.  Does that sound fine?


LGTM.

Best regards,
 Jesper






Re: partition routing layering in nodeModifyTable.c

2019-07-19 Thread Andres Freund
Hi,

On 2019-07-19 17:52:20 +0900, Amit Langote wrote:
> Attached are two patches.

Awesome.


> The first one (0001) deals with reducing the core executor's reliance
> on es_result_relation_info to access the currently active result
> relation, in favor of receiving it from the caller as a function
> argument.  So no piece of core code relies on it being correctly set
> anymore.  It still needs to be set correctly for the third-party code
> such as FDWs.

I'm inclined to just remove it. There's not much code out there relying
on it, as far as I can tell. Most FDWs don't support the direct modify
API, and that's afaict the case where we one needs to use
es_result_relation_info?

In fact, I searched through alllFDWs listed on 
https://wiki.postgresql.org/wiki/Foreign_data_wrappers
that are on github and in first few categories (up and including to
"file wrappers"), and there was only one reference to
es_result_relation_info, and that just in comments in a test:
https://github.com/pgspider/griddb_fdw/search?utf8=%E2%9C%93=es_result_relation_info=
which I think was just copied from our source code.

IOW, we should just change the direct modify calls to get the relevant
ResultRelationInfo or something in that vein (perhaps just the relevant
RT index?).

pglogical also references it, but just because it creates its own
EState afaict.



> @@ -334,32 +335,50 @@ ExecComputeStoredGenerated(EState *estate, 
> TupleTableSlot *slot)
>   *   ExecInsert
>   *
>   *   For INSERT, we have to insert the tuple into the target relation
> - *   and insert appropriate tuples into the index relations.
> + *   (or partition thereof) and insert appropriate tuples into the 
> index
> + *   relations.
>   *
>   *   Returns RETURNING result if any, otherwise NULL.
> + *
> + *   This may change the currently active tuple conversion map in
> + *   mtstate->mt_transition_capture, so the callers must take care to
> + *   save the previous value to avoid losing track of it.
>   * 
>   */
>  static TupleTableSlot *
>  ExecInsert(ModifyTableState *mtstate,
> +ResultRelInfo *resultRelInfo,
>  TupleTableSlot *slot,
>  TupleTableSlot *planSlot,
>  EState *estate,
>  bool canSetTag)
>  {
> - ResultRelInfo *resultRelInfo;
>   RelationresultRelationDesc;
>   List   *recheckIndexes = NIL;
>   TupleTableSlot *result = NULL;
>   TransitionCaptureState *ar_insert_trig_tcs;
>   ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
>   OnConflictAction onconflict = node->onConflictAction;
> + PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
> +
> + /*
> +  * If the input result relation is a partitioned table, find the leaf
> +  * partition to insert the tuple into.
> +  */
> + if (proute)
> + {
> + ResultRelInfo *partRelInfo;
> +
> + slot = ExecPrepareTupleRouting(mtstate, estate, proute,
> +
> resultRelInfo, slot,
> +
> );
> + resultRelInfo = partRelInfo;
> + /* Result relation has changed, so update EState reference too. 
> */
> + estate->es_result_relation_info = resultRelInfo;
> + }

I think by removing es_result_relation entirely, this would look
cleaner.


> @@ -1271,18 +1274,18 @@ lreplace:;
>   
>  mtstate->mt_root_tuple_slot);
>  
>   /*
> -  * Prepare for tuple routing, making it look like we're 
> inserting
> -  * into the root.
> +  * ExecInsert() may scribble on 
> mtstate->mt_transition_capture,
> +  * so save the currently active map.
>*/
> + if (mtstate->mt_transition_capture)
> + saved_tcs_map = 
> mtstate->mt_transition_capture->tcs_map;

Wonder if we could remove the need for this somehow, it's still pretty
darn ugly.  Thomas, perhaps you have some insights?

To me the need to modify these ModifyTable wide state on a per-subplan
and even per-partition basis indicates that the datastructures are in
the wrong place.



> @@ -2212,23 +2207,17 @@ ExecModifyTable(PlanState *pstate)
>   switch (operation)
>   {
>   case CMD_INSERT:
> - /* Prepare for tuple routing if needed. */
> - if (proute)
> - slot = ExecPrepareTupleRouting(node, 
> estate, proute,
> - 
> 

Re: [sqlsmith] Crash in mcv_get_match_bitmap

2019-07-19 Thread Tomas Vondra

On Thu, Jul 18, 2019 at 11:16:08AM -0400, Tom Lane wrote:

Tomas Vondra  writes:

I've pushed the fixes listed in the previous message, with the exception
of the collation part, because I had some doubts about that.


Sorry for being slow on this.


Now, for the collation part - after some more thought and looking at code
I think the fix I shared before is OK. It uses per-column collations
consistently both when building the stats and estimating clauses, which
makes it consistent. I was not quite sure if using Var->varcollid is the
same thing as pg_attribute.attcollation, but it seems it is - at least for
Vars pointing to regular columns (which for extended stats should always
be the case).


I think you are right, but it could use some comments in the code.
The important point here is that if we coerce a Var's collation to
something else, that will be represented as a separate CollateExpr
(which will be a RelabelType by the time it gets here, I believe).
We don't just replace varcollid, the way eval_const_expressions will
do to a Const.



OK, thanks. I've added a comment about that into mcv_get_match_bitmap (not
all the details about RelabelType, because that gets stripped while
examining the opexpr, but generally about the collations).



While I'm looking at the code --- I don't find this at all convincing:

   /*
* We don't care about isgt in equality, because
* it does not matter whether it's (var op const)
* or (const op var).
*/
   match = DatumGetBool(FunctionCall2Coll(,
  
DEFAULT_COLLATION_OID,
  
cst->constvalue,
  
item->values[idx]));

It *will* matter if the operator is cross-type.  I think there is no
good reason to have different code paths for the equality and inequality
cases --- just look at isgt and swap or don't swap the arguments.

BTW, "isgt" seems like a completely misleading name for that variable.
AFAICS, what that is actually telling is whether the var is on the left
or right side of the OpExpr.  Everywhere else in the planner with a
need for that uses "bool varonleft", and I think you should do likewise
here (though note that that isgt, as coded, is more like "varonright").



Yes, you're right in both cases. I've fixed the first issue with equality
by simply using the same code as for all operators (which means we don't
need to check the oprrest at all in this code, making it simpler).

And I've reworked the examine_opclause_expression() to use varonleft
naming - I agree it's a much better name. This was one of the remnants of
the code I initially copied from somewhere in selfuncs.c and massaged
it until it did what I needed.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

>From 14dc49ad7eefd6e67b751e74ce7253cdc85ca5ad Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 19 Jul 2019 16:28:28 +0200
Subject: [PATCH 1/2] Rework examine_opclause_expression to use varonleft

The examine_opclause_expression function needs to return information on
which side of the operator we found the Var, but the variable was called
"isgt" which is rather misleading (it assumes the operator is either
less-than or greater-than, but it may be equality or something else).
Other places in the planner use a variable called "varonleft" for this
purpose, so just adopt the same convention here.

The code also assumed we don't care about this flag for equality, as
(Var = Const) and (Const = Var) should be the same thing. But that does
not work for cross-type operators, in which case we need to pass the
parameters to the procedure in the right order. So just use the same
code for all types of expressions.

This means we don't need to care about the selectivity estimation
function anymore, at least not in this code. We should only get the
supported cases here (thanks to statext_is_compatible_clause).

Discussion: https://postgr.es/m/8736jdhbhc.fsf%40ansel.ydns.eu
Backpatch-to: 12
---
 src/backend/statistics/extended_stats.c   | 16 ++---
 src/backend/statistics/mcv.c  | 62 +--
 .../statistics/extended_stats_internal.h  |  2 +-
 3 files changed, 26 insertions(+), 54 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c 
b/src/backend/statistics/extended_stats.c
index d2346cbe02..23c74f7d90 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1196,15 +1196,15 @@ statext_clauselist_selectivity(PlannerInfo *root, List 
*clauses, int varRelid,
  * returns true, otherwise returns false.
  *
  * Optionally returns pointers to the extracted Var/Const nodes, when passed
- * 

Re: sepgsql seems rather thoroughly broken on Fedora 30

2019-07-19 Thread Tom Lane
Mike Palmiotto  writes:
> The sepgsql_regtest_user_t domain should be allowed to read any file
> labeled "passwd_file_t". We can check that with the `sesearch` tool,
> provided by the "setools-console" package on F30:

> % sudo sesearch -A -s sepgsql_regtest_user_t -t passwd_file_t
> allow domain file_type:blk_file map; [ domain_can_mmap_files ]:True
> allow domain file_type:chr_file map; [ domain_can_mmap_files ]:True
> allow domain file_type:file map; [ domain_can_mmap_files ]:True
> allow nsswitch_domain passwd_file_t:file { getattr ioctl lock map open read };

I got around to trying this, and lookee here:

$ sudo sesearch -A -s sepgsql_regtest_user_t -t passwd_file_t
allow domain file_type:blk_file map; [ domain_can_mmap_files ]:True
allow domain file_type:chr_file map; [ domain_can_mmap_files ]:True
allow domain file_type:file map; [ domain_can_mmap_files ]:True
allow domain file_type:lnk_file map; [ domain_can_mmap_files ]:True

Nothing about passwd_file_t.  So *something* is different about the
way the policy is being expanded.

regards, tom lane




Re: sepgsql seems rather thoroughly broken on Fedora 30

2019-07-19 Thread Tom Lane
Mike Palmiotto  writes:
> On Thu, Jul 18, 2019 at 11:06 PM Tom Lane  wrote:
>>> $ runcon -t sepgsql_regtest_user_t psql --help
>>> psql: fatal: could not look up effective user ID 1000: user does not exist

> You can rule out SELinux for this piece by running `sudo setenforce
> 0`. If the `runcon ... psql` command works in Permissive we should
> look at your audit log to determine what is being denied. audit2allow
> will provide a summary of the SELinux denials and is generally a good
> starting point:

> # grep denied /var/log/audit/audit.log | audit2allow

It's definitely SELinux.  The grep finds entries like

type=AVC msg=audit(1563547268.044:465): avc:  denied  { read } for  pid=10940 
comm="psql" name="passwd" dev="sda6" ino=4721184 
scontext=unconfined_u:unconfined_r:sepgsql_regtest_user_t:s0-s0:c0.c1023 
tcontext=system_u:object_r:passwd_file_t:s0 tclass=file permissive=0

which audit2allow turns into

#= sepgsql_regtest_user_t ==
allow sepgsql_regtest_user_t passwd_file_t:file read;

So somehow, my system's interpretation of the test policy file does
not include that permission.

I tried:

* restorecon / ... no effect, which is unsurprising given that /etc/passwd
was OK already.

* removing container-selinux ... this made the compile warnings go away,
as you predicted, but no change in the test results.

> FWIW, this appears to be working on my recently-installed F30 VM:

> % runcon -t sepgsql_regtest_user_t psql --help &> /dev/null
> % echo $?
> 0

Well, that's just weird.  I've not done anything to the SELinux state
on this installation either, so what's different?

I am wondering whether maybe the different behavior is a result of some
RPM that's present on my system but not yours, or vice versa.  As
a first stab at that, I see:

$ rpm -qa | grep selinux | sort
cockpit-selinux-198-1.fc30.noarch
container-selinux-2.107-1.git453b816.fc30.noarch
flatpak-selinux-1.4.2-2.fc30.x86_64
libselinux-2.9-1.fc30.x86_64
libselinux-devel-2.9-1.fc30.x86_64
libselinux-utils-2.9-1.fc30.x86_64
python3-libselinux-2.9-1.fc30.x86_64
rpm-plugin-selinux-4.14.2.1-4.fc30.1.x86_64
selinux-policy-3.14.3-40.fc30.noarch
selinux-policy-devel-3.14.3-40.fc30.noarch
selinux-policy-targeted-3.14.3-40.fc30.noarch
tpm2-abrmd-selinux-2.0.0-4.fc30.noarch

regards, tom lane




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-19 Thread Antonin Houska
Tomas Vondra  wrote:

> On Fri, Jul 19, 2019 at 12:04:36PM +0200, Antonin Houska wrote:
> >Tomas Vondra  wrote:
> >
> >> On Mon, Jul 15, 2019 at 03:42:39PM -0400, Bruce Momjian wrote:
> >> >On Sat, Jul 13, 2019 at 11:58:02PM +0200, Tomas Vondra wrote:
> >> >> One extra thing we should consider is authenticated encryption. We can't
> >> >> just encrypt the pages (no matter which AES mode is used - XTS/CBC/...),
> >> >> as that does not provide integrity protection (i.e. can't detect when
> >> >> the ciphertext was corrupted due to disk failure or intentionally). And
> >> >> we can't quite rely on checksums, because that checksums the plaintext
> >> >> and is stored encrypted.
> >> >
> >> >Uh, if someone modifies a few bytes of the page, we will decrypt it, but
> >> >the checksum (per-page or WAL) will not match our decrypted output.  How
> >> >would they make it match the checksum without already knowing the key.
> >> >I read [1] but could not see that explained.
> >> >
> >>
> >> Our checksum is only 16 bits, so perhaps one way would be to just
> >> generate 64k of randomly modified pages and hope one of them happens to
> >> hit the right checksum value. Not sure how practical such attack is, but
> >> it does require just filesystem access.
> >
> >I don't think you can easily generate 64k of different checksums this way. If
> >the data is random, I suppose that each set of 2^(128 - 16) blocks will
> >contain the the same checksum after decryption. Thus even you generate 64k of
> >different ciphertext blocks that contain the checksum, some (many?)  
> >checksums
> >will be duplicate. Unfortunately the math to describe this problem does not
> >seem to be trivial.
> >
> 
> I'm not sure what's your point, or why you care about the 128 bits, but I
> don't think the math is very complicated (and it's exactly the same with
> or without encryption). The probability of checksum collision for randomly
> modified page is 1/64k, so p=~0.00153%. So probability of *not* getting a
> collision is (1-p)=99.9985%. So with N pages, the probability of no
> collisions is pow((1-p),N) which behaves like this:
> 
>  N pow((1-p),N)
>
>1   85%
>2   73%
>3   63%
>46000   49%
>20   4%
> 
> So with 1.6GB relation you have about 96% chance of a checksum collision.

I thought your attack proposal was to find valid (encrypted) checksum for a
given encrypted page. Instead it seems that you were only trying to say that
it's not too hard to generate page with a valid checksum in general. Thus the
attacker can try to modify the ciphertext again and again in a way that is not
quite random, but the chance to pass the checksum verification may still be
relatively high.

> >Also note that if you try to generate ciphertext, decryption of which will
> >result in particular value of checksum, you can hardly control the other 14
> >bytes of the block, which in turn are used to verify the checksum.
> >
> 
> Now, I'm not saying this attack is particularly practical - it would
> generate a fair number of checkpoint failures before getting the first
> collision. So it'd trigger quite a few alerts, I guess.

You probably mean "checksum failures". I agree. And even if the checksum
passed the verification, page or tuple headers would probably be incorrect and
cause other errors.

> >> FWIW our CRC algorithm is not quite HMAC, because it's neither keyed nor
> >> a cryptographic hash algorithm. Now, maybe we don't want authenticated
> >> encryption (e.g. XTS is not authenticated, unlike GCM/CCM).
> >
> >I'm also not sure if we should try to guarantee data authenticity /
> >integrity. As someone already mentioned elsewhere, page MAC does not help if
> >the whole page is replaced. (An extreme case is that old filesystem snapshot
> >containing the whole data directory is restored, although that will probably
> >make the database crash soon.)
> >
> >We can guarantee integrity and authenticity of backup, but that's a separate
> >feature: someone may need this although it's o.k. for him to run the cluster
> >unencrypted.
> >
> 
> Yes, I do agree with that. I think attempts to guarantee data authenticity
> and/or integrity at the page level is mostly futile (replay attacks are an
> example of why). IMHO we should consider that to be outside the threat
> model TDE is expected to address.

When writing my previous email I forgot that, besides improving data
integrity, the authenticated encryption also tries to detect an attempt to get
encryption key via "chosen-ciphertext attack (CCA)". The fact that pages are
encrypted / decrypted independent from each other should not be a problem
here. We just need to consider if this kind of CCA is the threat we try to
protect against.

> IMO a better way to handle authenticity/integrity would be based on WAL,
> which is essentially an authoritative log of operations. We should be able
> to parse WAL, deduce expected state 

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-07-19 Thread Robert Haas
On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita  wrote:
> I.e., partition_bounds_merge() is performed for each pair of input
> partitioned relations for a join relation in try_partitionwise_join().
> Since partition_bounds_merge() would need a lot of CPU cycles, I don't
> think this is acceptable; ISTM that some redesign is needed to avoid
> this.  I'm wondering that once we successfully merged partition bounds
> from a pair of input partitioned relations for the join relation, by
> using the merged partition bounds, we could get the lists of matching
> to-be-joined partitions for subsequent pairs of input partitioned
> relations for the join relation in a more efficient way than by
> performing partition_bounds_merge() as proposed in the patch.

I don't know whether partition_bounds_merge() is well-implemented; I
haven't looked. But in general I don't see an alternative to doing
some kind of merging on each pair of input relations. That's just how
planning works, and I don't see why it should need to be prohibitively
expensive.  I might be missing something, though; do you have an idea?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )

2019-07-19 Thread Ian Barwick

On 7/19/19 7:45 PM, Sergei Kornilov wrote:

Hi

Oh. Replication slot name currently can contains only a-z0-9_ characters. So
we can not actually write such recovery.conf, pg_basebackup will stop
before. But perform escape_quotes on string and not use result - error anyway.


Good point, it does actually fail with an error if an impossible slot name
is provided, so the escaping is superfluous anyway.

I'll take another look at it later as it's not exactly critical, just stuck
out when I was passing through the code.


Regards

Ian Barwick


--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: sepgsql seems rather thoroughly broken on Fedora 30

2019-07-19 Thread Mike Palmiotto
On Thu, Jul 18, 2019 at 11:06 PM Tom Lane  wrote:
>
> Mike Palmiotto  writes:
> > On Wed, Jul 17, 2019 at 12:32 PM Tom Lane  wrote:
> >> $ runcon -t sepgsql_regtest_user_t psql --help
> >> psql: fatal: could not look up effective user ID 1000: user does not exist

You can rule out SELinux for this piece by running `sudo setenforce
0`. If the `runcon ... psql` command works in Permissive we should
look at your audit log to determine what is being denied. audit2allow
will provide a summary of the SELinux denials and is generally a good
starting point:

# grep denied /var/log/audit/audit.log | audit2allow

If SELinux is indeed the issue here and you want to avoid doing all of
this detective work, it may be a good idea to just run a system-wide
restorecon (assuming you didn't already do that before) to make sure
your labels are in a decent state.

FWIW, this appears to be working on my recently-installed F30 VM:

% runcon -t sepgsql_regtest_user_t psql --help &> /dev/null
% echo $?
0

Hopefully a system-wide `restorecon` just magically fixes this for
you. Otherwise, we can start digging into denials.

-- 
Mike Palmiotto
Software Engineer
Crunchy Data Solutions
https://crunchydata.com




Re: POC: Cleaning up orphaned files using undo logs

2019-07-19 Thread Robert Haas
On Fri, Jul 19, 2019 at 7:54 AM Amit Khandekar  wrote:
> + * We just want to mask the cid in the undo record header.  So
> + * only if the partial record in the current page include the undo
> + * record header then we need to mask the cid bytes in this page.
> + * Otherwise, directly jump to the next record.
> Here, I think you mean : "So only if the partial record in the current
> page includes the *cid* bytes", rather than "includes the undo record
> header"
> May be we can say :
> We just want to mask the cid. So do the partial record masking only if
> the current page includes the cid bytes from the partial record
> header.

Hmm, but why is it correct to mask the CID at all?  Shouldn't that match?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-07-19 Thread Robert Haas
On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila  wrote:
> We are doing exactly what you have written in the last line of the
> next paragraph "stop the transaction from writing undo when the hash
> table is already too full.".  So we will
> never face the problems related to repeated crash recovery.  The
> definition of too full is that we stop allowing the new transactions
> that can write undo when we have the hash table already have entries
> equivalent to (UndoRollbackHashTableSize() - MaxBackends).  Does this
> make sense?

Oops, I was looking in the wrong place.  Yes, that makes sense, but:

1. It looks like the test you added to PrepareUndoInsert has no
locking, and I don't see how that can be right.

2. It seems like this is would result in testing for each new undo
insertion that gets prepared, whereas surely we would want to only
test when first attaching to an undo log.  If you've already attached
to the undo log, there's no reason not to continue inserting into it,
because doing so doesn't increase the number of transactions (or
transaction-persistence level combinations) that need undo.

3. I don't think the test itself is correct. It can fire even when
there's no problem. It is correct (or would be if it said 2 *
MaxBackends) if every other backend in the system is already attached
to an undo log (or two). But if they are not, it will block
transactions from being started for no reason. For instance, suppose
max_connections = 100 and there are another 100 slots for background
rollbacks. Now suppose that the system crashes when 101 slots are in
use -- 100 pushed into the background plus 1 that was aborted by the
crash. On recovery, this test will refuse to let any new transaction
start. Actually it is OK for up to 99 transactions to write undo, just
not 100.  Or, given that you have a slot per persistence level, it's
OK to have up to 199 transaction-persistence-level combinations in
flight, just not 200. And that is the difference between the system
being unusable after the crash until a rollback succeeds and being
almost fully usable immediately.

> > I don't follow this.  If you have a hash table where the key is XID,
> > there is no need to delete and reinsert anything just because you
> > discover that the XID has not only permanent undo but also unlogged
> > undo, or something of that sort.
>
> The size of the total undo to be processed will be changed if we find
> anyone (permanent or unlogged) later.  Based on the size, the entry
> location should be changed in size queue.

OK, true. But that's not a significant cost, either in runtime or code
complexity.

I still don't really see any good reason to the hash table key be
anything other than XID, or really, FXID. I mean, sure, the data
structure manipulations are a little different, but not in any way
that really matters. And it seems to me that there are some benefits,
the biggest of which is that the system becomes easier for users to
understand.  We can simply say that there is a limit on the number of
transactions that either (1) are in progress and have written undo or
(2) have aborted and not all of the undo has been processed. If the
key is XID + persistence level, then it's a limit on the number of
transaction-and-persistence-level combinations, which I feel is not so
easy to understand. In most but not all scenarios, it means that the
limit is about double what you think the limit is, and as the mistake
in the current version of the patch makes clear, even the people
writing the code can forget about that factor of two.

It affects a few other things, too.  If you made the key XID and fixed
problems (2) and (3) from above, then you'd have a situation where a
transaction could fail at only one times: either it bombs the first
time it tries to write undo, or it works.  As it is, there is a second
failure scenario: you do a bunch of work on permanent (or unlogged)
tables and then try to write to an unlogged (or permanent) table and
it fails because there are not enough slots.  Is that the end of the
world? No, certainly not. The situation should be rare. But if we have
to fail transactions, it's best to fail them before they've started
doing any work, because that minimizes the amount of work we waste by
having to retry. Of course, a transaction that fails midway through
when it tries to write at a second persistence level is also consuming
an undo slot in a situation where we're short of undo slots.

Another thing which Andres pointed out to me off-list is that we might
want to have a function that takes a transaction ID as an argument and
tells you the status of that transaction from the point of view of the
undo machinery: does it have any undo, and if so how much? As you have
it now, such a function would require searching the whole hash table,
because the user won't be able to provide an UndoRecPtr to go with the
XID.  If the hash table key were in fact 
rather than , then you could it with two lookups; if
it were XID alone, you 

Re: Compiler warnings with MinGW

2019-07-19 Thread Andrew Dunstan


On 7/19/19 1:08 AM, Michael Paquier wrote:
> Hi all,
>
> Just browsing through the logs of the buildfarm, I have noticed that
> some buildfarm animals complain with warnings (jacana uses MinGW):
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=jacana=2019-07-19%2001%3A45%3A28=make
>
> There are two of them:
> c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/port/win32/mingwcompat.c:60:1:
> warning: 'RegisterWaitForSingleObject' redeclared without dllimport
> attribute: previous dllimport ignored [-Wattributes]
>
> c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/bin/pg_basebackup/pg_basebackup.c:1448:8:
> warning: variable 'filemode' set but not used
> [-Wunused-but-set-variable]
> Jul 18 21:59:49 int   filemode;
>
> The first one has been discussed already some time ago and is a cause
> of 811be893, but nothing got actually fixed (protagonists in CC):
> https://www.postgresql.org/message-id/cabueveyezfuvaymunop3nyrvvrh2up2tstk8sxvapderf8p...@mail.gmail.com


To answer Magnus' question in that thread, the Mingw headers on jacana
declare the function with WINBASEAPI which in turn is defined as
DECLSPEC_IMPORT, as long as _KERNEL32_ isn't defined, and we don't do
that (and I don't think anything else does either).


So the fix Peter proposed looks like it should be correct.



>
> The second one is rather obvious to fix, because we don't care about
> the file mode on Windows, so the attached should do the work.  I am
> actually surprised that the Visual Studio compilers don't complain
> about that, but let's fix it.
>
> Thoughts?


+1.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Compile from source using latest Microsoft Windows SDK

2019-07-19 Thread Andrew Dunstan


On 7/19/19 5:51 AM, Michael Paquier wrote:
>
>> I'm also very curious on how hamerkop and bowerbird build postgres with
>> VS2017.  Looks like hamerkop and bowerbird both exist before VS2017
>> and maybe they get SDK v8.1 from previous VS installations. I will
>> contact admin of hamerkop and bowerbird and see if that's the case.
>> As of now I can still encounter the same issue with fresh installed
>> Windows Server 2016 and VS2017, on both azure and google cloud. So
>> better to patch the build system anyway.
> I guess so but I cannot confirm myself.  I am adding Andrew and Hari
> in CC to double-check if they have seen this problem or not.  Hari has
> also worked on porting VS 2017 and 2019 in the MSVC scripts in the
> tree.



My tests of the VS2017 stuff used this install mechanism on a fresh
Windows instance:


choco install -y visualstudio2017-workload-vctools --package-parameters
"--includeOptional"


This installed Windows Kits 8.1 and 10, among many other things.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Further hacking on SPITupleTable struct

2019-07-19 Thread Daniel Gustafsson
> On 17 Jul 2019, at 22:35, Tom Lane  wrote:
> 
> Thinking more about the public/private field distinction we just
> specified --- it's always annoyed me that SPITupleTable doesn't
> provide a number-of-valid-rows field, so that callers have to
> look at the entirely separate SPI_processed variable in order
> to make sense of SPI_tuptable.

Sorry for being slow to return to this, I see you have already committed it.
FWIW, I do agree that this makes a lot more sense.  Retroactively +1’ing it.

Regarding the core code I agree that no callers directly benefit without some
refactoring, but contrib/xml2/xpath.c has one case which seems applicable as
per the attached.  Now, since contrib/xml2 has been deprecated for a long time
it’s probably not worth bothering, but it was the one case I found so I figured
I’d record it in this thread.

cheers ./daniel



xpath_numvals.diff
Description: Binary data


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-19 Thread Tomas Vondra

On Fri, Jul 19, 2019 at 01:32:01PM +0200, Antonin Houska wrote:

Tomas Vondra  wrote:


On Mon, Jul 15, 2019 at 06:11:41PM -0400, Bruce Momjian wrote:
>On Mon, Jul 15, 2019 at 11:05:30PM +0200, Tomas Vondra wrote:
>> On Mon, Jul 15, 2019 at 03:42:39PM -0400, Bruce Momjian wrote:
>> > On Sat, Jul 13, 2019 at 11:58:02PM +0200, Tomas Vondra wrote:
>> > > One extra thing we should consider is authenticated encryption. We can't
>> > > just encrypt the pages (no matter which AES mode is used - XTS/CBC/...),
>> > > as that does not provide integrity protection (i.e. can't detect when
>> > > the ciphertext was corrupted due to disk failure or intentionally). And
>> > > we can't quite rely on checksums, because that checksums the plaintext
>> > > and is stored encrypted.
>> >
>> > Uh, if someone modifies a few bytes of the page, we will decrypt it, but
>> > the checksum (per-page or WAL) will not match our decrypted output.  How
>> > would they make it match the checksum without already knowing the key.
>> > I read [1] but could not see that explained.
>> >
>>
>> Our checksum is only 16 bits, so perhaps one way would be to just
>> generate 64k of randomly modified pages and hope one of them happens to
>> hit the right checksum value. Not sure how practical such attack is, but
>> it does require just filesystem access.
>
>Yes, that would work, and opens the question of whether our checksum is
>big enough for this, and if it is not, we need to find space for it,
>probably with a custom encrypted page format.  :-(   And that makes
>adding encryption offline almost impossible because you potentially have
>to move tuples around.  Yuck!
>

Right. We've been working on allowing to disable checksum online, and it
would be useful to allow something like that for encryption too I guess.
And without some sort of page-level flag that won't be possible, which
would be rather annoying.

Not sure it needs to be in the page itself, though - that's pretty much
why I proposed to store metadata (IV, key ID, ...) for encryption in a
new fork. That would be a bit more flexible than storing it in the page
itself (e.g. different encryption schemes might easily store different
amounts of metadata).

Maybe a new fork is way too complex solution, not sure.


One problem of this new fork would be that contents of its buffer (the MAC
values) is not determined until the corresponding buffers of the MAIN fork get
encrypted. However encryption is performed by the storage layer (md.c), which
is not expected to lock other buffers (such as those of the "MAC fork"), read
their pages from disk or insert their WAL records.

This is different from the FSM or VM forks whose buffers are only updated
above the storage layer.



Yes, that seems like a valid issue :-(


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-19 Thread Tomas Vondra

On Fri, Jul 19, 2019 at 12:04:36PM +0200, Antonin Houska wrote:

Tomas Vondra  wrote:


On Mon, Jul 15, 2019 at 03:42:39PM -0400, Bruce Momjian wrote:
>On Sat, Jul 13, 2019 at 11:58:02PM +0200, Tomas Vondra wrote:
>> One extra thing we should consider is authenticated encryption. We can't
>> just encrypt the pages (no matter which AES mode is used - XTS/CBC/...),
>> as that does not provide integrity protection (i.e. can't detect when
>> the ciphertext was corrupted due to disk failure or intentionally). And
>> we can't quite rely on checksums, because that checksums the plaintext
>> and is stored encrypted.
>
>Uh, if someone modifies a few bytes of the page, we will decrypt it, but
>the checksum (per-page or WAL) will not match our decrypted output.  How
>would they make it match the checksum without already knowing the key.
>I read [1] but could not see that explained.
>

Our checksum is only 16 bits, so perhaps one way would be to just
generate 64k of randomly modified pages and hope one of them happens to
hit the right checksum value. Not sure how practical such attack is, but
it does require just filesystem access.


I don't think you can easily generate 64k of different checksums this way. If
the data is random, I suppose that each set of 2^(128 - 16) blocks will
contain the the same checksum after decryption. Thus even you generate 64k of
different ciphertext blocks that contain the checksum, some (many?)  checksums
will be duplicate. Unfortunately the math to describe this problem does not
seem to be trivial.



I'm not sure what's your point, or why you care about the 128 bits, but I
don't think the math is very complicated (and it's exactly the same with
or without encryption). The probability of checksum collision for randomly
modified page is 1/64k, so p=~0.00153%. So probability of *not* getting a
collision is (1-p)=99.9985%. So with N pages, the probability of no
collisions is pow((1-p),N) which behaves like this:

 N pow((1-p),N)
   
   1   85%
   2   73%
   3   63%
   46000   49%
   20   4%

So with 1.6GB relation you have about 96% chance of a checksum collision.


Also note that if you try to generate ciphertext, decryption of which will
result in particular value of checksum, you can hardly control the other 14
bytes of the block, which in turn are used to verify the checksum.



But we don't care about the 14 bytes. In fact, we want the page header
(which includes both the checksums and the other 14B in the block) to
remain unchanged - the attack only needs to modify the remaining parts of
the 8kB page in a way to generate the same checksum on the plaintext.

And that's not that hard to do, IMHO, because the header is stored at the
beginning of the page. So we can just randomly modify the last AES block
(last 16B on the page) to minimize the corruption to the last block.

Now, I'm not saying this attack is particularly practical - it would
generate a fair number of checkpoint failures before getting the first
collision. So it'd trigger quite a few alerts, I guess.


FWIW our CRC algorithm is not quite HMAC, because it's neither keyed nor
a cryptographic hash algorithm. Now, maybe we don't want authenticated
encryption (e.g. XTS is not authenticated, unlike GCM/CCM).


I'm also not sure if we should try to guarantee data authenticity /
integrity. As someone already mentioned elsewhere, page MAC does not help if
the whole page is replaced. (An extreme case is that old filesystem snapshot
containing the whole data directory is restored, although that will probably
make the database crash soon.)

We can guarantee integrity and authenticity of backup, but that's a separate
feature: someone may need this although it's o.k. for him to run the cluster
unencrypted.



Yes, I do agree with that. I think attempts to guarantee data authenticity
and/or integrity at the page level is mostly futile (replay attacks are an
example of why). IMHO we should consider that to be outside the threat
model TDE is expected to address.

IMO a better way to handle authenticity/integrity would be based on WAL,
which is essentially an authoritative log of operations. We should be able
to parse WAL, deduce expected state (min LSN, checksums) for each page,
and validate the cluster state based on that.

I still think having to decrypt the page in order to verify a checksum
(because the header is part of the encrypted page, and is computed from
the plaintext version) is not great.

regards


--
Antonin Houska
Web: https://www.cybertec-postgresql.com


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: POC: Cleaning up orphaned files using undo logs

2019-07-19 Thread Amit Khandekar
On Thu, 9 May 2019 at 12:04, Dilip Kumar  wrote:

> Patches can be applied on top of undo branch [1] commit:
> (cb777466d008e656f03771cf16ec7ef9d6f2778b)
>
> [1] https://github.com/EnterpriseDB/zheap/tree/undo

Below are some review points for 0009-undo-page-consistency-checker.patch :


+ /* Calculate the size of the partial record. */
+ partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) +
+ phdr->tuple_len + phdr->payload_len -
+ phdr->record_offset;

There is already an UndoPagePartialRecSize() function which calculates
the size of partial record, which seems to do the same as above. If
this is same, you can omit the above code, and instead down below
where you increment next_record, you can do "next_record +=
UndoPagePartialRecSize()".

Also, I see an extra  sizeof(uint16) added in
UndoPagePartialRecSize(). Not sure which one is correct, and which one
wrong, unless I am wrong in assuming that the above calculation and
the function definition do the same thing.

--

+ * We just want to mask the cid in the undo record header.  So
+ * only if the partial record in the current page include the undo
+ * record header then we need to mask the cid bytes in this page.
+ * Otherwise, directly jump to the next record.
Here, I think you mean : "So only if the partial record in the current
page includes the *cid* bytes", rather than "includes the undo record
header"
May be we can say :
We just want to mask the cid. So do the partial record masking only if
the current page includes the cid bytes from the partial record
header.



+ if (phdr->record_offset < (cid_offset + sizeof(CommandId)))
+ {
+char*cid_data;
+Size mask_size;
+
+mask_size = Min(cid_offset - phdr->record_offset,
+sizeof(CommandId));
+
+cid_data = next_record + cid_offset - phdr->record_offset;
+memset(_data, MASK_MARKER, mask_size);
+
Here, if record_offset lies *between* cid start and cid end, then
cid_offset - phdr->record_offset will be negative, and so will be
mask_size. Probably abs() should do the work.

Also, an Assert(cid_data + mask_size <= page_end) would be nice. I
know cid position of a partial record cannot go beyond the page
boundary, but it's better to have this Assert to do sanity check.

+ * Process the undo record of the page and mask their cid filed.
filed => field

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-19 Thread Antonin Houska
Tomas Vondra  wrote:

> On Mon, Jul 15, 2019 at 06:11:41PM -0400, Bruce Momjian wrote:
> >On Mon, Jul 15, 2019 at 11:05:30PM +0200, Tomas Vondra wrote:
> >> On Mon, Jul 15, 2019 at 03:42:39PM -0400, Bruce Momjian wrote:
> >> > On Sat, Jul 13, 2019 at 11:58:02PM +0200, Tomas Vondra wrote:
> >> > > One extra thing we should consider is authenticated encryption. We 
> >> > > can't
> >> > > just encrypt the pages (no matter which AES mode is used - 
> >> > > XTS/CBC/...),
> >> > > as that does not provide integrity protection (i.e. can't detect when
> >> > > the ciphertext was corrupted due to disk failure or intentionally). And
> >> > > we can't quite rely on checksums, because that checksums the plaintext
> >> > > and is stored encrypted.
> >> >
> >> > Uh, if someone modifies a few bytes of the page, we will decrypt it, but
> >> > the checksum (per-page or WAL) will not match our decrypted output.  How
> >> > would they make it match the checksum without already knowing the key.
> >> > I read [1] but could not see that explained.
> >> >
> >>
> >> Our checksum is only 16 bits, so perhaps one way would be to just
> >> generate 64k of randomly modified pages and hope one of them happens to
> >> hit the right checksum value. Not sure how practical such attack is, but
> >> it does require just filesystem access.
> >
> >Yes, that would work, and opens the question of whether our checksum is
> >big enough for this, and if it is not, we need to find space for it,
> >probably with a custom encrypted page format.  :-(   And that makes
> >adding encryption offline almost impossible because you potentially have
> >to move tuples around.  Yuck!
> >
> 
> Right. We've been working on allowing to disable checksum online, and it
> would be useful to allow something like that for encryption too I guess.
> And without some sort of page-level flag that won't be possible, which
> would be rather annoying.
> 
> Not sure it needs to be in the page itself, though - that's pretty much
> why I proposed to store metadata (IV, key ID, ...) for encryption in a
> new fork. That would be a bit more flexible than storing it in the page
> itself (e.g. different encryption schemes might easily store different
> amounts of metadata).
> 
> Maybe a new fork is way too complex solution, not sure.

One problem of this new fork would be that contents of its buffer (the MAC
values) is not determined until the corresponding buffers of the MAIN fork get
encrypted. However encryption is performed by the storage layer (md.c), which
is not expected to lock other buffers (such as those of the "MAC fork"), read
their pages from disk or insert their WAL records.

This is different from the FSM or VM forks whose buffers are only updated
above the storage layer.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Support for CALL statement in ecpg

2019-07-19 Thread Ashutosh Sharma
Hi,

Thanks for the review. Please find my comments in-line.

On Fri, Jul 19, 2019 at 8:33 AM Kyotaro Horiguchi
 wrote:
>
> Hello.
>
>
> +ECPG: CallStmtCALLfunc_application
>
>  Even though it is the default behavior, but as a written rule
>  this needs the postfix "block".
>

Done.

> +$$ = cat_str(2,mm_strdup("call"),$2);
>
> Let's have proper spacing.
>
> + * Copy input arguments to the result arguments list so that all the
> + * host variables gets treated as INOUT params.
>

I've removed above comments so this is no more valid.

> This fails for the following usage:
>
> -- define procedure
> create procedure ptest2 (in x int, inout y int) language plpgsql as $$
> begin
>  y := y + x;
> end;
> $$;
>
> -- in .pgc
> 14: a = 3;
> 15: r = 5;
> 16: EXEC SQL call ptest2(:a, :r);
> 21: printf("ret = %d, %d\n", a, r);
>
>
> This complains like this:
>
> > SQL error: too many arguments on line 16
> > ret = 8, 5;
>
> The result should be "3, 8". This is because the patch requests
> two return but actually retruned just one.
>
> I'm not sure how to know that previously on ecpg. Might need to
> let users append INTO  clause explicitly.
>

As the ecpg connector is not aware of the param types of the procedure
that it is calling, it becomes the responsibility of end users to
ensure that only those many out variables gets created by ecpg as the
number of fields in the tuple returned by the server and for that, as
you rightly said they must use the INTO clause with CALL statement in
ecpg. Considering this approach, now with the attached v2 patch, the
CALL statement in ecpg application would be like this:

EXEC SQL CALL(:hv1, :hv2) INTO :ret1, ret2;

EXEC SQL CALL(:hv1, :hv2) INTO :ret1 :ind1, :ret2, :ind2;

In case if INTO clause is not used with the CALL statement then the
ecpg compiler would fail with a parse error: "INTO clause is required
with CALL statement"

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
diff --git a/src/interfaces/ecpg/preproc/ecpg.addons b/src/interfaces/ecpg/preproc/ecpg.addons
index cbffd50..932b7f9 100644
--- a/src/interfaces/ecpg/preproc/ecpg.addons
+++ b/src/interfaces/ecpg/preproc/ecpg.addons
@@ -555,3 +555,12 @@ ECPG: limit_clauseLIMITselect_limit_value','select_offset_value block
 	}
 ECPG: SignedIconstIconst rule
 	| civar	{ $$ = $1; }
+ECPG: CallStmtCALLfunc_application block
+	{
+		mmerror(PARSE_ERROR, ET_ERROR, "INTO clause is required with CALL statement");
+		$$ = EMPTY;
+	}
+	| CALL func_application ecpg_into
+	{
+		$$ = cat_str(2,mm_strdup("call"),$2);
+	}
diff --git a/src/interfaces/ecpg/test/ecpg_schedule b/src/interfaces/ecpg/test/ecpg_schedule
index e034c5a..7e6b5c7 100644
--- a/src/interfaces/ecpg/test/ecpg_schedule
+++ b/src/interfaces/ecpg/test/ecpg_schedule
@@ -33,6 +33,7 @@ test: preproc/whenever_do_continue
 test: sql/array
 test: sql/binary
 test: sql/bytea
+test: sql/call
 test: sql/code100
 test: sql/copystdout
 test: sql/createtableas
diff --git a/src/interfaces/ecpg/test/expected/sql-call.c b/src/interfaces/ecpg/test/expected/sql-call.c
new file mode 100644
index 000..813b3b7
--- /dev/null
+++ b/src/interfaces/ecpg/test/expected/sql-call.c
@@ -0,0 +1,264 @@
+/* Processed by ecpg (regression mode) */
+/* These include files are added by the preprocessor */
+#include 
+#include 
+#include 
+/* End of automatic include section */
+#define ECPGdebug(X,Y) ECPGdebug((X)+100,(Y))
+
+#line 1 "call.pgc"
+#include 
+#include 
+
+
+#line 1 "sqlca.h"
+#ifndef POSTGRES_SQLCA_H
+#define POSTGRES_SQLCA_H
+
+#ifndef PGDLLIMPORT
+#if  defined(WIN32) || defined(__CYGWIN__)
+#define PGDLLIMPORT __declspec (dllimport)
+#else
+#define PGDLLIMPORT
+#endif			/* __CYGWIN__ */
+#endif			/* PGDLLIMPORT */
+
+#define SQLERRMC_LEN	150
+
+#ifdef __cplusplus
+extern "C"
+{
+#endif
+
+struct sqlca_t
+{
+	char		sqlcaid[8];
+	long		sqlabc;
+	long		sqlcode;
+	struct
+	{
+		int			sqlerrml;
+		char		sqlerrmc[SQLERRMC_LEN];
+	}			sqlerrm;
+	char		sqlerrp[8];
+	long		sqlerrd[6];
+	/* Element 0: empty		*/
+	/* 1: OID of processed tuple if applicable			*/
+	/* 2: number of rows processed*/
+	/* after an INSERT, UPDATE or*/
+	/* DELETE statement	*/
+	/* 3: empty		*/
+	/* 4: empty		*/
+	/* 5: empty		*/
+	char		sqlwarn[8];
+	/* Element 0: set to 'W' if at least one other is 'W'	*/
+	/* 1: if 'W' at least one character string		*/
+	/* value was truncated when it was			*/
+	/* stored into a host variable. */
+
+	/*
+	 * 2: if 'W' a (hopefully) non-fatal notice occurred
+	 */	/* 3: empty */
+	/* 4: empty		*/
+	/* 5: empty		*/
+	/* 6: empty		*/
+	/* 7: empty		*/
+
+	char		sqlstate[5];
+};
+
+struct sqlca_t *ECPGget_sqlca(void);
+
+#ifndef POSTGRES_ECPG_INTERNAL
+#define sqlca (*ECPGget_sqlca())
+#endif
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
+
+#line 4 "call.pgc"
+
+
+#line 1 "regression.h"
+
+
+
+
+
+
+#line 5 "call.pgc"
+
+
+int
+main(void)
+{
+   /* exec sql begin declare section 

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-07-19 Thread amul sul
On Mon, Jul 8, 2019 at 5:03 PM Etsuro Fujita 
wrote:

> On Wed, Jul 3, 2019 at 3:44 PM Etsuro Fujita 
> wrote:
> > On Tue, Jul 2, 2019 at 1:47 PM amul sul  wrote:
> > > Attached version is rebase atop of the latest master head(c74d49d41c),
> thanks.
> >
> > Thanks!  Will review.
>
> I started reviewing this.  Here is my initial review comments:
>
> * 0001-Hash-partition-bound-equality-refactoring-v22.patch
>
> First of all, I agree with your view on hash partitioning:
>
> "3. For hash partitioned tables however, we support partition-wise join
> only when the bounds exactly match. For hash partitioning it's unusual
> to have missing partitions and hence generic partition matching is not
> required."
>
> which is cited from the commit message for the main patch
> "0002-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi-v22.patch".
> (I think it would be better if we can extend the partition matching to
> the hash-partitioning case where there are missing partitions in
> future, though.)  However, I don't think it's a good idea to do this
> refactoring, because that would lead to duplicating the code to check
> whether two given hash bound collections are equal in
> partition_bounds_equal() and partition_hash_bounds_merge() that will
> be added by the main patch, after all.  To avoid that, how about
> calling partition_bounds_equal() from partition_hash_bounds_merge() in
> the main patch, like the attached?


Agree, your changes look good to me, thanks for working on it.

Regards,
Amul


Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )

2019-07-19 Thread Sergei Kornilov
Hi

Oh. Replication slot name currently can contains only a-z0-9_ characters. So we 
can not actually write such recovery.conf, pg_basebackup will stop before. But 
perform escape_quotes on string and not use result - error anyway.

regards, Sergei




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-19 Thread Antonin Houska
Tomas Vondra  wrote:

> On Mon, Jul 15, 2019 at 03:42:39PM -0400, Bruce Momjian wrote:
> >On Sat, Jul 13, 2019 at 11:58:02PM +0200, Tomas Vondra wrote:
> >> One extra thing we should consider is authenticated encryption. We can't
> >> just encrypt the pages (no matter which AES mode is used - XTS/CBC/...),
> >> as that does not provide integrity protection (i.e. can't detect when
> >> the ciphertext was corrupted due to disk failure or intentionally). And
> >> we can't quite rely on checksums, because that checksums the plaintext
> >> and is stored encrypted.
> >
> >Uh, if someone modifies a few bytes of the page, we will decrypt it, but
> >the checksum (per-page or WAL) will not match our decrypted output.  How
> >would they make it match the checksum without already knowing the key.
> >I read [1] but could not see that explained.
> >
> 
> Our checksum is only 16 bits, so perhaps one way would be to just
> generate 64k of randomly modified pages and hope one of them happens to
> hit the right checksum value. Not sure how practical such attack is, but
> it does require just filesystem access.

I don't think you can easily generate 64k of different checksums this way. If
the data is random, I suppose that each set of 2^(128 - 16) blocks will
contain the the same checksum after decryption. Thus even you generate 64k of
different ciphertext blocks that contain the checksum, some (many?)  checksums
will be duplicate. Unfortunately the math to describe this problem does not
seem to be trivial.

Also note that if you try to generate ciphertext, decryption of which will
result in particular value of checksum, you can hardly control the other 14
bytes of the block, which in turn are used to verify the checksum.

> FWIW our CRC algorithm is not quite HMAC, because it's neither keyed nor
> a cryptographic hash algorithm. Now, maybe we don't want authenticated
> encryption (e.g. XTS is not authenticated, unlike GCM/CCM).

I'm also not sure if we should try to guarantee data authenticity /
integrity. As someone already mentioned elsewhere, page MAC does not help if
the whole page is replaced. (An extreme case is that old filesystem snapshot
containing the whole data directory is restored, although that will probably
make the database crash soon.)

We can guarantee integrity and authenticity of backup, but that's a separate
feature: someone may need this although it's o.k. for him to run the cluster
unencrypted.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Compile from source using latest Microsoft Windows SDK

2019-07-19 Thread Michael Paquier
On Fri, Jul 19, 2019 at 03:39:49PM +0800, Peifeng Qiu wrote:
> I updated the patch to only include the WindowsTargetPlatformVersion node
> if WindowsSDKVersion is present.  I can confirm that this issue no
> longer exists for VS2019. So only VS2017 is problematic.

(Could you please avoid to top-post?)
Ugh.  That's one I don't have at hand.

> I'm also very curious on how hamerkop and bowerbird build postgres with
> VS2017.  Looks like hamerkop and bowerbird both exist before VS2017
> and maybe they get SDK v8.1 from previous VS installations. I will
> contact admin of hamerkop and bowerbird and see if that's the case.
> As of now I can still encounter the same issue with fresh installed
> Windows Server 2016 and VS2017, on both azure and google cloud. So
> better to patch the build system anyway.

I guess so but I cannot confirm myself.  I am adding Andrew and Hari
in CC to double-check if they have seen this problem or not.  Hari has
also worked on porting VS 2017 and 2019 in the MSVC scripts in the
tree.
--
Michael


signature.asc
Description: PGP signature


Re: SQL/JSON path issues/questions

2019-07-19 Thread Alexander Korotkov
On Thu, Jul 18, 2019 at 5:08 PM Thom Brown  wrote:
> On Tue, 16 Jul 2019 at 19:44, Alexander Korotkov
>  wrote:
> >
> > On Tue, Jul 16, 2019 at 9:22 PM Thom Brown  wrote:
> > > Now I'm looking at the @? and @@ operators, and getting a bit
> > > confused.  This following query returns true, but I can't determine
> > > why:
> > >
> > > # SELECT '{"a":[1,2,3,4,5]}'::jsonb @? '$.b == "hello"'::jsonpath;
> > >  ?column?
> > > --
> > >  t
> > > (1 row)
> > >
> > > "b" is not a valid item, so there should be no match.  Perhaps it's my
> > > misunderstanding of how these operators are supposed to work, but the
> > > documentation is quite terse on the behaviour.
> >
> > So, the result of jsonpath evaluation is single value "false".
> >
> > # SELECT jsonb_path_query_array('{"a":[1,2,3,4,5]}'::jsonb, '$.b == 
> > "hello"');
> >  jsonb_path_query_array
> > 
> >  [false]
> > (1 row)
> >
> > @@ operator checks that result is "true".  This is why it returns "false".
> >
> > @? operator checks if result is not empty.  So, it's single "false"
> > value, not empty list.  This is why it returns "true".
> >
> > Perhaps, we need to clarify this in docs providing more explanation.
>
> Understood.  Thanks.
>
> Also, is there a reason why jsonb_path_query doesn't have an operator analog?

The point of existing operator analogues is index support.  We
introduced operators for searches we can accelerate using GIN indexes.

jsonb_path_query() doesn't return bool.  So, even if we have an
operator for that, it wouldn't get index support.

However, we can discuss introduction of operator analogues for other
functions as syntax sugar.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: POC: Cleaning up orphaned files using undo logs

2019-07-19 Thread Amit Kapila
On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar  wrote:
>
> On Thu, Jul 11, 2019 at 12:38 AM Robert Haas  wrote:
> >
> > I don't like the fact that undoaccess.c has a new global,
> > undo_compression_info.  I haven't read the code thoroughly, but do we
> > really need that?  I think it's never modified (so it could just be
> > declared const),
>
> Actually, this will get modified otherwise across undo record
> insertion how we will know what was the values of the common fields in
> the first record of the page.  Another option could be that every time
> we insert the record, read the value from the first complete undo
> record on the page but that will be costly because for every new
> insertion we need to read the first undo record of the page.
>

This information won't be shared across transactions, so can't we keep
it in top transaction's state?   It seems to me that will be better
than to maintain it as a global state.

Few more comments on this patch:
1.
PrepareUndoInsert()
{
..
+ if (logswitched)
+ {
..
+ }
+ else
+ {
..
+ resize = true;
..
+ }
+
..
+
+ do
+ {
+ bufidx = UndoGetBufferSlot(context, rnode, cur_blk, rbm);
..
+ rbm = RBM_ZERO;
+ cur_blk++;
+ } while (cur_size < size);
+
+ /*
+ * Set/overwrite compression info if required and also exclude the common
+ * fields from the undo record if possible.
+ */
+ if (UndoSetCommonInfo(compression_info, urec, urecptr,
+   context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf))
+ resize = true;
+
+ if (resize)
+ size = UndoRecordExpectedSize(urec);

I see that in some cases where resize is possible are checked before
buffer allocation and some are after.  Isn't it better to do all these
checks before buffer allocation?  Also, isn't it better to even
compute changed size before buffer allocation as that might sometimes
help in lesser buffer allocations?

Can you find a better way to write
:context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf)?
 It makes the line too long and difficult to understand.  Check for
similar instances in the patch and if possible, change them as well.

2.
+InsertPreparedUndo(UndoRecordInsertContext *context)
{
..
/*
+ * Try to insert the record into the current page. If it
+ * doesn't succeed then recall the routine with the next page.
+ */
+ InsertUndoData(, page, starting_byte);
+ if (ucontext.stage == UNDO_PACK_STAGE_DONE)
+ {
+ MarkBufferDirty(buffer);
+ break;
+ }
+ MarkBufferDirty(buffer);
..
}

Can't we call MarkBufferDirty(buffer) just before 'if' check?  That
will avoid calling it twice.

3.
+ * Later, during insert phase we will write actual records into thse buffers.
+ */
+struct PreparedUndoBuffer

/thse/these

4.
+ /*
+ * If we are writing first undo record for the page the we can set the
+ * compression so that subsequent records from the same transaction can
+ * avoid including common information in the undo records.
+ */
+ if (first_complete_undo)

/page the we/page then we

5.
PrepareUndoInsert()
{
..
After
+ * allocation We'll only advance by as many bytes as we turn out to need.
+ */
+ UndoRecordSetInfo(urec);

Change the beginning of comment as: "After allocation, we'll .."

6.
PrepareUndoInsert()
{
..
* TODO:  instead of storing this in the transaction header we can
+ * have separate undo log switch header and store it there.
+ */
+ prevlogurp =
+ MakeUndoRecPtr(UndoRecPtrGetLogNo(prevlog_insert_urp),
+(UndoRecPtrGetOffset(prevlog_insert_urp) - prevlen));
+

I don't think this TODO is valid anymore because now the patch has a
separate log-switch header.

7.
/*
+ * If undo log is switched then set the logswitch flag and also reset the
+ * compression info because we can use same compression info for the new
+ * undo log.
+ */
+ if (UndoRecPtrIsValid(prevlog_xact_start))

/can/can't

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: partition routing layering in nodeModifyTable.c

2019-07-19 Thread Amit Langote
On Thu, Jul 18, 2019 at 4:50 PM Amit Langote  wrote:
> On Thu, Jul 18, 2019 at 2:53 PM Andres Freund  wrote:
> > On 2019-07-18 14:24:29 +0900, Amit Langote wrote:
> > > On Thu, Jul 18, 2019 at 10:09 AM Andres Freund  wrote:
> > Or perhaps the actually correct fix is to remove es_result_relation_info
> > alltogether, and just pass it down the places that need it - we've a lot
> > more code setting it than using the value.  And it'd not be hard to
> > actually pass it to the places that read it.  Given all the
> > setting/resetting of it it's pretty obvious that a query-global resource
> > isn't the right place for it.
>>
> > > Would you like me to write a patch for some or all items?
> >
> > Yes, that would be awesome.
>
> OK, I will try to post a patch soon.

Attached are two patches.

The first one (0001) deals with reducing the core executor's reliance
on es_result_relation_info to access the currently active result
relation, in favor of receiving it from the caller as a function
argument.  So no piece of core code relies on it being correctly set
anymore.  It still needs to be set correctly for the third-party code
such as FDWs.  Also, because the partition routing related suggestions
upthread are closely tied into this, especially those around
ExecInsert(), I've included them in the same patch.  I chose to keep
the function ExecPrepareTupleRouting, even though it's now only called
from ExecInsert(), to preserve the readability of the latter.

The second patch (0002) implements some rearrangement of the UPDATE
tuple movement code, addressing the point 2 of in the first email of
this thread.  Mainly the block of code in ExecUpdate() that implements
row movement proper has been moved in a function called ExecMove().
It also contains the cosmetic improvements suggested in point 4.

Thanks,
Amit


v1-0001-Reduce-es_result_relation_info-usage.patch
Description: Binary data


v1-0002-Rearrange-partition-update-row-movement-code-a-bi.patch
Description: Binary data


[PATCH] minor bugfix for pg_basebackup (9.6 ~ )

2019-07-19 Thread Ian Barwick
Hi

In pg_basebackup's GenerateRecoveryConf() function, the value for
"primary_slot_name" is escaped, but the original, non-escaped value
is written. See attached patch.

This has been present since the code was added in 9.6 (commit 0dc848b0314).

Regards

Ian Barwick

-- 
  Ian Barwick   https://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


pg_basebackup-generate-recovery-conf.v1.patch
Description: Binary data


Re: Compile from source using latest Microsoft Windows SDK

2019-07-19 Thread Peifeng Qiu
Hi Michael. Thanks for your review.
I updated the patch to only include the WindowsTargetPlatformVersion node
if WindowsSDKVersion is present.
I can confirm that this issue no longer exists for VS2019. So only VS2017
is problematic.

I'm also very curious on how hamerkop and bowerbird build postgres with
VS2017.
Looks like hamerkop and bowerbird both exist before VS2017 and maybe they
get SDK v8.1 from previous
VS installations. I will contact admin of hamerkop and bowerbird and see if
that's the case.
As of now I can still encounter the same issue with fresh installed Windows
Server 2016 and
VS2017, on both azure and google cloud. So better to patch the build system
anyway.

Peifeng Qiu
Best regards,

On Thu, Jul 18, 2019 at 4:09 PM Michael Paquier  wrote:

> Hi Peifeng,
>
> On Fri, Mar 29, 2019 at 12:01:26AM +0900, Peifeng Qiu wrote:
> > The current Windows build system supports compiling with Windows SDK up
> to
> > v8.1. When building with the latest Windows SDK v10 which is the default
> > for Visual Studio 2017, we will get the following error:
> >
> > error MSB8036: The Windows SDK version 8.1 was not found.
>
> Actually up to 10, no?  Sorry for the delay, I have just noticed this
> patch registered in the commit fest.  And now is review time.
>
> > When the build system generates projects files for MSBuild to consume, it
> > doesn't include a SDK version number. Then MSBuild will assume v8.1 as
> > default.
> > But if we only install the latest v10 but not v8.1, MSBuild will error
> out.
>
> So...  This actually boils down to that behavior:
>
> https://developercommunity.visualstudio.com/content/problem/140294/windowstargetplatformversion-makes-it-impossible-t.html
>
> While WindowsSDKVersion seems to be present all the time.  I think
> that we should be more defensive if the variable is not defined, and
> instead rely on the default provided by the system, whatever it may
> be.  In short it seems to me that the tag WindowsTargetPlatformVersion
> should be added only if the variable exists, and your patch always
> sets it.
>
> For anything with Postgres on Windows, I have been using Visual Studio
> 2015 and 2019 lately to compile Postgres mainly with the Native Tools
> command prompt so I have never actually faced this failure even with
> the most recent VS 2019.  Using just a command prompt causes a failure
> when finding out nmake for example as that's not in the default PATH.
> Our buildfarm members don't complain either, and there are two animals
> using VS 2017: hamerkop and bowerbird.
> --
> Michael
>


compile-latest-win-sdk-v2.patch
Description: Binary data


Re: pg_receivewal documentation

2019-07-19 Thread Laurenz Albe
On Fri, 2019-07-19 at 10:27 +0900, Michael Paquier wrote:
> On Thu, Jul 18, 2019 at 08:40:36AM -0400, Jesper Pedersen wrote:
> > On 7/18/19 1:29 AM, Michael Paquier wrote:
> > > Or more simply like that?
> > > "Note that while WAL will be flushed with this setting,
> > > pg_receivewal never applies it, so synchronous_commit must not be set
> > > to remote_apply if pg_receivewal is a synchronous standby, be it a
> > > member of a priority-based (FIRST) or a quorum-based (ANY) synchronous
> > > replication setup."
> > 
> > Yeah, better.
> 
> I was looking into committing that, and the part about
> synchronous_commit = on is not right.  The location of the warning is
> also harder to catch for the reader, so instead let's move it to the
> top where we have an extra description for --synchronous.  I am
> finishing with the attached that I would be fine to commit and
> back-patch as needed.  Does that sound fine?

It was my first reaction too that this had better be at the top.

I'm happy with the patch as it is.

Yours,
Laurenz Albe





Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-19 Thread Julien Rouhaud
On Fri, Jul 19, 2019 at 2:35 AM Michael Paquier  wrote:
>
> On Thu, Jul 18, 2019 at 09:45:14AM +0900, Michael Paquier wrote:
> > On Wed, Jul 17, 2019 at 07:46:10PM +0200, Julien Rouhaud wrote:
> >> Is it ok to call pg_free(slots) and let caller have a pointer pointing
> >  to freed memory?
> >
> > The interface has a Setup call which initializes the whole thing, and
> > Terminate is the logical end point, so having the free logic within
> > the termination looks more consistent to me.  We could now have actual
> > Init() and Free() but I am not sure that this justifies the move as
> > this complicates the scripts using it.
>
> I have reconsidered this point, moved the pg_free() call out of the
> termination logic, and committed the first patch after an extra lookup
> and more polishing.

Thanks!

> For the second patch, could you send a rebase with a fix for the
> connection slot when processing the reindex commands?

Attached, I also hopefully removed all the now unneeded progname usage.


reindex_parallel_v5.diff
Description: Binary data


RE: Multivariate MCV list vs. statistics target

2019-07-19 Thread Jamison, Kirk
On Tuesday, July 9, 2019, Tomas Vondra wrote:
> >apparently v1 of the ALTER STATISTICS patch was a bit confused about
> >length of the VacAttrStats array passed to statext_compute_stattarget,
> >causing segfaults. Attached v2 patch fixes that, and it also makes sure
> >we print warnings about ignored statistics only once.
> >
> 
> v3 of the patch, adding pg_dump support - it works just like when you tweak
> statistics target for a column, for example. When the value stored in the
> catalog is not -1, pg_dump emits a separate ALTER STATISTICS command setting
> it (for the already created statistics object).
> 

Hi Tomas, I stumbled upon your patch.

According to the CF bot, your patch applies cleanly, builds successfully, and
passes make world. Meaning, the pg_dump tap test passed, but there was no
test for the new SET STATISTICS yet. So you might want to add a regression
test for that and integrate it in the existing alter_generic file.

Upon quick read-through, the syntax and docs are correct because it's similar
to the format of ALTER TABLE/INDEX... SET STATISTICS... :
ALTER [ COLUMN ] column_name SET STATISTICS integer

+   /* XXX What if the target is set to 0? Reset the statistic? */

This also makes me wonder. I haven't looked deeply into the code, but since 0 is
a valid value, I believe it should reset the stats.
After lookup though, this is how it's tested in ALTER TABLE:
/test/regress/sql/stats_ext.sql:-- Ensure things work sanely with SET 
STATISTICS 0

> I've considered making it part of CREATE STATISTICS itself, but it seems a
> bit cumbersome and we don't do it for columns either. I'm not against adding
> it in the future, but at this point I don't see a need.

I agree. Perhaps that's for another patch should you decide to add it in the 
future.

Regards,
Kirk Jamison