Re: tab completion of IMPORT FOREIGN SCHEMA

2020-09-14 Thread Michael Paquier
On Mon, Aug 17, 2020 at 02:15:34PM +0900, Michael Paquier wrote:
> Sounds fine to me as well.  The LIMIT TO and EXCEPT clauses are
> optional, so using TailMatches() looks fine.
> 
> +   else if (TailMatches("FROM", "SERVER", MatchAny, "INTO", MatchAny))
> +   COMPLETE_WITH("OPTIONS")
> Shouldn't you complete with "OPTIONS (" here?
> 
> It would be good to complete with "FROM SERVER" after specifying
> EXCEPT or LIMIT TO, you can just use "(*)" to include the list of
> tables in the list of elements checked.

I have complete the patch with those parts as per the attached.  If
there are any objections or extra opinions, please feel free.
--
Michael
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index f41785f11c..9c6f5ecb6a 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -3293,6 +3293,17 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("FOREIGN SCHEMA");
 	else if (Matches("IMPORT", "FOREIGN"))
 		COMPLETE_WITH("SCHEMA");
+	else if (Matches("IMPORT", "FOREIGN", "SCHEMA", MatchAny))
+		COMPLETE_WITH("EXCEPT (", "FROM SERVER", "LIMIT TO (");
+	else if (TailMatches("LIMIT", "TO", "(*)") ||
+			 TailMatches("EXCEPT", "(*)"))
+		COMPLETE_WITH("FROM SERVER");
+	else if (TailMatches("FROM", "SERVER", MatchAny))
+		COMPLETE_WITH("INTO");
+	else if (TailMatches("FROM", "SERVER", MatchAny, "INTO"))
+		COMPLETE_WITH_QUERY(Query_for_list_of_schemas);
+	else if (TailMatches("FROM", "SERVER", MatchAny, "INTO", MatchAny))
+		COMPLETE_WITH("OPTIONS (");
 
 /* INSERT --- can be inside EXPLAIN, RULE, etc */
 	/* Complete INSERT with "INTO" */


signature.asc
Description: PGP signature


Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-09-14 Thread Andres Freund
On 2020-09-15 10:54:29 +0530, Dilip Kumar wrote:
> What problem do you see if we set xmax to the InvalidTransactionId and
> HEAP_XMAX_INVALID flag in the infomask ?

1) It'll make a dead tuple appear live. You cannot do this for tuples
   with an xid below the horizon.
2) it'll break HOT chain following / indexes.




Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-09-14 Thread Dilip Kumar
On Tue, Sep 15, 2020 at 2:35 AM Andres Freund  wrote:
>
> Hi,
>
> On 2020-09-14 17:00:48 -0400, Robert Haas wrote:
> > On Mon, Sep 14, 2020 at 4:13 PM Andres Freund  wrote:
> > > My understanding of the case we're discussing is that it's corruption
> > > (e.g. relfrozenxid being different than table contents) affecting a HOT
> > > chain. I.e. by definition all within a single page.  We won't have
> > > modified part of it independent of B < A, because freezing is
> > > all-or-nothing.  Just breaking the HOT chain into two or something like
> > > that will just make things worse, because indexes won't find tuples, and
> > > because reindexing might then get confused e.g. by HOT chains without a
> > > valid start, or by having two visible tuples for the same PK.
> >
> > If we adopt the proposal made by Dilip, we will not do that. We must
> > have a.xmax = b.xmin, and that value is either less than relfrozenxid
> > or it is not. If we skip an entire tuple because one XID is bad, then
> > we could break the HOT chain when a.xmin is bad and the remaining
> > values are OK. But if we decide separately for xmin and xmax then we
> > should be alright.
>
> I thought I precisely addressed this case:
>
> > What exactly are you going to put into xmin/xmax here? And how would
> > anything you put into the first tuple not break index lookups? There's
> > no such thing as a frozen xmax (so far), so what are you going to put
> > in there? A random different xid?  FrozenTransactionId?
> > HEAP_XMAX_INVALID?
>
> What am I missing?

What problem do you see if we set xmax to the InvalidTransactionId and
HEAP_XMAX_INVALID flag in the infomask ?  I mean now also if the xmax
is older than the cutoff xid then we do the same thing i.e.
if (freeze_xmax)
{
..
frz->xmax = InvalidTransactionId;
..
frz->t_infomask &= ~HEAP_XMAX_BITS;
frz->t_infomask |= HEAP_XMAX_INVALID;
frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
changed = true;
}

So if we do that it will not be part of the hot chain anymore.  I
might be missing something but could not see how it can be more broken
than what it is without our change.  I agree that in case of corrupted
xmin it can now mark tuple with HEAP_XMAX_INVALID without freezing the
xmin but that is anyway a valid status for a tuple.

However, if we think it still can cause some issues then I feel that
we can skip the whole page as Robert suggested.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Gripes about walsender command processing

2020-09-14 Thread Michael Paquier
On Mon, Sep 14, 2020 at 11:34:44PM -0400, Tom Lane wrote:
> (I don't quite follow your comment about repslot drop, btw.)

There is already a command tag equivalent to DROP_REPLICATION_SLOT:
$ git grep REPLICATION -- */cmdtaglist.h
src/include/tcop/cmdtaglist.h:PG_CMDTAG(CMDTAG_DROP_REPLICATION_SLOT,
"DROP REPLICATION SLOT", false, false, false)
--
Michael


signature.asc
Description: PGP signature


Re: PG 13 release notes, first draft

2020-09-14 Thread Peter Eisentraut

On 2020-09-09 22:57, Jonathan S. Katz wrote:

+
+ 
+  Parallelized vacuuming of B-tree indexes
+ 
+


I don't think B-tree indexes are relevant here.  AFAICT, this feature 
applies to all indexes.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_restore causing deadlocks on partitioned tables

2020-09-14 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Sep-14, Tom Lane wrote:
>> AFAICS, it is utterly silly for InitResultRelInfo to be forcing
>> a partition qual to be computed when we might not need it.
>> We could flush ResultRelInfo.ri_PartitionCheck altogether and
>> have anything that was reading it instead do
>> RelationGetPartitionQual(ResultRelInfo.ri_RelationDesc).

> Hmm, but I presume we don't want to compute it every time.  I suggest we
> would still have it, but we'd only computed it when first used.

RelationGetPartitionQual already does that caching.

regards, tom lane




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-09-14 Thread Amit Kapila
On Tue, Sep 15, 2020 at 8:38 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Mon, Sep 14, 2020 at 9:42 PM Tom Lane  wrote:
> >> BTW, unless someone has changed the behavior of dynahash when I
> >> wasn't looking, those MemoryContextSwitchTos shown above are useless.
>
> > As far as I can see they are useless in this case but I think they
> > might be required in case the user provides its own allocator function
> > (using HASH_ALLOC). So, we can probably remove those from here?
>
> You could imagine writing a HASH_ALLOC allocator whose behavior
> varies depending on CurrentMemoryContext, but it seems like a
> pretty foolish/fragile way to do it.  In any case I can think of,
> the hash table lives in one specific context and you really
> really do not want parts of it spread across other contexts.
> dynahash.c is not going to look kindly on pieces of what it
> is managing disappearing from under it.
>

I agree that doesn't make sense. I have fixed all the comments
discussed in the attached patch.

-- 
With Regards,
Amit Kapila.


v2-0001-Fix-initialization-of-RelationSyncEntry-for-strea.patch
Description: Binary data


Re: I'd like to discuss scaleout at PGCon

2020-09-14 Thread MauMau
Hello all,

# I'm resending because some error occurred

I've overhauled the scaleout design wiki I presented at PGCon 2018
developer unconference and assembled the research of other DBMSs'
scale-out features.

Scaleout Design
https://wiki.postgresql.org/wiki/Scaleout_Design

I intentionally have put little conclusion on our specification and
design.  I'd like you to look at recent distributed databases, and
then think about and discuss what we want to aim for together.  I feel
it's better to separate a thread per topic or group of topics.

I'm sorry, but I'm not confident about the readability at all, because
I cannot draw figures due to my visual impairment, and the page is
full of text only.

What workload do you think we should focus on first, OLTP or
analytics?  I think OLTP, because open source Postgres probably has
been so far getting popular with OLTP.  Also, I don't expect many
people will use existing popular SaaS for data warehousing like Amazon
Redshift, Azure Synapse, Google BigQuery and Snowflake, rather than
build their analytics databases on public IaaS or on-premises.


Regards
MauMau





Re: Parallelize stream replication process

2020-09-14 Thread Bharath Rupireddy
On Tue, Sep 15, 2020 at 9:27 AM Li Japin  wrote:
>
> For now, postgres use single process to send, receive and replay the WAL when 
> we use stream replication,
> is there any point to parallelize this process? If it does, how do we start?
>
> Any thoughts?
>

I think we must ask few questions:

1. What's the major gain we get out of this? Is it that the time to
stream gets reduced or something else?
If the answer to the above point is something solid, then
2. How do we distribute the work to multiple processes?
3. Do we need all of the workers to maintain the order in which they
read WAL files(on the publisher) and apply the changes(on the
subscriber?)
3. Do we want to map the sender/publisher workers to
receiver/subscriber workers on a one-to-one basis? If not, how do we
do it?
4. How do sender and receiver workers communicate?
5. What if we have multiple subscribers/receivers?

I'm no expert in replication, I may be wrong as well. Others may have
better thoughts.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: I'd like to discuss scaleout at PGCon

2020-09-14 Thread MauMau
Hello all,


I've overhauled the scaleout design wiki I presented at PGCon 2018
developer unconference and assembled the research of other DBMSs'
scale-out features.

Scaleout Design
https://wiki.postgresql.org/wiki/Scaleout_Design

I intentionally have put little conclusion on our specification and
design.  I'd like you to look at recent distributed databases, and
then think about and discuss what we want to aim for together.  I feel
it's better to separate a thread per topic or group of topics.

I'm sorry, but I'm not confident about the readability at all, because
I cannot draw figures due to my visual impairment, and the page is
full of text only.

What workload do you think we should focus on first, OLTP or
analytics?  I think OLTP, because open source Postgres probably has
been so far getting popular with OLTP.  Also, I don't expect many
people will use existing popular SaaS for data warehousing like Amazon
Redshift, Azure Synapse, Google BigQuery and Snowflake, rather than
build their analytics databases on public IaaS or on-premises.


Regards
MauMau





Re: pg_restore causing deadlocks on partitioned tables

2020-09-14 Thread Alvaro Herrera
On 2020-Sep-14, Tom Lane wrote:

> > (1) TRUNCATE on a partition tries to get AccessShareLock on the parent;
> 
> The reason for this is that
> 
> (a) ExecuteTruncateGuts calls InitResultRelInfo, because it might
> need that to fire TRUNCATE triggers for the child relation.

Hmm, this seems legitimate, but of course we don't need the partition
qual.  So the reported bug would be solved with just the change to avoid
loading ri_PartitionExpr until needed.

> AFAICS, it is utterly silly for InitResultRelInfo to be forcing
> a partition qual to be computed when we might not need it.
> We could flush ResultRelInfo.ri_PartitionCheck altogether and
> have anything that was reading it instead do
> RelationGetPartitionQual(ResultRelInfo.ri_RelationDesc).

Hmm, but I presume we don't want to compute it every time.  I suggest we
would still have it, but we'd only computed it when first used.

> Actually it looks like most of the places reading it are
> just interested in non-nullness; can't those be nuked from
> orbit in favor of testing rel->rd_rel->relispartition?
> There's no such thing as a partition with an empty partition
> qual is there?  (Or even if it's possible, do we care about
> optimizing the case?)

Actually, there is one such case -- when the default partition is the
only partition, its constraint is empty.  This has caused at least one
bug.  Maybe it'd be better if we used something like constant true
instead ... since we're not likely to care much about the performance of
that case.  But I don't think that would change this patch any.

> > (2) ALTER TABLE ONLY ... ADD CONSTRAINT on a partition root tries to get
> > AccessExclusiveLock on all child partitions, despite the ONLY.
> 
> The cause of this seems to be that ATPrepSetNotNull is too dumb to
> avoid recursing to all the child tables when the parent is already
> attnotnull.  Or is there a reason we have to recurse anyway?

Hmm, looking at ATExecSetNotNull, we invoke the PostAlter hook even when
there's no change, so if we supressed the recursion early, that would
change.  But I doubt we actually care.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Parallelize stream replication process

2020-09-14 Thread Li Japin
Hi, hackers

For now, postgres use single process to send, receive and replay the WAL when 
we use stream replication,
is there any point to parallelize this process? If it does, how do we start?

Any thoughts?

Best regards

Japin Li



Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-14 Thread Peter Geoghegan
On Mon, Sep 14, 2020 at 8:07 PM Jeff Davis  wrote:
> Sure. Will backporting either patch into REL_13_STABLE now interfere
> with RC1 release in any way?

The RMT will discuss this.

It would help if there was a patch ready to go.

Thanks
-- 
Peter Geoghegan




Re: Gripes about walsender command processing

2020-09-14 Thread Tom Lane
Michael Paquier  writes:
> On Mon, Sep 14, 2020 at 12:51:39PM -0400, Tom Lane wrote:
>> (Is there a better way to get the name of a replication command?
>> I didn't look very hard for one.)

> Wouldn't that just be in cmdtaglist.h, but extended for nodes used for
> replication commands?  Then you could just call CreateCommandTag() to
> get the command string to print as postgres.c does.  There is already
> one for repslot drop, in some way.  This would have the advantage to
> just call once set_ps_display() before the switch split.

Mmm, not sure whether having CreateCommandTag know about replication
commands is a good thing or not.  We certainly could do it like that,
since there's only one namespace of NodeTag values, but conceptually
it feels a bit weird to me.  There's a lot of other infrastructure
for SQL command nodes that we're surely never going to build out for
replication commands, so should we do it in CreateCommandTag?

Anybody else have an opinion about it?

(I don't quite follow your comment about repslot drop, btw.)

regards, tom lane




Re: Fix for parallel BTree initialization bug

2020-09-14 Thread Tom Lane
Amit Kapila  writes:
> On Mon, Sep 14, 2020 at 11:26 PM Peter Geoghegan  wrote:
>> The fix seems sensible to me.

> Thanks, I think it is better to wait for a day or two as yesterday
> only we stamped 13 and we need to backpatch this.

Right, please avoid pushing anything non-critical to REL_13_STABLE
until you see the git tag appear.  I doubt we will need to re-wrap
the tarballs, but you never know.

regards, tom lane




Re: Gripes about walsender command processing

2020-09-14 Thread Michael Paquier
On Mon, Sep 14, 2020 at 12:51:39PM -0400, Tom Lane wrote:
> I looked into the other point about ps status always claiming the
> walsender is "idle".  This turns out to be something PostgresMain
> does automatically, so unless we want to uglify that logic, we have
> to make walsenders set the field honestly.  So I propose the attached.
> (Is there a better way to get the name of a replication command?
> I didn't look very hard for one.)

Wouldn't that just be in cmdtaglist.h, but extended for nodes used for
replication commands?  Then you could just call CreateCommandTag() to
get the command string to print as postgres.c does.  There is already
one for repslot drop, in some way.  This would have the advantage to
just call once set_ps_display() before the switch split.
--
Michael


signature.asc
Description: PGP signature


Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

2020-09-14 Thread Michael Paquier
On Mon, Sep 14, 2020 at 10:56:01PM -0400, Tom Lane wrote:
> (Note that we disallow sub-queries in CHECK constraints, and also
> disclaim responsibility for what happens if you cheat by hiding
> the subquery in a function.  So while it's certainly possible to
> build CHECK constraints that only work if table X is loaded before
> table Y, pg_dump already doesn't guarantee that'll work, --data-only
> or otherwise.)

Yep, exactly what I was thinking upthread by cheating with a schema
having cross-table references in a check constraint.
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-09-14 Thread Tom Lane
Amit Kapila  writes:
> On Mon, Sep 14, 2020 at 9:42 PM Tom Lane  wrote:
>> BTW, unless someone has changed the behavior of dynahash when I
>> wasn't looking, those MemoryContextSwitchTos shown above are useless.

> As far as I can see they are useless in this case but I think they
> might be required in case the user provides its own allocator function
> (using HASH_ALLOC). So, we can probably remove those from here?

You could imagine writing a HASH_ALLOC allocator whose behavior
varies depending on CurrentMemoryContext, but it seems like a
pretty foolish/fragile way to do it.  In any case I can think of,
the hash table lives in one specific context and you really
really do not want parts of it spread across other contexts.
dynahash.c is not going to look kindly on pieces of what it
is managing disappearing from under it.

(To be clear, objects that the hash entries contain pointers to
are a different question.  But the hash entries themselves have
to have exactly the same lifespan as the hash table.)

regards, tom lane




Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-14 Thread Jeff Davis
On Mon, 2020-09-14 at 19:29 -0700, Peter Geoghegan wrote:
> Let's assume that we'll teach LogicalTapeSetBlocks() to use
> nBlocksWritten in place of nBlocksAllocated in all cases, as now
> seems
> likely. Rather than asserting "nBlocksWritten == nBlocksAllocated"
> inside LogicalTapeSetBlocks() (as I suggested earlier at one point),
> we could instead teach LogicalTapeSetBlocks() to iterate through each
> tape from the tapeset and make sure each tape has no writes buffered
> (so everything must be flushed). We could add a loop that would only
> be used on assert-enabled builds.

Sounds reasonable.

> You suggested this yourself, Jeff (my suggestion about the assertion
> is just an expansion on your suggestion from earlier). This all seems
> like a good idea to me. Can you write a patch that adjusts
> LogicalTapeSetBlocks() along these lines? Hopefully the assertion
> loop
> thing won't reveal some other problem with this plan.

Sure. Will backporting either patch into REL_13_STABLE now interfere
with RC1 release in any way?

Regards,
Jeff Davis






Re: Improving connection scalability: GetSnapshotData()

2020-09-14 Thread Andres Freund
Hi,

On 2020-09-15 11:56:24 +0900, Michael Paquier wrote:
> On Mon, Sep 14, 2020 at 05:42:51PM -0700, Andres Freund wrote:
> > My test uses IPC::Run - although I'm indirectly 'use'ing, which I guess
> > isn't pretty. Just as 013_crash_restart.pl already did (even before
> > psql/t/010_tab_completion.pl). I am mostly wondering whether we could
> > avoid copying the utility functions into multiple test files...
> > 
> > Does IO::Pty work on windows? Given that currently the test doesn't use
> > a pty and that there's no benefit I can see in requiring one, I'm a bit
> > hesitant to go there?
> 
> Per https://metacpan.org/pod/IO::Tty:
> "Windows is now supported, but ONLY under the Cygwin environment, see
> http://sources.redhat.com/cygwin/.;
> 
> So I would suggest to make stuff a soft dependency (as Tom is
> hinting?), and not worry about Windows specifically.  It is not like
> what we are dealing with here is specific to Windows anyway, so you
> would have already sufficient coverage.  I would not mind if any
> refactoring is done later, once we know that the proposed test is
> stable in the buildfarm as we would get a better image of what part of
> the facility overlaps across multiple tests.

I'm confused - the test as posted should work on windows, and we already
do this in an existing test (src/test/recovery/t/013_crash_restart.pl). What's
the point in adding a platforms specific dependency here?

Greetings,

Andres Freund




Re: Fix for parallel BTree initialization bug

2020-09-14 Thread Amit Kapila
On Mon, Sep 14, 2020 at 11:26 PM Peter Geoghegan  wrote:
>
> On Mon, Sep 14, 2020 at 5:37 AM Amit Kapila  wrote:
> > I am planning to push this tomorrow after doing testing on
> > back-branches. Let me know if you have any comments.
>
> The fix seems sensible to me.
>

Thanks, I think it is better to wait for a day or two as yesterday
only we stamped 13 and we need to backpatch this.

-- 
With Regards,
Amit Kapila.




Re: Improving connection scalability: GetSnapshotData()

2020-09-14 Thread Michael Paquier
On Mon, Sep 14, 2020 at 05:42:51PM -0700, Andres Freund wrote:
> My test uses IPC::Run - although I'm indirectly 'use'ing, which I guess
> isn't pretty. Just as 013_crash_restart.pl already did (even before
> psql/t/010_tab_completion.pl). I am mostly wondering whether we could
> avoid copying the utility functions into multiple test files...
> 
> Does IO::Pty work on windows? Given that currently the test doesn't use
> a pty and that there's no benefit I can see in requiring one, I'm a bit
> hesitant to go there?

Per https://metacpan.org/pod/IO::Tty:
"Windows is now supported, but ONLY under the Cygwin environment, see
http://sources.redhat.com/cygwin/.;

So I would suggest to make stuff a soft dependency (as Tom is
hinting?), and not worry about Windows specifically.  It is not like
what we are dealing with here is specific to Windows anyway, so you
would have already sufficient coverage.  I would not mind if any
refactoring is done later, once we know that the proposed test is
stable in the buildfarm as we would get a better image of what part of
the facility overlaps across multiple tests.
--
Michael


signature.asc
Description: PGP signature


Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

2020-09-14 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Jul 14, 2020 at 11:14:50AM +0200, Julien Rouhaud wrote:
>> Note that those extraneous queries were found while trying to dump
>> data out of a corrupted database.  The issue wasn't an excessive
>> runtime but corrupted catalog entries, so bypassing this code (since I
>> was only interested in the data anyway) was simpler than trying to
>> recover yet other corrupted rows.

> Yeah, I don't see actually why this argument can prevent us from doing
> a micro optimization if it proves to work correctly.

The main thing I'm wondering about is whether not fetching these objects
could lead to failing to detect an important dependency chain.  IIRC,
pg_dump simply ignores pg_depend entries that mention objects it has not
loaded, so there is a possible mechanism for that.  However, it's hard to
see how a --data-only dump could end up choosing an invalid dump order on
that basis.  It doesn't seem like safe load orders for the table data
objects could depend on what is referenced by defaults or CHECK
constraints.

But ... I've only spent a few minutes thinking about it, so maybe
I'm missing something.

(Note that we disallow sub-queries in CHECK constraints, and also
disclaim responsibility for what happens if you cheat by hiding
the subquery in a function.  So while it's certainly possible to
build CHECK constraints that only work if table X is loaded before
table Y, pg_dump already doesn't guarantee that'll work, --data-only
or otherwise.)

regards, tom lane




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-09-14 Thread Amit Kapila
On Mon, Sep 14, 2020 at 9:42 PM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > The attached patch will fix the issue. What do you think?
>
> I think it'd be cleaner to separate the initialization of a new entry from
> validation altogether, along the lines of
>
> /* Find cached function info, creating if not found */
> oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> entry = (RelationSyncEntry *) hash_search(RelationSyncCache,
>   (void *) ,
>   HASH_ENTER, );
> MemoryContextSwitchTo(oldctx);
> Assert(entry != NULL);
>
> if (!found)
> {
> /* immediately make a new entry valid enough to satisfy callbacks */
> entry->schema_sent = false;
> entry->streamed_txns = NIL;
> entry->replicate_valid = false;
> /* are there any other fields we should clear here for safety??? */
> }
>

If we want to separate validation then we need to initialize other
fields like 'pubactions' and 'publish_as_relid' as well. I think it
will be better to arrange it the way you are suggesting. So, I will
change it along with other fields that required initialization.

> /* Fill it in if not valid */
> if (!entry->replicate_valid)
> {
> List   *pubids = GetRelationPublications(relid);
> ...
>
> BTW, unless someone has changed the behavior of dynahash when I
> wasn't looking, those MemoryContextSwitchTos shown above are useless.
>

As far as I can see they are useless in this case but I think they
might be required in case the user provides its own allocator function
(using HASH_ALLOC). So, we can probably remove those from here?

> Also, why does the comment refer to a "function" entry?
>

It should be "relation" instead. I'll take care of changing this as well.

-- 
With Regards,
Amit Kapila.




Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

2020-09-14 Thread Michael Paquier
On Thu, Sep 10, 2020 at 02:31:32PM +0200, Daniel Gustafsson wrote:
> Given how unintrusive this optimization is, +1 from me to go ahead with this
> patch.  pg_dump tests pass.  Personally I would've updated the nearby comments
> to reflect why the check for dataOnly is there, but MMV there.  I'm moving 
> this
> patch to Ready for Committer.

We need two comments here.  I would suggest something like:
"Skip default/check for a data-only dump, as this is only needed for
dumps of the table schema."

Better wording is of course welcome.

> I'm fairly sure there is a lot more we can do to improve the performance of
> data-only dumps, but this nicely chips away at the problem.

I was looking at that, and wondered about cases like the following,
artistic, thing:
CREATE FUNCTION check_data_zz() RETURNS boolean
LANGUAGE sql STABLE STRICT
AS $$select count(a) > 0 from zz$$;
CREATE TABLE public.yy (
a integer,
CONSTRAINT yy_check CHECK (check_data_zz())
);
CREATE TABLE zz (a integer);
INSERT INTO zz VALUES (1);
INSERT INTO yy VALUES (1);

Even on HEAD, this causes the data load to fail because yy's data is
inserted before zz, so keeping track of the CHECK dependency could
have made sense for --data-only if we were to make a better work at
detecting the dependency between both tables and made sure that zz
data needs to appear before yy, but it is not like this would happen
easily in pg_dump, and we document it this way (see the warning about
dump/reload in ddl.sgml for check constraints).  In short, I think
that this patch looks like a good thing to have.
--
Michael


signature.asc
Description: PGP signature


Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

2020-09-14 Thread Michael Paquier
On Tue, Jul 14, 2020 at 11:14:50AM +0200, Julien Rouhaud wrote:
> Note that those extraneous queries were found while trying to dump
> data out of a corrupted database.  The issue wasn't an excessive
> runtime but corrupted catalog entries, so bypassing this code (since I
> was only interested in the data anyway) was simpler than trying to
> recover yet other corrupted rows.

Yeah, I don't see actually why this argument can prevent us from doing
a micro optimization if it proves to work correctly.
--
Michael


signature.asc
Description: PGP signature


Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-14 Thread Peter Geoghegan
On Mon, Sep 14, 2020 at 7:09 PM Peter Geoghegan  wrote:
> Oh, wait. Of course the point was that we wouldn't even have to use
> nBlocksAllocated in LogicalTapeSetBlocks() anymore -- we would make
> the assumption that nBlocksWritten could be used for all callers in
> all cases. Which is a reasonable assumption once you enforce that
> there are no active write buffers. Which is evidently a good idea
> anyway, since it saves on temp file disk space in
> HashAggs-that-spill/prealloc cases with very little work_mem.

Let's assume that we'll teach LogicalTapeSetBlocks() to use
nBlocksWritten in place of nBlocksAllocated in all cases, as now seems
likely. Rather than asserting "nBlocksWritten == nBlocksAllocated"
inside LogicalTapeSetBlocks() (as I suggested earlier at one point),
we could instead teach LogicalTapeSetBlocks() to iterate through each
tape from the tapeset and make sure each tape has no writes buffered
(so everything must be flushed). We could add a loop that would only
be used on assert-enabled builds.

This looping-through-tapes-to assert code would justify relying on
nBlocksWritten in LogicalTapeSetBlocks(), and would make sure that we
don't let any bugs like this slip in in the future. It would
necessitate that we commit Jeff's hashagg-release-write-buffers.patch
patch from earlier, I think, but that seems like a good idea anyway.

You suggested this yourself, Jeff (my suggestion about the assertion
is just an expansion on your suggestion from earlier). This all seems
like a good idea to me. Can you write a patch that adjusts
LogicalTapeSetBlocks() along these lines? Hopefully the assertion loop
thing won't reveal some other problem with this plan.

Thanks
-- 
Peter Geoghegan




Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-14 Thread Peter Geoghegan
On Mon, Sep 14, 2020 at 6:56 PM Peter Geoghegan  wrote:
> I'm not sure what I was talking about earlier when I connected this
> with the main/instrumentation issue, since preallocation used by
> logtape.c to help HashAggs-that-spill necessarily reserves blocks
> without writing them out for a while (the fires in California have
> made it difficult to be productive). You might write blocks out as
> zero blocks first, and then only write the real data later
> (overwriting the zero blocks). But no matter how the writes among
> tapes are interlaced, the fact is that nBlocksAllocated can exceed
> nBlocksWritten by at least one block per active tape.

Oh, wait. Of course the point was that we wouldn't even have to use
nBlocksAllocated in LogicalTapeSetBlocks() anymore -- we would make
the assumption that nBlocksWritten could be used for all callers in
all cases. Which is a reasonable assumption once you enforce that
there are no active write buffers. Which is evidently a good idea
anyway, since it saves on temp file disk space in
HashAggs-that-spill/prealloc cases with very little work_mem.


-- 
Peter Geoghegan




Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-14 Thread Peter Geoghegan
On Mon, Sep 14, 2020 at 5:50 PM Jeff Davis  wrote:
> Yes, it was apparently an oversight. Patch attached.

This is closer to how logical tapes are used within tuplesort.c. I
notice that this leads to about a 50% reduction in temp file usage for
a test case involving very little work_mem (work_mem is set to 64).
But it doesn't seem to make as much difference with more work_mem. It
probably has something to do with recursion during spilling.

> RC1 was just stamped, are we in a sensitive time or is it still
> possible to backport this to REL_13_STABLE?

Testing indicates that this still doesn't make "nBlocksWritten ==
nBlocksAllocated" when the instrumentation is used for
HashAggs-that-spill.

I'm not sure what I was talking about earlier when I connected this
with the main/instrumentation issue, since preallocation used by
logtape.c to help HashAggs-that-spill necessarily reserves blocks
without writing them out for a while (the fires in California have
made it difficult to be productive). You might write blocks out as
zero blocks first, and then only write the real data later
(overwriting the zero blocks). But no matter how the writes among
tapes are interlaced, the fact is that nBlocksAllocated can exceed
nBlocksWritten by at least one block per active tape.

If we really wanted to ensure "nBlocksWritten == nBlocksAllocated",
wouldn't it be necessary for LogicalTapeSetBlocks() to go through the
remaining preallocated blocks from each tape and count the number of
blocks "logically preallocated" (by ltsGetPreallocBlock()) but not yet
"physically preallocated" (by being written out as zero blocks within
ltsWriteBlock())? That count would have to be subtracted, because
nBlocksAllocated includes logically preallocated blocks, without
regard for whether they've been physically preallocated. But we only
know the difference by checking against nBlocksWritten, so we might as
well just use my patch from earlier. (I'm not arguing that we should,
I'm just pointing out the logical though perhaps absurd conclusion.)

-- 
Peter Geoghegan




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-14 Thread k.jami...@fujitsu.com
On Tuesday, September 8, 2020 1:02 PM, Amit Kapila wrote:
Hello,
> On Mon, Sep 7, 2020 at 1:33 PM k.jami...@fujitsu.com
>  wrote:
> >
> > On Wednesday, September 2, 2020 5:49 PM, Amit Kapila wrote:
> > > On Wed, Sep 2, 2020 at 9:17 AM Tom Lane  wrote:
> > > >
> > > > Amit Kapila  writes:
> > > > > Even if the relation is locked, background processes like
> > > > > checkpointer can still touch the relation which might cause
> > > > > problems. Consider a case where we extend the relation but
> > > > > didn't flush the newly added pages. Now during truncate
> > > > > operation, checkpointer can still flush those pages which can
> > > > > cause trouble for truncate. But, I think in the recovery path
> > > > > such cases won't cause a
> > > problem.
> > > >
> > > > I wouldn't count on that staying true ...
> > > >
> > > >
> > >
> https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFR
> > > OBV-jC
> > > > nacbyvtpptk2a9yy...@mail.gmail.com
> > > >
> > >
> > > I don't think that proposal will matter after commit c5315f4f44
> > > because we are caching the size/blocks for recovery while doing
> > > extend (smgrextend). In the above scenario, we would have cached the
> > > blocks which will be used at later point of time.
> > >
> >
> > I'm guessing we can still pursue this idea of improving the recovery path
> first.
> >
> 
> I think so.

Alright, so I've updated the patch which passes the regression and TAP tests.
It compiles and builds as intended.

> > I'm working on an updated patch version, because the CFBot's telling
> > that postgres fails to build (one of the recovery TAP tests fails).
> > I'm still working on refactoring my patch, but have yet to find a proper
> solution at the moment.
> > So I'm going to continue my investigation.
> >
> > Attached is an updated WIP patch.
> > I'd appreciate if you could take a look at the patch as well.
> >
> 
> So, I see the below log as one of the problems:
> 2020-09-07 06:20:33.918 UTC [10914] LOG:  redo starts at 0/15FFEC0
> 2020-09-07 06:20:33.919 UTC [10914] FATAL:  unexpected data beyond EOF
> in block 1 of relation base/13743/24581
> 
> This indicates that we missed invalidating some buffer which should have
> been invalidated. If you are able to reproduce this locally then I suggest to 
> first
> write a simple patch without the check of the threshold, basically in recovery
> always try to use the new way to invalidate the buffer. That will reduce the
> scope of the code that can create a problem. Let us know if the problem still
> exists and share the logs. BTW, I think I see one problem in the code:
> 
> if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >= 
> + bufHdr->firstDelBlock[j])
> 
> Here, I think you need to use 'i' not 'j' for forkNum and 
> firstDelBlock as those are arrays w.r.t forks. That might fix the 
> problem but I am not sure as I haven't tried to reproduce it.

Thanks for advice. Right, that seems to be the cause of error,
and fixing that (using fork) solved the case.
I also followed the advice of Tsunakawa-san of using more meaningful iterator
Instead of using "i" & "j" for readability.

I also added a new function when relation fork is bigger than the threshold
If (nblocks > BUF_DROP_FULLSCAN_THRESHOLD)
(DropRelFileNodeBuffersOfFork) Perhaps there's a better name for that function.
However, as expected in the previous discussions, this is a bit slower than the
standard buffer invalidation process, because the whole shared buffers are 
scanned nfork times.
Currently, I set the threshold to (NBuffers / 500)

Feedback on the patch/testing are very much welcome.

Best regards,
Kirk Jamison



v12-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v12-Speedup-dropping-of-relation-buffers-during-recovery.patch


Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-14 Thread Jeff Davis
On Mon, 2020-09-14 at 15:54 -0700, Peter Geoghegan wrote:
> Maybe the LogicalTapeRewindForRead() inconsistency you mention could
> be fixed, which would enable the simplification you suggested. What
> do
> you think?

Yes, it was apparently an oversight. Patch attached.

RC1 was just stamped, are we in a sensitive time or is it still
possible to backport this to REL_13_STABLE?

If not, that's fine, I'll just commit it to master. It's a little less
important after 9878b643, which reduced the overpartitioning issue.

Regards,
Jeff Davis

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index f74d4841f17..28802e6588d 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2639,8 +2639,6 @@ agg_refill_hash_table(AggState *aggstate)
 	 */
 	hashagg_recompile_expressions(aggstate, true, true);
 
-	LogicalTapeRewindForRead(tapeinfo->tapeset, batch->input_tapenum,
-			 HASHAGG_READ_BUFFER_SIZE);
 	for (;;)
 	{
 		TupleTableSlot *spillslot = aggstate->hash_spill_rslot;
@@ -2923,6 +2921,7 @@ hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *partitions,
 static void
 hashagg_tapeinfo_release(HashTapeInfo *tapeinfo, int tapenum)
 {
+	/* rewinding frees the buffer while not in use */
 	LogicalTapeRewindForWrite(tapeinfo->tapeset, tapenum);
 	if (tapeinfo->freetapes_alloc == tapeinfo->nfreetapes)
 	{
@@ -3152,6 +3151,7 @@ hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
 
 	for (i = 0; i < spill->npartitions; i++)
 	{
+		LogicalTapeSet	*tapeset = aggstate->hash_tapeinfo->tapeset;
 		int tapenum = spill->partitions[i];
 		HashAggBatch	*new_batch;
 		double			 cardinality;
@@ -3163,9 +3163,13 @@ hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
 		cardinality = estimateHyperLogLog(>hll_card[i]);
 		freeHyperLogLog(>hll_card[i]);
 
-		new_batch = hashagg_batch_new(aggstate->hash_tapeinfo->tapeset,
-	  tapenum, setno, spill->ntuples[i],
-	  cardinality, used_bits);
+		/* rewinding frees the buffer while not in use */
+		LogicalTapeRewindForRead(tapeset, tapenum,
+ HASHAGG_READ_BUFFER_SIZE);
+
+		new_batch = hashagg_batch_new(tapeset, tapenum, setno,
+	  spill->ntuples[i], cardinality,
+	  used_bits);
 		aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
 		aggstate->hash_batches_used++;
 	}


Re: Improving connection scalability: GetSnapshotData()

2020-09-14 Thread Andres Freund
Hi,

On 2020-09-14 20:14:48 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I think the approach of having a long running psql session is really
> > useful, and probably would speed up some tests. Does anybody have a good
> > idea for how to best, and without undue effort, to integrate this into
> > PostgresNode.pm?  I don't really have a great idea, so I think I'd leave
> > it with a local helper in the new test?
> 
> You could use the interactive_psql infrastructure that already exists
> for psql/t/010_tab_completion.pl.  That does rely on IO::Pty, but
> I think I'd prefer to accept that dependency for such tests over rolling
> our own IPC::Run, which is more or less what you've done here.

My test uses IPC::Run - although I'm indirectly 'use'ing, which I guess
isn't pretty. Just as 013_crash_restart.pl already did (even before
psql/t/010_tab_completion.pl). I am mostly wondering whether we could
avoid copying the utility functions into multiple test files...

Does IO::Pty work on windows? Given that currently the test doesn't use
a pty and that there's no benefit I can see in requiring one, I'm a bit
hesitant to go there?

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-09-14 Thread Tom Lane
Andres Freund  writes:
> I think the approach of having a long running psql session is really
> useful, and probably would speed up some tests. Does anybody have a good
> idea for how to best, and without undue effort, to integrate this into
> PostgresNode.pm?  I don't really have a great idea, so I think I'd leave
> it with a local helper in the new test?

You could use the interactive_psql infrastructure that already exists
for psql/t/010_tab_completion.pl.  That does rely on IO::Pty, but
I think I'd prefer to accept that dependency for such tests over rolling
our own IPC::Run, which is more or less what you've done here.

regards, tom lane




Re: pg_restore causing deadlocks on partitioned tables

2020-09-14 Thread Tom Lane
I wrote:
>> (2) ALTER TABLE ONLY ... ADD CONSTRAINT on a partition root tries to get
>> AccessExclusiveLock on all child partitions, despite the ONLY.

> The cause of this seems to be that ATPrepSetNotNull is too dumb to
> avoid recursing to all the child tables when the parent is already
> attnotnull.  Or is there a reason we have to recurse anyway?

I wrote a quick patch for this part.  It seems pretty safe and probably
could be back-patched without fear.  (I also noticed that
ATSimpleRecursion is being unnecessarily stupid: instead of the
demonstrably not-future-proof relkind check, it could test relhassubclass,
which is not only simpler and less likely to need future changes, but
is able to save a scan of pg_inherits in a lot more cases.)

As far as I can tell in some quick testing, this fix is sufficient to
resolve the complained-of deadlock.  It'd still be a good idea to fix the
TRUNCATE side of things as well.  But that would be hard to back-patch
because removing ri_PartitionCheck, or even just failing to fill it,
seems like a potential ABI break for extensions.  So my proposal is
to back-patch this, but address the ResultRelInfo change only in HEAD.

regards, tom lane

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c21a309f04..4a51d79d09 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5681,14 +5681,10 @@ ATSimpleRecursion(List **wqueue, Relation rel,
   AlterTableUtilityContext *context)
 {
 	/*
-	 * Propagate to children if desired.  Only plain tables, foreign tables
-	 * and partitioned tables have children, so no need to search for other
-	 * relkinds.
+	 * Propagate to children, if desired and if there are (or might be) any
+	 * children.
 	 */
-	if (recurse &&
-		(rel->rd_rel->relkind == RELKIND_RELATION ||
-		 rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE ||
-		 rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))
+	if (recurse && rel->rd_rel->relhassubclass)
 	{
 		Oid			relid = RelationGetRelid(rel);
 		ListCell   *child;
@@ -6698,6 +6694,35 @@ ATPrepSetNotNull(List **wqueue, Relation rel,
 	if (recursing)
 		return;
 
+	/*
+	 * If the target column is already marked NOT NULL, we can skip recursing
+	 * to children, because their columns should already be marked NOT NULL as
+	 * well.  But there's no point in checking here unless the relation has
+	 * some children; else we can just wait till execution to check.  (If it
+	 * does have children, however, this can save taking per-child locks
+	 * unnecessarily.  This greatly improves concurrency in some parallel
+	 * restore scenarios.)
+	 */
+	if (rel->rd_rel->relhassubclass)
+	{
+		HeapTuple	tuple;
+		bool		attnotnull;
+
+		tuple = SearchSysCacheAttName(RelationGetRelid(rel), cmd->name);
+
+		/* Might as well throw the error now, if name is bad */
+		if (!HeapTupleIsValid(tuple))
+			ereport(ERROR,
+	(errcode(ERRCODE_UNDEFINED_COLUMN),
+	 errmsg("column \"%s\" of relation \"%s\" does not exist",
+			cmd->name, RelationGetRelationName(rel;
+
+		attnotnull = ((Form_pg_attribute) GETSTRUCT(tuple))->attnotnull;
+		ReleaseSysCache(tuple);
+		if (attnotnull)
+			return;
+	}
+
 	/*
 	 * If we have ALTER TABLE ONLY ... SET NOT NULL on a partitioned table,
 	 * apply ALTER TABLE ... CHECK NOT NULL to every child.  Otherwise, use


Re: Adding Support for Copy callback functionality on COPY TO api

2020-09-14 Thread Soumyadeep Chakraborty
Hi Bilva,

Thank you for registering this patch!

I had a few suggestions:

1. Please run pg_indent[1] on your code. Make sure you add
copy_data_destination_cb to src/tools/pgindent/typedefs.list. Please
run pg_indent on only the files you changed (it will take files as
command line args)

2. For features such as this, it is often helpful to find a use case
within backend/utility/extension code that demonstrate thes callback and
to include the code to exercise it with the patch. Refer how
copy_read_data() is used as copy_data_source_cb, to copy the data from
the query results from the WAL receiver (Refer: copy_table()). Finding
a similar use case in the source tree will make a stronger case
for this patch.

3. Wouldn't we want to return the number of bytes written from
copy_data_destination_cb? (Similar to copy_data_source_cb) We should
also think about how to represent failure. Looking at CopySendEndOfRow(),
we should error out like we do for the other copy_dests after checking the
return value for the callback invocation.

4.
> bool pipe = (cstate->filename == NULL) && (cstate->data_destination_cb == 
> NULL);

I think a similar change should also be applied to BeginCopyFrom() and
CopyFrom(). Or even better, such code could be refactored to have a
separate destination type COPY_PIPE. This of course, will be a separate
patch. I think the above line is okay for this patch.

Regards,
Soumyadeep




Re: Use incremental sort paths for window functions

2020-09-14 Thread David Rowley
On Tue, 15 Sep 2020 at 00:02, Daniel Gustafsson  wrote:
>
> > On 8 Jul 2020, at 06:57, David Rowley  wrote:
> >
> > Over on [1] someone was asking about chained window paths making use
> > of already partially sorted input.  (The thread is on -general, so I
> > guessed they're not using PG13.)
>
> The [1] reference wasn't qualified, do you remember which thread it was?

That was sloppy of me.  It's
https://www.postgresql.org/message-id/CADd42iFZWwYNsXjEM_3HWK3QnfiCrMNmpOkZqyBQCabnVxOPtw%40mail.gmail.com

> > However, On checking PG13 to see if incremental sort would help their
> > case, I saw it didn't. Looking at the code I saw that
> > create_window_paths() and create_one_window_path() don't make any use
> > of incremental sort paths.
>
> Commit 728202b63cdcd7f counteracts this optimization in part since it orders
> the windows such that the longest common prefix is executed first to allow
> subsequent windows to skip sorting entirely.

This would have been clearer if I'd remembered to include the link to
the thread.  The thread talks about sorting requirements like c1, c3
then c1, c4. So it can make use of the common prefix and do
incremental sorts.

It sounds like you're talking about cases like: wfunc() over (order by
a), wfunc2() over (order by a,b). Where we can just sort on a,b and
have that order work for the first wfunc(). That's a good optimisation
but does not work for the above case.

> That being said, it's only in part and when the stars don't align with sub-
> sequently shorter common prefixes then incremental sort can help.  A synthetic
> unscientific test with three windows over 10M rows, where no common prefix
> exists, shows consistent speedups (for worst cases) well past what can be
> attributed to background noise.
>
> > I quickly put together the attached. It's only about 15 mins of work,
> > but it seems worth looking at a bit more for some future commitfest.
> > Yeah, I'll need to add some tests as I see nothing failed by changing
> > this.
>
> A few comments on the patch: there is no check for enable_incremental_sort, 
> and
> it lacks tests (as already mentioned) for the resulting plan.

Yeah, it should be making sure enable_incremental_sort is on for sure.
I've attached another version with a few tests added too.

David


incremental_sort_for_windowpaths_v2.patch
Description: Binary data


Re: Improving connection scalability: GetSnapshotData()

2020-09-14 Thread Andres Freund
Hi,

On 2020-09-09 17:02:58 +0900, Ian Barwick wrote:
> Attached, though bear in mind I'm not very familiar with parts of this,
> particularly 2PC stuff, so consider it educated guesswork.

Thanks for this, and the test case!

Your commit fixes the issues, but not quite correctly. Multixacts
shouldn't matter, so we don't need to do anything there. And for the
increases, I think they should be inside the already existing
ProcArrayLock acquisition, as in the attached.


I've also included a quite heavily revised version of your test. Instead
of using dblink I went for having a long-running psql that I feed over
stdin. The main reason for not liking the previous version is that it
seems fragile, with the sleep and everything.  I expanded it to cover
2PC is as well.

The test probably needs a bit of cleanup, wrapping some of the
redundancy around the pump_until calls.

I think the approach of having a long running psql session is really
useful, and probably would speed up some tests. Does anybody have a good
idea for how to best, and without undue effort, to integrate this into
PostgresNode.pm?  I don't really have a great idea, so I think I'd leave
it with a local helper in the new test?

Regards,

Andres
>From a637b65fc53b208857e0d3d17141d8ed3609036f Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 14 Sep 2020 16:08:25 -0700
Subject: [PATCH v2] WIP: fix and test snapshot behaviour on standby.

Reported-By: Ian Barwick 
Author: Andres Freund 
Author: Ian Barwick 
Discussion: https://postgr.es/m/61291ffe-d611-f889-68b5-c298da9fb...@2ndquadrant.com
---
 src/backend/storage/ipc/procarray.c   |   3 +
 src/test/recovery/t/021_row_visibility.pl | 227 ++
 2 files changed, 230 insertions(+)
 create mode 100644 src/test/recovery/t/021_row_visibility.pl

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 802b119c490..fffa5f7a93e 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -4280,6 +4280,9 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
 	MaintainLatestCompletedXidRecovery(max_xid);
 
+	/* ... and xactCompletionCount */
+	ShmemVariableCache->xactCompletionCount++;
+
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/test/recovery/t/021_row_visibility.pl b/src/test/recovery/t/021_row_visibility.pl
new file mode 100644
index 000..08713fa2686
--- /dev/null
+++ b/src/test/recovery/t/021_row_visibility.pl
@@ -0,0 +1,227 @@
+# Checks that a standby session can see all expected rows
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 10;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', 'max_prepared_transactions=10');
+$node_primary->start;
+
+# Initialize with empty test table
+$node_primary->safe_psql('postgres',
+	'CREATE TABLE public.test_visibility (data text not null)');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', 'max_prepared_transactions=10');
+$node_standby->start;
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(5);
+
+
+# One psql to primary for all queries. That allows to check
+# uncommitted changes being replicated and such.
+my ($psql_primary_stdin, $psql_primary_stdout, $psql_primary_stderr) = ('', '', '');
+my $psql_primary = IPC::Run::start(
+	[
+		'psql', '-X', '-qAe', '-f', '-', '-d',
+		$node_primary->connstr('postgres')
+	],
+	'<',
+	\$psql_primary_stdin,
+	'>',
+	\$psql_primary_stdout,
+	'2>',
+	\$psql_primary_stderr,
+	$psql_timeout);
+
+# One psql to standby for all queries. That allows to reuse the same
+# session for multiple queries, which is important to detect some
+# types of errors.
+my ($psql_standby_stdin, $psql_standby_stdout, $psql_standby_stderr) = ('', '', '');
+my $psql_standby = IPC::Run::start(
+	[
+		'psql', '-X', '-qAe', '-f', '-', '-d',
+		$node_standby->connstr('postgres')
+	],
+	'<',
+	\$psql_standby_stdin,
+	'>',
+	\$psql_standby_stdout,
+	'2>',
+	\$psql_standby_stderr,
+	$psql_timeout);
+
+#
+# 1. Check initial data is the same
+#
+$psql_standby_stdin .= q[
+SELECT * FROM test_visibility ORDER BY data;
+  ];
+ok(pump_until($psql_standby, \$psql_standby_stdout, qr/0 rows/m),
+   'data not visible');
+$psql_standby_stdout = '';
+$psql_standby_stderr = '';
+
+
+#
+# 2. Check if an INSERT is replayed and visible
+#

Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-14 Thread Peter Geoghegan
On Mon, Sep 14, 2020 at 3:54 PM Peter Geoghegan  wrote:
> If I add the assertion described above and run the regression tests,
> it fails within "select_distinct" (and at other points). This is the
> specific code:

This variant of the same assertion works fine:

+Assert(lts->enable_prealloc || lts->nBlocksWritten ==
lts->nBlocksAllocated);

(This is hardly surprising, though.)

-- 
Peter Geoghegan




Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

2020-09-14 Thread Ranier Vilela
Em sex., 11 de set. de 2020 às 17:44, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > New version, with support to read Virtual File (pipe, FIFO and socket).
> > With assert, in case, erroneous, of trying to read a pipe, with offset.
>
> I do agree that it might be worth skipping the fseeko call in the
> probably-common case where seek_offset is zero.  Otherwise I don't
> see much reason to change what's there.
>
Tom, if you think it's worth it, I agree with only avoiding syscal fseeko.

regards,
Ranier Vilela


v1-0001-avoid_syscall_fseeko_read_binary_file.patch
Description: Binary data


Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-14 Thread Peter Geoghegan
On Mon, Sep 14, 2020 at 3:39 PM Peter Geoghegan  wrote:
> I think that they are an exact match in practice (i.e. nBlocksWritten
> == nBlocksAllocated), given when and how we call
> LogicalTapeSetBlocks().

Just to be clear: this is only true for external sorts. The
preallocation stuff can make nBlocksAllocated quite a lot higher.
That's probably why adding a new "Assert(lts->nBlocksWritten ==
lts->nBlocksAllocated)" assertion fails during the regression tests,
though there might be other reasons as well (I'm thinking of the
LogicalTapeRewindForRead() inconsistency Jeff mentioned).

-- 
Peter Geoghegan




Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-14 Thread Peter Geoghegan
On Mon, Sep 14, 2020 at 3:20 PM Jeff Davis  wrote:
> In the comment in the patch, you say:
>
> "In practice this probably doesn't matter because we'll be called after
> the flush anyway, but be tidy."
>
> By which I assume you mean that LogicalTapeRewindForRead() will be
> called before LogicalTapeSetBlocks().

Yeah, I'm pretty sure that that's an equivalent way of expressing the
same idea. It appears that this assumption holds, though only when
we're not using preallocation (i.e. it doesn't necessarily hold for
the HashAggs-that-spill case, as I go into below).

> If that's the intention of LogicalTapeSetBlocks(), should we just make
> it a requirement that there are no open write buffers for any tapes
> when it's called? Then we could just use nBlocksWritten in both cases,
> right?

That does seem appealing. Perhaps it could be enforced by an assertion.

> (Aside: HashAgg calls it before LogicalTapeRewindForRead(). That might
> be a mistake in HashAgg where it will keep the write buffers around
> longer than necessary. If I recall correctly, it was my intention to
> rewind for reading immediately after the batch was finished, which is
> why I made the read buffer lazily-allocated.)

If I add the assertion described above and run the regression tests,
it fails within "select_distinct" (and at other points). This is the
specific code:

--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -1284,6 +1284,7 @@ LogicalTapeSetBlocks(LogicalTapeSet *lts)
  * (In practice this probably doesn't matter because we'll be called after
  * the flush anyway, but be tidy.)
  */
+Assert(lts->nBlocksWritten == lts->nBlocksAllocated);
 if (lts->enable_prealloc)
 return lts->nBlocksWritten;

Maybe the LogicalTapeRewindForRead() inconsistency you mention could
be fixed, which would enable the simplification you suggested. What do
you think?

-- 
Peter Geoghegan




Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-14 Thread Peter Geoghegan
On Mon, Sep 14, 2020 at 3:24 PM Alvaro Herrera  wrote:
> I don't understand this patch.  Or maybe I should say I don't understand
> the code you're patching.  Why isn't the correct answer *always*
> nBlocksWritten?  The comment in LogicalTapeSet says:

I think that they are an exact match in practice (i.e. nBlocksWritten
== nBlocksAllocated), given when and how we call
LogicalTapeSetBlocks().

-- 
Peter Geoghegan




Re: pg_restore causing deadlocks on partitioned tables

2020-09-14 Thread Tom Lane
I wrote:
> However, the deadlock report suggests, and manual experimentation
> confirms, that

> (1) TRUNCATE on a partition tries to get AccessShareLock on the parent;

The reason for this is that

(a) ExecuteTruncateGuts calls InitResultRelInfo, because it might
need that to fire TRUNCATE triggers for the child relation.

(b) InitResultRelInfo calls RelationGetPartitionQual, which
of course(?) must access the parent table.

AFAICS, it is utterly silly for InitResultRelInfo to be forcing
a partition qual to be computed when we might not need it.
We could flush ResultRelInfo.ri_PartitionCheck altogether and
have anything that was reading it instead do
RelationGetPartitionQual(ResultRelInfo.ri_RelationDesc).

Actually it looks like most of the places reading it are
just interested in non-nullness; can't those be nuked from
orbit in favor of testing rel->rd_rel->relispartition?
There's no such thing as a partition with an empty partition
qual is there?  (Or even if it's possible, do we care about
optimizing the case?)

> (2) ALTER TABLE ONLY ... ADD CONSTRAINT on a partition root tries to get
> AccessExclusiveLock on all child partitions, despite the ONLY.

The cause of this seems to be that ATPrepSetNotNull is too dumb to
avoid recursing to all the child tables when the parent is already
attnotnull.  Or is there a reason we have to recurse anyway?

regards, tom lane




Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-14 Thread Alvaro Herrera
On 2020-Sep-14, Peter Geoghegan wrote:

> On Fri, Sep 11, 2020 at 6:37 PM Peter Geoghegan  wrote:
> > It would be awkward if we just used nBlocksWritten within
> > LogicalTapeSetBlocks() in the case where we didn't preallocate (or in
> > all cases). Not entirely sure what to do about that just yet.
> 
> I guess that that's the logical thing to do, as in the attached patch.

I don't understand this patch.  Or maybe I should say I don't understand
the code you're patching.  Why isn't the correct answer *always*
nBlocksWritten?  The comment in LogicalTapeSet says:

"nBlocksWritten is the size of the underlying file, in BLCKSZ blocks."

so if LogicalTapeSetBlocks wants to do what its comment says, that is,

"Obtain total disk space currently used by a LogicalTapeSet, in blocks."

then it seems like they're an exact match.  Either that, or more than
zero of those comments are lying.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-14 Thread Jeff Davis
On Mon, 2020-09-14 at 14:24 -0700, Peter Geoghegan wrote:
> On Fri, Sep 11, 2020 at 6:37 PM Peter Geoghegan  wrote:
> > It would be awkward if we just used nBlocksWritten within
> > LogicalTapeSetBlocks() in the case where we didn't preallocate (or
> > in
> > all cases). Not entirely sure what to do about that just yet.
> 
> I guess that that's the logical thing to do, as in the attached
> patch.

Hi Peter,

In the comment in the patch, you say:

"In practice this probably doesn't matter because we'll be called after
the flush anyway, but be tidy."

By which I assume you mean that LogicalTapeRewindForRead() will be
called before LogicalTapeSetBlocks().

If that's the intention of LogicalTapeSetBlocks(), should we just make
it a requirement that there are no open write buffers for any tapes
when it's called? Then we could just use nBlocksWritten in both cases,
right?

(Aside: HashAgg calls it before LogicalTapeRewindForRead(). That might
be a mistake in HashAgg where it will keep the write buffers around
longer than necessary. If I recall correctly, it was my intention to
rewind for reading immediately after the batch was finished, which is
why I made the read buffer lazily-allocated.)

Regards,
Jeff Davis






Re: Division in dynahash.c due to HASH_FFACTOR

2020-09-14 Thread Thomas Munro
On Mon, Sep 14, 2020 at 11:35 PM David Rowley  wrote:
> I just did some benchmarking with this patch using the same recovery
> benchmark that I used in [1] and also the two patches that I posted in
> [2]. Additionally, I added a PANIC at the end of recovery so that I
> could repeat the recovery over and over again with the same WAL.
>
> [data]

N   Min   MaxMedian   AvgStddev
x  10 62.15 67.06 64.8664.132 1.6188528
+  10  59.6 63.81 63.1362.233 1.4983031
Difference at 95.0% confidence
-1.899 +/- 1.46553
-2.96108% +/- 2.28517%
(Student's t, pooled s = 1.55974)

Thanks!  Hmm, small but apparently significant and in line with
Jakub's report, and I suppose the effect will be greater with other
nearby recovery performance patches applied that halve the times.
Annoyingly, I can't reproduce this speedup on my local i9-9900; maybe
it requires a different CPU...

> I looked over the patch and the only thing I saw was that we might
> also want to remove the following line:
>
> #define DEF_FFACTOR1 /* default fill factor */

Right, thanks.  Fixed in the attached.

> The 2nd most costly call to hash_search_with_hash_value() came in via
> hash_search() via smgropen(). That does use HASH_ENTER, which could
> have triggered the divide code. The main caller of smgropen() was
> XLogReadBufferExtended().
>
> So, it looks unlikely that any gains we are seeing are from improved
> buffer lookups. It's more likely they're coming from more optimal
> XLogReadBufferExtended()

I think we call smgropen() twice for every buffer referenced in the
WAL: XLogReadBufferExtended() and again in
ReadBufferWithoutRelcache().  We could reduce it to once with some
refactoring, but I am looking into whether I can reduce it to zero as
a side-effect of another change, more soon...
From efecf68b159a3c65517e91076009cb4e5cc6f157 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 10 Sep 2020 12:27:25 +1200
Subject: [PATCH v2] Remove custom fill factor support from dynahash.c.

Since ancient times we have had support for a fill factor (maximum load
factor) to be set for a dynahash hash table, but:

1.  It had to be an integer value >= 1, whereas for in memory hash
tables interesting load factor targets are probably somewhere near the
0.75-1.0 range.

2.  It was implemented in a way that performed an expensive division
operation that regularly showed up in profiles.

3.  We are not aware of anyone ever having used a non-default value.

Therefore, remove support, making fill factor 1 as the implicit value.

Author: Jakub Wartak 
Reviewed-by: Alvaro Herrera 
Reviewed-by: Tomas Vondra 
Reviewed-by: Thomas Munro 
Reviewed-by: David Rowley 
Discussion: https://postgr.es/m/VI1PR0701MB696044FC35013A96FECC7AC8F62D0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
---
 src/backend/utils/hash/dynahash.c | 20 ++--
 src/include/utils/hsearch.h   |  2 --
 2 files changed, 6 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index f4fbccdd7e..1122e2e5e5 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -122,7 +122,6 @@
 #define DEF_SEGSIZE			   256
 #define DEF_SEGSIZE_SHIFT	   8	/* must be log2(DEF_SEGSIZE) */
 #define DEF_DIRSIZE			   256
-#define DEF_FFACTOR			   1	/* default fill factor */
 
 /* Number of freelists to be used for a partitioned hash table. */
 #define NUM_FREELISTS			32
@@ -191,7 +190,6 @@ struct HASHHDR
 	Size		keysize;		/* hash key length in bytes */
 	Size		entrysize;		/* total user element size in bytes */
 	long		num_partitions; /* # partitions (must be power of 2), or 0 */
-	long		ffactor;		/* target fill factor */
 	long		max_dsize;		/* 'dsize' limit if directory is fixed size */
 	long		ssize;			/* segment size --- must be power of 2 */
 	int			sshift;			/* segment shift = log2(ssize) */
@@ -497,8 +495,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 		/* ssize had better be a power of 2 */
 		Assert(hctl->ssize == (1L << hctl->sshift));
 	}
-	if (flags & HASH_FFACTOR)
-		hctl->ffactor = info->ffactor;
 
 	/*
 	 * SHM hash tables have fixed directory size passed by the caller.
@@ -603,8 +599,6 @@ hdefault(HTAB *hashp)
 
 	hctl->num_partitions = 0;	/* not partitioned */
 
-	hctl->ffactor = DEF_FFACTOR;
-
 	/* table has no fixed maximum size */
 	hctl->max_dsize = NO_MAX_DSIZE;
 
@@ -670,11 +664,10 @@ init_htab(HTAB *hashp, long nelem)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
 	/*
-	 * Divide number of elements by the fill factor to determine a desired
-	 * number of buckets.  Allocate space for the next greater power of two
-	 * number of buckets
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
 	 */
-	nbuckets = next_pow2_int((nelem - 1) / hctl->ffactor + 1);
+	nbuckets = 

Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-14 Thread Peter Geoghegan
On Fri, Sep 11, 2020 at 6:37 PM Peter Geoghegan  wrote:
> It would be awkward if we just used nBlocksWritten within
> LogicalTapeSetBlocks() in the case where we didn't preallocate (or in
> all cases). Not entirely sure what to do about that just yet.

I guess that that's the logical thing to do, as in the attached patch.

What do you think, Jeff?

-- 
Peter Geoghegan


0001-LogicalTapeSetBlocks-issue-Tentative-fix.patch
Description: Binary data


Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-09-14 Thread Andres Freund
Hi,

On 2020-09-14 17:00:48 -0400, Robert Haas wrote:
> On Mon, Sep 14, 2020 at 4:13 PM Andres Freund  wrote:
> > My understanding of the case we're discussing is that it's corruption
> > (e.g. relfrozenxid being different than table contents) affecting a HOT
> > chain. I.e. by definition all within a single page.  We won't have
> > modified part of it independent of B < A, because freezing is
> > all-or-nothing.  Just breaking the HOT chain into two or something like
> > that will just make things worse, because indexes won't find tuples, and
> > because reindexing might then get confused e.g. by HOT chains without a
> > valid start, or by having two visible tuples for the same PK.
> 
> If we adopt the proposal made by Dilip, we will not do that. We must
> have a.xmax = b.xmin, and that value is either less than relfrozenxid
> or it is not. If we skip an entire tuple because one XID is bad, then
> we could break the HOT chain when a.xmin is bad and the remaining
> values are OK. But if we decide separately for xmin and xmax then we
> should be alright.

I thought I precisely addressed this case:

> What exactly are you going to put into xmin/xmax here? And how would
> anything you put into the first tuple not break index lookups? There's
> no such thing as a frozen xmax (so far), so what are you going to put
> in there? A random different xid?  FrozenTransactionId?
> HEAP_XMAX_INVALID?

What am I missing?

Greetings,

Andres Freund




Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-09-14 Thread Robert Haas
On Mon, Sep 14, 2020 at 4:13 PM Andres Freund  wrote:
> My understanding of the case we're discussing is that it's corruption
> (e.g. relfrozenxid being different than table contents) affecting a HOT
> chain. I.e. by definition all within a single page.  We won't have
> modified part of it independent of B < A, because freezing is
> all-or-nothing.  Just breaking the HOT chain into two or something like
> that will just make things worse, because indexes won't find tuples, and
> because reindexing might then get confused e.g. by HOT chains without a
> valid start, or by having two visible tuples for the same PK.

If we adopt the proposal made by Dilip, we will not do that. We must
have a.xmax = b.xmin, and that value is either less than relfrozenxid
or it is not. If we skip an entire tuple because one XID is bad, then
we could break the HOT chain when a.xmin is bad and the remaining
values are OK. But if we decide separately for xmin and xmax then we
should be alright. Alternately, if we're only concerned about HOT
chains, we could skip the entire page if any tuple on the page shows
evidence of damage.

> I don't think that's quite the calculation. You're suggesting to make
> already really complicated and failure prone code even more complicated
> by adding heuristic error recovery to it. That has substantial cost,
> even if we were to get it perfectly right (which I don't believe we
> will).

That's a legitimate concern, but I think it would make more sense to
first make the design as good as we can and then decide whether it's
adequate than to decide ab initio that there's no way to make it good
enough.

> > However, I disagree with the idea that a typical user who has a 2TB
> > with one corrupted tuple on page 0 probably wants VACUUM to fail over
> > and over again, letting the table bloat like crazy, instead of
> > bleating loudly but still vacuuming the other 0.99% of the
> > table. I mean, somebody probably wants that, and that's fine. But I
> > have a hard time imagining it as a typical view. Am I just lacking in
> > imagination?
>
> I know that that kind of user exists, but yea, I disagree extremely
> strongly that that's a reasonable thing that the majority of users
> want. And I don't think that that's something we should encourage. Those
> cases indicate that either postgres has a bug, or their storage / memory
> / procedures have an issue. Reacting by making it harder to diagnose is
> just a bad idea.

Well, the people I tend to deal with are not going to let me conduct a
lengthy investigation almost no matter what, and the more severe the
operational consequences of the problem are, the less likely it is
that I'm going to have time to figure anything out. Being able to
create some kind of breathing room is pretty valuable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




"Unified logging system" breaks access to pg_dump debug outputs

2020-09-14 Thread Tom Lane
pg_dump et al have some low-level debug log messages that commit
cc8d41511 converted to pg_log_debug() calls, replacing the previous
one-off logging verbosity system that was there.  However, these
calls are dead code as things stand, because there is no way to set
__pg_log_level high enough to get them to print.

I propose the attached minimal patch to restore the previous
functionality.

Alternatively, we might consider inventing an additional logging.c
function pg_logging_increase_level() with the obvious semantics, and
make the various programs just call that when they see a -v switch.
That would be a slightly bigger patch, but it would more easily support
programs with a range of useful verbosities, so maybe that's a better
idea.

In a quick look around, I could not find any other unreachable
pg_log_debug calls.

(Note: it seems possible that the theoretical multiple verbosity
levels in pg_dump were already broken before cc8d41511, because
right offhand I do not see any code that that removed that would
have allowed invoking the higher levels either.  Nonetheless, there
is no point in carrying dead code --- and these messages *are*
of some interest.  I discovered this problem while trying to
debug parallel pg_restore behavior just now, and wondering
why "-v -v" didn't produce the messages I saw in the source code.)

regards, tom lane

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 784bceaec3..08a2976a6b 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -511,8 +511,11 @@ main(int argc, char **argv)
 break;
 
 			case 'v':			/* verbose */
+if (g_verbose)
+	pg_logging_set_level(PG_LOG_DEBUG); /* -v -v */
+else
+	pg_logging_set_level(PG_LOG_INFO);
 g_verbose = true;
-pg_logging_set_level(PG_LOG_INFO);
 break;
 
 			case 'w':
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
index 97d2b8dac1..11eeb36aa1 100644
--- a/src/bin/pg_dump/pg_dumpall.c
+++ b/src/bin/pg_dump/pg_dumpall.c
@@ -282,8 +282,11 @@ main(int argc, char *argv[])
 break;
 
 			case 'v':
+if (verbose)
+	pg_logging_set_level(PG_LOG_DEBUG); /* -v -v */
+else
+	pg_logging_set_level(PG_LOG_INFO);
 verbose = true;
-pg_logging_set_level(PG_LOG_INFO);
 appendPQExpBufferStr(pgdumpopts, " -v");
 break;
 
diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c
index 544ae3bc5c..af04aa0787 100644
--- a/src/bin/pg_dump/pg_restore.c
+++ b/src/bin/pg_dump/pg_restore.c
@@ -244,8 +244,11 @@ main(int argc, char **argv)
 break;
 
 			case 'v':			/* verbose */
+if (opts->verbose)
+	pg_logging_set_level(PG_LOG_DEBUG); /* -v -v */
+else
+	pg_logging_set_level(PG_LOG_INFO);
 opts->verbose = 1;
-pg_logging_set_level(PG_LOG_INFO);
 break;
 
 			case 'w':


Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-09-14 Thread Andres Freund
Hi,

On 2020-09-14 15:50:49 -0400, Robert Haas wrote:
> On Mon, Sep 14, 2020 at 3:00 PM Alvaro Herrera  
> wrote:
> > FWIW I agree with Andres' stance on this.  The current system is *very*
> > complicated and bugs are obscure already.  If we hide them, what we'll
> > be getting is a system where data can become corrupted for no apparent
> > reason.
> 
> I think I might have to give up on this proposal given the level of
> opposition to it, but the nature of the opposition doesn't make any
> sense to me on a technical level. Suppose a tuple with tid A has been
> updated, producing a new version at tid B. The argument that is now
> being offered is that if A has been found to be corrupt then we'd
> better stop vacuuming the table altogether lest we reach B and vacuum
> it too, further corrupting the table and destroying forensic evidence.
> But even ignoring the fact that many users want to get the database
> running again more than they want to do forensics, it's entirely
> possible that B < A, in which case the damage has already been done.

My understanding of the case we're discussing is that it's corruption
(e.g. relfrozenxid being different than table contents) affecting a HOT
chain. I.e. by definition all within a single page.  We won't have
modified part of it independent of B < A, because freezing is
all-or-nothing.  Just breaking the HOT chain into two or something like
that will just make things worse, because indexes won't find tuples, and
because reindexing might then get confused e.g. by HOT chains without a
valid start, or by having two visible tuples for the same PK.


> But even ignoring the fact that many users want to get the database
> running again more than they want to do forensics

The user isn't always right. And I am not objecting against providing a
tool to get things running. I'm objecting to VACUUM doing so, especially
when it's a system wide config option triggering that behaviour.


> Therefore, I can't see any argument that this patch creates any
> scenario that can't happen already. It seems entirely reasonable to me
> to say, as a review comment, hey, you haven't sufficiently considered
> this particular scenario, that still needs work. But the argument here
> is much more about whether this is a reasonable thing to do in general
> and under any circumstances, and it feels to me like you guys are
> saying "no" without offering any really convincing evidence that there
> are unfixable problems here.

I don't think that's quite the calculation. You're suggesting to make
already really complicated and failure prone code even more complicated
by adding heuristic error recovery to it. That has substantial cost,
even if we were to get it perfectly right (which I don't believe we
will).


> The big picture here is that people have terabyte-scale tables, 1 or 2
> tuples get corrupted, and right now the only real fix is to dump and
> restore the whole table, which leads to prolonged downtime. The
> pg_surgery stuff should help with that, and the work to make VACUUM
> report the exact TID will also help, and if we can get the heapcheck
> stuff Mark Dilger is working on committed, that will provide an
> alternative and probably better way of finding this kind of
> corruption, which is all to the good.

Agreed.


> However, I disagree with the idea that a typical user who has a 2TB
> with one corrupted tuple on page 0 probably wants VACUUM to fail over
> and over again, letting the table bloat like crazy, instead of
> bleating loudly but still vacuuming the other 0.99% of the
> table. I mean, somebody probably wants that, and that's fine. But I
> have a hard time imagining it as a typical view. Am I just lacking in
> imagination?

I know that that kind of user exists, but yea, I disagree extremely
strongly that that's a reasonable thing that the majority of users
want. And I don't think that that's something we should encourage. Those
cases indicate that either postgres has a bug, or their storage / memory
/ procedures have an issue. Reacting by making it harder to diagnose is
just a bad idea.

Greetings,

Andres Freund




Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-09-14 Thread Robert Haas
On Mon, Sep 14, 2020 at 3:00 PM Alvaro Herrera  wrote:
> FWIW I agree with Andres' stance on this.  The current system is *very*
> complicated and bugs are obscure already.  If we hide them, what we'll
> be getting is a system where data can become corrupted for no apparent
> reason.

I think I might have to give up on this proposal given the level of
opposition to it, but the nature of the opposition doesn't make any
sense to me on a technical level. Suppose a tuple with tid A has been
updated, producing a new version at tid B. The argument that is now
being offered is that if A has been found to be corrupt then we'd
better stop vacuuming the table altogether lest we reach B and vacuum
it too, further corrupting the table and destroying forensic evidence.
But even ignoring the fact that many users want to get the database
running again more than they want to do forensics, it's entirely
possible that B < A, in which case the damage has already been done.
Therefore, I can't see any argument that this patch creates any
scenario that can't happen already. It seems entirely reasonable to me
to say, as a review comment, hey, you haven't sufficiently considered
this particular scenario, that still needs work. But the argument here
is much more about whether this is a reasonable thing to do in general
and under any circumstances, and it feels to me like you guys are
saying "no" without offering any really convincing evidence that there
are unfixable problems here. IOW, I agree that having a GUC
corrupt_my_tables_more=true is not a reasonable thing, but I disagree
that the proposal on the table is tantamount to that.

The big picture here is that people have terabyte-scale tables, 1 or 2
tuples get corrupted, and right now the only real fix is to dump and
restore the whole table, which leads to prolonged downtime. The
pg_surgery stuff should help with that, and the work to make VACUUM
report the exact TID will also help, and if we can get the heapcheck
stuff Mark Dilger is working on committed, that will provide an
alternative and probably better way of finding this kind of
corruption, which is all to the good. However, I disagree with the
idea that a typical user who has a 2TB with one corrupted tuple on
page 0 probably wants VACUUM to fail over and over again, letting the
table bloat like crazy, instead of bleating loudly but still vacuuming
the other 0.99% of the table. I mean, somebody probably wants
that, and that's fine. But I have a hard time imagining it as a
typical view. Am I just lacking in imagination?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_restore causing deadlocks on partitioned tables

2020-09-14 Thread Tom Lane
I wrote:
> If memory serves, which it may not given my undercaffeinated state,
> we would not expect there to be a direct dependency link between the
> constraint and the table data "object".  What there should be is
> dependencies forcing the data to be restored before the post-data
> boundary pseudo-object, and the constraint after the boundary.

No, that's wrong: the boundary objects only exist inside pg_dump.

Looking more closely, we have a deadlock between data restore
for a partition:

Process 15858: TRUNCATE TABLE ONLY myschema."myTable:2020-09-01";

and adding a PK to what I assume is its parent partitioned table:

Process 15861: ALTER TABLE ONLY myschema."myTable" ADD CONSTRAINT "pk_myTable" 
PRIMARY KEY ("ID", date);

Since that's an ALTER TABLE ONLY, it shouldn't be trying to touch the
child partitions at all; while the TRUNCATE should only be trying to touch
the child partition.  At least, that's what pg_dump is expecting.

However, the deadlock report suggests, and manual experimentation
confirms, that

(1) TRUNCATE on a partition tries to get AccessShareLock on the parent;

(2) ALTER TABLE ONLY ... ADD CONSTRAINT on a partition root tries to get
AccessExclusiveLock on all child partitions, despite the ONLY.

Each of these facts violates pg_dump's expectations about what can be
done in parallel with what.  There's no obvious reason why we need such
concurrency-killing locks for these operations, either.  So I think
what we have here are two distinct backend bugs, not a pg_dump bug.

regards, tom lane




Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-09-14 Thread Alvaro Herrera
On 2020-Sep-14, Andres Freund wrote:

> It seems pretty dangerous to me. What exactly are you going to put into
> xmin/xmax here? And how would anything you put into the first tuple not
> break index lookups? There's no such thing as a frozen xmax (so far), so
> what are you going to put in there? A random different xid?
> FrozenTransactionId? HEAP_XMAX_INVALID?
> 
> This whole approach just seems likely to exascerbate corruption while
> also making it impossible to debug. That's ok enough if it's an explicit
> user action, but doing it based on a config variable setting seems
> absurdly dangerous to me.

FWIW I agree with Andres' stance on this.  The current system is *very*
complicated and bugs are obscure already.  If we hide them, what we'll
be getting is a system where data can become corrupted for no apparent
reason.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

2020-09-14 Thread Ranier Vilela
Em seg., 14 de set. de 2020 às 15:33, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > msvc 2019 (64 bits), is worried about it:
> >   C:\dll\postgres\src\backend\utils\adt\dbsize.c(630,20): warning C4334:
> > '<<': resultado de 32-bit shift convertido implicitamente para 64 bits
> > (64-bit shift era pretendid
> > o?) [C:\dll\postgres\postgres.vcxproj]
>
> Yeah, most/all of the MSVC buildfarm members are showing this warning too.
> The previous coding was
>
> Int64GetDatum((int64) (1 << count))
>
> which apparently is enough to silence MSVC, though it makes no difference
> as to the actual overflow risk involved.
>
> I'm a bit inclined to make the new code be
>
> NumericGetDatum(int64_to_numeric(INT64CONST(1) << count));
>
+1
msvc 2019 (64 bits), is happy with INT64CONST(1)

regards,
Ranier Vilela


Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-09-14 Thread Andres Freund
Hi,

On 2020-09-14 13:26:27 -0400, Robert Haas wrote:
> On Sat, Aug 29, 2020 at 4:36 AM Dilip Kumar  wrote:
> > One example is,  suppose during vacuum, there are 2 tuples in the hot
> > chain,  and the xmin of the first tuple is corrupted (i.e. smaller
> > than relfrozenxid).  And the xmax of this tuple (which is same as the
> > xmin of the second tuple) is smaller than the cutoff_xid while trying
> > to freeze the tuple.  As a result, it will freeze the second tuple but
> > the first tuple will be left untouched.
> >
> > Now,  if we come for the heap_hot_search_buffer, then the xmax of the
> > first tuple will not match the xmin of the second tuple as we have
> > frozen the second tuple.  But, I feel this is easily fixable right? I
> > mean instead of not doing anything to the corrupted tuple we can
> > partially freeze it?  I mean we can just leave the corrupted xid alone
> > but mark the other xid as frozen if that is smaller then cutoff_xid.
> 
> That seems reasonable to me. Andres, what do you think?

It seems pretty dangerous to me. What exactly are you going to put into
xmin/xmax here? And how would anything you put into the first tuple not
break index lookups? There's no such thing as a frozen xmax (so far), so
what are you going to put in there? A random different xid?
FrozenTransactionId? HEAP_XMAX_INVALID?

This whole approach just seems likely to exascerbate corruption while
also making it impossible to debug. That's ok enough if it's an explicit
user action, but doing it based on a config variable setting seems
absurdly dangerous to me.

Greetings,

Andres Freund




Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

2020-09-14 Thread Tom Lane
Ranier Vilela  writes:
> msvc 2019 (64 bits), is worried about it:
>   C:\dll\postgres\src\backend\utils\adt\dbsize.c(630,20): warning C4334:
> '<<': resultado de 32-bit shift convertido implicitamente para 64 bits
> (64-bit shift era pretendid
> o?) [C:\dll\postgres\postgres.vcxproj]

Yeah, most/all of the MSVC buildfarm members are showing this warning too.
The previous coding was

Int64GetDatum((int64) (1 << count))

which apparently is enough to silence MSVC, though it makes no difference
as to the actual overflow risk involved.

I'm a bit inclined to make the new code be

NumericGetDatum(int64_to_numeric(INT64CONST(1) << count));

ie do the shift in 64 bits.  That's either free or next door to free, and
it allows larger count values to be accepted.  The existing callers don't
care of course, but someday somebody might try to expose this function
more widely.

regards, tom lane




Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

2020-09-14 Thread Ranier Vilela
 Peter Eisentraut  writes:
> ok done
msvc 2019 (64 bits), is worried about it:
https://github.com/postgres/postgres/commit/0aa8f764088ea0f36620ae2955fa6c54ec736c46

"C:\dll\postgres\pgsql.sln" (default target) (1) ->
"C:\dll\postgres\cyrillic_and_mic.vcxproj" (default target) (37) ->
"C:\dll\postgres\postgres.vcxproj" (default target) (38) ->
(ClCompile target) ->
  C:\dll\postgres\src\backend\utils\adt\dbsize.c(630,20): warning C4334:
'<<': resultado de 32-bit shift convertido implicitamente para 64 bits
(64-bit shift era pretendid
o?) [C:\dll\postgres\postgres.vcxproj]

1 Warning(s)
0 Error(s)

"result of 32-bit shift implicitly converted to 64 bits (was 64-bit shift
intended?)"

regards,
Ranier Vilela


Re: Fix overflow at return wchar2char (src/backend/utils/adt/pg_locale.c)

2020-09-14 Thread Tom Lane
Ranier Vilela  writes:
> Em seg., 14 de set. de 2020 às 10:53, Daniel Gustafsson 
> escreveu:
>> If the objection is that an unsigned var is tested with <= 0, then
>> changing the
>> semantics of the function seems a rather drastic solution:

> But if wchar2char must be follow wcstombs_l API.
> wchar2char all calls must be:

> result = wchar2char();
> if (result == 0 || result == (size_t)-1) {

> See at lowerstr_with_len (src/backend/tsearch/ts_locale.c):

Actually, lowerstr_with_len is perfectly fine.  It's coercing the
size_t result to int, so (size_t) -1 becomes integer -1 and its
error check for wlen < 0 is correct.  It might have a problem if
the coercion to int could overflow, but that cannot happen because
of our restrictions on the size of a palloc'd chunk.

There are some other call sites that are failing to check at all,
which is not so good.  But changing the function's API to be both
nonstandard and ambiguous (because a zero result is a perfectly valid
case) doesn't help fix that.

I concur with Daniel that none of the changes shown here are
worthwhile improvements.  It's not illegal to test an unsigned
variable for "x <= 0".

regards, tom lane




Re: Fix for parallel BTree initialization bug

2020-09-14 Thread Peter Geoghegan
On Mon, Sep 14, 2020 at 5:37 AM Amit Kapila  wrote:
> I am planning to push this tomorrow after doing testing on
> back-branches. Let me know if you have any comments.

The fix seems sensible to me.

-- 
Peter Geoghegan




Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-09-14 Thread Robert Haas
On Sat, Aug 29, 2020 at 4:36 AM Dilip Kumar  wrote:
> One example is,  suppose during vacuum, there are 2 tuples in the hot
> chain,  and the xmin of the first tuple is corrupted (i.e. smaller
> than relfrozenxid).  And the xmax of this tuple (which is same as the
> xmin of the second tuple) is smaller than the cutoff_xid while trying
> to freeze the tuple.  As a result, it will freeze the second tuple but
> the first tuple will be left untouched.
>
> Now,  if we come for the heap_hot_search_buffer, then the xmax of the
> first tuple will not match the xmin of the second tuple as we have
> frozen the second tuple.  But, I feel this is easily fixable right? I
> mean instead of not doing anything to the corrupted tuple we can
> partially freeze it?  I mean we can just leave the corrupted xid alone
> but mark the other xid as frozen if that is smaller then cutoff_xid.

That seems reasonable to me. Andres, what do you think?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use incremental sort paths for window functions

2020-09-14 Thread Tomas Vondra

On Wed, Jul 08, 2020 at 04:57:21PM +1200, David Rowley wrote:

Over on [1] someone was asking about chained window paths making use
of already partially sorted input.  (The thread is on -general, so I
guessed they're not using PG13.)

However, On checking PG13 to see if incremental sort would help their
case, I saw it didn't. Looking at the code I saw that
create_window_paths() and create_one_window_path() don't make any use
of incremental sort paths.

I quickly put together the attached. It's only about 15 mins of work,
but it seems worth looking at a bit more for some future commitfest.
Yeah, I'll need to add some tests as I see nothing failed by changing
this.



Yeah, I'm sure there are a couple other places that might benefit from
incremental sort but were not included in the PG13 commit. The patch
seems correct - did it help in the reported thread? How much?

I suppose this might benefit from an optimization similar to the GROUP
BY reordering discussed in [1]. For example, with

   max(a) over (partition by b,c)

I think we could use index on (c) and consider incremental sort by c,b,
i.e. with the inverted pathkeys. But that's a completely independent
topic, I believe.

[1] 
https://www.postgresql.org/message-id/7c79e6a5-8597-74e8-0671-1c39d124c9d6%40sigaev.ru


I'll just park this here until then so I don't forget.



OK, thanks for looking into this!

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Gripes about walsender command processing

2020-09-14 Thread Tom Lane
Michael Paquier  writes:
> On Sun, Sep 13, 2020 at 03:47:51PM -0400, Tom Lane wrote:
>> While trying to fix this, I also observed that exec_replication_command
>> fails to clean up the temp context it made for parsing the command string,
>> if that turns out to be a SQL command.  This very accidentally fails to
>> lead to a long-term memory leak, because that context will be a child of
>> MessageContext so it'll be cleaned out in the next iteration of
>> PostgresMain's main loop.  But it's still unacceptably sloppy coding
>> IMHO, and it's closely related to a lot of other randomness in the
>> function; it'd be better to have a separate early-exit path for SQL
>> commands.

> Looks fine seen from here.  +1.

Pushed, thanks for looking.

>> What I'd really like to do is adjust pgstat_report_activity so that
>> callers are *required* to provide some kind of command string when
>> transitioning into STATE_RUNNING state, ie something like
>> Assert(state == STATE_RUNNING ? cmd_str != NULL : cmd_str == NULL);

> For this one, I don't actually agree.  For now the state and the query
> string, actually the activity string, are completely independent.  So
> external modules like bgworkers can mix the state enum and the
> activity string as they wish.  I think that this flexibility is useful
> to keep.

Meh ... but I'm not excited enough about the point to argue.

I looked into the other point about ps status always claiming the
walsender is "idle".  This turns out to be something PostgresMain
does automatically, so unless we want to uglify that logic, we have
to make walsenders set the field honestly.  So I propose the attached.
(Is there a better way to get the name of a replication command?
I didn't look very hard for one.)

regards, tom lane

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 67093383e6..a34c5745de 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1619,19 +1619,23 @@ exec_replication_command(const char *cmd_string)
 	switch (cmd_node->type)
 	{
 		case T_IdentifySystemCmd:
+			set_ps_display("IDENTIFY_SYSTEM");
 			IdentifySystem();
 			break;
 
 		case T_BaseBackupCmd:
+			set_ps_display("BASE_BACKUP");
 			PreventInTransactionBlock(true, "BASE_BACKUP");
 			SendBaseBackup((BaseBackupCmd *) cmd_node);
 			break;
 
 		case T_CreateReplicationSlotCmd:
+			set_ps_display("CREATE_REPLICATION_SLOT");
 			CreateReplicationSlot((CreateReplicationSlotCmd *) cmd_node);
 			break;
 
 		case T_DropReplicationSlotCmd:
+			set_ps_display("DROP_REPLICATION_SLOT");
 			DropReplicationSlot((DropReplicationSlotCmd *) cmd_node);
 			break;
 
@@ -1639,6 +1643,7 @@ exec_replication_command(const char *cmd_string)
 			{
 StartReplicationCmd *cmd = (StartReplicationCmd *) cmd_node;
 
+set_ps_display("START_REPLICATION");
 PreventInTransactionBlock(true, "START_REPLICATION");
 
 if (cmd->kind == REPLICATION_KIND_PHYSICAL)
@@ -1651,6 +1656,7 @@ exec_replication_command(const char *cmd_string)
 			}
 
 		case T_TimeLineHistoryCmd:
+			set_ps_display("TIMELINE_HISTORY");
 			PreventInTransactionBlock(true, "TIMELINE_HISTORY");
 			SendTimeLineHistory((TimeLineHistoryCmd *) cmd_node);
 			break;
@@ -1660,6 +1666,7 @@ exec_replication_command(const char *cmd_string)
 DestReceiver *dest = CreateDestReceiver(DestRemoteSimple);
 VariableShowStmt *n = (VariableShowStmt *) cmd_node;
 
+set_ps_display("SHOW");
 /* syscache access needs a transaction environment */
 StartTransactionCommand();
 GetPGVariable(n->name, dest);
@@ -1680,8 +1687,11 @@ exec_replication_command(const char *cmd_string)
 	SetQueryCompletion(, CMDTAG_SELECT, 0);
 	EndCommand(, DestRemote, true);
 
-	/* Report to pgstat that this process is now idle */
-	pgstat_report_activity(STATE_IDLE, NULL);
+	/*
+	 * We need not update ps display or pg_stat_activity, because PostgresMain
+	 * will reset those to "idle".  But we must reset debug_query_string to
+	 * ensure it doesn't become a dangling pointer.
+	 */
 	debug_query_string = NULL;
 
 	return true;


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-09-14 Thread Tom Lane
Amit Kapila  writes:
> The attached patch will fix the issue. What do you think?

I think it'd be cleaner to separate the initialization of a new entry from
validation altogether, along the lines of

/* Find cached function info, creating if not found */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
entry = (RelationSyncEntry *) hash_search(RelationSyncCache,
  (void *) ,
  HASH_ENTER, );
MemoryContextSwitchTo(oldctx);
Assert(entry != NULL);

if (!found)
{
/* immediately make a new entry valid enough to satisfy callbacks */
entry->schema_sent = false;
entry->streamed_txns = NIL;
entry->replicate_valid = false;
/* are there any other fields we should clear here for safety??? */
}

/* Fill it in if not valid */
if (!entry->replicate_valid)
{
List   *pubids = GetRelationPublications(relid);
...

BTW, unless someone has changed the behavior of dynahash when I
wasn't looking, those MemoryContextSwitchTos shown above are useless.
Also, why does the comment refer to a "function" entry?

regards, tom lane




Re: On login trigger: take three

2020-09-14 Thread Pavel Stehule
po 14. 9. 2020 v 17:53 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 14.09.2020 17:34, Pavel Stehule wrote:
>
> If we introduce buildin session trigger , we should to define what is the
> session. Your design is much more related to the process than to session.
> So the correct name should be "process_start" trigger, or some should be
> different. I think there are two different events - process_start, and
> session_start, and there should be two different event triggers. Maybe the
> name "session_start" is just ambiguous and should be used with a different
> name.
>
>
> I agree.
> I can rename trigger to backend_start or process_start or whatever else.
>

Creating a good name can be hard - it is not called for any process - so
maybe "user_backend_start" ?


>
>
>>
>> 5. I do not quite understand your concern. If I define  trigger
>> procedure which is  blocked (for example as in your example), then I can
>> use pg_cancel_backend to interrupt execution of login trigger and
>> superuser can login. What should be changed here?
>>
>
> You cannot run pg_cancel_backend, because you cannot open a new session.
> There is a cycle.
>
>
> It is always possible to login by disabling startup triggers using
> disable_session_start_trigger GUC:
>
> psql "dbname=postgres options='-c disable_session_start_trigger=true'"
>

sure, I know. Just this behavior can be a very unpleasant surprise, and my
question is if it can be fixed.  Creating custom libpq variables can be the
stop for people that use pgAdmin.


Re: Fix overflow at return wchar2char (src/backend/utils/adt/pg_locale.c)

2020-09-14 Thread Ranier Vilela
Em seg., 14 de set. de 2020 às 10:53, Daniel Gustafsson 
escreveu:

> > On 14 Sep 2020, at 14:41, Ranier Vilela  wrote:
>
> > 1. wchar2char has a mistake when checking the return of
> WideCharToMultiByte call.
> > result variable is unsigned, therefore, cannot be less than zero,
> returning -1 is not an option.
>
> If the objection is that an unsigned var is tested with <= 0, then
> changing the
> semantics of the function seems a rather drastic solution:
>
> /* A zero return is failure */
> -   if (result <= 0)
> -   result = -1;
> +   if (result == 0)
> +   return 0;
>
> The comment for wchar2char explicitly state "This has the same API as the
> standard wcstombs_l() function;", and man wcstombs_l shows:
>
> RETURN VALUES
>  The wcstombs() function returns the number of bytes converted (not
>  including any terminating null), if successful; otherwise, it
> returns
>  (size_t)-1.
>
I'm not agree that must, follow wcstombs_l API.
And there is already a precedent in returning zero.

But if wchar2char must be follow wcstombs_l API.
wchar2char all calls must be:

result = wchar2char();
if (result == 0 || result == (size_t)-1) {

See at lowerstr_with_len (src/backend/tsearch/ts_locale.c):

wlen = char2wchar(wstr, len + 1, str, len, mylocale);  // fail with -1
Assert((void) 0);  // Release mode, Assert(wlen <= len);
wlen = 18446744073709551615;
len = pg_database_encoding_max_length() *  18446744073709551615 + 1;  //
len is int, Windows LLP64 is 32 bits.
out = (char *) palloc(len);


> > 2. strftime or strftime_win32, return cannot be less than zero.
> >
> > 3. If strftime or strftime_win32, fails, why not abort the loop?
>
> This recently changed in 7ad1cd31bfc, and the commit message along with the
> comment above the code implies that an error is unlikely:,
>
> * MAX_L10N_DATA is sufficient buffer space for every known locale, and
> * POSIX defines no strftime() errors.  (Buffer space exhaustion is not
> an
> * error.)
>
> ..so it's probably a case of not optimizing for never-happens-scenarios:
> The
> fact that strftimefail will trigger elog and not ereport is an additional
> clue
> that an error is unlikely.
>
The cost of the test, It has been paid, then the break is free.
And testing unsigned if it is less than zero, it is useless,
it just gets in the way of the compiler.

regards,
Ranier Vilela


Re: On login trigger: take three

2020-09-14 Thread Konstantin Knizhnik



On 14.09.2020 17:34, Pavel Stehule wrote:
If we introduce buildin session trigger , we should to define what is 
the session. Your design is much more related to the process than to 
session. So the correct name should be "process_start" trigger, or 
some should be different. I think there are two different events - 
process_start, and session_start, and there should be two different 
event triggers. Maybe the name "session_start" is just ambiguous and 
should be used with a different name.


I agree.
I can rename trigger to backend_start or process_start or whatever else.



5. I do not quite understand your concern. If I define trigger
procedure which is  blocked (for example as in your example), then
I can
use pg_cancel_backend to interrupt execution of login trigger and
superuser can login. What should be changed here?


You cannot run pg_cancel_backend, because you cannot open a new 
session. There is a cycle.


It is always possible to login by disabling startup triggers using 
disable_session_start_trigger GUC:


psql "dbname=postgres options='-c disable_session_start_trigger=true'"




Re: pg_restore causing deadlocks on partitioned tables

2020-09-14 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Sep-14, Tom Lane wrote:
>> Hm, this seems related to 2ba5b2db7, but not the same thing.
>> Alvaro, any thoughts?

> So apparently when we go to restore the table data for the partition,
> the TRUNCATE deadlocks with the PK addition ... that's pretty odd;
> shouldn't the constraint restore have waited until the data had been
> fully loaded?

Yeah, that's certainly the design expectation.  Missing dependency?

If memory serves, which it may not given my undercaffeinated state,
we would not expect there to be a direct dependency link between the
constraint and the table data "object".  What there should be is
dependencies forcing the data to be restored before the post-data
boundary pseudo-object, and the constraint after the boundary.
I'm half guessing that that's being mucked up for partitioned tables.

regards, tom lane




Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

2020-09-14 Thread Tom Lane
In connection with a nearby thread, I tried to run the subscription
test suite in a CLOBBER_CACHE_ALWAYS build.  I soon found that I had
to increase wal_receiver_timeout, but after doing this:

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1488bff..5fe6810 100644
*** a/src/test/perl/PostgresNode.pm
--- b/src/test/perl/PostgresNode.pm
*** sub init
*** 447,452 
--- 447,453 
print $conf "log_statement = all\n";
print $conf "log_replication_commands = on\n";
print $conf "wal_retrieve_retry_interval = '500ms'\n";
+   print $conf "wal_receiver_timeout = '10min'\n";
  
# If a setting tends to affect whether tests pass or fail, print it after
# TEMP_CONFIG.  Otherwise, print it before TEMP_CONFIG, thereby permitting

I let it run overnight, and came back to find that it was stuck at

[03:02:15] t/013_partition.pl . 19/51

and had been for circa eight hours, where extrapolation from other tests
said it shouldn't take much over half an hour.  Investigation found that
the subscriber was repeatedly failing like this:

2020-09-14 11:05:26.483 EDT [1030506] LOG:  logical replication apply worker 
for subscription "sub1" has started
2020-09-14 11:05:27.139 EDT [1030506] ERROR:  cache lookup failed for relation 0
2020-09-14 11:05:27.140 EDT [947156] LOG:  background worker "logical 
replication worker" (PID 1030506) exited with exit code 1
2020-09-14 11:05:27.571 EDT [1030509] LOG:  logical replication apply worker 
for subscription "sub1" has started
2020-09-14 11:05:28.227 EDT [1030509] ERROR:  cache lookup failed for relation 0
2020-09-14 11:05:28.228 EDT [947156] LOG:  background worker "logical 
replication worker" (PID 1030509) exited with exit code 1

The publisher's log shows no sign of distress:

2020-09-14 11:06:09.380 EDT [1030619] sub1 LOG:  statement: SELECT 
pg_catalog.set_config('search_path', '', false);
2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG:  received replication command: 
IDENTIFY_SYSTEM
2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG:  received replication command: 
START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', 
publication_names '"pub1"')
2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG:  starting logical decoding for 
slot "sub1"
2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL:  Streaming transactions 
committing after 0/163D848, reading WAL from 0/163CF08.
2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG:  logical decoding found 
consistent point at 0/163CF08
2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL:  There are no running 
transactions.
2020-09-14 11:06:10.468 EDT [1030621] sub1 LOG:  statement: SELECT 
pg_catalog.set_config('search_path', '', false);
2020-09-14 11:06:10.533 EDT [1030621] sub1 LOG:  received replication command: 
IDENTIFY_SYSTEM
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  received replication command: 
START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', 
publication_names '"pub1"')
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  starting logical decoding for 
slot "sub1"
2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL:  Streaming transactions 
committing after 0/163D848, reading WAL from 0/163CF08.
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  logical decoding found 
consistent point at 0/163CF08
2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL:  There are no running 
transactions.

I do not have time today to chase this further, but somebody should.

More generally, this seems like good evidence that we really oughta have a
buildfarm member that's running *all* the tests under CLOBBER_CACHE_ALWAYS
not just the core tests.  That seems impossibly expensive, but I realized
while watching these tests that a ridiculous fraction of the runtime is
being spent in repeated initdb calls.  On my machine, initdb takes about
12 minutes under CCA, so doing it separately for publisher and subscriber
means 24 minutes, which compares not at all favorably to the
circa-half-an-hour total runtime of each of the subscription test scripts.
We're surely not learning anything after the first CCA run of initdb, so
if we could find a way to skip that overhead for later runs, it'd make a
huge difference in the practicality of running these tests under CCA.

I recall having worked on a patch to make the regression tests run
initdb just once, creating a template directory tree, and then "cp -a"
that into place for each test.  I did not finish it, because it wasn't
showing a lot of advantage in a normal test run, but maybe the idea
could be resurrected for CCA and other slow builds.

Another idea is to make CCA a little more dynamic, say allow it to be
suppressed through an environment variable setting, and then use that
to speed up per-test initdbs.

regards, tom lane




Re: pg_dump --where option

2020-09-14 Thread Daniel Gustafsson
> On 14 Sep 2020, at 12:04, Surafel Temesgen  wrote:
> On Fri, Jul 31, 2020 at 1:38 AM Daniel Gustafsson  > wrote:
> 
> >  $ pg_dump -d cary --where="test1:a3 = ( select max(aa1) from test2 )" > 
> > testdump2
> >  $ pg_dump: error: processing of table "public.test1" failed
> > 
> > both test1 and test2 exist in the database and the same subquery works 
> > under psql.
> This is because pg_dump uses schema-qualified object name I add documentation 
> about to use schema-qualified name when using sub query

Documenting something is well and good, but isn't allowing arbitrary SQL
copy-pasted into the query (which isn't checked for schema qualification)
opening up for some of the ill-effects of CVE-2018-1058?

> I don’t add tests because single-quotes and double-quotes are meta-characters 
> for PROVE too.

I'm not sure I follow. Surely tests can be added for this functionality?


How should one invoke this on a multibyte char table name which require
quoting, like --table='"x"' (where x would be an mb char).  Reading the
original thread and trying the syntax from there, it's also not clear how table
names with colons should be handled.  I know they're not common, but if they're
not supported then the tradeoff should be documented.

A nearby thread [0] is adding functionality to read from an input file due to
the command line being too short.  Consumers of this might not run into the
issues mentioned there, but it doesn't seem far fetched that someone who does
also adds a small WHERE clause too.  Maybe these patches should join forces?

cheers ./daniel

[0] CAFj8pRB10wvW0CC9Xq=1XDs=zcqxer3cblcnza+qix4cuh-...@mail.gmail.com



Re: Function to execute a program

2020-09-14 Thread Stephen Frost
Greetings.

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Magnus Hagander (mag...@hagander.net) wrote:
> >> Would it make sense to have a pg_execute_program() that corresponds to COPY
> >> FROM PROGRAM? This would obviously have the same permissions restrictions
> >> as COPY FROM PROGRAM.
> 
> > I'd rather come up with a way to import this kind of object into PG by
> > using COPY rather than adding a different way to pull them in.
> 
> I'm not for overloading COPY to try to make it handle every data import
> use-case.  The issue here AIUI is that Magnus wants the program output
> to be read as an uninterpreted blob (which he'll then try to convert to
> jsonb or whatever, but that's not the concern of the import code).  This
> is exactly antithetical to COPY's mission of reading some rows that are
> made up of some columns and putting the result into a table.

I don't really think there's anything inherent in the fact that "COPY"
today only has one way to handle data that the user wants to import that
it should be required to always operate in that manner.

As for slowing down the current method- I don't think that we'd
implement such a change as just a modification to the existing optimized
parsing code as that just wouldn't make any sense and would slow COPY
down for this use-case, but having a COPY command that's able to work in
a few different modes when it comes to importing data seems like it
could be sensible, fast, and clear to users.

One could imagine creating some other top-level command to handle more
complex import cases than what COPY does today but I don't actually
think that'd be an improvment.

> Yeah, we could no doubt add some functionality to disable all the
> row-splitting and column-splitting and associated escaping logic,
> but that's going to make COPY slower and more complicated.  And it
> still doesn't address wanting to use the result directly in a query
> instead of sticking it into a table.

The way that's handled for the cases that COPY does work with today is
file_fdw.  Ideally, we'd do the same here.

Ultimately, COPY absolutely *is* our general data import tool- it's just
that today we push some of the work to make things 'fit' on the user and
that ends up with pain points like exactly what Magnus has pointed out
here.  We should be looking to improve that situation, and I don't
really care for the solution to that being "create some random other new
thing for data import that users then have to know exists and learn how
to use".

Thanks,

Stephen


signature.asc
Description: PGP signature


RE: pg_restore causing deadlocks on partitioned tables

2020-09-14 Thread Domagoj Smoljanovic
Forgot to mention the versions:
pg_restore (PostgreSQL) 12.4
source/ destination databases also 12.4

D.

-Original Message-
From: Alvaro Herrera  
Sent: 14. rujna 2020. 16:40
To: Tom Lane 
Cc: Domagoj Smoljanovic ; 
pgsql-hack...@postgresql.org
Subject: Re: pg_restore causing deadlocks on partitioned tables

On 2020-Sep-14, Tom Lane wrote:

> Domagoj Smoljanovic  writes:
> > I have pg_restore running in parallel (3 or more) and processing large 
> > amount of data that is in partitioned tables. However it seems that 
> > sometime deadlock appears when one process is trying to process primary key 
> > on parent table while data still hasn’t been loaded into partitions. And 
> > acquires Exclusive Lock on the whole table. Then another process comes and 
> > tries to load one of the partitions with SharedLock but it fails.
> 
> > This of course doesn’t happen always; depending on the course of actions of 
> > the pg_restore. But often enough to cause frustration.
> 
> > Process 15858 waits for AccessShareLock on relation 233358134 of database 
> > 233346697; blocked by process 15861.
> > Process 15861 waits for AccessExclusiveLock on relation 233374757 of 
> > database 233346697; blocked by process 15858.
> > Process 15858: TRUNCATE TABLE ONLY myschema."myTable:2020-09-01"; 
> > Process 15861: ALTER TABLE ONLY myschema."myTable" ADD CONSTRAINT 
> > "pk_myTable" PRIMARY KEY ("ID", date);
> 
> Hm, this seems related to 2ba5b2db7, but not the same thing.
> Alvaro, any thoughts?

So apparently when we go to restore the table data for the partition, the 
TRUNCATE deadlocks with the PK addition ... that's pretty odd; shouldn't the 
constraint restore have waited until the data had been fully loaded?

-- 
Álvaro Herrera
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.2ndquadrant.com%2Fdata=01%7C01%7Cdomagoj.smoljanovic%40oradian.com%7Cf9054c64e75a49adac3308d858bc1423%7Cc3d7e30ad09240c8b35c54a27682c60d%7C0sdata=9pphCt1EzkEzrCuCg8CLdRywknjNiG6WLfRhR4T7qPQ%3Dreserved=0
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Function to execute a program

2020-09-14 Thread Tom Lane
Stephen Frost  writes:
> * Magnus Hagander (mag...@hagander.net) wrote:
>> Would it make sense to have a pg_execute_program() that corresponds to COPY
>> FROM PROGRAM? This would obviously have the same permissions restrictions
>> as COPY FROM PROGRAM.

> I'd rather come up with a way to import this kind of object into PG by
> using COPY rather than adding a different way to pull them in.

I'm not for overloading COPY to try to make it handle every data import
use-case.  The issue here AIUI is that Magnus wants the program output
to be read as an uninterpreted blob (which he'll then try to convert to
jsonb or whatever, but that's not the concern of the import code).  This
is exactly antithetical to COPY's mission of reading some rows that are
made up of some columns and putting the result into a table.

Yeah, we could no doubt add some functionality to disable all the
row-splitting and column-splitting and associated escaping logic,
but that's going to make COPY slower and more complicated.  And it
still doesn't address wanting to use the result directly in a query
instead of sticking it into a table.

regards, tom lane




Re: pg_restore causing deadlocks on partitioned tables

2020-09-14 Thread Alvaro Herrera
On 2020-Sep-14, Tom Lane wrote:

> Domagoj Smoljanovic  writes:
> > I have pg_restore running in parallel (3 or more) and processing large 
> > amount of data that is in partitioned tables. However it seems that 
> > sometime deadlock appears when one process is trying to process primary key 
> > on parent table while data still hasn’t been loaded into partitions. And 
> > acquires Exclusive Lock on the whole table. Then another process comes and 
> > tries to load one of the partitions with SharedLock but it fails.
> 
> > This of course doesn’t happen always; depending on the course of actions of 
> > the pg_restore. But often enough to cause frustration.
> 
> > Process 15858 waits for AccessShareLock on relation 233358134 of database 
> > 233346697; blocked by process 15861.
> > Process 15861 waits for AccessExclusiveLock on relation 233374757 of 
> > database 233346697; blocked by process 15858.
> > Process 15858: TRUNCATE TABLE ONLY myschema."myTable:2020-09-01";
> > Process 15861: ALTER TABLE ONLY myschema."myTable" ADD CONSTRAINT 
> > "pk_myTable" PRIMARY KEY ("ID", date);
> 
> Hm, this seems related to 2ba5b2db7, but not the same thing.
> Alvaro, any thoughts?

So apparently when we go to restore the table data for the partition,
the TRUNCATE deadlocks with the PK addition ... that's pretty odd;
shouldn't the constraint restore have waited until the data had been
fully loaded?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: On login trigger: take three

2020-09-14 Thread Pavel Stehule
po 14. 9. 2020 v 16:12 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 14.09.2020 12:44, Pavel Stehule wrote:
> > Hi
> >
> > I am checking last patch, and there are notices
> >
> > 1. disable_session_start_trigger should be SU_BACKEND instead SUSET
> >
> > 2. The documentation should be enhanced - there is not any note about
> > behave when there are unhandled exceptions, about motivation for this
> > event trigger
> >
> > 3. regress tests should be enhanced - the cases with exceptions are
> > not tested
> >
> > 4. This trigger is not executed again after RESET ALL or DISCARD ALL -
> > it can be a problem if somebody wants to use this trigger for
> > initialisation of some session objects with some pooling solutions.
> >
> > 5. The handling errors don't work well for canceling. If event trigger
> > waits for some event, then cancel disallow connect although connected
> > user is superuser
> >
> > CREATE OR REPLACE FUNCTION on_login_proc2() RETURNS EVENT_TRIGGER AS
> > $$ begin perform pg_sleep(1); raise notice '%', fx1(100);raise
> > notice 'kuku kuku'; end  $$ language plpgsql;
> >
> > probably nobody will use pg_sleep in this routine, but there can be
> > wait on some locks ...
> >
> > Regards
> >
> > Pavel
> >
> >
> >
>
> Hi
> Thank you very much for looking at my patch for connection triggers.
> I have fixed 1-3 issues in the attached patch.
> Concerning 4 and 5 I have some doubts:
>
> 4. Should I add some extra GUC to allow firing of session_start trigger
> in case of  RESET ALL or DISCARD ALL ?
> Looks like such behavior contradicts with event name "session_start"...
> And do we really need it? If some pooler is using RESET ALL/DISCARD ALL
> to emulate session semantic then  most likely it provides way to define
> custom actions which
> should be perform for session initialization. As far as I know, for
> example pgbouncer allows do define own on-connect hooks.
>

If we introduce buildin session trigger , we should to define what is the
session. Your design is much more related to the process than to session.
So the correct name should be "process_start" trigger, or some should be
different. I think there are two different events - process_start, and
session_start, and there should be two different event triggers. Maybe the
name "session_start" is just ambiguous and should be used with a different
name.


>
> 5. I do not quite understand your concern. If I define  trigger
> procedure which is  blocked (for example as in your example), then I can
> use pg_cancel_backend to interrupt execution of login trigger and
> superuser can login. What should be changed here?
>

You cannot run pg_cancel_backend, because you cannot open a new session.
There is a cycle.

Regards

Pavel



>
>
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: Function to execute a program

2020-09-14 Thread Stephen Frost
Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> Would it make sense to have a pg_execute_program() that corresponds to COPY
> FROM PROGRAM? This would obviously have the same permissions restrictions
> as COPY FROM PROGRAM.

Eh, perhaps.

> The usecase would be to for example execute a command that returns json
> format output, which could then be parsed and processed as part of a query.
> Today, COPY FROM PROGRAM cannot do this, as we can't read a multiline json
> value with COPY.

I'd rather come up with a way to import this kind of object into PG by
using COPY rather than adding a different way to pull them in.

In the "this is a crazy idea" category- have a way to specify an input
function to COPY, or a target data type (and then use it's input
function) to which COPY just keeps feeding bytes until the function
comes back with "ok, I got a value to return" or something.

Would be really neat to have a way to COPY in a JSON, XML, or whatever
even if it's multi-line and even if it has multiple objects (ie: 10
complete JSON documents in a single file would become 10 rows of jsonb).

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: 回复:how to create index concurrently on partitioned table

2020-09-14 Thread Justin Pryzby
On Sat, Sep 12, 2020 at 10:35:34AM +0900, Michael Paquier wrote:
> On Fri, Sep 11, 2020 at 07:13:01PM -0500, Justin Pryzby wrote:
> > On Tue, Sep 08, 2020 at 01:31:05PM +0900, Michael Paquier wrote:
> >> - CIC on partitioned relations.  (Should we also care about DROP INDEX
> >> CONCURRENTLY as well?)
> > 
> > Do you have any idea what you think that should look like for DROP INDEX
> > CONCURRENTLY ?
> 
> Making the maintenance of the partition tree consistent to the user is
> the critical part here, so my guess on this matter is:
> 1) Remove each index from the partition tree and mark the indexes as
> invalid in the same transaction.  This makes sure that after commit no
> indexes would get used for scans, and the partition dependency tree
> pis completely removed with the parent table.  That's close to what we
> do in index_concurrently_swap() except that we just want to remove the
> dependencies with the partitions, and not just swap them of course.
> 2) Switch each index to INDEX_DROP_SET_DEAD, one per transaction
> should be fine as that prevents inserts.
> 3) Finish the index drop.
> 
> Step 2) and 3) could be completely done for each index as part of
> index_drop().  The tricky part is to integrate 1) cleanly within the
> existing dependency machinery while still knowing about the list of
> partitions that can be removed.  I think that this part is not that
> straight-forward, but perhaps we could just make this logic part of
> RemoveRelations() when listing all the objects to remove.

Thanks.

I see three implementation ideas..

1. I think your way has an issue that the dependencies are lost.  If there's an
interruption, the user is maybe left with hundreds or thousands of detached
indexes to clean up.  This is strange since there's actually no detach command
for indexes (but they're implicitly "attached" when a matching parent index is
created).  A 2nd issue is that DETACH currently requires an exclusive lock (but
see Alvaro's WIP patch).

2. Maybe the easiest way is to mark all indexes invalid and then drop all
partitions (concurrently) and then the partitioned parent.  If interrupted,
this would leave a parent index marked "invalid", and some child tables with no
indexes.  I think this may be "ok".  The same thing is possible if a concurrent
build is interrupted, right ?

3. I have a patch which changes index_drop() to "expand" a partitioned index 
into
its list of children.  Each of these becomes a List:
| indexId, heapId, userIndexRelation, userHeapRelation, heaplocktag, heaprelid, 
indexrelid
The same process is followed as for a single index, but handling all partitions
at once in two transactions total.  Arguably, this is bad since that function
currently takes a single Oid but would now ends up operating on a list of 
indexes.

Anyway, for now this is rebased on 83158f74d.

-- 
Justin
>From 94f3578d6a6ae9d7e33f902dd86c2b7cf0aee3e0 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 6 Jun 2020 17:42:23 -0500
Subject: [PATCH v7 1/3] Allow CREATE INDEX CONCURRENTLY on partitioned table

Note, this effectively reverts 050098b14, so take care to not reintroduce the
bug it fixed.

XXX: does pgstat_progress_update_param() break other commands progress ?
---
 doc/src/sgml/ref/create_index.sgml |   9 --
 src/backend/commands/indexcmds.c   | 135 +
 src/test/regress/expected/indexing.out |  60 ++-
 src/test/regress/sql/indexing.sql  |  18 +++-
 4 files changed, 166 insertions(+), 56 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 33aa64e81d..c780dc9547 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -657,15 +657,6 @@ Indexes:
 cannot.

 
-   
-Concurrent builds for indexes on partitioned tables are currently not
-supported.  However, you may concurrently build the index on each
-partition individually and then finally create the partitioned index
-non-concurrently in order to reduce the time where writes to the
-partitioned table will be locked out.  In this case, building the
-partitioned index is a metadata only operation.
-   
-
   
  
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f1b5f87e6a..d417404211 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -91,6 +91,7 @@ static bool ReindexRelationConcurrently(Oid relationOid, int options);
 
 static void ReindexPartitions(Oid relid, int options, bool isTopLevel);
 static void ReindexMultipleInternal(List *relids, int options);
+static void reindex_invalid_child_indexes(Oid indexRelationId);
 static void reindex_error_callback(void *args);
 static void update_relispartition(Oid relationId, bool newval);
 static bool CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts);
@@ -665,17 +666,6 @@ DefineIndex(Oid relationId,
 	partitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
 	

Re: pg_restore causing deadlocks on partitioned tables

2020-09-14 Thread Tom Lane
Domagoj Smoljanovic  writes:
> I have pg_restore running in parallel (3 or more) and processing large amount 
> of data that is in partitioned tables. However it seems that sometime 
> deadlock appears when one process is trying to process primary key on parent 
> table while data still hasn’t been loaded into partitions. And acquires 
> Exclusive Lock on the whole table. Then another process comes and tries to 
> load one of the partitions with SharedLock but it fails.

> This of course doesn’t happen always; depending on the course of actions of 
> the pg_restore. But often enough to cause frustration.

> Process 15858 waits for AccessShareLock on relation 233358134 of database 
> 233346697; blocked by process 15861.
> Process 15861 waits for AccessExclusiveLock on relation 233374757 of database 
> 233346697; blocked by process 15858.
> Process 15858: TRUNCATE TABLE ONLY myschema."myTable:2020-09-01";
> Process 15861: ALTER TABLE ONLY myschema."myTable" ADD CONSTRAINT 
> "pk_myTable" PRIMARY KEY ("ID", date);

Hm, this seems related to 2ba5b2db7, but not the same thing.
Alvaro, any thoughts?

regards, tom lane




Re: On login trigger: take three

2020-09-14 Thread Konstantin Knizhnik



On 14.09.2020 12:44, Pavel Stehule wrote:

Hi

I am checking last patch, and there are notices

1. disable_session_start_trigger should be SU_BACKEND instead SUSET

2. The documentation should be enhanced - there is not any note about 
behave when there are unhandled exceptions, about motivation for this 
event trigger


3. regress tests should be enhanced - the cases with exceptions are 
not tested


4. This trigger is not executed again after RESET ALL or DISCARD ALL - 
it can be a problem if somebody wants to use this trigger for 
initialisation of some session objects with some pooling solutions.


5. The handling errors don't work well for canceling. If event trigger 
waits for some event, then cancel disallow connect although connected 
user is superuser


CREATE OR REPLACE FUNCTION on_login_proc2() RETURNS EVENT_TRIGGER AS 
$$ begin perform pg_sleep(1); raise notice '%', fx1(100);raise 
notice 'kuku kuku'; end  $$ language plpgsql;


probably nobody will use pg_sleep in this routine, but there can be 
wait on some locks ...


Regards

Pavel





Hi
Thank you very much for looking at my patch for connection triggers.
I have fixed 1-3 issues in the attached patch.
Concerning 4 and 5 I have some doubts:

4. Should I add some extra GUC to allow firing of session_start trigger 
in case of  RESET ALL or DISCARD ALL ?

Looks like such behavior contradicts with event name "session_start"...
And do we really need it? If some pooler is using RESET ALL/DISCARD ALL 
to emulate session semantic then  most likely it provides way to define 
custom actions which
should be perform for session initialization. As far as I know, for 
example pgbouncer allows do define own on-connect hooks.


5. I do not quite understand your concern. If I define  trigger 
procedure which is  blocked (for example as in your example), then I can 
use pg_cancel_backend to interrupt execution of login trigger and 
superuser can login. What should be changed here?




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/event-trigger.sgml b/doc/src/sgml/event-trigger.sgml
index 60366a9..f7c3c14 100644
--- a/doc/src/sgml/event-trigger.sgml
+++ b/doc/src/sgml/event-trigger.sgml
@@ -28,12 +28,31 @@
  An event trigger fires whenever the event with which it is associated
  occurs in the database in which it is defined. Currently, the only
  supported events are
+ session_start,
  ddl_command_start,
  ddl_command_end,
  table_rewrite
  and sql_drop.
  Support for additional events may be added in future releases.

+table_rewrite
+   
+ The session_start event occurs on backend start when connection with user was established.
+ As far as bug in trigger procedure can prevent normal login to the system, there are two mechanisms
+ for preventing it:
+ 
+  
+   
+ GUC disable_session_start_trigger which makes it possible to disable firing triggers on session startup.
+   
+  
+   
+ Ignoring errors in trigger procedure for superuser. Error message is delivered to client as NOTICE
+ in this case
+   
+  
+ 
+   
 

  The ddl_command_start event occurs just before the
diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c
index 7844880..a341fee 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -48,6 +48,8 @@
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
+bool disable_session_start_trigger;
+
 typedef struct EventTriggerQueryState
 {
 	/* memory context for this state's objects */
@@ -130,6 +132,7 @@ CreateEventTrigger(CreateEventTrigStmt *stmt)
 	if (strcmp(stmt->eventname, "ddl_command_start") != 0 &&
 		strcmp(stmt->eventname, "ddl_command_end") != 0 &&
 		strcmp(stmt->eventname, "sql_drop") != 0 &&
+		strcmp(stmt->eventname, "session_start") != 0 &&
 		strcmp(stmt->eventname, "table_rewrite") != 0)
 		ereport(ERROR,
 (errcode(ERRCODE_SYNTAX_ERROR),
@@ -562,6 +565,9 @@ EventTriggerCommonSetup(Node *parsetree,
 	ListCell   *lc;
 	List	   *runlist = NIL;
 
+	/* Get the command tag. */
+	tag = parsetree ? CreateCommandTag(parsetree) : CMDTAG_CONNECT;
+
 	/*
 	 * We want the list of command tags for which this procedure is actually
 	 * invoked to match up exactly with the list that CREATE EVENT TRIGGER
@@ -577,22 +583,18 @@ EventTriggerCommonSetup(Node *parsetree,
 	 * relevant command tag.
 	 */
 #ifdef USE_ASSERT_CHECKING
+	if (event == EVT_DDLCommandStart ||
+		event == EVT_DDLCommandEnd ||
+		event == EVT_SQLDrop ||
+		event == EVT_Connect)
 	{
-		CommandTag	dbgtag;
-
-		dbgtag = CreateCommandTag(parsetree);
-		if (event == EVT_DDLCommandStart ||
-			event == EVT_DDLCommandEnd ||
-			event == EVT_SQLDrop)
-		{
-			if (!command_tag_event_trigger_ok(dbgtag))
-elog(ERROR, "unexpected command tag \"%s\"", GetCommandTagName(dbgtag));
-		}
-		else if (event == 

Re: Fix overflow at return wchar2char (src/backend/utils/adt/pg_locale.c)

2020-09-14 Thread Daniel Gustafsson
> On 14 Sep 2020, at 14:41, Ranier Vilela  wrote:

> 1. wchar2char has a mistake when checking the return of WideCharToMultiByte 
> call.
> result variable is unsigned, therefore, cannot be less than zero, returning 
> -1 is not an option.

If the objection is that an unsigned var is tested with <= 0, then changing the
semantics of the function seems a rather drastic solution:

/* A zero return is failure */
-   if (result <= 0)
-   result = -1;
+   if (result == 0)
+   return 0;

The comment for wchar2char explicitly state "This has the same API as the
standard wcstombs_l() function;", and man wcstombs_l shows:

RETURN VALUES
 The wcstombs() function returns the number of bytes converted (not
 including any terminating null), if successful; otherwise, it returns
 (size_t)-1.

It can of course be argued that the check should be "result == 0" as result is
of type size_t.  The original commit introducing this in 2007, 654dcfb9e4b6,
had an integer return variable, so it's just a carry-over from there.  Will
changing that buy us anything, except possibly silence a static analyzer?

> 2. strftime or strftime_win32, return cannot be less than zero.
> 
> 3. If strftime or strftime_win32, fails, why not abort the loop?

This recently changed in 7ad1cd31bfc, and the commit message along with the
comment above the code implies that an error is unlikely:,

* MAX_L10N_DATA is sufficient buffer space for every known locale, and
* POSIX defines no strftime() errors.  (Buffer space exhaustion is not an
* error.)

..so it's probably a case of not optimizing for never-happens-scenarios: The
fact that strftimefail will trigger elog and not ereport is an additional clue
that an error is unlikely.

cheers ./daniel



Re: problem with RETURNING and update row movement

2020-09-14 Thread Amit Langote
On Mon, Sep 14, 2020 at 5:56 PM Etsuro Fujita  wrote:
> On Mon, Sep 14, 2020 at 3:53 PM Amit Langote  wrote:
> > On Sat, Sep 12, 2020 at 5:42 AM Alvaro Herrera  
> > wrote:
> > > I noticed that this bugfix has stalled, probably because the other
> > > bugfix has also stalled.
> > >
> > > It seems that cleanly returning system columns from table AM is not
> > > going to be a simple task -- whatever fix we get for that is likely not
> > > going to make it all the way to PG 12.  So I suggest that we should fix
> > > this bug in 11-13 without depending on that.
> >
> > Although I would be reversing course on what I said upthread, I tend
> > to agree with that, because the core idea behind the fix for this
> > issue does not seem likely to be invalidated by any conclusion
> > regarding the other issue.  That is, as far as the issue here is
> > concerned, we can't avoid falling back to using the source partition's
> > RETURNING projection whose scan tuple is provided using the source
> > partition's tuple slot.
>
> I agree on that point.
>
> > However, given that we have to pass the *new* tuple to the projection,
> > not the old one, we will need a "clean" way to transfer its (the new
> > tuple's) system columns into the source partition's tuple slot.  The
> > sketch I gave upthread of what that could look like was not liked by
> > Fujita-san much.
>
> IIUC, I think two issues are discussed in the thread [1]: (a) there is
> currently no way to define the set of meaningful system columns for a
> partitioned table that contains pluggable storages other than standard
> heap, and (b) even in the case where the set is defined as before,
> like partitioned tables that don’t contain any pluggable storages,
> system column values are not obtained from a tuple inserted into a
> partitioned table in cases as reported in [1], because virtual slots
> are assigned for partitioned tables [2][3].  (I think the latter is
> the original issue in the thread, though.)

Right, (b) can be solved by using a leaf partition's tuple slot as
proposed.  Although (a) needs a long term fix.

> I think we could fix this update-tuple-routing-vs-RETURNING issue
> without the fix for (a).  But to transfer system column values in a
> cleaner way, I think we need to fix (b) first so that we can always
> obtain/copy them from the new tuple moved to another partition by
> INSERT following DELETE.

Yes, I think you are right.  Although, I am still not sure how to
"copy" system columns from one partition's slot to another's, that is,
without assuming what they are.

> (I think we could address this issue for v11 independently of [1], off 
> course.)

Yeah.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




pg_restore causing deadlocks on partitioned tables

2020-09-14 Thread Domagoj Smoljanovic
Hi all.

I tried searching for the response to this but couldn’t find any. Tried also 
posting to general but got no love there.

I have pg_restore running in parallel (3 or more) and processing large amount 
of data that is in partitioned tables. However it seems that sometime deadlock 
appears when one process is trying to process primary key on parent table while 
data still hasn’t been loaded into partitions. And acquires Exclusive Lock on 
the whole table. Then another process comes and tries to load one of the 
partitions with SharedLock but it fails.

This of course doesn’t happen always; depending on the course of actions of the 
pg_restore. But often enough to cause frustration.

Process 15858 waits for AccessShareLock on relation 233358134 of database 
233346697; blocked by process 15861.
Process 15861 waits for AccessExclusiveLock on relation 233374757 of database 
233346697; blocked by process 15858.
Process 15858: TRUNCATE TABLE ONLY myschema."myTable:2020-09-01";
Process 15861: ALTER TABLE ONLY myschema."myTable" ADD CONSTRAINT "pk_myTable" 
PRIMARY KEY ("ID", date);

Should this be treated as a bug or am I doing something wrong?

Disclamer: --load-via-partition-root was NOT used. Meaning that warning from 
the pg_dump documentation should not be applicable 

Thanx,
Domagoj



Fix overflow at return wchar2char (src/backend/utils/adt/pg_locale.c)

2020-09-14 Thread Ranier Vilela
Hi,

1. wchar2char has a mistake when checking the return of WideCharToMultiByte
call.
result variable is unsigned, therefore, cannot be less than zero, returning
-1 is not an option.

2. strftime or strftime_win32, return cannot be less than zero.

3. If strftime or strftime_win32, fails, why not abort the loop?

regards,
Ranier Vilela


v1-0001-fix_overflow_wchar2char.patch
Description: Binary data


Re: TDE (Transparent Data Encryption) supported ?

2020-09-14 Thread Stephen Frost
Greetings,

We'd prefer it if you didn't top-post (just write some stuff at the top)
when you respond and post to these mailing lists.

* laurent.fe...@free.fr (laurent.fe...@free.fr) wrote:
> I come back to your comments about vestor attacks. I know that TDE protects 
> against disk thefts, not really more ..

That is a data-at-rest concern and TDE is one approach to addressing it.

> But compagnie has some internal rules and some of them require "At Rest" 
> encryption, nothing more is mentionned.
> Then, even if TDE is not THE solution in term of security, it is something 
> that companies want.

Disk-based encryption is available for basically all operating systems
and PostgreSQL works reasonably well on top of encrypted filesystems or
block devices.  That's all available today, works quite well to deal
with the "someone stole the disk" or "someone forgot to wipe the drive
before throwing it away" attack vectors.

In particular, I'd encourage you to look at Linux with LUKS for data at
rest encryption.  You can then simply run PostgreSQL on top of that and
be protected without any of the complications which TDE introduces.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Fix for parallel BTree initialization bug

2020-09-14 Thread Amit Kapila
On Fri, Sep 11, 2020 at 4:41 PM Amit Kapila  wrote:
>
> On Fri, Sep 11, 2020 at 8:07 AM Justin Pryzby  wrote:
> >
> I have tested this on HEAD. It would be great if you can verify in
> back branches as well. I'll also do it before commit.
>

I am planning to push this tomorrow after doing testing on
back-branches. Let me know if you have any comments.


-- 
With Regards,
Amit Kapila.




Re: Use incremental sort paths for window functions

2020-09-14 Thread Daniel Gustafsson
> On 8 Jul 2020, at 06:57, David Rowley  wrote:
> 
> Over on [1] someone was asking about chained window paths making use
> of already partially sorted input.  (The thread is on -general, so I
> guessed they're not using PG13.)

The [1] reference wasn't qualified, do you remember which thread it was?

> However, On checking PG13 to see if incremental sort would help their
> case, I saw it didn't. Looking at the code I saw that
> create_window_paths() and create_one_window_path() don't make any use
> of incremental sort paths.

Commit 728202b63cdcd7f counteracts this optimization in part since it orders
the windows such that the longest common prefix is executed first to allow
subsequent windows to skip sorting entirely.

That being said, it's only in part and when the stars don't align with sub-
sequently shorter common prefixes then incremental sort can help.  A synthetic
unscientific test with three windows over 10M rows, where no common prefix
exists, shows consistent speedups (for worst cases) well past what can be
attributed to background noise.

> I quickly put together the attached. It's only about 15 mins of work,
> but it seems worth looking at a bit more for some future commitfest.
> Yeah, I'll need to add some tests as I see nothing failed by changing
> this.

A few comments on the patch: there is no check for enable_incremental_sort, and
it lacks tests (as already mentioned) for the resulting plan.

cheers ./daniel



Re: TDE (Transparent Data Encryption) supported ?

2020-09-14 Thread Hans-Jürgen Schönig (PostgreSQL)
hi …

well, the reason why there is no key management is that we wanted to keep the 
interface open so that people 
can integrate with everything they want. i have seen keymanagement tools come 
and go.
postgresql is certainly gonna live longer than many of those things.
thus we intentionally decided to be flexible here.

as for postgresql 14: we are certainly willing to push that into 14 but the 
past couple of years
on the TDE front have been a little frustrating for us.
it seems we cannot reach consensus so it is pretty likely that we have to
keep maintaining it separately. i would love to see it in core but i don’t 
expect that to happen
under current circumstances. 

many thanks,

hans



> On 11.09.2020, at 17:19, Stephen Frost  wrote:
> 
> Greetings,
> 
> * laurent.fe...@free.fr (laurent.fe...@free.fr) wrote:
>> There is a fork named PostgreSQL 12.x TDE from Cybertec. The issue is that 
>> there is no key management at all.
> 
> Key management is definitely an understood issue in the community and
> there was a fair bit of work put into trying to solve that problem last
> year and hopefully that'll continue and perhaps be to a point where we
> have a solution in v14.  With that, perhaps we can also get TDE in, but
> there certainly aren't any guarantees.
> 
> What really needs to be considered, however, is what your attack vectors
> are and if TDE would really address them (often it doesn't, as with any
> TDE the keys end up at least in the memory of the database server and
> therefore available to an attacker, not to mention that the data ends up
> being sent to the application unencrypted, and no, TLS/SSL doesn't help
> that since an attacker on the server would be able to get at the data
> before it's wrapped in TLS).
> 
>> Using pgcrypto has an impact on the application then I have to give up this 
>> way.
> 
> pgcrypto is an option but, again, the keys end up on the database server
> and therefore it may not be very helpful, depending on what the attack
> vector is that you're concerned about.
> 
>> There is another alternative named "Client-Side Encryption'' that I have not 
>> looked at in detail yet. I'm afraid that this solution has an impact on the 
>> application too. And if there are two applications pointing to the same 
>> database I am wondering how the encryption key is shared between the two 
>> nodes.
> 
> Performing encryption of the sensitive data as early as possible is
> certainly the best answer, and that means doing it in the application
> before it ever reaches the database server.  Yes, that means that
> changes have to be made to the application, but it's a much better
> solution which will actually address real attack vectors, like the
> database server being compromised.
> 
> In general, as it relates to multiple applications connecting to the
> same database- I'd just downright discourage that.  PG is much lighter
> weight than other RDBMS solutions and you don't need to have 100
> different applications connecting to the same database server- instead
> have an independent cluster for each application and use certificates or
> other strong authentication mechanism to make sure that app A can only
> connect to the PG cluster for app A and not to any others- this
> completely removes the concern about an SQL injection attack on app A
> allowing the attacker to gain access to app B's data.
> 
> Another advantage of this is that the individual clusters are smaller,
> and they can be scaled independently, backed up independently, and
> restored independently, all of which are really good things.
> 
>> The last point is about the backups, whatever the solution, the data has to 
>> be in an encrypted format when "backuping".
> 
> Use a backup technology that provides encryption- pgbackrest does, and
> some others probably do too.
> 
> Thanks,
> 
> Stephen

--
Cybertec PostgreSQL International GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com 










Re: Avoid incorrect allocation in buildIndexArray

2020-09-14 Thread Ranier Vilela
Em dom., 13 de set. de 2020 às 22:46, Michael Paquier 
escreveu:

> On Sat, Sep 12, 2020 at 12:40:49PM +0200, Julien Rouhaud wrote:
> > agreed.
>
> Ok, done as ac673a1 then.
>
Thanks Michael.

regards,
Ranier Vilela


Re: [PATCH] Automatic HASH and LIST partition creation

2020-09-14 Thread Anastasia Lubennikova

On 08.09.2020 17:03, Pavel Borisov wrote:


The patch lacks documentation, because I expect some details may
change during discussion. Other than that, the feature is ready
for review.

Hi, hackers!

From what I've read I see there is much interest in automatic 
partitions creation. (Overall discussion on the topic is partitioned 
into two threads: (1) 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1907150711080.22273%40lancre and 
(2) 
https://www.postgresql.org/message-id/flat/7fec3abb-c663-c0d2-8452-a46141be6...@postgrespro.ru 
(current thread) )


There were many syntax proposals and finally, there is a patch 
realizing one of them. So I'd like to review it.


The syntax proposed in the patch seems good enough for me and is in 
accordance with one of the proposals in the discussion. Maybe I'd 
prefer using the word AUTOMATICALLY/AUTO instead of CONFIGURATION with 
explicit meaning that using this syntax we'd get already 
(automatically) created partitions and don't need to create them 
manually, as in the existing state of postgresql declarative 
partitioning.


CREATE TABLE tbl (iint) PARTITION BY HASH (i) AUTOMATICALLY (MODULUS 3); 
(partitions are created automatically)
vs
CREATE TABLE tbl (iint) PARTITION BY HASH (i); (partitions should be created 
manually by use of PARTITION OF)
CREATE TABLE tbl (i char) PARTITION BY LIST (i) AUTOMATICALLY (VALUES 
IN ('a', 'b'), ('c', 'd'), ('e','f') DEFAULTPARTITION tbl_default);

vs
CREATE TABLE tbl (ichar) PARTITION BY LIST (i); (partitions should be created 
manually by use of PARTITION OF)

I think this syntax can also be extended later with adding automatic 
creation of RANGE partitions, with IMMEDIATE/DEFERRED for 
dynamic/on-demand automatic partition creation, and with SUBPARTITION 
possibility.


But I don't have a strong preference for the word AUTOMATICALLY, 
moreover I saw opposition to using AUTO at the top of the discussion. 
I suppose we can go with the existing CONFIGURATION word.


I agree that 'AUTOMATICALLY' keyword is more specific and probably less 
confusing for users. I've picked 'CONFIGURATION' simply because it is an 
already existing keyword. It would like to hear other opinions on that.



If compare with existing declarative partitions, I think automatic 
creation simplifies the process for the end-user and  I'd vote for its 
committing into Postgres. The patch is short and clean in code style. 
It has enough comments Tests covering the new functionality are 
included. Yet it doesn't have documentation and I'd suppose it's worth 
adding it. Even if there will be syntax changes, I hope they will not 
be more than the replacement of several words. Current syntax is 
described in the text of a patch.




Fair enough. New patch contains a documentation draft. While writing it, 
I also noticed, that the syntax, introduced in the patch differs from 
greenpulm one. For now, list partitioning clause doesn't support 
'PARTITION name' part, that is supported in greenplum. I don't think 
that we aim for 100% compatibility here. Still, the ability to provide 
table names is probably a good optional feature, especially for list 
partitions.


What do you think?


The patch applies cleanly and installcheck-world is passed.

Some minor things:

I've got a compiler warning:
parse_utilcmd.c:4280:15: warning: unused variable 'lc' [-Wunused-variable]


Fixed. This was also caught by cfbot. This version should pass it clean.



When the number of partitions is over the maximum value of int32 the 
output shows a generic syntax error. I don't think it is very 
important as it is not the case someone will make deliberately, but 
maybe it's better to output something like "Partitions number is more 
than the maximum supported value"
create table test (i int, t text) partition by hash (i) configuration 
(modulus );

ERROR:  syntax error at or near ""


This value is not a valid int32 number, thus parser throws the error 
before we have a chance to handle it more gracefully.




One more piece of nitpicking. Probably we can go just with a mention 
in documentation.
create table test (i int, t text) partition by hash (i) configuration 
(modulus );

ERROR:  out of shared memory
HINT:  You might need to increase max_locks_per_transaction.

Well, it looks like a legit error, when we try to lock a lot of objects 
in one transaction. I will double check if we don't release a lock 
somewhere.


Do we need to restrict the number of partitions, that can be created by 
this statement? With what number?  As far as I see, there is no such 
restriction for now, just a recommendation about performance issues. 
With automatic creation it becomes easier to mess with it.


Probably, it's enough to mention it in documentation and rely on users 
common sense.



Typo:
+ /* Add statemets to create each partition after we create parent 
table */



Fixed.

Overall I see the patch almost ready for commit and I'd like to meet 
this 

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-09-14 Thread Dilip Kumar
On Mon, Sep 14, 2020 at 4:50 PM Amit Kapila  wrote:
>
> On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar  wrote:
> >
> > On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila  wrote:
> > >
> > >
> > > Yeah, this is right, and here is some initial analysis. It seems to be
> > > failing in below code:
> > > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}
> > >
> > > This list can have elements only in 'streaming' mode (need to enable
> > > 'streaming' with Create Subscription command) whereas none of the
> > > tests in 010_truncate.pl is using 'streaming', so this list should be
> > > empty (NULL). The two different assertion failures shown in BF reports
> > > in list_free code are as below:
> > > Assert(list->length > 0);
> > > Assert(list->length <= list->max_length);
> > >
> > > It seems to me that this list is not initialized properly when it is
> > > not used or maybe that is true in some special circumstances because
> > > we initialize it in get_rel_sync_entry(). I am not sure if CCI build
> > > is impacting this in some way.
> >
> >
> > Even I have analyzed this but did not find any reason why the
> > streamed_txns list should be anything other than NULL.  The only thing
> > is we are initializing the entry->streamed_txns to NULL and the list
> > free is checking  "if (list == NIL)" then return. However IMHO, that
> > should not be an issue becase NIL is defined as (List*) NULL.
> >
>
> Yeah, that is not the issue but it is better to initialize it with NIL
> for the sake of consistency. The basic issue here was we were trying
> to open/lock the relation(s) before initializing this list. Now, when
> we process the invalidations during open relation, we try to access
> this list in rel_sync_cache_relation_cb and that leads to assertion
> failure. I have reproduced the exact scenario of 010_truncate.pl via
> debugger. Basically, the backend on publisher has sent the
> invalidation after truncating the relation 'tab1' and while processing
> the truncate message if WALSender receives that message exactly after
> creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be
> reproduced.

Yeah, this is an issue and I am also able to reproduce this manually
using gdb.  Basically, I have inserted some data in publication table
and after that, I stopped in get_rel_sync_entry after creating the
reentry and before calling GetRelationPublications.  Meanwhile, I have
truncated this table and then it hit the same issue you pointed here.

> The attached patch will fix the issue. What do you think?

The patch looks good to me and fixing the reported issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Division in dynahash.c due to HASH_FFACTOR

2020-09-14 Thread David Rowley
On Thu, 10 Sep 2020 at 14:55, Thomas Munro  wrote:
>
> I wrote a draft commit message for Jakub's proposed change (attached),
> and did a little bit of testing, but I haven't seen a case where it
> wins yet; I need to work on something else now but thought I'd share
> this much anyway.  One observation is that the rule in init_htab had
> better agree with the expansion rule in hash_search_with_hash_value,
> otherwise you can size your hash table perfectly for n elements and
> then it still has to expand when you insert the nth element, which is
> why I changed >= to >.  Does this make sense?  Oh man, I don't like
> the mash-up of int, long, uint32, Size dynahash uses in various places
> and that are brought into relief by that comparison...

I just did some benchmarking with this patch using the same recovery
benchmark that I used in [1] and also the two patches that I posted in
[2]. Additionally, I added a PANIC at the end of recovery so that I
could repeat the recovery over and over again with the same WAL.

Without 0001-Remove-custom-fill-factor-support-from-dynahash.c.patch,
the time in seconds reported for recovery was as follows:

Run 1 65.2
Run 2 65.41
Run 3 65.1
Run 4 64.86
Run 5 62.29
Run 6 67.06
Run 7 63.35
Run 8 63.1
Run 9 62.8
Run 10 62.15

Avg 64.132
Med 64.105

After patching I got:

Run 1 60.69
Run 2 63.13
Run 3 63.24
Run 4 62.13
Run 5 63.81
Run 6 60.27
Run 7 62.85
Run 8 63.45
Run 9 59.6
Run 10 63.16

Avg 62.233
Med 62.99

I was quite happy to see 59.6 seconds in there on run 9 as I'd been
trying hard to get that benchmark to go below 1 minute on my machine.

perf appears to indicate that less time was spent in
hash_search_with_hash_value()

Unpatched:
  26.96%  postgres  postgres[.] PageRepairFragmentation
  12.07%  postgres  libc-2.31.so[.] __memmove_avx_unaligned_erms
  10.13%  postgres  postgres[.] hash_search_with_hash_value
   6.70%  postgres  postgres[.] XLogReadBufferExtended

Patched:
  27.70%  postgres  postgres[.] PageRepairFragmentation
  13.50%  postgres  libc-2.31.so[.] __memmove_avx_unaligned_erms
   9.63%  postgres  postgres[.] hash_search_with_hash_value
   6.44%  postgres  postgres[.] XLogReadBufferExtended

I looked over the patch and the only thing I saw was that we might
also want to remove the following line:

#define DEF_FFACTOR1 /* default fill factor */

Also, a couple of things I noticed when looking at performance
profiles in detail.

1) The most expensive user of hash_search_with_hash_value() comes from
BufTableLookup() which passes HASH_FIND as the HASHACTION.
2) The code that was doing the divide in the unpatched version only
triggered when HASHACTION was HASH_ENTER or HASH_ENTER_NULL.

The 2nd most costly call to hash_search_with_hash_value() came in via
hash_search() via smgropen(). That does use HASH_ENTER, which could
have triggered the divide code. The main caller of smgropen() was
XLogReadBufferExtended().

So, it looks unlikely that any gains we are seeing are from improved
buffer lookups. It's more likely they're coming from more optimal
XLogReadBufferExtended()

David

[1] 
https://www.postgresql.org/message-id/caaphdvokwqazhiuxet8jsqupjkdph8dnuzdfusx9p7dxrjd...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAApHDvo9n-nOb3b4PYFx%2Bw9mqd2SSUHm_oAs039eZnZLqFGcbQ%40mail.gmail.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-09-14 Thread Amit Kapila
On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar  wrote:
>
> On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila  wrote:
> >
> >
> > Yeah, this is right, and here is some initial analysis. It seems to be
> > failing in below code:
> > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}
> >
> > This list can have elements only in 'streaming' mode (need to enable
> > 'streaming' with Create Subscription command) whereas none of the
> > tests in 010_truncate.pl is using 'streaming', so this list should be
> > empty (NULL). The two different assertion failures shown in BF reports
> > in list_free code are as below:
> > Assert(list->length > 0);
> > Assert(list->length <= list->max_length);
> >
> > It seems to me that this list is not initialized properly when it is
> > not used or maybe that is true in some special circumstances because
> > we initialize it in get_rel_sync_entry(). I am not sure if CCI build
> > is impacting this in some way.
>
>
> Even I have analyzed this but did not find any reason why the
> streamed_txns list should be anything other than NULL.  The only thing
> is we are initializing the entry->streamed_txns to NULL and the list
> free is checking  "if (list == NIL)" then return. However IMHO, that
> should not be an issue becase NIL is defined as (List*) NULL.
>

Yeah, that is not the issue but it is better to initialize it with NIL
for the sake of consistency. The basic issue here was we were trying
to open/lock the relation(s) before initializing this list. Now, when
we process the invalidations during open relation, we try to access
this list in rel_sync_cache_relation_cb and that leads to assertion
failure. I have reproduced the exact scenario of 010_truncate.pl via
debugger. Basically, the backend on publisher has sent the
invalidation after truncating the relation 'tab1' and while processing
the truncate message if WALSender receives that message exactly after
creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be
reproduced.

The attached patch will fix the issue. What do you think?

-- 
With Regards,
Amit Kapila.


v1-0001-Fix-initialization-of-RelationSyncEntry-for-strea.patch
Description: Binary data


Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-14 Thread Ashutosh Sharma
On Sun, Sep 13, 2020 at 3:30 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > I have committed this version.
>
> This failure says that the test case is not entirely stable:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer=2020-09-12%2005%3A13%3A12
>
> diff -U3 
> /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out
>  
> /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out
> --- 
> /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out
>2020-09-11 06:31:36.0 +
> +++ 
> /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out
> 2020-09-12 11:40:26.0 +
> @@ -116,7 +116,6 @@
>   vacuum freeze htab2;
>   -- unused TIDs should be skipped
>   select heap_force_kill('htab2'::regclass, ARRAY['(0, 2)']::tid[]);
> - NOTICE:  skipping tid (0, 2) for relation "htab2" because it is marked 
> unused
>heap_force_kill
>   -
>
>
> sungazer's first run after pg_surgery went in was successful, so it's
> not a hard failure.  I'm guessing that it's timing dependent.
>
> The most obvious theory for the cause is that what VACUUM does with
> a tuple depends on whether the tuple's xmin is below global xmin,
> and a concurrent autovacuum could very easily be holding back global
> xmin.  While I can't easily get autovac to run at just the right
> time, I did verify that a concurrent regular session holding back
> global xmin produces the symptom seen above.  (To replicate, insert
> "select pg_sleep(...)" in heap_surgery.sql before "-- now create an unused
> line pointer"; run make installcheck; and use the delay to connect
> to the database manually, start a serializable transaction, and do
> any query to acquire a snapshot.)
>

Thanks for reporting. I'm able to reproduce the issue by creating some
delay just before "-- now create an unused line pointer" and use the
delay to start a new session either with repeatable read or
serializable transaction isolation level and run some query on the
test table. To fix this, as you suggested I've converted the test
table to the temp table. Attached is the patch with the changes.
Please have a look and let me know about any concerns.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
index 9451c57..d2cf647 100644
--- a/contrib/pg_surgery/expected/heap_surgery.out
+++ b/contrib/pg_surgery/expected/heap_surgery.out
@@ -87,7 +87,7 @@ NOTICE:  skipping tid (0, 6) for relation "htab" because the item number is out
 
 rollback;
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -175,6 +175,5 @@ ERROR:  "vw" is not a table, materialized view, or TOAST table
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 ERROR:  "vw" is not a table, materialized view, or TOAST table
 -- cleanup.
-drop table htab2;
 drop view vw;
 drop extension pg_surgery;
diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql
index 8a27214..3635bd5 100644
--- a/contrib/pg_surgery/sql/heap_surgery.sql
+++ b/contrib/pg_surgery/sql/heap_surgery.sql
@@ -41,7 +41,7 @@ select heap_force_freeze('htab'::regclass, ARRAY['(0, 0)', '(0, 6)']::tid[]);
 rollback;
 
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -86,6 +86,5 @@ select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 
 -- cleanup.
-drop table htab2;
 drop view vw;
 drop extension pg_surgery;


Re: pg_dump --where option

2020-09-14 Thread Surafel Temesgen
On Fri, Jul 31, 2020 at 1:38 AM Daniel Gustafsson  wrote:

>
> >  $ pg_dump -d cary --where="test1:a3 = ( select max(aa1) from test2 )" >
> testdump2
> >  $ pg_dump: error: processing of table "public.test1" failed
> >
> > both test1 and test2 exist in the database and the same subquery works
> under psql.
> >
>

This is because pg_dump uses schema-qualified object name I add
documentation about to use schema-qualified name when using sub query




> > I also notice that the regression tests for pg_dump is failing due to
> the patch, I think it is worth looking into the failure messages and also
> add some test cases on the new "where" clause to ensure that it can cover
> as many use cases as possible.
>
>
I fix regression test  failure on the attached patch.

I don’t add tests because single-quotes and double-quotes are
meta-characters for PROVE too.

regards

Surafel
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0b2e2de87b..7dc3041247 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1104,6 +1104,24 @@ PostgreSQL documentation
   
  
 
+ 
+  --where=table:filter_clause
+  
+   
+When dumping data for table, only include rows
+that meet the filter_clause condition.
+if --where contains subquery, uses schema-qualified name otherwise
+it is error because pg_dump uses schema-qualified object name to identifies the tables.
+This option is useful when you want to dump only a subset of a particular table.
+--where can be given more than once to provide different filters
+for multiple tables. Note that if multiple options refer to the same table,
+only the first filter_clause will be applied. If necessary, use quotes in your shell to
+provide an argument that contains spaces.
+E.g. --where=mytable:"created_at >= '2018-01-01' AND test = 'f'"
+   
+  
+ 
+
  
-?
--help
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d3ca54e4dc..418684e272 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -121,6 +121,8 @@ static SimpleStringList tabledata_exclude_patterns = {NULL, NULL};
 static SimpleOidList tabledata_exclude_oids = {NULL, NULL};
 static SimpleStringList foreign_servers_include_patterns = {NULL, NULL};
 static SimpleOidList foreign_servers_include_oids = {NULL, NULL};
+static SimpleStringList tabledata_where_patterns = {NULL, NULL};
+static SimpleOidList tabledata_where_oids = {NULL, NULL};
 
 static const CatalogId nilCatalogId = {0, 0};
 
@@ -156,7 +158,8 @@ static void expand_foreign_server_name_patterns(Archive *fout,
 static void expand_table_name_patterns(Archive *fout,
 	   SimpleStringList *patterns,
 	   SimpleOidList *oids,
-	   bool strict_names);
+	   bool strict_names,
+	   bool match_data);
 static NamespaceInfo *findNamespace(Archive *fout, Oid nsoid);
 static void dumpTableData(Archive *fout, TableDataInfo *tdinfo);
 static void refreshMatViewData(Archive *fout, TableDataInfo *tdinfo);
@@ -387,6 +390,7 @@ main(int argc, char **argv)
 		{"on-conflict-do-nothing", no_argument, _nothing, 1},
 		{"rows-per-insert", required_argument, NULL, 10},
 		{"include-foreign-data", required_argument, NULL, 11},
+		{"where", required_argument, NULL, 12},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -604,6 +608,10 @@ main(int argc, char **argv)
 		  optarg);
 break;
 
+			case 12:/* table data WHERE clause */
+simple_string_list_append(_where_patterns, optarg);
+break;
+
 			default:
 fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 exit_nicely(1);
@@ -806,17 +814,26 @@ main(int argc, char **argv)
 	{
 		expand_table_name_patterns(fout, _include_patterns,
    _include_oids,
-   strict_names);
+   strict_names, false);
 		if (table_include_oids.head == NULL)
 			fatal("no matching tables were found");
 	}
+
+	if (tabledata_where_patterns.head != NULL)
+	{
+		expand_table_name_patterns(fout, _where_patterns,
+   _where_oids,
+   true, true);
+		if (tabledata_where_oids.head == NULL)
+			fatal("no matching table was found");
+	}
 	expand_table_name_patterns(fout, _exclude_patterns,
 			   _exclude_oids,
-			   false);
+			   false, false);
 
 	expand_table_name_patterns(fout, _exclude_patterns,
 			   _exclude_oids,
-			   false);
+			   false, false);
 
 	expand_foreign_server_name_patterns(fout, _servers_include_patterns,
 		_servers_include_oids);
@@ -1047,6 +1064,7 @@ help(const char *progname)
 	printf(_("  --use-set-session-authorization\n"
 			 "   use SET SESSION AUTHORIZATION commands instead of\n"
 			 "   ALTER OWNER commands to set ownership\n"));
+	printf(_("  --where=TABLE:WHERE_CLAUSE   only dump selected rows for the given table\n"));
 
 	

Re: On login trigger: take three

2020-09-14 Thread Pavel Stehule
Hi

I am checking last patch, and there are notices

1. disable_session_start_trigger should be SU_BACKEND instead SUSET

2. The documentation should be enhanced - there is not any note about
behave when there are unhandled exceptions, about motivation for this event
trigger

3. regress tests should be enhanced - the cases with exceptions are not
tested

4. This trigger is not executed again after RESET ALL or DISCARD ALL - it
can be a problem if somebody wants to use this trigger for initialisation
of some session objects with some pooling solutions.

5. The handling errors don't work well for canceling. If event trigger
waits for some event, then cancel disallow connect although connected user
is superuser

CREATE OR REPLACE FUNCTION on_login_proc2() RETURNS EVENT_TRIGGER AS $$
begin perform pg_sleep(1); raise notice '%', fx1(100);raise notice
'kuku kuku'; end  $$ language plpgsql;

probably nobody will use pg_sleep in this routine, but there can be wait on
some locks ...

Regards

Pavel


unusual use of "path" in pg_verifybackup?

2020-09-14 Thread Peter Eisentraut
pg_verifybackup and the associated backup manifest functionality uses 
"path", "path name", etc. throughout where other components might just 
say "file", "file name".  This isn't wrong, if you follow POSIX 
terminology, but it's a bit unusual, I think.  Specifically, option 
naming like


  -i, --ignore=RELATIVE_PATH  ignore indicated path
  -m, --manifest-path=PATHuse specified path for manifest
  -w, --wal-directory=PATHuse specified path for WAL files

might make one think that there is a search path involved.

Thoughts?  Is there a deeper reason behind this, or should we dial this 
back a bit perhaps?


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: problem with RETURNING and update row movement

2020-09-14 Thread Etsuro Fujita
On Mon, Sep 14, 2020 at 3:53 PM Amit Langote  wrote:
> On Sat, Sep 12, 2020 at 5:42 AM Alvaro Herrera  
> wrote:
> > I noticed that this bugfix has stalled, probably because the other
> > bugfix has also stalled.
> >
> > It seems that cleanly returning system columns from table AM is not
> > going to be a simple task -- whatever fix we get for that is likely not
> > going to make it all the way to PG 12.  So I suggest that we should fix
> > this bug in 11-13 without depending on that.
>
> Although I would be reversing course on what I said upthread, I tend
> to agree with that, because the core idea behind the fix for this
> issue does not seem likely to be invalidated by any conclusion
> regarding the other issue.  That is, as far as the issue here is
> concerned, we can't avoid falling back to using the source partition's
> RETURNING projection whose scan tuple is provided using the source
> partition's tuple slot.

I agree on that point.

> However, given that we have to pass the *new* tuple to the projection,
> not the old one, we will need a "clean" way to transfer its (the new
> tuple's) system columns into the source partition's tuple slot.  The
> sketch I gave upthread of what that could look like was not liked by
> Fujita-san much.

IIUC, I think two issues are discussed in the thread [1]: (a) there is
currently no way to define the set of meaningful system columns for a
partitioned table that contains pluggable storages other than standard
heap, and (b) even in the case where the set is defined as before,
like partitioned tables that don’t contain any pluggable storages,
system column values are not obtained from a tuple inserted into a
partitioned table in cases as reported in [1], because virtual slots
are assigned for partitioned tables [2][3].  (I think the latter is
the original issue in the thread, though.)

I think we could fix this update-tuple-routing-vs-RETURNING issue
without the fix for (a).  But to transfer system column values in a
cleaner way, I think we need to fix (b) first so that we can always
obtain/copy them from the new tuple moved to another partition by
INSERT following DELETE.

(I think we could address this issue for v11 independently of [1], off course.)

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/141051591267657%40mail.yandex.ru
[2] 
https://www.postgresql.org/message-id/CA%2BHiwqHrsNa4e0MfpSzv7xOM94TvX%3DR0MskYxYWfy0kjL0DAdQ%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/20200811180231.fcssvhelqpnydnx7%40alap3.anarazel.de




Re: TDE (Transparent Data Encryption) supported ?

2020-09-14 Thread laurent . feron
Thank for your responses
I come back to your comments about vestor attacks. I know that TDE protects 
against disk thefts, not really more ..
But compagnie has some internal rules and some of them require "At Rest" 
encryption, nothing more is mentionned.
Then, even if TDE is not THE solution in term of security, it is something that 
companies want.
Laurent

- Mail original -
De: "Stephen Frost" 
À: "laurent feron" 
Cc: pgsql-hackers@lists.postgresql.org
Envoyé: Vendredi 11 Septembre 2020 17:19:21
Objet: Re: TDE (Transparent Data Encryption) supported ?

Greetings,

* laurent.fe...@free.fr (laurent.fe...@free.fr) wrote:
> There is a fork named PostgreSQL 12.x TDE from Cybertec. The issue is that 
> there is no key management at all.

Key management is definitely an understood issue in the community and
there was a fair bit of work put into trying to solve that problem last
year and hopefully that'll continue and perhaps be to a point where we
have a solution in v14.  With that, perhaps we can also get TDE in, but
there certainly aren't any guarantees.

What really needs to be considered, however, is what your attack vectors
are and if TDE would really address them (often it doesn't, as with any
TDE the keys end up at least in the memory of the database server and
therefore available to an attacker, not to mention that the data ends up
being sent to the application unencrypted, and no, TLS/SSL doesn't help
that since an attacker on the server would be able to get at the data
before it's wrapped in TLS).

> Using pgcrypto has an impact on the application then I have to give up this 
> way.

pgcrypto is an option but, again, the keys end up on the database server
and therefore it may not be very helpful, depending on what the attack
vector is that you're concerned about.

> There is another alternative named "Client-Side Encryption'' that I have not 
> looked at in detail yet. I'm afraid that this solution has an impact on the 
> application too. And if there are two applications pointing to the same 
> database I am wondering how the encryption key is shared between the two 
> nodes.

Performing encryption of the sensitive data as early as possible is
certainly the best answer, and that means doing it in the application
before it ever reaches the database server.  Yes, that means that
changes have to be made to the application, but it's a much better
solution which will actually address real attack vectors, like the
database server being compromised.

In general, as it relates to multiple applications connecting to the
same database- I'd just downright discourage that.  PG is much lighter
weight than other RDBMS solutions and you don't need to have 100
different applications connecting to the same database server- instead
have an independent cluster for each application and use certificates or
other strong authentication mechanism to make sure that app A can only
connect to the PG cluster for app A and not to any others- this
completely removes the concern about an SQL injection attack on app A
allowing the attacker to gain access to app B's data.

Another advantage of this is that the individual clusters are smaller,
and they can be scaled independently, backed up independently, and
restored independently, all of which are really good things.

> The last point is about the backups, whatever the solution, the data has to 
> be in an encrypted format when "backuping".

Use a backup technology that provides encryption- pgbackrest does, and
some others probably do too.

Thanks,

Stephen




Re: Gripes about walsender command processing

2020-09-14 Thread Michael Paquier
On Sun, Sep 13, 2020 at 03:47:51PM -0400, Tom Lane wrote:
> While trying to fix this, I also observed that exec_replication_command
> fails to clean up the temp context it made for parsing the command string,
> if that turns out to be a SQL command.  This very accidentally fails to
> lead to a long-term memory leak, because that context will be a child of
> MessageContext so it'll be cleaned out in the next iteration of
> PostgresMain's main loop.  But it's still unacceptably sloppy coding
> IMHO, and it's closely related to a lot of other randomness in the
> function; it'd be better to have a separate early-exit path for SQL
> commands.

Looks fine seen from here.  +1.

> What I'd really like to do is adjust pgstat_report_activity so that
> callers are *required* to provide some kind of command string when
> transitioning into STATE_RUNNING state, ie something like
> 
>   Assert(state == STATE_RUNNING ? cmd_str != NULL : cmd_str == NULL);
> 
> However, before we could do that, we'd have to clean up the rat's nest
> of seemingly-inserted-with-the-aid-of-a-dartboard pgstat_report_activity
> calls in replication/logical/worker.c.  I couldn't make heads or tails
> of what is going on there, so I did not try.

For this one, I don't actually agree.  For now the state and the query
string, actually the activity string, are completely independent.  So
external modules like bgworkers can mix the state enum and the
activity string as they wish.  I think that this flexibility is useful
to keep.

> BTW, while I didn't change it here, isn't the second
> SnapBuildClearExportedSnapshot call in exec_replication_command just
> useless?  We already did that before parsing the command.

Indeed.
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-09-14 Thread Dilip Kumar
On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila  wrote:
>
> On Mon, Sep 14, 2020 at 3:08 AM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > > Pushed.
> >
> > Observe the following reports:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus=2020-09-13%2016%3A54%3A03
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes=2020-09-10%2009%3A08%3A03
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis=2020-09-05%2020%3A22%3A02
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2020-09-04%2001%3A52%3A03
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2020-09-03%2020%3A54%3A04
> >
> > These are all on HEAD, and all within the last ten days, and I see
> > nothing comparable in any branch before that.  So it's hard to avoid
> > the conclusion that somebody broke something about ten days ago.
> >
> > None of these animals provided gdb backtraces; but we do have a built-in
> > trace from several, and they all look like pgoutput.so is trying to
> > list_free() garbage, somewhere inside a relcache invalidation/rebuild
> > scenario:
> >
>
> Yeah, this is right, and here is some initial analysis. It seems to be
> failing in below code:
> rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}
>
> This list can have elements only in 'streaming' mode (need to enable
> 'streaming' with Create Subscription command) whereas none of the
> tests in 010_truncate.pl is using 'streaming', so this list should be
> empty (NULL). The two different assertion failures shown in BF reports
> in list_free code are as below:
> Assert(list->length > 0);
> Assert(list->length <= list->max_length);
>
> It seems to me that this list is not initialized properly when it is
> not used or maybe that is true in some special circumstances because
> we initialize it in get_rel_sync_entry(). I am not sure if CCI build
> is impacting this in some way.


Even I have analyzed this but did not find any reason why the
streamed_txns list should be anything other than NULL.  The only thing
is we are initializing the entry->streamed_txns to NULL and the list
free is checking  "if (list == NIL)" then return. However IMHO, that
should not be an issue becase NIL is defined as (List*) NULL.  I am
doing further testing and investigation.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




  1   2   >