Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-12-25 Thread Masahiko Sawada
On Tue, 24 Dec 2019 at 17:21, Amit Kapila  wrote:
>
> On Tue, Dec 24, 2019 at 11:17 AM Masahiko Sawada
>  wrote:
> >
> > On Fri, 20 Dec 2019 at 22:30, Amit Kapila  wrote:
> > >
> > >
> > > The main aim of this feature is to reduce apply lag.  Because if we
> > > send all the changes together it can delay there apply because of
> > > network delay, whereas if most of the changes are already sent, then
> > > we will save the effort on sending the entire data at commit time.
> > > This in itself gives us decent benefits.  Sure, we can further improve
> > > it by having separate workers (dedicated to apply the changes) as you
> > > are suggesting and in fact, there is a patch for that as well(see the
> > > performance results and bgworker patch at [1]), but if try to shove in
> > > all the things in one go, then it will be difficult to get this patch
> > > committed (there are already enough things and the patch is quite big
> > > that to get it right takes a lot of energy).  So, the plan is
> > > something like that first we get the basic feature and then try to
> > > improve by having dedicated workers or things like that.  Does this
> > > make sense to you?
> > >
> >
> > Thank you for explanation. The plan makes sense. But I think in the
> > current design it's a problem that logical replication worker doesn't
> > receive changes (and doesn't check interrupts) during applying
> > committed changes even if we don't have a worker dedicated for
> > applying. I think the worker should continue to receive changes and
> > save them to temporary files even during applying changes.
> >
>
> Won't it beat the purpose of this feature which is to reduce the apply
> lag?  Basically, it can so happen that while applying commit, it
> constantly gets changes of other transactions which will delay the
> apply of the current transaction.

You're right. But it seems to me that it optimizes the apply lags of
only a transaction that made many changes. On the other hand if a
transaction made many changes applying of subsequent changes are
delayed.

>  Also, won't it create some further
> work to identify the order of commits?  Say while applying commit-1,
> it receives 5 other commits that are written to separate temporary
> files.  How will we later identify which transaction's WAL we need to
> apply first?  We might deduce by LSN's, but I think that could be
> tricky.  Another thing is that I think it could lead to some design
> complications as well because while applying commit, you need some
> sort of callback or something like that to receive and flush totally
> unrelated changes.  It could lead to another kind of failure mode
> wherein while applying commit if it tries to receive another
> transaction data and some failure happens while writing the data of
> that transaction.  I am not sure if it is a good idea to try something
> like that.

It's just an idea but we might want to have new workers dedicated to
apply changes first and then we will have streaming option later. That
way we can reduce the flush lags depending on use cases. The commit
order can be  determined by the receiver and shared with the applyer
in shared memory. Once we separated workers the streaming option can
be introduced without such a downside.

>
> > Otherwise
> > the buffer would be easily full and replication gets stuck.
> >
>
> Are you telling about network buffer?

Yes.

>   I think the best way as
> discussed is to launch new workers for streamed transactions, but we
> can do that as an additional feature. Anyway, as proposed, users can
> choose the streaming mode for subscriptions, so there is an option to
> turn this selectively.

Yes. But user who wants to use this feature would want to replicate
many changes but I guess the side effect is quite big. I think that at
least we need to make the logical replication tolerate such situation.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




table partitioning and access privileges

2019-12-25 Thread Fujii Masao
Hi,

My customer reported me that the queries through a partitioned table
ignore each partition's SELECT, INSERT, UPDATE, and DELETE privileges,
on the other hand, only TRUNCATE privilege specified for each partition
is applied. I'm not sure if this behavior is expected or not. But anyway
is it better to document that? For example,

Access privileges may be defined and removed separately for each partition.
But note that queries through a partitioned table ignore each partition's
SELECT, INSERT, UPDATE and DELETE privileges, and apply only TRUNCATE one.

Regards,

-- 
Fujii Masao




Re: table partition and column default

2019-12-25 Thread Fujii Masao
On Wed, Dec 25, 2019 at 5:47 PM Amit Langote  wrote:
>
> On Wed, Dec 25, 2019 at 5:40 PM Fujii Masao  wrote:
> > On Wed, Dec 25, 2019 at 1:56 PM Amit Langote  
> > wrote:
> > > IIRC, there was some discussion about implementing a feature whereby
> > > partition's default will used for an attribute if it's null even after
> > > considering the parent table's default, that is, when no default value
> > > is defined in the parent.  The details are at toward the end of this
> > > thread:
> > >
> > > https://www.postgresql.org/message-id/flat/578398af46350effe7111895a4856b87b02e000e.camel%402ndquadrant.com
> >
> > Thanks for pointing that thread!
> >
> > As you mentioned in that thread, I also think that this current
> > behavior (maybe restriction) should be documented.
> > What about adding the note like "a partition's default value is
> > not applied when inserting a tuple through a partitioned table."
> > into the document?
>
> Agreed.
>
> > Patch attached.
>
> Thanks for creating the patch, looks good to me.

Thanks for reviewing the patch. Committed!

Regards,

-- 
Fujii Masao




Re: unsupportable composite type partition keys

2019-12-25 Thread Amit Langote
On Thu, Dec 26, 2019 at 3:21 AM Tom Lane  wrote:
> I wrote:
> > Amit Langote  writes:
> >> On Tue, Dec 24, 2019 at 10:59 AM Amit Langote  
> >> wrote:
> >>> Btw, does the memory leakage fix in this patch address any of the
> >>> pending concerns that were discussed on the "hyrax vs.
> >>> RelationBuildPartitionDesc" thread earlier this year[1]?
> >>> [1] 
> >>> https://www.postgresql.org/message-id/flat/3800.1560366716%40sss.pgh.pa.us#092b6b4f6bf75d2f3f90ef6a3b3eab5b
>
> >> I thought about this a little and I think it *does* address the main
> >> complaint in the above thread.
>
> It occurred to me to also recheck the original complaint in that thread,
> which was poor behavior in CLOBBER_CACHE_ALWAYS builds.

Thanks for taking the time to do that.

>  I didn't have
> the patience to run a full CCA test, but I did run update.sql, which
> we previously established was sufficient to show the problem.  There's
> no apparent memory bloat, either with HEAD or with the patch.  I also
> see the runtime (for update.sql on its own) dropping from about
> 474 sec in HEAD to 457 sec with the patch.  So that indicates that we're
> actually saving a noticeable amount of work, not just postponing it,
> at least under CCA scenarios where relcache entries get flushed a lot.

Yeah, as long as nothing in between those flushes needs to look at the
partition descriptor.

> I also tried to measure update.sql's runtime in a regular debug build
> (not CCA).  I get pretty repeatable results of 279ms on HEAD vs 273ms
> with patch, or about a 2% overall savings.  That's at the very limit of
> what I'd consider a reproducible difference, but still it seems to be
> real.  So that seems like evidence that forcing the partition data to be
> loaded immediately rather than on-demand is a loser from a performance
> standpoint as well as the recursion concerns that prompted this patch.

Agreed.

> Which naturally leads one to wonder whether forcing other relcache
> substructures (triggers, rules, etc) to be loaded immediately isn't
> a loser as well.  I'm still feeling like we're overdue to redesign how
> all of this works and come up with a more uniform, less fragile/ad-hoc
> approach.  But I don't have the time or interest to do that right now.

I suppose if on-demand loading of partition descriptors can result in
up to 2% savings, we can perhaps expect slightly more by doing the
same for other substructures.  Also, the more different substructures
are accessed similarly the better.

> Anyway, I've run out of reasons not to commit this patch, so I'll
> go do that.

Thank you.  I noticed that there are comments suggesting that certain
RelationData members are to be accessed using their RelationGet*
functions, but partitioning members do not have such comments.  How
about the attached?

Regards,
Amit
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 2239f791e8..5ce9d8a086 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -95,10 +95,15 @@ typedef struct RelationData
List   *rd_fkeylist;/* list of ForeignKeyCacheInfo (see 
below) */
boolrd_fkeyvalid;   /* true if list has been computed */
 
+   /* data managed by RelationGetPartitionKey: */
PartitionKey rd_partkey;/* partition key, or NULL */
MemoryContext rd_partkeycxt;/* private context for rd_partkey, if 
any */
+
+   /* data managed by RelationGetPartitionDesc: */
PartitionDesc rd_partdesc;  /* partition descriptor, or NULL */
MemoryContext rd_pdcxt; /* private context for rd_partdesc, if 
any */
+
+   /* data managed by RelationGetPartitionQual: */
List   *rd_partcheck;   /* partition CHECK quals */
boolrd_partcheckvalid;  /* true if list has been 
computed */
MemoryContext rd_partcheckcxt;  /* private cxt for rd_partcheck, if any 
*/


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-25 Thread Kyotaro Horiguchi
At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> > - Reverted most post-v24nm changes to swap_relation_files().  Under
> >   "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
> >   rel1->rd_node.relNode update.  Clearing rel2->rd_createSubid is not right 
> > if
> >   we're running CLUSTER for the second time in one transaction.  I used
> 
> I don't agree to that. As I think I have mentioned upthread, rel2 is
> wrongly marked as "new in this tranction" at that time, which hinders
> the opportunity of removal and such entries wrongly persist for the
> backend life and causes problems. (That was found by abort-time
> AssertPendingSyncs_RelationCache()..)

I played with the new version for a while and I don't see such a
problem. I don't recall cleary what I saw the time I thought I saw a
problem but I changed my mind to agree to that. It's far reasonable
and clearer as long as it works correctly.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-25 Thread Noah Misch
On Thu, Dec 26, 2019 at 12:46:39PM +0900, Kyotaro Horiguchi wrote:
> At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch  wrote in 
> >   Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did.  Making
> >   that work no matter what does ereport(ERROR) would be tricky and 
> > low-value.
> 
> Right about ereport, but I'm not sure remove the whole assertion from abort.

You may think of a useful assert location that lacks the problems of asserting
at abort.  For example, I considered asserting in PortalRunMulti() and
PortalRun(), just after each command, if still in a transaction.

> > - Reverted most post-v24nm changes to swap_relation_files().  Under
> >   "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
> >   rel1->rd_node.relNode update.  Clearing rel2->rd_createSubid is not right 
> > if
> >   we're running CLUSTER for the second time in one transaction.  I used
> 
> I don't agree to that. As I think I have mentioned upthread, rel2 is
> wrongly marked as "new in this tranction" at that time, which hinders
> the opportunity of removal and such entries wrongly persist for the
> backend life and causes problems. (That was found by abort-time
> AssertPendingSyncs_RelationCache()..)

I can't reproduce rel2's relcache entry wrongly persisting for the life of a
backend.  If that were happening, I would expect repeating a CLUSTER command N
times to increase hash_get_num_entries(RelationIdCache) by at least N.  I
tried that, but hash_get_num_entries(RelationIdCache) did not increase.  In a
non-assert build, how can I reproduce problems caused by incorrect
rd_createSubid on rel2?




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-25 Thread Kyotaro Horiguchi
Thank you for the findings.

At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch  wrote in 
> By improving AssertPendingSyncs_RelationCache() and by testing with
> -DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
> Would you fix these?

I'd like to do that, please give me som time.

> === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO
> 
> A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
> when running "make check" under wal_level=minimal.  I test this way:
> 
> printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' 
> >$PWD/minimal.conf
> make check TEMP_CONFIG=$PWD/minimal.conf
> 
> Self-contained demonstration:
>   begin;
>   create table t (c int);
>   savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
>   commit;  -- assertion failure
> 
> 
> === Defect 2: Forgets to skip WAL due to oversimplification in heap_create()
> 
> In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
> to transfer WAL-skipped state to the new index relation.  Before v24nm, the
> new index relation skipped WAL unconditionally.  Since v24nm, the new index
> relation never skips WAL.  I've added a test to alter_table.sql that reveals
> this problem under wal_level=minimal.
> 
> 
> === Defect 3: storage.c checks size decrease of MAIN_FORKNUM only
> 
> storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated.  Is it
> possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
> net size decrease?  I haven't tested, but this sequence seems possible:
> 
>   TRUNCATE
> reduces MAIN_FORKNUM from 100 blocks to 0 blocks
> reduces FSM_FORKNUM from 3 blocks to 0 blocks
>   COPY
> raises MAIN_FORKNUM from 0 blocks to 110 blocks
> does not change FSM_FORKNUM
>   COMMIT
> should fsync, but wrongly chooses log_newpage_range() approach
> 
> If that's indeed a problem, beside the obvious option of tracking every fork's
> max_truncated, we could convert max_truncated to a bool and use fsync anytime
> the relation experienced an mdtruncate().  (While FSM_FORKNUM is not critical
> for database operations, the choice to subject it to checksums entails
> protecting it here.)  If that's not a problem, would you explain?
> 


> === Non-defect notes
> 
> Once you have a correct patch, would you run check-world with
> -DCLOBBER_CACHE_ALWAYS?  That may reveal additional defects.  It may take a
> day or more, but that's fine.

Sure.

> The new smgrimmedsync() calls are potentially fragile, because they sometimes
> target a file of a dropped relation.  However, the mdexists() test prevents
> anything bad from happening.  No change is necessary.  Example:
> 
>   SET wal_skip_threshold = 0;
>   BEGIN;
>   SAVEPOINT q;
>   CREATE TABLE t (c) AS SELECT 1;
>   ROLLBACK TO q;  -- truncates the relfilenode
>   CHECKPOINT;  -- unlinks the relfilenode
>   COMMIT;  -- calls mdexists() on the relfilenode
> 
> 
> === Notable changes in v30nm
> 
> - Changed "wal_skip_threshold * 1024" to an expression that can't overflow.
>   Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0.

Ahh.., I wrongly understood that MAX_KILOBYTES inhibits that
setting. work_mem and maintenance_work_mem are casted to double or
long before calculation. In this case it's enough that calculation
unit becomes kilobytes instad of bytes.

> - Changed AssertPendingSyncs_RelationCache() to open all relations on which
>   the transaction holds locks.  This permits detection of cases where
>   RelationNeedsWAL() returns true but storage.c will sync the relation.
>
>   Removed the assertions from RelationIdGetRelation().  Using
>   "-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't
>   actually problematic, since invalidation updates rd_node while other code
>   updates rd_firstRelfilenodeSubid.  This is not a significant loss, now that
>   AssertPendingSyncs_RelationCache() opens relations.  (I considered making
>   the update of rd_firstRelfilenodeSubid more like rd_node, where we store it
>   somewhere until the next CommandCounterIncrement(), which would make it
>   actually affect RelationNeedsWAL().  That might have been better in general,
>   but it felt complex without clear benefits.)
> 
>   Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did.  Making
>   that work no matter what does ereport(ERROR) would be tricky and low-value.

Right about ereport, but I'm not sure remove the whole assertion from abort.

> - Extracted the RelationTruncate() changes into new function
>   RelationPreTruncate(), so table access methods that can't use
>   RelationTruncate() have another way to request that work.

Sounds reasonable. Also the new behavior of max_truncated looks fine.

> - Changed wal_skip_threshold default to 2MB.  My second preference was for
>   4MB.  In your data, 2MB and 4MB had similar performance at optimal
>   wal_buffers, but 4MB performed worse at low 

Re: Implementing Incremental View Maintenance

2019-12-25 Thread Yugo Nagata
On Tue, 24 Dec 2019 07:07:35 +
"tsunakawa.ta...@fujitsu.com"  wrote:

> From: Yugo Nagata 
> > On Mon, 23 Dec 2019 08:08:53 +
> > "tsunakawa.ta...@fujitsu.com"  wrote:
> > > How about unlogged tables ?  I thought the point of using a temp table is 
> > > to
> > avoid WAL overhead.
> > 
> > Hmm... this might be another option. However, if we use unlogged tables,
> > we will need to create them in a special schema similar to pg_toast
> > to split this from user tables. Otherwise, we need to create and drop
> > unlogged tables repeatedly for each session.
> 
> Maybe we can create the work tables in the same schema as the materialized 
> view, following:
> 
> * Prefix the table name to indicate that the table is system-managed, thus 
> alluding to the user that manually deleting the table would break something.  
> This is like the system attribute __imv_count you are proposing.
> 
> * Describe the above in the manual.  Columns of serial and bigserial data 
> type similarly create sequences behind the scenes.
> 
> * Make the work tables depend on the materialized view by recording the 
> dependency in pg_depend, so that Dropping the materialized view will also 
> drop its work tables.

Maybe it works, but instead of using special names for work tables, we can also 
create
a schema whose name is special and place work tables in this. This will not 
annoy users
with information they are not interested in when, for example, psql 
meta-commands like
\d are used.

Anyway, I understood it is better to avoid creating and dropping temporary 
tables
during view maintenance per statement.

-- 
Yugo Nagata 




Re: Implementing Incremental View Maintenance

2019-12-25 Thread Takuma Hoshiai
Hi, 

Attached is the latest patch (v11) to add support for Incremental Materialized 
View Maintenance (IVM).

Differences from the previous patch (v10) include:
- Prohibit creating matviews including mutable functions

Matviews including mutable functions (for example now(),random(), ... etc) 
could result in inconsistent data with the base tables.
This patch adds a check whether the requested matview definition includes 
SELECTs using mutable functions. If so, raise an error while creating the 
matview.

This issue is reported by nuko-san.
https://www.postgresql.org/message-id/caf3gu1z950hqqjzwanbeg7pmuxlc+7uzmstfnlezm9iqdwe...@mail.gmail.com


Currently other IVM's support status is:

> IVM is a way to make materialized views up-to-date in which only
> incremental changes are computed and applied on views rather than
> recomputing the contents from scratch as REFRESH MATERIALIZED VIEW
> does. IVM can update materialized views more efficiently
> than recomputation when only small part of the view need updates.
> 
> There are two approaches with regard to timing of view maintenance:
> immediate and deferred. In immediate maintenance, views are updated in
> the same transaction where its base table is modified. In deferred
> maintenance, views are updated after the transaction is committed,
> for example, when the view is accessed, as a response to user command
> like REFRESH, or periodically in background, and so on. 
> 
> This patch implements a kind of immediate maintenance, in which
> materialized views are updated immediately in AFTER triggers when a
> base table is modified.
> 
> This supports views using:
>  - inner and outer joins including self-join
>  - some built-in aggregate functions (count, sum, agv, min, max)
>  - a part of subqueries
>-- simple subqueries in FROM clause
>-- EXISTS subqueries in WHERE clause
>  - DISTINCT and views with tuple duplicates
> 
> ===
> Here are major changes we made after the previous submitted patch:
> 
> * Aggregate functions are checked if they can be used in IVM 
>   using their OID. Per comments from Alvaro Herrera.
> 
>   For this purpose, Gen_fmgrtab.pl was modified so that OIDs of
>   aggregate functions are output to fmgroids.h.
> 
> * Some bug fixes including:
> 
>  - Mistake of tab-completion of psql pointed out by nuko-san
>  - A bug relating rename of matview pointed out by nuko-san
>  - spelling errors
>  - etc.
> 
> * Add documentations for IVM
> 
> * Patch is splited into eleven parts to make review easier
>   as suggested by Amit Langote:
> 
>  - 0001: Add a new syntax:
>  CREATE INCREMENTAL MATERIALIZED VIEW
>  - 0002: Add a new column relisivm to pg_class
>  - 0003: Change trigger.c to allow to prolong life span of tupestores
>  containing Transition Tables generated via AFTER trigger
>  - 0004: Add the basic IVM future using counting algorithm:
>  This supports inner joins, DISTINCT, and tuple duplicates.
>  - 0005: Change GEN_fmgrtab.pl to output aggregate function's OIDs
>  - 0006: Add aggregates support for IVM
>  - 0007: Add subqueries support for IVM
>  - 0008: Add outer joins support for IVM
>  - 0009: Add IVM support to psql command
>  - 0010: Add regression tests for IVM
>  - 0011: Add documentations for IVM
> 
> ===
> Todo:
> 
> Currently, REFRESH and pg_dump/pg_restore is not supported, but
> we are working on them.
> 
> Also, TRUNCATE is not supported. When TRUNCATE command is executed
> on a base table, nothing occurs on materialized views. We are
> now considering another better options, like:
> 
> - Raise an error or warning when a base table is TRUNCATEed.
> - Make the view non-scannable (like WITH NO DATA)
> - Update the view in some ways. It would be easy for inner joins
>   or aggregate views, but there is some difficult with outer joins.

Best Regards,

-- 
Takuma Hoshiai 


IVM_patches_v11.tar.gz
Description: Binary data


Re: Avoid full GIN index scan when possible

2019-12-25 Thread Alexander Korotkov
On Wed, Dec 25, 2019 at 8:25 AM Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> Patch requires further polishing including comments, minor refactoring
> etc.  I'm going to continue work on this.
>

I also run the same performance comparison as Nikita [1] on my laptop.
The results are shown below.  PostgreSQL was built with -O2 and
asserts enabled.

   | Query time, ms |
WHERE condition| master | patch |
---++---+
 a @> '{}' |117 |   116 |
 a @> '{}' and b @> '{}'   |150 |   146 |
 a @> '{}' and b @> '{}' and c @> '{}' |168 |   167 |
 a @> '{}' and a @@ '1'|126 |   0.6 |
 a @> '{}' and a @@ '-1'   |128 |   3.2 |
 a @@ '!-1' and a @@ '1'   |127 |   0.7 |
 a @@ '!1' and a @@ '-1'   |122 |   4.4 |

Performance effect looks similar to patch #4 by Nikita.  I've tried to
add patch #4 to comparison, but I've catch assertion failure.

TRAP: FailedAssertion("key->includeNonMatching", File: "ginget.c", Line:
1340)

I'm going to continue polishing my version of patch.

Links
1.
https://www.postgresql.org/message-id/f2889144-db1d-e3b2-db97-cfc8794cda43%40postgrespro.ru

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: unsupportable composite type partition keys

2019-12-25 Thread Amit Langote
On Thu, Dec 26, 2019 at 5:45 AM Tom Lane  wrote:
> Amit Langote  writes:
> >> On Sun, Dec 22, 2019 at 6:13 AM Tom Lane  wrote:
> >>> I wonder whether we couldn't also lift
> >>> the restriction against whole-row Vars in partition expressions.
> >>> Doesn't seem like there is much difference between such a Var and
> >>> a row(...)::table_rowtype expression.
>
> > I gave that a try and ended up with attached that applies on top of
> > your delay-loading-relcache-partition-data-2.patch.
>
> Pushed with minor fixes.

Thank you.

Regards,
Amit




Re: Implementing Incremental View Maintenance

2019-12-25 Thread Tatsuo Ishii
>> Another use case is a ticket selling system. The system shows how many
>> tickets remain in a real time manner. For this purpose it needs to
>> count the number of tickets already sold from a log table. By using
>> IVM, it could be accomplished in simple and effective way.
> 
> Wouldn't the app just have a table like ticket(id, name, quantity), decrement 
> the quantity when the ticket is sold, and read the current quantity to know 
> the remaining tickets?  If many consumers try to buy tickets for a popular 
> event, the materialized view refresh would limit the concurrency.

Yes, as long as number of sold ticks is the only important data for
the system, it could be true. However suppose the system wants to
start sort of "campaign" and the system needs to collect statistics of
counts depending on the city that each ticket buyer belongs to so that
certain offer is limited to first 100 ticket buyers in each city. In
this case IVM will give more flexible way to handle this kind of
requirements than having adhoc city counts column in a table.

> I think we need to find a typical example of this.  That should be useful to 
> write the manual article, because it's better to caution users that the IMV 
> is a good fit for this case and not for that case.  Using real-world table 
> names in the syntax example will also be good.

In general I agree. I'd try to collect good real-world examples by
myself but my experience is limited. I hope people in this community
come up with such that examples.

> "outweigh" means "exceed."  I meant that I'm wondering if and why users 
> prefer ON STATEMENT's benefit despite of its additional overhead on update 
> statements.

I already found at least one such user in the upthread if I don't
missing something.

> Bottom line: The use of triggers makes me hesitate, because I saw someone's 
> (probably Fujii san) article that INSERTs into inheritance-and-trigger-based 
> partitioned tables were 10 times slower than the declaration-based 
> partitioned tables.  I think I will try to find a good use case.

Great. In the mean time we will try to mitigate the overhead of IVM
(triggers are just one of them).

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-25 Thread Noah Misch
By improving AssertPendingSyncs_RelationCache() and by testing with
-DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
Would you fix these?


=== Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO

A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
when running "make check" under wal_level=minimal.  I test this way:

printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf
make check TEMP_CONFIG=$PWD/minimal.conf

Self-contained demonstration:
  begin;
  create table t (c int);
  savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
  commit;  -- assertion failure


=== Defect 2: Forgets to skip WAL due to oversimplification in heap_create()

In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
to transfer WAL-skipped state to the new index relation.  Before v24nm, the
new index relation skipped WAL unconditionally.  Since v24nm, the new index
relation never skips WAL.  I've added a test to alter_table.sql that reveals
this problem under wal_level=minimal.


=== Defect 3: storage.c checks size decrease of MAIN_FORKNUM only

storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated.  Is it
possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
net size decrease?  I haven't tested, but this sequence seems possible:

  TRUNCATE
reduces MAIN_FORKNUM from 100 blocks to 0 blocks
reduces FSM_FORKNUM from 3 blocks to 0 blocks
  COPY
raises MAIN_FORKNUM from 0 blocks to 110 blocks
does not change FSM_FORKNUM
  COMMIT
should fsync, but wrongly chooses log_newpage_range() approach

If that's indeed a problem, beside the obvious option of tracking every fork's
max_truncated, we could convert max_truncated to a bool and use fsync anytime
the relation experienced an mdtruncate().  (While FSM_FORKNUM is not critical
for database operations, the choice to subject it to checksums entails
protecting it here.)  If that's not a problem, would you explain?


=== Non-defect notes

Once you have a correct patch, would you run check-world with
-DCLOBBER_CACHE_ALWAYS?  That may reveal additional defects.  It may take a
day or more, but that's fine.

The new smgrimmedsync() calls are potentially fragile, because they sometimes
target a file of a dropped relation.  However, the mdexists() test prevents
anything bad from happening.  No change is necessary.  Example:

  SET wal_skip_threshold = 0;
  BEGIN;
  SAVEPOINT q;
  CREATE TABLE t (c) AS SELECT 1;
  ROLLBACK TO q;  -- truncates the relfilenode
  CHECKPOINT;  -- unlinks the relfilenode
  COMMIT;  -- calls mdexists() on the relfilenode


=== Notable changes in v30nm

- Changed "wal_skip_threshold * 1024" to an expression that can't overflow.
  Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0.

- Changed AssertPendingSyncs_RelationCache() to open all relations on which
  the transaction holds locks.  This permits detection of cases where
  RelationNeedsWAL() returns true but storage.c will sync the relation.

  Removed the assertions from RelationIdGetRelation().  Using
  "-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't
  actually problematic, since invalidation updates rd_node while other code
  updates rd_firstRelfilenodeSubid.  This is not a significant loss, now that
  AssertPendingSyncs_RelationCache() opens relations.  (I considered making
  the update of rd_firstRelfilenodeSubid more like rd_node, where we store it
  somewhere until the next CommandCounterIncrement(), which would make it
  actually affect RelationNeedsWAL().  That might have been better in general,
  but it felt complex without clear benefits.)

  Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did.  Making
  that work no matter what does ereport(ERROR) would be tricky and low-value.

- Extracted the RelationTruncate() changes into new function
  RelationPreTruncate(), so table access methods that can't use
  RelationTruncate() have another way to request that work.

- Changed wal_skip_threshold default to 2MB.  My second preference was for
  4MB.  In your data, 2MB and 4MB had similar performance at optimal
  wal_buffers, but 4MB performed worse at low wal_buffers.

- Reverted most post-v24nm changes to swap_relation_files().  Under
  "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
  rel1->rd_node.relNode update.  Clearing rel2->rd_createSubid is not right if
  we're running CLUSTER for the second time in one transaction.  I used
  relation_open(r1, NoLock) instead of AccessShareLock, because we deserve an
  assertion failure if we hold no lock at that point.

- Change toast_get_valid_index() to retain locks until end of transaction.
  When I adopted relation_open(r1, NoLock) in swap_relation_files(), that
  revealed that we retain no lock on the TOAST index.

- Ran pgindent and perltidy.  Updated some comments and names.

On Mon, Dec 09, 2019 at 

Re: proposal: schema variables

2019-12-25 Thread Pavel Stehule
n dumping the database (Issue 3). The script
> generated by pg_dump includes the CREATE EXTENSION statement as expected
> but also a redundant CREATE VARIABLE statement for the variable that
> belongs to the extension. As a result, one of course gets an error at
> restore time.
>

should be fixed now


> G) Row Level Security
>
> I did a test activating RLS on a table and creating a POLICY that
> references a schema variable in its USING and WITH CHECK clauses.
> Everything worked fine.
>
> H) psql
>
> A \dV meta-command displays all the created variables.
> I would change a little bit the provided view. More precisely I would:
> - rename "Constraint" into "Is nullable" and report it as a boolean
> - rename "Special behave" into "Is transactional" and report it as a
> boolean
> - change the order of columns so to have:
> Schema | Name | Type | Is nullable | Default | Owner | Is transactional |
> Transaction end action
> "Is nullable" being aside "Default"
>

I implemented your proposal


> I) Performance
>
> I just quickly looked at the performance, and didn't notice any issue.
>
> About variables read performance, I have noticed that:
>   select sum(1) from generate_series(1,1000);
> and
>   select sum(sv1) from generate_series(1,1000);
> have similar response times.
>
> About planning, a condition with a variable used as a constant is
> indexable, as if it were a literal.
>
> J) Documentation
>
> There are some wordings to improve in the documentation. But I am not the
> best person to give advice about english language ;-).
>
> However, aside the already mentionned lack of changes in the ALTER DEFAULT
> PRIVILEGES chapter, I also noticed :
> - line 50 of the patch, the sentence "(hidden attribute; must be
> explicitly selected)" looks false as the oid column of pg_variable is
> displayed, as for other tables of the catalog;
> - at several places, the word "behave" should be replaced by "behaviour"
> - line 433, a get_schema_variable() function is mentionned; is it a
> function that can really be called by users ?
>

should be fixed


> May be it would be interesting to also add a chapter in the Section V of
> the documentation, in order to more globally present the schema variables
> concept, aside the new or the modified statements.
>

We can finalize documentation little bit later, when will be clear what
related functionality is implemented.

updated patch attached





> K) Coding
>
> I am not able to appreciate the way the feature has been coded. So I let
> this for other reviewers ;-)
>
>
> To conclude, again, thanks a lot for this feature !
> And if I may add this. I dream of an additional feature: adding a SHARED
> clause to the CREATE VARIABLE statement in order to be able to create
> memory spaces that could be shared by all connections on the database and
> accessible in SQL and PL, under the protection of ACL. But that's another
> story ;-)
>
> Best regards. Philippe.
>

Thank you very much for review

>
> The new status of this patch is: Waiting on Author
>


schema-variables-20191225.patch.gz
Description: application/gzip


Re: unsupportable composite type partition keys

2019-12-25 Thread Tom Lane
Amit Langote  writes:
>> On Sun, Dec 22, 2019 at 6:13 AM Tom Lane  wrote:
>>> I wonder whether we couldn't also lift
>>> the restriction against whole-row Vars in partition expressions.
>>> Doesn't seem like there is much difference between such a Var and
>>> a row(...)::table_rowtype expression.

> I gave that a try and ended up with attached that applies on top of
> your delay-loading-relcache-partition-data-2.patch.

Pushed with minor fixes.

regards, tom lane




Re: Expose lock group leader pid in pg_stat_activity

2019-12-25 Thread Julien Rouhaud
On Wed, Dec 25, 2019 at 7:03 PM Julien Rouhaud  wrote:
>
> Guillaume (in Cc) recently pointed out [1] that it's currently not
> possible to retrieve the list of parallel workers for a given backend
> at the SQL level.  His use case was to develop a function in plpgsql
> to sample a given query wait event, but it's not hard to imagine other
> useful use cases for this information, for instance doing some
> analysis on the average number of workers per parallel query, or ratio
> of parallel queries.  IIUC parallel queries is for now the only user
> of lock group, so this should work just fine.
>
> I'm attaching a trivial patch to expose the group leader pid if any
> in pg_stat_activity.  Quick example of usage:
>
> =# SELECT query, leader_pid,
>   array_agg(pid) filter(WHERE leader_pid != pid) AS members
> FROM pg_stat_activity
> WHERE leader_pid IS NOT NULL
> GROUP BY query, leader_pid;
>query   | leader_pid |members
> ---++---
>  select * from t1; |  28701 | {28728,28732}
> (1 row)
>
>
> [1] https://twitter.com/g_lelarge/status/1209486212190343168

And I just realized that I forgot to update rule.out, sorry about
that.  v2 attached.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dcb58115af..af0a04cfb1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -614,6 +614,11 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  integer
  Process ID of this backend
 
+
+ pid
+ integer
+ Process ID of the lock group leader, if any
+
 
  usesysid
  oid
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f7800f01a6..0c3eb08028 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -734,6 +734,7 @@ CREATE VIEW pg_stat_activity AS
 S.datid AS datid,
 D.datname AS datname,
 S.pid,
+S.leader_pid,
 S.usesysid,
 U.rolname AS usename,
 S.application_name,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index d9f78221aa..323eb89c86 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -545,7 +545,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_activity(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_ACTIVITY_COLS	29
+#define PG_STAT_GET_ACTIVITY_COLS	30
 	int			num_backends = pgstat_fetch_stat_numbackends();
 	int			curr_backend;
 	int			pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
@@ -684,6 +684,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 			values[5] = CStringGetTextDatum(clipped_activity);
 			pfree(clipped_activity);
 
+			nulls[29] = true;
 			proc = BackendPidGetProc(beentry->st_procpid);
 			if (proc != NULL)
 			{
@@ -693,6 +694,12 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 wait_event_type = pgstat_get_wait_event_type(raw_wait_event);
 wait_event = pgstat_get_wait_event(raw_wait_event);
 
+if (proc->lockGroupLeader)
+{
+	values[29] = Int32GetDatum(proc->lockGroupLeader->pid);
+	nulls[29] = false;
+}
+
 			}
 			else if (beentry->st_backendType != B_BACKEND)
 			{
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ac8f64b219..6ff7322425 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5148,9 +5148,9 @@
   proname => 'pg_stat_get_activity', prorows => '100', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => 'int4',
-  proallargtypes => '{int4,oid,int4,oid,text,text,text,text,text,timestamptz,timestamptz,timestamptz,timestamptz,inet,text,int4,xid,xid,text,bool,text,text,int4,bool,text,numeric,text,bool,text,bool}',
-  proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,datid,pid,usesysid,application_name,state,query,wait_event_type,wait_event,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,backend_type,ssl,sslversion,sslcipher,sslbits,sslcompression,ssl_client_dn,ssl_client_serial,ssl_issuer_dn,gss_auth,gss_princ,gss_enc}',
+  proallargtypes => '{int4,oid,int4,oid,text,text,text,text,text,timestamptz,timestamptz,timestamptz,timestamptz,inet,text,int4,xid,xid,text,bool,text,text,int4,bool,text,numeric,text,bool,text,bool,int4}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,datid,pid,usesysid,application_name,state,query,wait_event_type,wait_event,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,backend_type,ssl,sslversion,sslcipher,sslbits,sslcompression,ssl_client_dn,ssl_client_serial,ssl_issuer_dn,gss_auth,gss_princ,gss_enc,leader_pid}',
   prosrc => 

Re: unsupportable composite type partition keys

2019-12-25 Thread Tom Lane
I wrote:
> Amit Langote  writes:
>> On Tue, Dec 24, 2019 at 10:59 AM Amit Langote  
>> wrote:
>>> Btw, does the memory leakage fix in this patch address any of the
>>> pending concerns that were discussed on the "hyrax vs.
>>> RelationBuildPartitionDesc" thread earlier this year[1]?
>>> [1] 
>>> https://www.postgresql.org/message-id/flat/3800.1560366716%40sss.pgh.pa.us#092b6b4f6bf75d2f3f90ef6a3b3eab5b

>> I thought about this a little and I think it *does* address the main
>> complaint in the above thread.

It occurred to me to also recheck the original complaint in that thread,
which was poor behavior in CLOBBER_CACHE_ALWAYS builds.  I didn't have
the patience to run a full CCA test, but I did run update.sql, which
we previously established was sufficient to show the problem.  There's
no apparent memory bloat, either with HEAD or with the patch.  I also
see the runtime (for update.sql on its own) dropping from about 
474 sec in HEAD to 457 sec with the patch.  So that indicates that we're
actually saving a noticeable amount of work, not just postponing it,
at least under CCA scenarios where relcache entries get flushed a lot.

I also tried to measure update.sql's runtime in a regular debug build
(not CCA).  I get pretty repeatable results of 279ms on HEAD vs 273ms
with patch, or about a 2% overall savings.  That's at the very limit of
what I'd consider a reproducible difference, but still it seems to be
real.  So that seems like evidence that forcing the partition data to be
loaded immediately rather than on-demand is a loser from a performance
standpoint as well as the recursion concerns that prompted this patch.

Which naturally leads one to wonder whether forcing other relcache
substructures (triggers, rules, etc) to be loaded immediately isn't
a loser as well.  I'm still feeling like we're overdue to redesign how
all of this works and come up with a more uniform, less fragile/ad-hoc
approach.  But I don't have the time or interest to do that right now.

Anyway, I've run out of reasons not to commit this patch, so I'll
go do that.

regards, tom lane




Expose lock group leader pid in pg_stat_activity

2019-12-25 Thread Julien Rouhaud
Hello,

Guillaume (in Cc) recently pointed out [1] that it's currently not
possible to retrieve the list of parallel workers for a given backend
at the SQL level.  His use case was to develop a function in plpgsql
to sample a given query wait event, but it's not hard to imagine other
useful use cases for this information, for instance doing some
analysis on the average number of workers per parallel query, or ratio
of parallel queries.  IIUC parallel queries is for now the only user
of lock group, so this should work just fine.

I'm attaching a trivial patch to expose the group leader pid if any
in pg_stat_activity.  Quick example of usage:

=# SELECT query, leader_pid,
  array_agg(pid) filter(WHERE leader_pid != pid) AS members
FROM pg_stat_activity
WHERE leader_pid IS NOT NULL
GROUP BY query, leader_pid;
   query   | leader_pid |members
---++---
 select * from t1; |  28701 | {28728,28732}
(1 row)


[1] https://twitter.com/g_lelarge/status/1209486212190343168
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dcb58115af..af0a04cfb1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -614,6 +614,11 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  integer
  Process ID of this backend
 
+
+ pid
+ integer
+ Process ID of the lock group leader, if any
+
 
  usesysid
  oid
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f7800f01a6..0c3eb08028 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -734,6 +734,7 @@ CREATE VIEW pg_stat_activity AS
 S.datid AS datid,
 D.datname AS datname,
 S.pid,
+S.leader_pid,
 S.usesysid,
 U.rolname AS usename,
 S.application_name,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index d9f78221aa..323eb89c86 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -545,7 +545,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_activity(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_ACTIVITY_COLS	29
+#define PG_STAT_GET_ACTIVITY_COLS	30
 	int			num_backends = pgstat_fetch_stat_numbackends();
 	int			curr_backend;
 	int			pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
@@ -684,6 +684,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 			values[5] = CStringGetTextDatum(clipped_activity);
 			pfree(clipped_activity);
 
+			nulls[29] = true;
 			proc = BackendPidGetProc(beentry->st_procpid);
 			if (proc != NULL)
 			{
@@ -693,6 +694,12 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 wait_event_type = pgstat_get_wait_event_type(raw_wait_event);
 wait_event = pgstat_get_wait_event(raw_wait_event);
 
+if (proc->lockGroupLeader)
+{
+	values[29] = Int32GetDatum(proc->lockGroupLeader->pid);
+	nulls[29] = false;
+}
+
 			}
 			else if (beentry->st_backendType != B_BACKEND)
 			{
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ac8f64b219..6ff7322425 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5148,9 +5148,9 @@
   proname => 'pg_stat_get_activity', prorows => '100', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => 'int4',
-  proallargtypes => '{int4,oid,int4,oid,text,text,text,text,text,timestamptz,timestamptz,timestamptz,timestamptz,inet,text,int4,xid,xid,text,bool,text,text,int4,bool,text,numeric,text,bool,text,bool}',
-  proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,datid,pid,usesysid,application_name,state,query,wait_event_type,wait_event,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,backend_type,ssl,sslversion,sslcipher,sslbits,sslcompression,ssl_client_dn,ssl_client_serial,ssl_issuer_dn,gss_auth,gss_princ,gss_enc}',
+  proallargtypes => '{int4,oid,int4,oid,text,text,text,text,text,timestamptz,timestamptz,timestamptz,timestamptz,inet,text,int4,xid,xid,text,bool,text,text,int4,bool,text,numeric,text,bool,text,bool,int4}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,datid,pid,usesysid,application_name,state,query,wait_event_type,wait_event,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,backend_type,ssl,sslversion,sslcipher,sslbits,sslcompression,ssl_client_dn,ssl_client_serial,ssl_issuer_dn,gss_auth,gss_princ,gss_enc,leader_pid}',
   prosrc => 'pg_stat_get_activity' },
 { oid => '3318',
   descr => 'statistics: information about progress of backends running maintenance command',


Re: Physical replication slot advance is not persistent

2019-12-25 Thread Alexey Kondratov

On 25.12.2019 16:51, Alexey Kondratov wrote:

On 25.12.2019 07:03, Kyotaro Horiguchi wrote:

As the result I think what is needed there is just checking if the
returned lsn is equal or larger than moveto. Doen't the following
change work?

-    if (XLogRecPtrIsInvalid(endlsn))
+    if (moveto <= endlsn)


Yep, it helps with physical replication slot persistence after 
advance, but the whole validation (moveto <= endlsn) does not make 
sense for me. The value of moveto should be >= than minlsn == 
confirmed_flush / restart_lsn, while endlsn == retlsn is also always 
initialized with confirmed_flush / restart_lsn. Thus, your condition 
seems to be true in any case, even if it was no-op one, which we were 
intended to catch.


I will recheck everything again and try to come up with something 
during this week.


If I get it correctly, then we already keep previous slot position in 
the minlsn, so we just have to compare endlsn with minlsn and treat 
endlsn <= minlsn as a no-op without slot state flushing.


Attached is a patch that does this, so it fixes the bug without 
affecting any user-facing behavior. Detailed comment section and DEBUG 
output are also added. What do you think now?


I have also forgotten to mention that all versions down to 11.0 should 
be affected with this bug.



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

>From e08299ddf92abc3fb4e802e8b475097fa746c458 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Wed, 25 Dec 2019 20:12:42 +0300
Subject: [PATCH v2] Make physical replslot advance persistent

---
 src/backend/replication/slotfuncs.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6683fc3f9b..bc5c93b089 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -573,9 +573,17 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	values[0] = NameGetDatum(>data.name);
 	nulls[0] = false;
 
-	/* Update the on disk state when lsn was updated. */
-	if (XLogRecPtrIsInvalid(endlsn))
+	/*
+	 * Update the on disk state when LSN was updated.  Here we rely on the facts
+	 * that: 1) minlsn is initialized with restart_lsn and confirmed_flush LSN for
+	 * physical and logical replication slot respectively, and 2) endlsn is set in
+	 * the same way by pg_*_replication_slot_advance, but after advance.  Thus,
+	 * endlsn <= minlsn is treated as a no-op.
+	 */
+	if (endlsn > minlsn)
 	{
+		elog(DEBUG1, "flushing replication slot '%s' state",
+			NameStr(MyReplicationSlot->data.name));
 		ReplicationSlotMarkDirty();
 		ReplicationSlotsComputeRequiredXmin(false);
 		ReplicationSlotsComputeRequiredLSN();

base-commit: 8ce3aa9b5914d1ac45ed3f9bc484f66b3c4850c7
-- 
2.17.1



Re: Disallow cancellation of waiting for synchronous replication

2019-12-25 Thread Maksim Milyutin

On 25.12.2019 14:27, Marco Slot wrote:




On Wed, Dec 25, 2019, 11:28 Maksim Milyutin > wrote:


But in this case locally committed data becomes visible to new
incoming
transactions that is bad side-effect of this issue.


Your application should be prepared for that in any case.

At the point where synchronous replication waits, the commit has 
already been written to disk on the primary. If postgres 
restarts while waiting for replication then the write becomes 
immediately visible regardless of whether it was replicated.



Yes, this write is recovered after starting of instance. At first, this 
case I want to delegate to external HA tool around PostgreSQL. It can 
handle instance stopping and take switchover to sync replica or start 
current instance with closed connections from external users until all 
writes replicate to sync replicas. Later, arguably closing connection 
after recovery process could be implemented inside the kernel.



I don't think throwing a PANIC actually prevents that and if it does 
it's coincidental.



PANIC lets down instance and doesn't provide clients to read locally 
committed data. HA tool takes further steps to close access to these 
data as described above.



That's far from ideal, but fixing it requires a much bigger change to 
streaming replication. The write should be replicated prior to commit 
on the primary, but applied after in a way where unapplied writes on 
the secondary can be overwritten/discarded if it turns out they did 
not commit on the primary.



Thanks for sharing your opinion about enhancement of synchronous commit 
protocol. Here [1] my position is listed. It would like to see positions 
of other members of community.



1. 
https://www.postgresql.org/message-id/f3ffc220-e601-cc43-3784-f9bba66dc382%40gmail.com


--
Best regards,
Maksim Milyutin



Re: Add pg_file_sync() to adminpack

2019-12-25 Thread Julien Rouhaud
On Wed, Dec 25, 2019 at 2:01 PM Fujii Masao  wrote:
>
> Hi,
>
> I'd like to propose to add pg_file_sync() function into contrib/adminpack.
> This function fsyncs the specified file or directory named by its argument.
> IMO this is useful, for example, when you want to fsync the file that
> pg_file_write() writes out or that COPY TO exports the data into,
> for durability. Thought?

+1, that seems like a useful wrapper.  Looking at existing functions,
I see that there's a pg_file_rename() in adminpack, but it doesn't use
durable_rename nor does it try to perform any fsync.  Same for
pg_file_unlink vs. durable_unlink.  It's probably worth fixing that at
the same time?




Re: Physical replication slot advance is not persistent

2019-12-25 Thread Alexey Kondratov

On 25.12.2019 07:03, Kyotaro Horiguchi wrote:

At Tue, 24 Dec 2019 20:12:32 +0300, Alexey Kondratov 
 wrote in

I dig into the code and it happens because of this if statement:

     /* Update the on disk state when lsn was updated. */
     if (XLogRecPtrIsInvalid(endlsn))
     {
         ReplicationSlotMarkDirty();
         ReplicationSlotsComputeRequiredXmin(false);
         ReplicationSlotsComputeRequiredLSN();
         ReplicationSlotSave();
     }

Yes, it seems just broken.


Attached is a small patch, which fixes this bug. I have tried to
stick to the same logic in this 'if (XLogRecPtrIsInvalid(endlsn))'
and now pg_logical_replication_slot_advance and
pg_physical_replication_slot_advance return InvalidXLogRecPtr if
no-op.

What do you think?

I think we shoudn't change the definition of
pg_*_replication_slot_advance since the result is user-facing.


Yes, that was my main concern too. OK.


The functions return a invalid value only when the slot had the
invalid value and failed to move the position. I think that happens
only for uninitalized slots.

Anyway what we should do there is dirtying the slot when the operation
can be assumed to have been succeeded.

As the result I think what is needed there is just checking if the
returned lsn is equal or larger than moveto. Doen't the following
change work?

-   if (XLogRecPtrIsInvalid(endlsn))
+   if (moveto <= endlsn)


Yep, it helps with physical replication slot persistence after advance, 
but the whole validation (moveto <= endlsn) does not make sense for me. 
The value of moveto should be >= than minlsn == confirmed_flush / 
restart_lsn, while endlsn == retlsn is also always initialized with 
confirmed_flush / restart_lsn. Thus, your condition seems to be true in 
any case, even if it was no-op one, which we were intended to catch.


Actually, if we do not want to change pg_*_replication_slot_advance, we 
can just add straightforward validation that either confirmed_flush, or 
restart_lsn changed after slot advance guts execution. It will be a 
little bit bulky, but much more clear and will never be affected by 
pg_*_replication_slot_advance logic change.



Another weird part I have found is this assignment inside 
pg_logical_replication_slot_advance:


/* Initialize our return value in case we don't do anything */
retlsn = MyReplicationSlot->data.confirmed_flush;

It looks redundant, since later we do the same assignment, which should 
be reachable in any case.


I will recheck everything again and try to come up with something during 
this week.



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company





How to test GSSAPI based encryption support

2019-12-25 Thread Abbas Butt
Hi,

I want to test GSSAPI based encryption support added via commit ID
b0b39f72b9904bcb80f97b35837ccff1578aa4b8.

I have built source --with-gssapi.
After installation I added the following line in pg_hba.conf
hostgssenc  all  all   172.16.214.149/24   trust
Next I ran the server and ran psql using
./psql 'host=172.16.214.149 port=5432 dbname=postgres user=abbas
gssencmode=require'
and it resulted in the following error:
psql: error: could not connect to server: GSSAPI encryption required but
was impossible (possibly no credential cache, no server support, or using a
local socket)

What steps should I follow If I want to test just the encryption support?

If GSSAPI based encryption support cannot be tested without GSSAPI
(kerberos) based authentication, then what is the purpose of having trust
as authentication method for hostgssenc connections?

Best Regards
-- 
*Abbas*
Architect

Ph: 92.334.5100153
Skype ID: gabbasb
www.enterprisedb.co m


*Follow us on Twitter*
@EnterpriseDB

Visit EnterpriseDB for tutorials, webinars, whitepapers
 and more



Add pg_file_sync() to adminpack

2019-12-25 Thread Fujii Masao
Hi,

I'd like to propose to add pg_file_sync() function into contrib/adminpack.
This function fsyncs the specified file or directory named by its argument.
IMO this is useful, for example, when you want to fsync the file that
pg_file_write() writes out or that COPY TO exports the data into,
for durability. Thought?

Regards,

-- 
Fujii Masao


pg_file_sync_v1.patch
Description: Binary data


Re: [HACKERS] Block level parallel vacuum

2019-12-25 Thread Masahiko Sawada
On Tue, 24 Dec 2019 at 15:46, Masahiko Sawada
 wrote:
>
> On Tue, 24 Dec 2019 at 15:44, Amit Kapila  wrote:
> >
> > On Tue, Dec 24, 2019 at 12:08 PM Masahiko Sawada
> >  wrote:
> > >
> > >
> > > The first patches look good to me. I'm reviewing other patches and
> > > will post comments if there is.
> > >
>
> Oops I meant first "two" patches look good to me.
>
> >
> > Okay, feel free to address few comments raised by Mahendra along with
> > whatever you find.
>
> Thanks!
>

I've attached updated patch set as the previous version patch set
conflicts to the current HEAD. This patch set incorporated the review
comments, a few fix and the patch for
PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION. 0001 patch is the same
as previous version.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v40-0004-Add-ability-to-disable-leader-participation-in-p.patch
Description: Binary data


v40-0001-Introduce-IndexAM-fields-for-parallel-vacuum.patch
Description: Binary data


v40-0003-Add-FAST-option-to-vacuum-command.patch
Description: Binary data


v40-0002-Add-a-parallel-option-to-the-VACUUM-command.patch
Description: Binary data


Re: Online checksums patch - once again

2019-12-25 Thread Sergei Kornilov
Hello

> Attached is a v15 of the online checksums patchset (minus 0005), rebased on 
> top
> of your v3 ProcSignalBarrier patch rather than Andres' PoC GlobalBarrier 
> patch.
> It does take the, perhaps, controversial approach of replacing the SAMPLE
> barrier with the CHECKSUM barrier. The cfbot will be angry since this email
> doesn't contain the procsignalbarrier patch, but it sounded like that would go
> in shortly so opted for that.

ProcSignalBarrier was committed, so online checksums patchset has no other 
pending dependencies and should be applied cleanly on master. Right? The 
patchset needs another rebase in this case, does not apply...

regards, Sergei




Re: Disallow cancellation of waiting for synchronous replication

2019-12-25 Thread Marco Slot
On Wed, Dec 25, 2019, 11:28 Maksim Milyutin  wrote:

> But in this case locally committed data becomes visible to new incoming
> transactions that is bad side-effect of this issue.
>

Your application should be prepared for that in any case.

At the point where synchronous replication waits, the commit has already
been written to disk on the primary. If postgres restarts while waiting for
replication then the write becomes immediately visible regardless of
whether it was replicated. I don't think throwing a PANIC actually prevents
that and if it does it's coincidental. Best you can do is signal to the
client that the commit status is unknown.

That's far from ideal, but fixing it requires a much bigger change to
streaming replication. The write should be replicated prior to commit on
the primary, but applied after in a way where unapplied writes on the
secondary can be overwritten/discarded if it turns out they did not commit
on the primary.

Marco


Re: ALTER TABLE support for dropping generation expression

2019-12-25 Thread Sergei Kornilov
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   tested, failed
Spec compliant:   not tested
Documentation:tested, passed

Hello

Patch does not apply to master. Could you rebase?

Code looks good and very similar to "ALTER TABLE ... ALTER COLUMN ... DROP 
IDENTITY"

I noticed one bug:

create table testdrop (i int, b int, m int GENERATED ALWAYS AS ( i*2) stored);
insert into testdrop(i,b) values (3,4);
alter table testdrop alter COLUMN m drop expression ;
alter table testdrop drop column i;

Here is no "m" column anymore. Possible due some forgotten dependency?

regards, Sergei

The new status of this patch is: Waiting on Author


Re: Disallow cancellation of waiting for synchronous replication

2019-12-25 Thread Andrey Borodin



> 25 дек. 2019 г., в 15:28, Maksim Milyutin  написал(а):
> 
>> Synchronous replication
>> does not guarantee that a committed write is actually on any replica,
>> but it does in general guarantee that a commit has been replicated
>> before sending a response to the client. That's arguably more
>> important because the rest of what the application might depend on the
>> transaction completing and replicating successfully. I don't know of
>> cases other than cancellation in which a response is sent to the
>> client without replication when synchronous replication is enabled.
> 
> 
> Yes, at query canceling (e.g. by timeout from client driver) client receives 
> response about completed transaction (though with warning which not all 
> client drivers can handle properly) and the guarantee about successfully 
> replicated transaction *violates*.

We obviously need a design discussion here to address the issue. But the 
immediate question is should we add this topic to January CF items?

Best regards, Andrey Borodin.



Re: Disallow cancellation of waiting for synchronous replication

2019-12-25 Thread Maksim Milyutin

On 21.12.2019 13:34, Marco Slot wrote:


I do agree with the general sentiment that terminating the connection
is preferable over sending a response to the client (except when
synchronous replication was already disabled).



But in this case locally committed data becomes visible to new incoming 
transactions that is bad side-effect of this issue. Under failover those 
changes potentially undo.




Synchronous replication
does not guarantee that a committed write is actually on any replica,
but it does in general guarantee that a commit has been replicated
before sending a response to the client. That's arguably more
important because the rest of what the application might depend on the
transaction completing and replicating successfully. I don't know of
cases other than cancellation in which a response is sent to the
client without replication when synchronous replication is enabled.



Yes, at query canceling (e.g. by timeout from client driver) client 
receives response about completed transaction (though with warning which 
not all client drivers can handle properly) and the guarantee about 
successfully replicated transaction *violates*.



--
Best regards,
Maksim Milyutin





Re: Disallow cancellation of waiting for synchronous replication

2019-12-25 Thread Maksim Milyutin

On 21.12.2019 00:19, Tom Lane wrote:


Three is still a problem when backend is not canceled, but terminated [2].

Exactly.  If you don't have a fix that handles that case, you don't have
anything.  In fact, you've arguably made things worse, by increasing the
temptation to terminate or "kill -9" the nonresponsive session.



I assume that the termination of backend that causes termination of 
PostgreSQL instance in Andrey's patch proposal have to be resolved by 
external HA agents that could interrupt such terminations as parent 
process of postmaster and make appropriate decisions e.g., restart 
PostgreSQL node in closed from external users state (via pg_hba.conf 
manipulation) until all sync replicas synchronize changes from master. 
Stolon HA tool implements this strategy  [1]. This logic (waiting for 
all replicas declared in synchronous_standby_names replicate all WAL 
from master) could be implemented inside PostgreSQL kernel after start 
recovery process before database is opened to users and this can be done 
separately later.


Another approach is to implement two-phase commit over master and sync 
replicas (as it did Oracle in old versions [2]) where the risk to get 
local committed data under instance restarting and query canceling is 
minimal (after starting of final commitment phase). But this approach 
has latency penalty and complexity to resolve partial (prepared but not 
committed) transactions under coordinator (in this case master node) 
failure in automatic mode. Nicely if this approach will be implemented 
later as option of synchronous commit.



1. 
https://github.com/sorintlab/stolon/blob/master/doc/syncrepl.md#handling-postgresql-sync-repl-limits-under-such-circumstances


2. 
https://docs.oracle.com/cd/B28359_01/server.111/b28326/repmaster.htm#i33607


--
Best regards,
Maksim Milyutin





Re: table partition and column default

2019-12-25 Thread Amit Langote
On Wed, Dec 25, 2019 at 5:40 PM Fujii Masao  wrote:
> On Wed, Dec 25, 2019 at 1:56 PM Amit Langote  wrote:
> > IIRC, there was some discussion about implementing a feature whereby
> > partition's default will used for an attribute if it's null even after
> > considering the parent table's default, that is, when no default value
> > is defined in the parent.  The details are at toward the end of this
> > thread:
> >
> > https://www.postgresql.org/message-id/flat/578398af46350effe7111895a4856b87b02e000e.camel%402ndquadrant.com
>
> Thanks for pointing that thread!
>
> As you mentioned in that thread, I also think that this current
> behavior (maybe restriction) should be documented.
> What about adding the note like "a partition's default value is
> not applied when inserting a tuple through a partitioned table."
> into the document?

Agreed.

> Patch attached.

Thanks for creating the patch, looks good to me.

Thanks,
Amit




Re: table partition and column default

2019-12-25 Thread Fujii Masao
On Wed, Dec 25, 2019 at 1:56 PM Amit Langote  wrote:
>
> Fujii-san,
>
> On Wed, Dec 25, 2019 at 12:19 PM Fujii Masao  wrote:
> >
> > Hi,
> >
> > As the document explains, column defaults can be specified separately for
> > each partition. But I found that INSERT via the partitioned table ignores
> > that default. Is this expected behavior or bug?
> >
> > CREATE TABLE test (i INT, j INT) PARTITION BY RANGE (i);
> > CREATE TABLE test1 PARTITION OF test (j DEFAULT 99) FOR VALUES FROM (1) TO 
> > (10);
> > INSERT INTO test VALUES (1, DEFAULT);
> > INSERT INTO test1 VALUES (2, DEFAULT);
> > SELECT * FROM test;
> >  i |   j
> > ---+
> >  1 | (null)
> >  2 | 99
> > (2 rows)
> >
> > In the above example, INSERT accessing directly to the partition uses
> > the default, but INSERT via the partitioned table not.
>
> This is as of now expected.
>
> IIRC, there was some discussion about implementing a feature whereby
> partition's default will used for an attribute if it's null even after
> considering the parent table's default, that is, when no default value
> is defined in the parent.  The details are at toward the end of this
> thread:
>
> https://www.postgresql.org/message-id/flat/578398af46350effe7111895a4856b87b02e000e.camel%402ndquadrant.com

Thanks for pointing that thread!

As you mentioned in that thread, I also think that this current
behavior (maybe restriction) should be documented.
What about adding the note like "a partition's default value is
not applied when inserting a tuple through a partitioned table."
into the document?

Patch attached.

Regards,

-- 
Fujii Masao


default_in_partition.patch
Description: Binary data


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-25 Thread Kyotaro Horiguchi
At Tue, 24 Dec 2019 16:35:35 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I rebased the patch and changed the default value for the GUC variable
> wal_skip_threshold to 4096 kilobytes in config.sgml, storage.c and
> guc.c. 4096kB is choosed as it is the nice round number of 500 pages *
> 8kB = 4000kB.

The value in the doc was not correct. Fixed only the value from 3192
to 4096kB.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 2f184c140ab442ee29103be830b3389b71e8e609 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v29] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml |  43 ++--
 doc/src/sgml/perform.sgml|  47 +
 src/backend/access/gist/gistutil.c   |  31 ++-
 src/backend/access/gist/gistxlog.c   |  21 ++
 src/backend/access/heap/heapam.c |  45 +---
 src/backend/access/heap/heapam_handler.c |  22 +-
 src/backend/access/heap/rewriteheap.c|  21 +-
 src/backend/access/nbtree/nbtsort.c  |  41 +---
 src/backend/access/rmgrdesc/gistdesc.c   |   5 +
 src/backend/access/transam/README|  47 -
 src/backend/access/transam/xact.c|  15 ++
 src/backend/access/transam/xloginsert.c  |  10 +-
 src/backend/access/transam/xlogutils.c   |  17 +-
 src/backend/catalog/heap.c   |   4 +
 src/backend/catalog/storage.c| 257 +--
 src/backend/commands/cluster.c   |  31 +++
 src/backend/commands/copy.c  |  58 +
 src/backend/commands/createas.c  |  11 +-
 src/backend/commands/matview.c   |  12 +-
 src/backend/commands/tablecmds.c |  11 +-
 src/backend/storage/buffer/bufmgr.c  | 123 ++-
 src/backend/storage/smgr/md.c|  35 ++-
 src/backend/storage/smgr/smgr.c  |  37 
 src/backend/utils/cache/relcache.c   | 122 ---
 src/backend/utils/misc/guc.c |  13 ++
 src/bin/psql/input.c |   1 +
 src/include/access/gist_private.h|   2 +
 src/include/access/gistxlog.h|   1 +
 src/include/access/heapam.h  |   3 -
 src/include/access/rewriteheap.h |   2 +-
 src/include/access/tableam.h |  18 +-
 src/include/catalog/storage.h|   5 +
 src/include/storage/bufmgr.h |   4 +
 src/include/storage/smgr.h   |   1 +
 src/include/utils/rel.h  |  57 +++--
 src/include/utils/relcache.h |   8 +-
 src/test/regress/pg_regress.c|   2 +
 37 files changed, 839 insertions(+), 344 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..d893864c40 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2481,21 +2481,14 @@ include_dir 'conf.d'
 levels.  This parameter can only be set at server start.


-In minimal level, WAL-logging of some bulk
-operations can be safely skipped, which can make those
-operations much faster (see ).
-Operations in which this optimization can be applied include:
-
- CREATE TABLE AS
- CREATE INDEX
- CLUSTER
- COPY into tables that were created or truncated in the same
- transaction
-
-But minimal WAL does not contain enough information to reconstruct the
-data from a base backup and the WAL logs, so replica or
-higher must be used to enable WAL archiving
-() and streaming replication.
+In minimal level, no information is logged for
+tables or indexes for the remainder of a transaction that creates or
+truncates them.  This can make bulk operations much faster (see
+).  But minimal WAL does not contain
+enough information to reconstruct the data from a base backup and the
+WAL logs, so replica or higher must be used to
+enable WAL archiving () and
+streaming replication.


 In logical level, the same information is logged as
@@ -2887,6 +2880,26 @@ include_dir 'conf.d'
   
  
 
+ 
+  wal_skip_threshold (integer)
+  
+   wal_skip_threshold configuration parameter
+  
+  
+  
+   
+When wal_level is minimal and a
+transaction commits after creating or 

Re: Assert failure due to "drop schema pg_temp_3 cascade" for temporary tables and \d+ is not showing any info after drooping temp table schema

2019-12-25 Thread Michael Paquier
On Wed, Dec 25, 2019 at 10:07:58AM +0530, Mahendra Singh wrote:
> Yes, you are right that we can drop temporary schema of other sessions.

I have mentioned that upthread, and basically we need to use
isAnyTempNamespace() here.  My mistake.

> While applying attached patch on HEAD, I got below warnings:

The patch applies cleanly for me.
--
Michael


signature.asc
Description: PGP signature