Re: Server crash with Master-Slave configuration.

2019-12-24 Thread Prabhat Sahu
On Wed, Dec 25, 2019 at 8:01 AM Michael Paquier  wrote:

> On Tue, Dec 24, 2019 at 05:29:25PM +0530, Prabhat Sahu wrote:
> > While performing below operations with Master-Slave configuration, Slave
> is
> > crashed.
> > Below are the steps to reproduce:
> >
> > -- create a Slave using pg_basebackup and start:
> > ./pg_basebackup -v -R -D d2 -p 55510
> > mkdir /home/centos/ts1
> >
> > -- Session 1(Master):
> > ./psql postgres -p 55510
> >
> > CREATE TABLESPACE ts1 location '/home/centos/ts1';
>
> Your mistake is here.  Both primary and standby are on the same host,
> so CREATE TABLESPACE would point to a path that overlap for both
> clusters as the tablespace path is registered the WAL replayed,
> leading to various weird behaviors.  What you need to do instead is to
> create the tablespace before taking the base backup, and then take the
> base backup using pg_basebackup's --tablespace-mapping.

Thanks Michael for pointing it out, I have re-tested the scenario
with "--tablespace-mapping=OLDDIR=NEWDIR" option of pg_basebackup, and now
its working fine.
But I think, instead of the crash, a proper error message would be better.


> --
> Michael
>


-- 

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.

The Postgres Database Company


Re: unsupportable composite type partition keys

2019-12-24 Thread Tom Lane
Amit Langote  writes:
> On Tue, Dec 24, 2019 at 10:59 AM Amit Langote  wrote:
>> Btw, does the memory leakage fix in this patch address any of the
>> pending concerns that were discussed on the "hyrax vs.
>> RelationBuildPartitionDesc" thread earlier this year[1]?
>> [1] 
>> https://www.postgresql.org/message-id/flat/3800.1560366716%40sss.pgh.pa.us#092b6b4f6bf75d2f3f90ef6a3b3eab5b

> I thought about this a little and I think it *does* address the main
> complaint in the above thread.

I experimented with the test shown in [1].  This patch does prevent that
case from accumulating copies of the partition descriptor.

(The performance of that test case is still awful, more or less O(N^2)
in the number of repetitions.  But I think what's going on is that it
repeatedly creates and deletes the same catalog entries, and we're not
smart enough to recognize that the dead row versions are fully dead,
so lots of time is wasted by unique-index checks.  It's not clear
that that's of any interest for real-world cases.)

I remain of the opinion that this is a pretty crummy, ad-hoc way to manage
the partition descriptor caching.  It's less bad than before, but I'm
still concerned that holding a relcache entry open for any long period
could result in bloat if the cache entry is rebuilt many times meanwhile
--- and there's no strong reason to think that can't happen.  Still,
maybe we can wait to solve that until there's some evidence that it
does happen in useful cases.

I also poked at the test case mentioned in the other thread about foreign
keys across lots of partitions [2].  Again, this patch gets rid of the
memory bloat, though the performance is still pretty awful with lots of
partitions for other reasons.

regards, tom lane

[1] https://www.postgresql.org/message-id/10797.1552679128%40sss.pgh.pa.us
[2] 
https://www.postgresql.org/message-id/OSAPR01MB374809E8DE169C8BF2B82CBD9F6B0%40OSAPR01MB3748.jpnprd01.prod.outlook.com




RE: Implementing Incremental View Maintenance

2019-12-24 Thread tsunakawa.ta...@fujitsu.com
From: Tatsuo Ishii 
> AFAIK benefit of ON STATEMENT is the transaction can see the result of
> update to the base tables. With ON COMMIT, the transaction does not
> see the result until the transaction commits.
> 
> Well, I can see use cases of IVM in both DWH and OLTP.
> 
> For example, a user create a DWH-like data using materialized
> view. After the initial data is loaded, the data is seldom updated.
> However one day a user wants to change just one row to see how it
> affects to the whole DWH data. IVM will help here because it could be
> done in shorter time than loading whole data.

> I heard that REFRESH ON STATEMENT of Oracle has been added after ON
> COMMIT materialized view. So I suspect Oracle realizes that there are
> needs/use case for ON STATEMENT, but I am not sure.

Yes, it was added relatively recently in Oracle Database 12.2.  As the 
following introduction to new features shows, the benefits are described as 
twofold:
1) The transaction can see the refreshed view result without committing.
2) The materialized view log is not needed.

I guess from these that the ON STATEMENT refresh mode can be useful when the 
user wants to experiment with some changes to see how data change could affect 
the analytics result, without persisting the change.  I think that type of 
experiment is done in completely or almost static data marts where the user is 
allowed to modify the data freely.  The ON STATEMENT refresh mode wouldn't be 
for the DWH that requires high-performance, regular and/or continuous data 
loading and maintenance based on a rigorous discipline.  But I'm still not sure 
if this is a real-world use case...

https://docs.oracle.com/en/database/oracle/oracle-database/19/dwhsg/release-changes.html#GUID-2A2D6E3B-A3FD-47A8-82A3-1EF95AEF5993
--
ON STATEMENT refresh mode for materialized views 
The ON STATEMENT refresh mode refreshes materialized views every time a DML 
operation is performed on any base table, without the need to commit the 
transaction. This mode does not require you to maintain materialized view logs 
on the base tables. 
--


> Another use case is a ticket selling system. The system shows how many
> tickets remain in a real time manner. For this purpose it needs to
> count the number of tickets already sold from a log table. By using
> IVM, it could be accomplished in simple and effective way.

Wouldn't the app just have a table like ticket(id, name, quantity), decrement 
the quantity when the ticket is sold, and read the current quantity to know the 
remaining tickets?  If many consumers try to buy tickets for a popular event, 
the materialized view refresh would limit the concurrency.


> Here are some use cases suitable for IVM I can think of:
> 
> - Users are creating home made triggers to get data from tables. Since
>   IVM could eliminates some of those triggers, we could expect less
>   maintenance cost and bugs accidentally brought in when the triggers
>   were created.
> 
> - Any use case in which the cost of refreshing whole result table
>   (materialized view) is so expensive that it justifies the cost of
>   updating of base tables. See the example of use cases above.

I think we need to find a typical example of this.  That should be useful to 
write the manual article, because it's better to caution users that the IMV is 
a good fit for this case and not for that case.  Using real-world table names 
in the syntax example will also be good.


> > * Do you think the benefit of ON STATEMENT (i.e. do not have to use
> materialized view log) outweighs the drawback of ON  STATEMENT (i.g. DML
> overhead)?
> 
> Outweights to what?

"outweigh" means "exceed."  I meant that I'm wondering if and why users prefer 
ON STATEMENT's benefit despite of its additional overhead on update statements.


Bottom line: The use of triggers makes me hesitate, because I saw someone's 
(probably Fujii san) article that INSERTs into inheritance-and-trigger-based 
partitioned tables were 10 times slower than the declaration-based partitioned 
tables.  I think I will try to find a good use case.


Regards
Takayuki Tsunakawa





Re: Avoid full GIN index scan when possible

2019-12-24 Thread Alexander Korotkov
On Sat, Nov 23, 2019 at 2:39 AM Nikita Glukhov  wrote:
> Attached 8th version of the patches.

I've read this thread.  I decided to rewrite the patch in the way,
which I find simple and more clear.  Attached is the draft patch
written from scratch except regression tests, which were copied "as
is".  It based on the discussion in this thread as well as my own
ideas.  It works as following.

1) New GinScanKey->excludeOnly flag is introduced.  This flag means
that scan key might be satisfied even if no of its entries match the
row.  So, such scan keys are useful only for additional check of
results returned by other keys.  That is excludeOnly scan key is
designed for exclusion of already obtained results.
2) Initially no hidden scan entries are appended to
GIN_SEARCH_MODE_ALL scan keys.  They are appended after getting
statistics about search modes applied to particular attributes.
3) We append at only one GIN_CAT_EMPTY_QUERY scan entry when all scan
keys GIN_SEARCH_MODE_ALL.  If there is at least one normal scan key,
no GIN_CAT_EMPTY_QUERY is appended.
4) No hidden entries are appended to GIN_SEARCH_MODE_ALL scan key if
there are normal scan keys for the same column.  Otherwise
GIN_CAT_NULL_KEY hidden entry is appended.
5) GIN_SEARCH_MODE_ALL scan keys, which don't have GIN_CAT_EMPTY_QUERY
hidden entry, are marked with excludeOnly flag.  So, they are used to
filter results of other scan keys.
6) GIN_CAT_NULL_KEY hidden entry is found, then scan key doesn't match
independently on result of consistent function call.

Therefore, attached patch removes unnecessary GIN_CAT_EMPTY_QUERY scan
entries without removing positive effect of filtering in
GIN_SEARCH_MODE_ALL scan keys.

Patch requires further polishing including comments, minor refactoring
etc.  I'm going to continue work on this.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-Avoid-GIN-full-scan-for-empty-ALL-keys-v09.patch
Description: Binary data


Re: relpages of btree indexes are not truncating even after deleting all the tuples from table and doing vacuum

2019-12-24 Thread Mahendra Singh
On Tue, 24 Dec 2019 at 02:41, Peter Geoghegan  wrote:
>
> On Mon, Dec 23, 2019 at 11:05 AM Mahendra Singh  wrote:
> > From above example, we can see that after deleting all the tuples from 
> > table and firing vacuum command, size of table is reduced but size of index 
> > relation is same as before vacuum.
>
> VACUUM is only able to make existing empty pages in indexes recyclable
> by future page splits within the same index. It is not possible for it
> to reclaim space for the filesystem. Workload characteristics tend to
> determine whether or not this limitation is truly important.

Thank you for the clarification.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com




Re: table partition and column default

2019-12-24 Thread Amit Langote
Fujii-san,

On Wed, Dec 25, 2019 at 12:19 PM Fujii Masao  wrote:
>
> Hi,
>
> As the document explains, column defaults can be specified separately for
> each partition. But I found that INSERT via the partitioned table ignores
> that default. Is this expected behavior or bug?
>
> CREATE TABLE test (i INT, j INT) PARTITION BY RANGE (i);
> CREATE TABLE test1 PARTITION OF test (j DEFAULT 99) FOR VALUES FROM (1) TO 
> (10);
> INSERT INTO test VALUES (1, DEFAULT);
> INSERT INTO test1 VALUES (2, DEFAULT);
> SELECT * FROM test;
>  i |   j
> ---+
>  1 | (null)
>  2 | 99
> (2 rows)
>
> In the above example, INSERT accessing directly to the partition uses
> the default, but INSERT via the partitioned table not.

This is as of now expected.

IIRC, there was some discussion about implementing a feature whereby
partition's default will used for an attribute if it's null even after
considering the parent table's default, that is, when no default value
is defined in the parent.  The details are at toward the end of this
thread:

https://www.postgresql.org/message-id/flat/578398af46350effe7111895a4856b87b02e000e.camel%402ndquadrant.com

Thanks,
Amit




Re: Assert failure due to "drop schema pg_temp_3 cascade" for temporary tables and \d+ is not showing any info after drooping temp table schema

2019-12-24 Thread Mahendra Singh
On Wed, 25 Dec 2019 at 07:52, Michael Paquier  wrote:
>
> On Tue, Dec 24, 2019 at 04:50:58PM +0530, Mahendra Singh wrote:
> > We can fix this problem by either one way 1) reset myTempNamespace to
> > invalid while drooping schema of temp table 2) should not allow to drop
> > temporary table schema
>
> (Please note that it is better not to cross-post on multiple lists, so

Sorry. I was not aware of multiple mail ids. I will take care in future mails.

> I have removed pgsql-bugs from CC.)

Thanks.

> There is a little bit more to that, as we would basically need to do
> the work of RemoveTempRelationsCallback() once the temp schema is
> dropped, callback registered when the schema is correctly created at
> transaction commit (also we need to make sure that
> RemoveTempRelationsCallback is not called or unregistered if we were
> to authorize DROP SCHEMA on a temp schema).  And then all the reset
> done at the beginning of AtEOXact_Namespace() would need to happen.
>

Thanks for quick detailed analysis.

> Anyway, as dropping a temporary schema leads to an inconsistent
> behavior when recreating new temporary objects in a session that
> dropped it, that nobody has actually complained on the matter, and
> that in concept a temporary schema is linked to the session that
> created it, I think that we have a lot of arguments to just forbid the
> operation from happening.  Please note as well that it is possible to
> drop temporary schemas of other sessions, still this is limited to
> owners of the schema.

Yes, you are right that we can drop temporary schema of other sessions.

Even after applying your attached patch, I am getting same assert
failure because I am able to drop " temporary schema" from other
session so I think, we should not allow to drop any temporary schema
from any session.

> In short, let's tighten the logic, and we had better back-patch this
> one all the way down, 9.4 being broken.  Attached is a patch to do

Yes, I also verified that we have to back-patch till v9.4.

> that.  The error message generated depends on the state of the session
> so I have not added a test for this reason, and the check is added
> before the ACL check.  We could make the error message more generic,
> like "cannot drop temporary namespace".  Any thoughts?

I think, we can make error message as "cannot drop temporary schema"

While applying attached patch on HEAD, I got below warnings:

[mahendra@localhost postgres]$ git apply drop-temp-schema-v1.patch
drop-temp-schema-v1.patch:9: trailing whitespace.
/*
drop-temp-schema-v1.patch:10: trailing whitespace.
 * Prevent drop of a temporary schema as this would mess up with
drop-temp-schema-v1.patch:11: trailing whitespace.
 * the end-of-session callback cleaning up all temporary objects.
drop-temp-schema-v1.patch:12: trailing whitespace.
 * As the in-memory state is not cleaned up either here, upon
drop-temp-schema-v1.patch:13: trailing whitespace.
 * recreation of a temporary schema within the same session the
error: patch failed: src/backend/commands/dropcmds.c:101
error: src/backend/commands/dropcmds.c: patch does not apply

I think, above warnings are due to "trailing CRs" in patch.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com




Re: Physical replication slot advance is not persistent

2019-12-24 Thread Kyotaro Horiguchi
At Tue, 24 Dec 2019 20:12:32 +0300, Alexey Kondratov 
 wrote in 
> I dig into the code and it happens because of this if statement:
> 
>     /* Update the on disk state when lsn was updated. */
>     if (XLogRecPtrIsInvalid(endlsn))
>     {
>         ReplicationSlotMarkDirty();
>         ReplicationSlotsComputeRequiredXmin(false);
>         ReplicationSlotsComputeRequiredLSN();
>         ReplicationSlotSave();
>     }

Yes, it seems just broken.

> Attached is a small patch, which fixes this bug. I have tried to
> stick to the same logic in this 'if (XLogRecPtrIsInvalid(endlsn))'
> and now pg_logical_replication_slot_advance and
> pg_physical_replication_slot_advance return InvalidXLogRecPtr if
> no-op.
> 
> What do you think?

I think we shoudn't change the definition of
pg_*_replication_slot_advance since the result is user-facing.

The functions return a invalid value only when the slot had the
invalid value and failed to move the position. I think that happens
only for uninitalized slots.

Anyway what we should do there is dirtying the slot when the operation
can be assumed to have been succeeded.

As the result I think what is needed there is just checking if the
returned lsn is equal or larger than moveto. Doen't the following
change work?

-   if (XLogRecPtrIsInvalid(endlsn))
+   if (moveto <= endlsn)

reagrds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Assert failure due to "drop schema pg_temp_3 cascade" for temporary tables and \d+ is not showing any info after drooping temp table schema

2019-12-24 Thread Michael Paquier
On Wed, Dec 25, 2019 at 12:18:26PM +0900, Kyotaro Horiguchi wrote:
> Still the owner can drop temporary namespace on another session or
> pg_toast_temp_x of the current session.

Arf.  Yes, this had better be isAnyTempNamespace() so as we complain
about all of them.
--
Michael


signature.asc
Description: PGP signature


table partition and column default

2019-12-24 Thread Fujii Masao
Hi,

As the document explains, column defaults can be specified separately for
each partition. But I found that INSERT via the partitioned table ignores
that default. Is this expected behavior or bug?

CREATE TABLE test (i INT, j INT) PARTITION BY RANGE (i);
CREATE TABLE test1 PARTITION OF test (j DEFAULT 99) FOR VALUES FROM (1) TO (10);
INSERT INTO test VALUES (1, DEFAULT);
INSERT INTO test1 VALUES (2, DEFAULT);
SELECT * FROM test;
 i |   j
---+
 1 | (null)
 2 | 99
(2 rows)

In the above example, INSERT accessing directly to the partition uses
the default, but INSERT via the partitioned table not.

Regards,

-- 
Fujii Masao




Re: Assert failure due to "drop schema pg_temp_3 cascade" for temporary tables and \d+ is not showing any info after drooping temp table schema

2019-12-24 Thread Kyotaro Horiguchi
At Wed, 25 Dec 2019 11:22:03 +0900, Michael Paquier  wrote 
in 
> Anyway, as dropping a temporary schema leads to an inconsistent
> behavior when recreating new temporary objects in a session that
> dropped it, that nobody has actually complained on the matter, and
> that in concept a temporary schema is linked to the session that

Agreed.

> created it, I think that we have a lot of arguments to just forbid the
> operation from happening.  Please note as well that it is possible to
> drop temporary schemas of other sessions, still this is limited to
> owners of the schema.
> 
> In short, let's tighten the logic, and we had better back-patch this
> one all the way down, 9.4 being broken.  Attached is a patch to do
> that.  The error message generated depends on the state of the session
> so I have not added a test for this reason, and the check is added
> before the ACL check.  We could make the error message more generic,
> like "cannot drop temporary namespace".  Any thoughts?

Just inhibiting the action seems reasonable to me.

Still the owner can drop temporary namespace on another session or
pg_toast_temp_x of the current session.

isTempnamespace(address.objectId) doesn't work for the purpose.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Server crash with Master-Slave configuration.

2019-12-24 Thread Michael Paquier
On Tue, Dec 24, 2019 at 05:29:25PM +0530, Prabhat Sahu wrote:
> While performing below operations with Master-Slave configuration, Slave is
> crashed.
> Below are the steps to reproduce:
> 
> -- create a Slave using pg_basebackup and start:
> ./pg_basebackup -v -R -D d2 -p 55510
> mkdir /home/centos/ts1
> 
> -- Session 1(Master):
> ./psql postgres -p 55510
> 
> CREATE TABLESPACE ts1 location '/home/centos/ts1';

Your mistake is here.  Both primary and standby are on the same host,
so CREATE TABLESPACE would point to a path that overlap for both
clusters as the tablespace path is registered the WAL replayed,
leading to various weird behaviors.  What you need to do instead is to
create the tablespace before taking the base backup, and then take the
base backup using pg_basebackup's --tablespace-mapping.
--
Michael


signature.asc
Description: PGP signature


Re: pgsql: Superuser can permit passwordless connections on postgres_fdw

2019-12-24 Thread Michael Paquier
On Fri, Dec 20, 2019 at 10:17:20PM -0500, Tom Lane wrote:
> Yeah, it's sort of annoying that the buildfarm didn't notice this
> aspect of things.  I'm not sure I want to spend cycles on checking
> it in every test run, though.
> 
> Maybe we could have -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS
> enable a check for that aspect along with what it does now?

Makes sense to restrict that under the flag.  Perhaps a catalog scan
of pg_authid at the end of pg_regress and isolationtester then?
--
Michael


signature.asc
Description: PGP signature


Re: Online checksums verification in the backend

2019-12-24 Thread Masahiko Sawada
On Tue, 24 Dec 2019 at 16:09, Julien Rouhaud  wrote:
>
> On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada  wrote:
> >
> > On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud  wrote:
> > >
> > > This brings the second consideration: how to report the list corrupted
> > > blocks to end users.  As I said this is for now returned via the SRF,
> > > but this is clearly not ideal and should rather be made available more
> > > globally.  One usage of this information could be block level
> > > recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
> > > mentioned me that he had ideas on passing the list of corrupted blocks
> > > using the stat collector.
> >
> > Yes it's necessary the list of corrupted pages for single page
> > recovery. Apart from single page recovery I think it's helpful for DBA
> > if they can find the corrupted blocks in the server logs and on a
> > system view.
> >
> > I've also tried to report corrupted pages to the stats collector
> > during I researching single page recovery in PostgreSQL but one
> > problem is that the statistics in the stats collector is cleared when
> > crash recovery. I want the information of block corruption to survive
> > even when the server down.
>
> Yes, having the list of corrupted blocks surviving a crash-and-restart
> cycle, and also available after a clean shutdown is definitely
> important.
>
> > And we might want to add checksums to the
> > permanent file having information of database corruption. The
> > correctness of these information would be important because we can fix
> > a database by restoring some tables from a logical backup or by doing
> > reindex etc as long as we have a non-broken information of database
> > corruption.
>
> Agreed
>
> > > Finally, the read and locking considerations.  I tried to cover that
> > > extensively in the comments, but here are some details on how I tried
> > > to make the check safe while trying to keep the overhead as low as
> > > possible.  First thing is that this is only doing buffered reads,
> > > without any attempt to discard OS cache.  Therefore, any discrepancy
> > > between the OS cache and the disk cannot be detected unless you do
> > > other actions, such as sync / drop_caches on GNU/Linux.
> > >
> > > An access share lock on the currently checked relation is held,
> > > meaning that it can't get deleted/truncated.  The total number of
> > > blocks for the given fork is retrieved first, so any new block will be
> > > ignored.  Such new blocks are considered out of scope as being written
> > > after the start of the check.
> > >
> > > Each time a buffer is being checked, the target buffer mapping
> > > partition lock is acquired in shared mode, to prevent concurrent
> > > eviction.  If the buffer is found in shared buffers, it's pinned and
> > > released immediately, just to get the state.
> >
> > I wonder if there is possibility that blocks on disk can be corrupted
> > even if these are loaded to the shared buffer. ISTM the above method
> > cannot detect such corruption. Reading and checking blocks fast is
> > attractive but I thought it's also important to check blocks precisely
> > without overlooking.
>
> It can definitely happen, and it's the usual doomsday scenario:
> database is working fine for months, then postgres is restarted say
> for a minor version upgrade and then boom the most populars blocks
> that are constantly used in read only were corrupted on disk but never
> evicted from shared buffers, and you have a major outage.  I have
> witnessed that unfortunately too many times.  This is especially bad
> as in this kind of scenario, you typically discover the corruption
> once all backup only contains the corrupted blocks.
>
> Note that in the approach I'm suggesting, I do verify blocks that are
> loaded in shared buffers, I only ignore the dirty blocks, as they'll
> be written by the checkpointer or recovery process in case of unclean
> shutdown.  A bufferpin isn't necessary to avoid torn page read, an IO
> lock also guarantees that and causes less overhead.  The included TAP
> test should also detect the corruption of a
> present-in-shared-buffers-non-dirty block.  It could however be
> improved eg. by calling pg_prewarm to make sure that it's indeed in
> shared_buffers, and also do the same test after a clean restart to
> make sure that it's hitting the not-in-shared-buffers case.

It reads blocks from disk even if they are loaded in shared buffer.
Now I understand. Thanks!

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Assert failure due to "drop schema pg_temp_3 cascade" for temporary tables and \d+ is not showing any info after drooping temp table schema

2019-12-24 Thread Michael Paquier
On Tue, Dec 24, 2019 at 04:50:58PM +0530, Mahendra Singh wrote:
> We can fix this problem by either one way 1) reset myTempNamespace to
> invalid while drooping schema of temp table 2) should not allow to drop
> temporary table schema

(Please note that it is better not to cross-post on multiple lists, so
I have removed pgsql-bugs from CC.) 

There is a little bit more to that, as we would basically need to do
the work of RemoveTempRelationsCallback() once the temp schema is
dropped, callback registered when the schema is correctly created at
transaction commit (also we need to make sure that
RemoveTempRelationsCallback is not called or unregistered if we were
to authorize DROP SCHEMA on a temp schema).  And then all the reset
done at the beginning of AtEOXact_Namespace() would need to happen.

Anyway, as dropping a temporary schema leads to an inconsistent
behavior when recreating new temporary objects in a session that
dropped it, that nobody has actually complained on the matter, and
that in concept a temporary schema is linked to the session that
created it, I think that we have a lot of arguments to just forbid the
operation from happening.  Please note as well that it is possible to
drop temporary schemas of other sessions, still this is limited to
owners of the schema.

In short, let's tighten the logic, and we had better back-patch this
one all the way down, 9.4 being broken.  Attached is a patch to do
that.  The error message generated depends on the state of the session
so I have not added a test for this reason, and the check is added
before the ACL check.  We could make the error message more generic,
like "cannot drop temporary namespace".  Any thoughts?
--
Michael
diff --git a/src/backend/commands/dropcmds.c b/src/backend/commands/dropcmds.c
index be7a40d5d2..13a7c327f3 100644
--- a/src/backend/commands/dropcmds.c
+++ b/src/backend/commands/dropcmds.c
@@ -101,6 +101,20 @@ RemoveObjects(DropStmt *stmt)
 		 errhint("Use DROP AGGREGATE to drop aggregate functions.")));
 		}
 
+		/*
+		 * Prevent drop of a temporary schema as this would mess up with
+		 * the end-of-session callback cleaning up all temporary objects.
+		 * As the in-memory state is not cleaned up either here, upon
+		 * recreation of a temporary schema within the same session the
+		 * temporary object handling would be inconsistent.
+		 */
+		if (stmt->removeType == OBJECT_SCHEMA &&
+			isTempNamespace(address.objectId))
+ereport(ERROR,
+		(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+		 errmsg("cannot drop temporary namespace \"%s\"",
+get_namespace_name(address.objectId;
+
 		/* Check permissions. */
 		namespaceId = get_object_namespace();
 		if (!OidIsValid(namespaceId) ||


signature.asc
Description: PGP signature


Re: mdclose() does not cope w/ FileClose() failure

2019-12-24 Thread Kyotaro Horiguchi
At Tue, 24 Dec 2019 11:57:39 -0800, Noah Misch  wrote in 
> On Mon, Dec 23, 2019 at 07:41:49PM +0900, Kyotaro Horiguchi wrote:
> > If I understand it correctly, it is mentioning the number of the all
> > segment files in a fork, not the length of md_seg_fds arrays at a
> > certain moment. But actually _fdvec_resize is called for every segment
> > opening during mdnblocks (just-after mdopen), and every segment
> > closing during mdclose and mdtruncate as mentioned here. We are going
> > to omit pallocs only in the decreasing case.
> 
> That is a good point.  How frequently one adds 1 GiB of data is not the main
> issue.  mdclose() and subsequent re-opening of all segments will be more
> relevant to overall performance.

Yes, that's exactly what I meant.

> > If we regard repalloc as far faster than FileOpen/FileClose or we care
> > about only increase of segment number of mdopen'ed files and don't
> > care the frequent resize that happens during the functions above, then
> > the comment is right and we may resize the array in the
> > segment-by-segment manner.
> 
> In most cases, the array will fit into a power-of-two chunk, so repalloc()
> already does the right thing.  Once the table has more than ~1000 segments (~1
> TiB table size), the allocation will get a single-chunk block, and every
> subsequent repalloc() will call realloc().  Even then, repalloc() probably is
> far faster than File operations.  Likely, I should just accept the extra
> repalloc() calls and drop the "else if" change in _fdvec_resize().

I'm not sure which is better. If we say we know that
repalloc(AllocSetRealloc) doesn't free memory at all, there's no point
in calling repalloc for shrinking and we could omit that under the
name of optimization.  If we say we want to free memory as much as
possible, we should call repalloc pretending to believe that that
happens.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Should we rename amapi.h and amapi.c?

2019-12-24 Thread Michael Paquier
On Tue, Dec 24, 2019 at 02:22:22PM +0100, Fabien COELHO wrote:
> The change does not attempt to keep included files in ab order. Should it do
> that, or is it fixed later by some reindentation phase?

Yeah, it should.  Committed after fixing all that stuff.
--
Michael


signature.asc
Description: PGP signature


Re: unsupportable composite type partition keys

2019-12-24 Thread Amit Langote
On Wed, Dec 25, 2019 at 2:42 AM Tom Lane  wrote:
> Amit Langote  writes:
> > On Tue, Dec 24, 2019 at 12:00 AM Tom Lane  wrote:
> >> BTW, I forgot to mention: while I think the patch to forbid pseudotypes
> >> by using CheckAttributeType() can be back-patched, I'm leaning towards
> >> not back-patching the other patch.  The situation where we get into
> >> infinite recursion seems not very likely in practice, and it's not
> >> going to cause any crash or data loss, so I think we can just say
> >> "sorry that's not supported before v13".  The patch as I'm proposing
> >> it seems rather invasive for a back-branch fix.
>
> > It is indeed.
>
> > Just to be sure, by going with "unsupported before v13", which one do you 
> > mean:
>
> > * documenting it as so
> > * giving an error in such cases, like the patch in the first email on
> > this thread did
> > * doing nothing really
>
> I was thinking "do nothing in the back branches".  I don't believe we
> can detect such cases reliably (at least not without complicated logic,
> which'd defeat the point), so I don't think giving an error is actually
> feasible, and I doubt that documenting it would be useful.  If we get
> some field complaints about this, it'd be time enough to reconsider.

Sure, thanks for the reply.

Regards,
Amit




Re: Implementing Incremental View Maintenance

2019-12-24 Thread Tatsuo Ishii
> Materialized view reminds me of the use in a data warehouse.  Oracle handles 
> the top in its Database Data Warehousing Guide, and Microsoft has just 
> started to offer the materialized view feature in its Azure Synapse Analytics 
> (formerly SQL Data Warehouse).  AWS also has previewed Redshift's 
> materialized view feature in re:Invent 2019.  Are you targeting the data 
> warehouse (analytics) workload?
> 
> IIUC, to put (over) simply, the data warehouse has two kind of tables:
> 
> * Facts (transaction data): e.g. sales, user activity
> Large amount.  INSERT only on a regular basis (ETL/ELT) or continuously 
> (streaming)
> 
> * Dimensions (master/reference data): e.g. product, customer, time, country
> Small amount.  Infrequently INSERTed or UPDATEd.
> 
> 
> The proposed trigger-based approach does not seem to be suitable for the 
> facts, because the trigger overhead imposed on data loading may offset or 
> exceed the time saved by incrementally refreshing the materialized views.

I think that depends on use case of the DWH. If the freshness of
materialized view tables is important for a user, then the cost of the
trigger overhead may be acceptable for the user.

> Then, does the proposed feature fit the dimension tables?  If the 
> materialized view is only based on the dimension data, then the full REFRESH 
> of the materialized view wouldn't take so long.  The typical materialized 
> view should join the fact and dimension tables.  Then, the fact table will 
> have to have the triggers, causing the data loading slowdown.
> 
> I'm saying this because I'm concerned about the trigger based overhead.  As 
> you know, Oracle uses materialized view logs to save changes and 
> incrementally apply them later to the materialized views (REFRESH ON 
> STATEMENT materialized views doesn't require the materialized view log, so it 
> might use triggers.)  Does any commercial grade database implement 
> materialized view using triggers?  I couldn't find relevant information 
> regarding Azure Synapse and Redshift.

I heard that REFRESH ON STATEMENT of Oracle has been added after ON
COMMIT materialized view. So I suspect Oracle realizes that there are
needs/use case for ON STATEMENT, but I am not sure.

> If our only handy option is a trigger, can we minimize the overhead by doing 
> the view maintenance at transaction commit?

I am not sure it's worth the trouble. If it involves some form of
logging, then I think it should be used for deferred IVM first because
it has more use case than on commit IVM.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: mdclose() does not cope w/ FileClose() failure

2019-12-24 Thread Noah Misch
On Mon, Dec 23, 2019 at 07:41:49PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 22 Dec 2019 12:21:00 -0800, Noah Misch  wrote in 
> > On Sun, Dec 22, 2019 at 01:19:30AM -0800, Noah Misch wrote:
> > > An alternative would be to call
> > > _fdvec_resize() after every FileClose(), like mdtruncate() does; however, 
> > > the
> > > repalloc() overhead could be noticeable.  (mdclose() is called much more
> > > frequently than mdtruncate().)
> > 
> > I can skip repalloc() when the array length decreases, to assuage 
> > mdclose()'s
> > worry.  In the mdclose() case, the final _fdvec_resize(reln, fork, 0) will
> > still pfree() the array.  Array elements that mdtruncate() frees today will
> > instead persist to end of transaction.  That is okay, since mdtruncate()
> > crossing more than one segment boundary is fairly infrequent.  For it to
> > happen, you must either create a >2G relation and then TRUNCATE it in the 
> > same
> > transaction, or VACUUM must find >1-2G of unused space at the end of the
> > relation.  I'm now inclined to do it that way, attached.
> 
>* It doesn't seem worthwhile complicating the code by having a 
> more
>* aggressive growth strategy here; the number of segments 
> doesn't
>* grow that fast, and the memory context internally will 
> sometimes
> -  * avoid doing an actual reallocation.
> +  * avoid doing an actual reallocation.  Likewise, since the 
> number of
> +  * segments doesn't shrink that fast, don't shrink at all.  
> During
> +  * mdclose(), we'll pfree the array at nseg==0.
> 
> If I understand it correctly, it is mentioning the number of the all
> segment files in a fork, not the length of md_seg_fds arrays at a
> certain moment. But actually _fdvec_resize is called for every segment
> opening during mdnblocks (just-after mdopen), and every segment
> closing during mdclose and mdtruncate as mentioned here. We are going
> to omit pallocs only in the decreasing case.

That is a good point.  How frequently one adds 1 GiB of data is not the main
issue.  mdclose() and subsequent re-opening of all segments will be more
relevant to overall performance.

> If we regard repalloc as far faster than FileOpen/FileClose or we care
> about only increase of segment number of mdopen'ed files and don't
> care the frequent resize that happens during the functions above, then
> the comment is right and we may resize the array in the
> segment-by-segment manner.

In most cases, the array will fit into a power-of-two chunk, so repalloc()
already does the right thing.  Once the table has more than ~1000 segments (~1
TiB table size), the allocation will get a single-chunk block, and every
subsequent repalloc() will call realloc().  Even then, repalloc() probably is
far faster than File operations.  Likely, I should just accept the extra
repalloc() calls and drop the "else if" change in _fdvec_resize().




Re: Allow an alias to be attached directly to a JOIN ... USING

2019-12-24 Thread Fabien COELHO



On Tue, 17 Sep 2019, Alvaro Herrera wrote:


Indeed, that seems like a problem, and it's a good question.  You can
see this on unpatched master with SELECT x.filler FROM
(pgbench_tellers AS t JOIN b USING (bid)) AS x.


I'm not sure I understand why that problem is a blocker for this patch.


As discussed on another thread,


https://www.postgresql.org/message-id/flat/2aa57950-b1d7-e9b6-0770-fa592d565...@2ndquadrant.com

the patch does not conform to spec

  SQL:2016 Part 2 Foundation Section 7.10 

Basically "x" is expected to include *ONLY* joined attributes with USING, 
i.e. above only x.bid should exists, and per-table aliases are expected to 
still work for other attributes.


ISTM that this patch could be "returned with feedback".

--
Fabien.




Re: unsupportable composite type partition keys

2019-12-24 Thread Tom Lane
Amit Langote  writes:
> On Tue, Dec 24, 2019 at 12:00 AM Tom Lane  wrote:
>> BTW, I forgot to mention: while I think the patch to forbid pseudotypes
>> by using CheckAttributeType() can be back-patched, I'm leaning towards
>> not back-patching the other patch.  The situation where we get into
>> infinite recursion seems not very likely in practice, and it's not
>> going to cause any crash or data loss, so I think we can just say
>> "sorry that's not supported before v13".  The patch as I'm proposing
>> it seems rather invasive for a back-branch fix.

> It is indeed.

> Just to be sure, by going with "unsupported before v13", which one do you 
> mean:

> * documenting it as so
> * giving an error in such cases, like the patch in the first email on
> this thread did
> * doing nothing really

I was thinking "do nothing in the back branches".  I don't believe we
can detect such cases reliably (at least not without complicated logic,
which'd defeat the point), so I don't think giving an error is actually
feasible, and I doubt that documenting it would be useful.  If we get
some field complaints about this, it'd be time enough to reconsider.

regards, tom lane




Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence

2019-12-24 Thread Tom Lane
Alvaro Herrera  writes:
> Am I the only one bothered by the fact that this patch (and all
> downstream discussion) reduces the term "bitwise equality" to simply
> "bitwise"?  It reads really strange to me, both in the resulting SQL
> grammar as well as in struct names, code comments etc.  "This operator
> class is bitwise."

I agree, that's really poor English.

regards, tom lane




Physical replication slot advance is not persistent

2019-12-24 Thread Alexey Kondratov

Hi Hackers,

I have accidentally noticed that pg_replication_slot_advance only 
changes in-memory state of the slot when its type is physical. Its new 
value does not survive restart.


Reproduction steps:

1) Create new slot and remember its restart_lsn

SELECT pg_create_physical_replication_slot('slot1', true);
SELECT * from pg_replication_slots;

2) Generate some dummy WAL

CHECKPOINT;
SELECT pg_switch_wal();
CHECKPOINT;
SELECT pg_switch_wal();

3) Advance slot to the value of pg_current_wal_insert_lsn()

SELECT pg_replication_slot_advance('slot1', '0/160001A0');

4) Check that restart_lsn has been updated

SELECT * from pg_replication_slots;

5) Restart server and check restart_lsn again. It should be the same as 
in the step 1.



I dig into the code and it happens because of this if statement:

    /* Update the on disk state when lsn was updated. */
    if (XLogRecPtrIsInvalid(endlsn))
    {
        ReplicationSlotMarkDirty();
        ReplicationSlotsComputeRequiredXmin(false);
        ReplicationSlotsComputeRequiredLSN();
        ReplicationSlotSave();
    }

Actually, endlsn is always a valid LSN after the execution of 
replication slot advance guts. It works for logical slots only by 
chance, since there is an implicit ReplicationSlotMarkDirty() call 
inside LogicalConfirmReceivedLocation.


Attached is a small patch, which fixes this bug. I have tried to
stick to the same logic in this 'if (XLogRecPtrIsInvalid(endlsn))'
and now pg_logical_replication_slot_advance and
pg_physical_replication_slot_advance return InvalidXLogRecPtr if
no-op.

What do you think?


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

P.S. CCed Simon and Michael as they are the last who seriously touched 
pg_replication_slot_advance code.


>From 36d1fa2a89b3fb354a813354496df475ee11b62e Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Tue, 24 Dec 2019 18:21:50 +0300
Subject: [PATCH v1] Make phsycal replslot advance persistent

---
 src/backend/replication/slotfuncs.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 46e6dd4d12..826708d3f6 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -358,12 +358,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
  * The LSN position to move to is compared simply to the slot's restart_lsn,
  * knowing that any position older than that would be removed by successive
  * checkpoints.
+ *
+ * Returns InvalidXLogRecPtr if no-op.
  */
 static XLogRecPtr
 pg_physical_replication_slot_advance(XLogRecPtr moveto)
 {
 	XLogRecPtr	startlsn = MyReplicationSlot->data.restart_lsn;
-	XLogRecPtr	retlsn = startlsn;
+	XLogRecPtr	retlsn = InvalidXLogRecPtr;
 
 	if (startlsn < moveto)
 	{
@@ -386,6 +388,8 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
  * because we need to digest WAL to advance restart_lsn allowing to recycle
  * WAL and removal of old catalog tuples.  As decoding is done in fast_forward
  * mode, no changes are generated anyway.
+ *
+ * Returns InvalidXLogRecPtr if no-op.
  */
 static XLogRecPtr
 pg_logical_replication_slot_advance(XLogRecPtr moveto)
@@ -393,7 +397,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 	LogicalDecodingContext *ctx;
 	ResourceOwner old_resowner = CurrentResourceOwner;
 	XLogRecPtr	startlsn;
-	XLogRecPtr	retlsn;
+	XLogRecPtr	retlsn = InvalidXLogRecPtr;
 
 	PG_TRY();
 	{
@@ -414,9 +418,6 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 		 */
 		startlsn = MyReplicationSlot->data.restart_lsn;
 
-		/* Initialize our return value in case we don't do anything */
-		retlsn = MyReplicationSlot->data.confirmed_flush;
-
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
@@ -480,9 +481,9 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 			 * better than always losing the position even on clean restart.
 			 */
 			ReplicationSlotMarkDirty();
-		}
 
-		retlsn = MyReplicationSlot->data.confirmed_flush;
+			retlsn = MyReplicationSlot->data.confirmed_flush;
+		}
 
 		/* free context, call shutdown callback */
 		FreeDecodingContext(ctx);
@@ -575,7 +576,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	nulls[0] = false;
 
 	/* Update the on disk state when lsn was updated. */
-	if (XLogRecPtrIsInvalid(endlsn))
+	if (!XLogRecPtrIsInvalid(endlsn))
 	{
 		ReplicationSlotMarkDirty();
 		ReplicationSlotsComputeRequiredXmin(false);
-- 
2.17.1



Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence

2019-12-24 Thread Alvaro Herrera


> @@ -106,6 +106,18 @@ CREATE OPERATOR CLASS  class="parameter">name [ DEFAUL
>  
> 
>  
> +
> +NOT BITWISE
> +
> + 
> +  If present, the operator class equality is not the same as equivalence.
> +  For example, two numerics can compare equal but have different scales.
> +  Most opclasses implement bitwise equal comparison, alternative 
> behaviour
> +  must be set explicitly.
> + 
> +
> +   

Am I the only one bothered by the fact that this patch (and all
downstream discussion) reduces the term "bitwise equality" to simply
"bitwise"?  It reads really strange to me, both in the resulting SQL
grammar as well as in struct names, code comments etc.  "This operator
class is bitwise."

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




weird libpq GSSAPI comment

2019-12-24 Thread Alvaro Herrera
I found this comment in fe-connect.c:

/*
 * If GSSAPI is enabled and we have a credential cache, try to
 * set it up before sending startup messages.  If it's already
 * operating, don't try SSL and instead just build the startup
 * packet.
 */

I'm not sure I understand this correctly.  Why does it say "just build
the startup" packet about the SSL thing, when in reality the SSL block
below is unrelated to the GSS logic?  If I consider that SSL is just a
typo for GSS, then the comment doesn't seem to describe the logic
either, because what it does is go to CONNECTION_GSS_STARTUP state which
*doesn't* "build the startup packet" in the sense of pqBuildStartupPacket2/3,
but instead it just does pqPacketSend (which is what the SSL block below
calls "request SSL instead of sending the startup packet").

Also, it says "... and we have a credential cache, try to set it up..." but I
think it should say "if we *don't* have a credential cache".

Now that I've read this code half a dozen times, I think I'm starting to
vaguely understand how it works, but I would have expected the comment
to explain it so that I didn't have to do that.

Can we discuss a better wording for this comment?  I wrote this, but I
don't think it captures all the nuances in this code:

/*
 * If GSSAPI is enabled, we need a credential cache; we may
 * already have it, or set it up if not.  Then, if we don't
 * have a GSS context, request it and switch to
 * CONNECTION_GSS_STARTUP to wait for the response.
 *
 * Fail altogether if GSS is required but cannot be had.
 */

Thanks!

-- 
Álvaro Herrerahttp://www.twitter.com/alvherre




Re: Should we rename amapi.h and amapi.c?

2019-12-24 Thread Fabien COELHO


Bonjour Michaël,


the syscache mainly, so instead the attached patch does the following
changes:
- amapi.h -> indexam.h
- amapi.c -> indexamapi.c.  Here we have an equivalent in access/table/
as tableamapi.c.
- amvalidate.c -> indexamvalidate.c
- amvalidate.h -> indexamvalidate.h
- genam.c -> indexgenam.c



Patch applies cleanly, compiles, make check-world ok.

The change does not attempt to keep included files in ab order. Should it 
do that, or is it fixed later by some reindentation phase?


--
Fabien.

Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence

2019-12-24 Thread Anastasia Lubennikova

19.12.2019 18:19, Antonin Houska wrote:

Anastasia Lubennikova  wrote:


I attached new version with pg_opclass documentation update.

One more thing I am uncertain about  is array_ops. Arrays may contain bitwise
and not bitwise element types.
What is the correct value of opcisbitwise the array_ops itself?

How about setting opcisbitwise to false for the array_ops opclass and checking
opcisbitwise of the element type whenever we need to know whether the array is
"bitwise equal"? When checking array_eq(), I thought whether the existence of
"expanded array" format is a problem but it does not seem to be: the
conversion of "expanded" value to "flat" value and then back to the "expanded"
should not change the array contents.

Anyway, in the current version of the patch I see that array_ops opclasses
have opcisbitwise=true. It should be false even if you don't use the approach
of checking the element type.

Besides that, I think that record_ops is similar to array_ops and therefore it
should not set opcisbitwise to true.

I also remember that, when thinking about the problem in the context of the
aggregate push down patch, I considered some of the geometric types
problematic. For example, box_eq() uses this expression

#define FPeq(A,B)   (fabs((A) - (B)) <= EPSILON)

so equality does not imply bitwise equality here. Maybe you should only set
the flag for btree opclasses for now.


Thank you for pointing out at the issue with geometric opclasses.
If I understand it correctly, regular float types are not bitwise as well.

I updated the patchset.
The first patch now contains only infrastructure changes
and the second one sets opcisbitwise for btree opclasses in pg_opclass.dat.

I've tried to be conservative and only mark types that are 100% bitwise 
safe.

See attached v2-Opclass-isbitwise.out file.

Non-atomic types, such as record, range, json and enum depend on element 
types.
Text can be considered bitwise (i.e. texteq uses memcmp) only when 
specific collation clauses are satisfied.


We can make this 'opcisbitwise' parameter enum (or char) instead of 
boolean to mark
"always bitwise", "never bitwise" and "maybe bitwise". Though, I doubt 
if it will be helpful in any real use case.


What do you think?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit 891c03c4b8f37cf6711d6ae84f4e75573c123ec8
Author: Anastasia 
Date:   Mon Dec 23 19:52:22 2019 +0300

v4-Opclass-bitwise-equality-0001.patch
Add new infrastracture to mark opclass as BITWISE. Actual values of this parameter will be added in the following patch.

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index c915885..9b5c33b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3111,7 +3111,7 @@ create operator public.>^ (
 create operator family my_op_family using btree;
 create function my_op_cmp(a int, b int) returns int as
   $$begin return btint4cmp(a, b); end $$ language plpgsql;
-create operator class my_op_class for type int using btree family my_op_family as
+create operator class my_op_class bitwise for type int using btree family my_op_family as
  operator 1 public.<^,
  operator 3 public.=^,
  operator 5 public.>^,
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4f29e7c..48263d6 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -825,7 +825,7 @@ create operator family my_op_family using btree;
 create function my_op_cmp(a int, b int) returns int as
   $$begin return btint4cmp(a, b); end $$ language plpgsql;
 
-create operator class my_op_class for type int using btree family my_op_family as
+create operator class my_op_class bitwise for type int using btree family my_op_family as
  operator 1 public.<^,
  operator 3 public.=^,
  operator 5 public.>^,
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 55694c4..87cc344 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4544,6 +4544,13 @@ SCRAM-SHA-256$iteration count:
   Type of data stored in index, or zero if same as opcintype
  
 
+ 
+  opcisbitwise
+  bool
+  
+  True if the operator class equality is the same as equivalence
+ 
+
 

   
diff --git a/doc/src/sgml/ref/create_opclass.sgml b/doc/src/sgml/ref/create_opclass.sgml
index dd5252f..8d0ddfc 100644
--- a/doc/src/sgml/ref/create_opclass.sgml
+++ b/doc/src/sgml/ref/create_opclass.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-CREATE OPERATOR CLASS name [ DEFAULT ] FOR TYPE data_type
+CREATE OPERATOR CLASS name [ DEFAULT | BITWISE ] FOR TYPE data_type
   USING index_method [ FAMILY family_name ] AS
   {  OPERATOR strategy_number operator_name [ ( op_type, op_type ) ] [ FOR SEARCH | FOR ORDER BY sort_family_name ]

Re: [HACKERS] Restricting maximum keep segments by repslots

2019-12-24 Thread Kyotaro Horiguchi
I'm very sorry for being late to reply.

At Wed, 2 Oct 2019 17:08:07 +0200, Jehan-Guillaume de Rorthais 
 wrote in 
> On Tue, 30 Jul 2019 21:30:45 +0900 (Tokyo Standard Time)
> Kyotaro Horiguchi  wrote:
> > > In "pg_replication_slots" view, the new "wal_status" field is misleading.
> > > Consider this sentence and the related behavior from documentation
> > > (catalogs.sgml):
> > > 
> > >   keeping means that some of them are to be removed by
> > > the next checkpoint.
> > > 
> > > "keeping" appears when the current checkpoint will delete some WAL further
> > > than "current_lsn - max_slot_wal_keep_size", but still required by at 
> > > least
> > > one slot. As some WAL required by some slots will be deleted quite soon,
> > > probably before anyone can react, "keeping" status is misleading here. We
> > > are already in the red zone.  
> > 
> > It may be "losing", which would be less misleading.
> 
> Indeed, "loosing" is a better match for this state.
>
> However, what's the point of this state from the admin point of view? In 
> various
> situation, the admin will have no time to react immediately and fix whatever
> could help.
> 
> How useful is this specific state?

If we assume "losing" segments as "lost", a segment once "lost" can
return to "keeping" or "streaming" state. That is intuitively
impossible. On the other hand if we assume it as "keeping", it should
not be removed by the next checkpoint but actually it can be
removed. The state "losing" means such a unstable state different from
both "lost" and "keeping".

> > > I would expect this "wal_status" to be:
> > > 
> > > - streaming: slot lag between 0 and "max_wal_size"
> > > - keeping: slot lag between "max_wal_size" and "max_slot_wal_keep_size". 
> > > the
> > >   slot actually protect some WALs from being deleted
> > > - lost: slot lag superior of max_slot_wal_keep_size. The slot couldn't
> > > protect some WAL from deletion  
> > 
> > I agree that comparing to max_wal_size is meaningful. The revised
> > version behaves as that.
>
> The v16-0006 patch doesn't apply anymore because of commit 709d003fbd. Here is
> the fix:
> 
>   --- a/src/backend/access/transam/xlogreader.c
>   +++ b/src/backend/access/transam/xlogreader.c
>   @@ -304,7 +304,7
>   -   XLByteToSeg(targetPagePtr, targetSegNo, state->wal_segment_size);
>   +   XLByteToSeg(targetPagePtr, targetSegNo, state->segcxt.ws_segsize);
> 
> I suppose you might have more refactoring to do in regard with Alvaro's
> review. 
> 
> I confirm the new patch behaves correctly in my tests in regard with the
> "wal_status" field.

Thanks for testing. I fixed it in the attached patch.

> +  Availability of WAL records claimed by this
> +  slot. streaming, keeping,
> 
> Slots are keeping WALs, not WAL records. Shouldn't it be "Availability of WAL
> files claimed by this slot"?

I choosed "record" since a slot points a record. I'm not sure but I'm
fine with "file". Fixed catalogs.sgml and config.sgml that way.

> +  streaming means that the claimed records are
> +  available within max_wal_size. keeping means
> 
> I wonder if streaming is the appropriate name here. The WALs required might be
> available for streaming, but the slot not active, thus not "streaming". What
> about merging with the "active" field, in the same fashion as
> pg_stat_activity.state? We would have an enum "pg_replication_slots.state" 
> with
> the following states:
> * inactive: non active slot
> * active: activated, required WAL within max_wal_size
> * keeping: activated, max_wal_size is exceeded but still held by replication
>   slots or wal_keep_segments.
> * lost: some WAL are definitely lost
> 
> Thoughts?

In the first place, I realized that I was missed a point about the
relationship between max_wal_size and max_slot_wal_keep_size
here. Since the v15 of this patch, GetLsnAvailablity returns
"streaming" when the restart_lsn is within max_wal_size. That behavior
makes sense when max_slot_wal_keep_size > max_wal_size. However, in
the contrary case, restart_lsn could be lost even it is
withinmax_wal_size. So we would see "streaming" (or "normal") even
though restart_lsn is already lost. That is broken.

In short, the "streaming/normal" state is useless if
max_slot_wal_keep_size < max_wal_size.


Finally I used the following wordings.

(there's no "inactive" wal_state)

* normal: required WAL within max_wal_size when max_slot_wal_keep_size
  is larger than max_wal_size.
* keeping: required segments are held by replication slots or
  wal_keep_segments.

* losing: required segments are about to be removed or may be already
  removed but streaming is not dead yet.

* lost: cannot continue streaming using this slot.

> [...]
> > > * "remain" should be NULL if "max_slot_wal_keep_size=-1 or if the slot 
> > > isn't
> > >   active  
> > 
> > The revised  version shows the following statuses.
> > 
> >streaming / NULL max_slot_wal_keep_size is -1
> >unkown/ NULL mswks 

Server crash with Master-Slave configuration.

2019-12-24 Thread Prabhat Sahu
Hi,

While performing below operations with Master-Slave configuration, Slave is
crashed.
Below are the steps to reproduce:

-- create a Slave using pg_basebackup and start:
./pg_basebackup -v -R -D d2 -p 55510
mkdir /home/centos/ts1

-- Session 1(Master):
./psql postgres -p 55510

CREATE TABLESPACE ts1 location '/home/centos/ts1';
CREATE TABLE tab1 (c1 INTEGER, c2 TEXT, c3 point) tablespace ts1;
insert into tab1 (select x, x||'_c2',point (x,x) from
generate_series(1,10) x);

-- Cancel the below update query in middle and then vacuum:
update tab1 set c1=c1+2 , c3=point(10,10) where c1 <=9;
vacuum(analyze) tab1(c3, c2);

postgres=# update tab1 set c1=c1+2 , c3=point(10,10) where c1 <=9;
^CCancel request sent
ERROR:  canceling statement due to user request

postgres=# vacuum(analyze) tab1(c3, c2);
VACUUM

OR

postgres=# vacuum(analyze) tab1(c3, c2);
ERROR:  index "pg_toast_16385_index" contains unexpected zero page at block
0
HINT:  Please REINDEX it.

-- session 2: (slave)
./psql postgres -p 55520

-- Below select query is crashed:
select count(*) from tab1_2;

postgres=# select count(*) from tab1_2;
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

-- Below is the stack trace:
[centos@parallel-vacuum-testing bin]$ gdb -q -c d2/core.20509 postgres
Reading symbols from /home/centos/PGsrc/postgresql/inst/bin/postgres...done.
[New LWP 20509]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: startup   recovering
00010006   '.
Program terminated with signal 6, Aborted.
#0  0x7f42d2565337 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.15.1-37.el7_7.2.x86_64 libcom_err-1.42.9-16.el7.x86_64
libselinux-2.5-14.1.el7.x86_64 openssl-libs-1.0.2k-19.el7.x86_64
pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x7f42d2565337 in raise () from /lib64/libc.so.6
#1  0x7f42d2566a28 in abort () from /lib64/libc.so.6
#2  0x00a94c55 in errfinish (dummy=0) at elog.c:590
#3  0x00a9729a in elog_finish (elevel=22, fmt=0xb30a10 "WAL
contains references to invalid pages") at elog.c:1465
#4  0x0057cb10 in log_invalid_page (node=..., forkno=MAIN_FORKNUM,
blkno=470, present=false) at xlogutils.c:96
#5  0x0057d64e in XLogReadBufferExtended (rnode=...,
forknum=MAIN_FORKNUM, blkno=470, mode=RBM_NORMAL) at xlogutils.c:472
#6  0x0057d386 in XLogReadBufferForRedoExtended (record=0x1b4a9c8,
block_id=0 '\000', mode=RBM_NORMAL, get_cleanup_lock=true,
buf=0x7ffda55b39d4)
at xlogutils.c:390
#7  0x004f12b5 in heap_xlog_clean (record=0x1b4a9c8) at
heapam.c:7744
#8  0x004f4ebe in heap2_redo (record=0x1b4a9c8) at heapam.c:8891
#9  0x0056cceb in StartupXLOG () at xlog.c:7202
#10 0x0086cb0c in StartupProcessMain () at startup.c:170
#11 0x00582150 in AuxiliaryProcessMain (argc=2,
argv=0x7ffda55b4600) at bootstrap.c:451
#12 0x0086ba0f in StartChildProcess (type=StartupProcess) at
postmaster.c:5461
#13 0x0086685d in PostmasterMain (argc=5, argv=0x1b49d50) at
postmaster.c:1392
#14 0x00775bb1 in main (argc=5, argv=0x1b49d50) at main.c:210
(gdb)

-- 

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.

The Postgres Database Company


Re: string literal continuations in C

2019-12-24 Thread Joe Conway
On 12/23/19 2:51 PM, Alvaro Herrera wrote:
> Per a recent thread, these patches remove string literals split with
> \-escaped newlines.  The first is for the message "materialize mode
> required, but it is not allowed in this context" where it's more
> prevalent, and we keep perpetuating it; the second is for other
> messages, whose bulk is in dblink and tablefunc.  I think the split is
> pointless and therefore propose to push both together as a single
> commit, but maybe somebody would like me to leave those contrib modules
> alone.

I take it since I was explicitly CC'd that the contrib comment was aimed
at me? I likely copied the convention from somewhere else in the
backend, but I don't care either way if you want to change them. However
I guess we should coordinate since I've been berated regarding error
codes and will likely go change at least two of them in tablefunc soon
(not likely before Thursday though)...

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Assert failure due to "drop schema pg_temp_3 cascade" for temporary tables and \d+ is not showing any info after drooping temp table schema

2019-12-24 Thread Mahendra Singh
Hi hackers,

In other thread "[HACKERS] Block level parallel vacuum"[1],  Prabhat Kumar
Sahu reported a random assert failure but he got only once and he was not
able to reproduce it. In that thread [2], Amit Kapila suggested some points
to reproduce assert.  I tried to reproduce and I was able to reproduce it
consistently.

Below are the steps to reproduce assert:
*Configure sett*ing:
log_min_messages=debug1
autovacuum_naptime = 5s
autovacuum = on

postgres=# create temporary table temp1(c1 int);
CREATE TABLE
postgres=# \d+
 List of relations
  Schema   | Name  | Type  |  Owner   | Persistence |  Size   | Description
---+---+---+--+-+-+-
 pg_temp_3 | temp1 | table | mahendra | temporary   | 0 bytes |
(1 row)

postgres=# drop schema pg_temp_3 cascade;
NOTICE:  drop cascades to table temp1
DROP SCHEMA
postgres=# \d+
Did not find any relations.
postgres=# create temporary table temp2(c1 int);
CREATE TABLE
postgres=# \d+
Did not find any relations.
postgres=# select pg_sleep(6);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
postgres=#


*Stack Trace:*
elinux-2.5-12.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64
openssl-libs-1.0.2k-12.el7.x86_64 pcre-8.32-17.el7.x86_64
xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x7f80b2ef9277 in __GI_raise (sig=sig@entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f80b2efa968 in __GI_abort () at abort.c:90
#2  0x00ecdd4e in ExceptionalCondition (conditionName=0x11a9bcb
"strvalue != NULL", errorType=0x11a9bbb "FailedAssertion",
fileName=0x11a9bb0 "snprintf.c", lineNumber=442)
at assert.c:67
#3  0x00f80122 in dopr (target=0x7ffe902e44d0, format=0x10e8fe5
".%s\"", args=0x7ffe902e45b8) at snprintf.c:442
#4  0x00f7f821 in pg_vsnprintf (str=0x18cd480 "autovacuum: dropping
orphan temp table \"postgres.", '\177' ..., count=1024,
fmt=0x10e8fb8 "autovacuum: dropping orphan temp table \"%s.%s.%s\"",
args=0x7ffe902e45b8) at snprintf.c:195
#5  0x00f74cb3 in pvsnprintf (buf=0x18cd480 "autovacuum: dropping
orphan temp table \"postgres.", '\177' ..., len=1024,
fmt=0x10e8fb8 "autovacuum: dropping orphan temp table \"%s.%s.%s\"",
args=0x7ffe902e45b8) at psprintf.c:110
#6  0x00f7775b in appendStringInfoVA (str=0x7ffe902e45d0,
fmt=0x10e8fb8 "autovacuum: dropping orphan temp table \"%s.%s.%s\"",
args=0x7ffe902e45b8) at stringinfo.c:149
#7  0x00ecf5de in errmsg (fmt=0x10e8fb8 "autovacuum: dropping
orphan temp table \"%s.%s.%s\"") at elog.c:832
#8  0x00aef625 in do_autovacuum () at autovacuum.c:2253
#9  0x00aedfae in AutoVacWorkerMain (argc=0, argv=0x0) at
autovacuum.c:1693
#10 0x00aed82f in StartAutoVacWorker () at autovacuum.c:1487
#11 0x00b1773a in StartAutovacuumWorker () at postmaster.c:5562
#12 0x00b16c13 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5279
#13 
#14 0x7f80b2fb8c53 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:81
#15 0x00b0da27 in ServerLoop () at postmaster.c:1691
#16 0x00b0cfa2 in PostmasterMain (argc=3, argv=0x18cb290) at
postmaster.c:1400
#17 0x0097868a in main (argc=3, argv=0x18cb290) at main.c:210

ereport(LOG,
(errmsg("autovacuum: dropping orphan temp table \"%s.%s.%s\"",
get_database_name(MyDatabaseId),
get_namespace_name(classForm->relnamespace),
NameStr(classForm->relname;

I debugged and found that "get_namespace_name(classForm->relnamespace)" was
null so it was crashing.

This bug is introduced or exposed from below mentioned commit:
*commit 246a6c8f7b237cc1943efbbb8a7417da9288f5c4*
Author: Michael Paquier 
Date:   Mon Aug 13 11:49:04 2018 +0200

Make autovacuum more aggressive to remove orphaned temp tables

Commit dafa084, added in 10, made the removal of temporary orphaned
tables more aggressive.  This commit makes an extra step into the

Before above commit, we were not getting any assert failure but \d+ was not
showing any temp table info after "drop  schema pg_temp_3 cascade" (for
those tables are created after drooping schema) .

As per my analysis, I can see that while drooping schema of temporary
table, we are not setting myTempNamespace to invalid so at the time of
creating again temporary table, we are not creating proper schema.

We can fix this problem by either one way 1) reset myTempNamespace to
invalid while drooping schema of temp table 2) should not allow to drop
temporary table schema

Please let me know your thoughts 

Re: Implementing Incremental View Maintenance

2019-12-24 Thread legrand legrand
Yugo Nagata wrote
> On Mon, 23 Dec 2019 03:41:18 -0700 (MST)
> legrand legrand 

> legrand_legrand@

>  wrote:
> 
> [ ...]
> 
>> I would even
>> prefer a common "table" shared between all sessions like GLOBAL TEMPORARY
>> TABLE (or something similar) as described here:
>> https://www.postgresql.org/message-id/flat/157703426606.1198.2452090605041230054.pgcf%40coridan.postgresql.org#331e8344bbae904350af161fb43a0aa6
> 
> Although I have not looked into this thread, this may be help if this is
> implemented. However, it would be still necessary to truncate the table
> before the view maintenance because such tables always exist and can be
> accessed and modified by any users.
> 
> -- 
> Yugo Nagata 

> nagata@.co

> 

For information, in this table data is PRIVATE to each session, can be
purged on the ON COMMIT event and disappear at SESSION end.
Yes, this feature could be utile only if it's implemented. And you are rigth
some data has to be deleted
on the ON STATEMENT event (not sure if TRUNCATE is Global or Session
specific in this situation).





--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html




Re: backup manifests

2019-12-24 Thread Suraj Kharage
Thank you for review comments.

On Fri, Dec 20, 2019 at 9:14 PM Robert Haas  wrote:

> On Fri, Dec 20, 2019 at 8:24 AM Suraj Kharage
>  wrote:
> > Thank you for review comments.
>
> Thanks for the new version.
>
> +  --verify-backup 
>
> Whitespace.
>
Corrected.


>
> +struct manifesthash_hash *hashtab;
>
> Uh, I had it in mind that you would nuke this line completely, not
> just remove "typedef" from it. You shouldn't need a global variable
> here.
>

Removed.


> + if (buf == NULL)
>
> pg_malloc seems to have an internal check such that it never returns
> NULL. I don't see anything like this test in other callers.
>

Yeah, removed this check


>
> The order of operations in create_manifest_hash() seems unusual:
>
> + fd = open(manifest_path, O_RDONLY, 0);
> + if (fstat(fd, ))
> + buf = pg_malloc(stat.st_size);
> + hashtab = manifesthash_create(1024, NULL);
> ...
> + entry = manifesthash_insert(hashtab, filename, );
> ...
> + close(fd);
>
> I would have expected open-fstat-read-close to be consecutive, and the
> manifesthash stuff all done afterwards. In fact, it seems like reading
> the file could be a separate function.
>

Yes, created new function which will read the file and return the buffer.


>
> + if (strncmp(checksum, "SHA256", 6) == 0)
>
> This isn't really right; it would give a false match if we had a
> checksum algorithm with a name like SHA2560 or SHA256C or
> SHA256ExceptWayBetter. The right thing to do is find the colon first,
> and then probably overwrite it with '\0' so that you have a string
> that you can pass to parse_checksum_algorithm().
>

Corrected this check. Below suggestion, allow us to put '\0' in between the
line.
since SHA256 is used to generate for backup manifest, so that we can feed
that
line early to the checksum machinery.


>
> + /*
> + * we don't have checksum type in the header, so need to
> + * read through the first file enttry to find the checksum
> + * type for the manifest file and initilize the checksum
> + * for the manifest file itself.
> + */
>
> This seems to be proceeding on the assumption that the checksum type
> for the manifest itself will always be the same as the checksum type
> for the first file in the manifest. I don't think that's the right
> approach. I think the manifest should always have a SHA256 checksum,
> regardless of what type of checksum is used for the individual files
> within the manifest. Since the volume of data in the manifest is
> presumably very small compared to the size of the database cluster
> itself, I don't think there should be any performance problem there.
>
Made the change in backup manifest as well in backup validatort patch.
Thanks to Rushabh Lathia for the offline discussion and help.

To examine the first word of each line, I am using below check:
if (strncmp(line, "File", 4) == 0)
{
..
}
else if (strncmp(line, "Manifest-Checksum", 17) == 0)
{
..
}
else
error

strncmp might be not right here, but we can not put '\0' in between the
line (to find out first word)
before we recognize the line type.
All the lines expect line last one (where we have manifest checksum) are
feed to the checksum machinary to calculate manifest checksum.
so update_checksum() should be called after recognizing the type, i.e: if
it is a File type record. Do you see any issues with this?

+ filesize = atol(size);
>
> Using strtol() would allow for some error checking.
>
corrected.


>
> + * Increase the checksum by its lable length so that we can
> + checksum = checksum + checksum_lable_length;
>
> Spelling.
>
corrected.


>
> + pg_log_error("invalid record found in \"%s\"", manifest_path);
>
> Error message needs work.
>
> +VerifyBackup(void)
> +create_manifest_hash(char *manifest_path)
> +nextLine(char *buf)
>
> Your function names should be consistent with the surrounding style,
> and with each other, as far as possible. Three different conventions
> within the same patch and source file seems over the top.
>
> Also keep in mind that you're not writing code in a vacuum. There's a
> whole file of code here, and around that, a whole project.
> scan_data_directory() is a good example of a function whose name is
> clearly too generic. It's not a general-purpose function for scanning
> the data directory; it's specifically a support function for verifying
> a backup. Yet, the name gives no hint of this.
>
> +verify_file(struct dirent *de, char fn[MAXPGPATH], struct stat st,
> + char relative_path[MAXPGPATH], manifesthash_hash *hashtab)
>
> I think I commented on the use of char[] parameters in my previous review.
>
> + /* Skip backup manifest file. */
> + if (strcmp(de->d_name, "backup_manifest") == 0)
> + return;
>
> Still looks like this will be skipped at any level of the directory
> hierarchy, not just the top. And why are we skipping backup_manifest
> here bug pg_wal in scan_data_directory? That's a rhetorical question,
> because I know the answer: verify_file() is only getting called for
> files, so you can't use it to skip 

pgbench - use pg logging capabilities

2019-12-24 Thread Fabien COELHO


Bonjour Michaël, hello devs,

As suggested in "cce64a51", this patch make pgbench use postgres logging 
capabilities.


I tried to use fatal/error/warning/info/debug where appropriate.

Some printing to stderr remain for some pgbench specific output.

The patch fixes a inconsistent test case name that I noticed in passing.

--
Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 6ab79eef99..a95f238127 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -70,6 +70,14 @@
 #define M_PI 3.14159265358979323846
 #endif
 
+/*
+ * Convenient shorcuts
+ */
+#define fatal pg_log_fatal
+#define error pg_log_error
+#define warning pg_log_warning
+#define info pg_log_info
+#define debug pg_log_debug
 
 #define ERRCODE_UNDEFINED_TABLE  "42P01"
 
@@ -541,7 +549,7 @@ static ParsedScript sql_script[MAX_SCRIPTS];	/* SQL script files */
 static int	num_scripts;		/* number of scripts in sql_script[] */
 static int64 total_weight = 0;
 
-static int	debug = 0;			/* debug flag */
+static int	debug_level = 0;	/* debug flag */
 
 /* Builtin test scripts */
 typedef struct BuiltinScript
@@ -787,14 +795,12 @@ strtoint64(const char *str, bool errorOK, int64 *result)
 
 out_of_range:
 	if (!errorOK)
-		fprintf(stderr,
-"value \"%s\" is out of range for type bigint\n", str);
+		error("value \"%s\" is out of range for type bigint", str);
 	return false;
 
 invalid_syntax:
 	if (!errorOK)
-		fprintf(stderr,
-"invalid input syntax for type bigint: \"%s\"\n", str);
+		error("invalid input syntax for type bigint: \"%s\"", str);
 	return false;
 }
 
@@ -810,16 +816,14 @@ strtodouble(const char *str, bool errorOK, double *dv)
 	if (unlikely(errno != 0))
 	{
 		if (!errorOK)
-			fprintf(stderr,
-	"value \"%s\" is out of range for type double\n", str);
+			error("value \"%s\" is out of range for type double", str);
 		return false;
 	}
 
 	if (unlikely(end == str || *end != '\0'))
 	{
 		if (!errorOK)
-			fprintf(stderr,
-	"invalid input syntax for type double: \"%s\"\n", str);
+			error("invalid input syntax for type double: \"%s\"", str);
 		return false;
 	}
 	return true;
@@ -1149,7 +1153,7 @@ executeStatement(PGconn *con, const char *sql)
 	res = PQexec(con, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 	{
-		fprintf(stderr, "%s", PQerrorMessage(con));
+		fatal("%s", PQerrorMessage(con));
 		exit(1);
 	}
 	PQclear(res);
@@ -1164,8 +1168,8 @@ tryExecuteStatement(PGconn *con, const char *sql)
 	res = PQexec(con, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 	{
-		fprintf(stderr, "%s", PQerrorMessage(con));
-		fprintf(stderr, "(ignoring this error and continuing anyway)\n");
+		error("%s", PQerrorMessage(con));
+		info("(ignoring this error and continuing anyway)");
 	}
 	PQclear(res);
 }
@@ -1211,8 +1215,7 @@ doConnect(void)
 
 		if (!conn)
 		{
-			fprintf(stderr, "connection to database \"%s\" failed\n",
-	dbName);
+			error("connection to database \"%s\" failed", dbName);
 			return NULL;
 		}
 
@@ -1230,8 +1233,8 @@ doConnect(void)
 	/* check to see that the backend connection was successfully made */
 	if (PQstatus(conn) == CONNECTION_BAD)
 	{
-		fprintf(stderr, "connection to database \"%s\" failed:\n%s",
-dbName, PQerrorMessage(conn));
+		error("connection to database \"%s\" failed", dbName);
+		error("%s", PQerrorMessage(conn));
 		PQfinish(conn);
 		return NULL;
 	}
@@ -1360,9 +1363,8 @@ makeVariableValue(Variable *var)
 
 		if (!strtodouble(var->svalue, true, ))
 		{
-			fprintf(stderr,
-	"malformed variable \"%s\" value: \"%s\"\n",
-	var->name, var->svalue);
+			error("malformed variable \"%s\" value: \"%s\"",
+  var->name, var->svalue);
 			return false;
 		}
 		setDoubleValue(>value, dv);
@@ -1425,8 +1427,7 @@ lookupCreateVariable(CState *st, const char *context, char *name)
 		 */
 		if (!valid_variable_name(name))
 		{
-			fprintf(stderr, "%s: invalid variable name: \"%s\"\n",
-	context, name);
+			error("%s: invalid variable name: \"%s\"", context, name);
 			return NULL;
 		}
 
@@ -1635,7 +1636,7 @@ coerceToBool(PgBenchValue *pval, bool *bval)
 	}
 	else		/* NULL, INT or DOUBLE */
 	{
-		fprintf(stderr, "cannot coerce %s to boolean\n", valueTypeName(pval));
+		error("cannot coerce %s to boolean", valueTypeName(pval));
 		*bval = false;			/* suppress uninitialized-variable warnings */
 		return false;
 	}
@@ -1680,7 +1681,7 @@ coerceToInt(PgBenchValue *pval, int64 *ival)
 
 		if (isnan(dval) || !FLOAT8_FITS_IN_INT64(dval))
 		{
-			fprintf(stderr, "double to int overflow for %f\n", dval);
+			error("double to int overflow for %f", dval);
 			return false;
 		}
 		*ival = (int64) dval;
@@ -1688,7 +1689,7 @@ coerceToInt(PgBenchValue *pval, int64 *ival)
 	}
 	else		/* BOOLEAN or NULL */
 	{
-		fprintf(stderr, "cannot coerce %s to int\n", valueTypeName(pval));
+		error("cannot coerce %s to int", valueTypeName(pval));
 		return false;
 	}
 }
@@ -1709,7 +1710,7 @@ coerceToDouble(PgBenchValue *pval, double *dval)
 	}
 	

Re: unsupportable composite type partition keys

2019-12-24 Thread Amit Langote
On Mon, Dec 23, 2019 at 6:42 PM Amit Langote  wrote:
> On Sun, Dec 22, 2019 at 6:13 AM Tom Lane  wrote:
> > > I wonder whether we couldn't also lift
> > > the restriction against whole-row Vars in partition expressions.
> > > Doesn't seem like there is much difference between such a Var and
> > > a row(...)::table_rowtype expression.
> >
> > I didn't look into that either.  I wouldn't propose back-patching that,
> > but it'd be interesting to try to fix it in HEAD.
>
> Agreed.

I gave that a try and ended up with attached that applies on top of
your delay-loading-relcache-partition-data-2.patch.

Thanks,
Amit
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 7657608dd7..a5963c367d 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -190,17 +190,13 @@ index_get_partition(Relation partition, Oid indexId)
  * in the hierarchy, and they must also have the same types, the attnos may
  * be different.
  *
- * If found_whole_row is not NULL, *found_whole_row returns whether a
- * whole-row variable was found in the input expression.
- *
  * Note: this will work on any node tree, so really the argument and result
  * should be declared "Node *".  But a substantial majority of the callers
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
 map_partition_varattnos(List *expr, int fromrel_varno,
-   Relation to_rel, Relation 
from_rel,
-   bool *found_whole_row)
+   Relation to_rel, Relation 
from_rel)
 {
boolmy_found_whole_row = false;
 
@@ -217,9 +213,6 @@ map_partition_varattnos(List *expr, int fromrel_varno,

_found_whole_row);
}
 
-   if (found_whole_row)
-   *found_whole_row = my_found_whole_row;
-
return expr;
 }
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 53a8f1610a..1b410110f9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15172,11 +15172,6 @@ ComputePartitionAttrs(ParseState *pstate, Relation 
rel, List *partParams, AttrNu
 * system column references.
 */
pull_varattnos(expr, 1, _attrs);
-   if (bms_is_member(0 - 
FirstLowInvalidHeapAttributeNumber,
- expr_attrs))
-   ereport(ERROR,
-   
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
-errmsg("partition key 
expressions cannot contain whole-row references")));
for (i = FirstLowInvalidHeapAttributeNumber; i 
< 0; i++)
{
if (bms_is_member(i - 
FirstLowInvalidHeapAttributeNumber,
@@ -15196,7 +15191,8 @@ ComputePartitionAttrs(ParseState *pstate, Relation rel, 
List *partParams, AttrNu
{
AttrNumber  attno = i + 
FirstLowInvalidHeapAttributeNumber;
 
-   if 
(TupleDescAttr(RelationGetDescr(rel), attno - 1)->attgenerated)
+   if (AttributeNumberIsValid(attno) &&
+   
TupleDescAttr(RelationGetDescr(rel), attno - 1)->attgenerated)
ereport(ERROR,

(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 errmsg("cannot 
use generated column in partition key"),
@@ -15451,7 +15447,6 @@ QueuePartitionConstraintValidation(List **wqueue, 
Relation scanrel,
for (i = 0; i < partdesc->nparts; i++)
{
Relationpart_rel;
-   boolfound_whole_row;
List   *thisPartConstraint;
 
/*
@@ -15465,10 +15460,7 @@ QueuePartitionConstraintValidation(List **wqueue, 
Relation scanrel,
 */
thisPartConstraint =
map_partition_varattnos(partConstraint, 1,
-   
part_rel, scanrel, _whole_row);
-   /* There can never be a whole-row reference here */
-   if (found_whole_row)
-   elog(ERROR, "unexpected whole-row reference 
found in partition constraint");
+ 

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-24 Thread Amit Khandekar
On Tue, 24 Dec 2019 at 10:41, Amit Kapila  wrote:
>
> On Fri, Dec 20, 2019 at 9:31 AM Amit Khandekar  wrote:
> > Attached are the patches from master back up to 94 branch.
> >
> > PG 9.4 and 9.5 have a common patch to be applied :
> > pg94_95_use_vfd_for_logrep.patch
> > From PG 9.6 onwards, each version has a separate patch.
> >
> > For PG 9.6, there is no logical decoding perl test file. So I have
> > made a new file 006_logical_decoding_spill.pl that has only the
> > specific testcase. Also, for building the test_decoding.so, I had to
> > add the EXTRA_INSTALL=contrib/test_decoding line in the
> > src/test/recovery/Makefile, because this is the first time we are
> > using the plugin in the 9.6 tap test.
> >
>
> I am not sure if we need to go that far for 9.6 branch.  If the other
> tests for logical decoding are not present, then I don't see why we
> need to create a new test file for this test only.  Also, I think this
> will make the patch the same for 9.4,9.5 and 9.6.

Ok. I tested pg94_95_use_vfd_for_logrep.patch for 9.6 branch, and it
works there. So please use this patch for all the three branches.

>
> > From PG 10 onwards, pgstat_report_*() calls around read() are removed
> > in the patch, because FileRead() itself reports the wait events.
> >
>
> Why there are different patches for 10 and 11?
For PG10, OpenTransientFile() and PathNameOpenFile() each have an
extra parameter for specifying file creation modes such as S_IRUSR or
S_IWUSR. For 11, these functions don't accept the flags, rather the
file is always opened with PG_FILE_MODE_OWNER. Because of these
differences in the calls, the PG 10 patch does not apply on 11.


>
> We should try to minimize the difference between patches in different 
> branches.
Ok.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Should we rename amapi.h and amapi.c?

2019-12-24 Thread Michael Paquier
On Tue, Dec 24, 2019 at 09:32:23AM +0100, Julien Rouhaud wrote:
> Looks good to me.  There are still references to amapi.c in various
> .po files, but those should rather be taken care of with the next
> update-po cycle right?

Yes, these are updated as part of the translation updates.
--
Michael


signature.asc
Description: PGP signature


Re: Minimal logical decoding on standbys

2019-12-24 Thread Amit Khandekar
On Thu, 19 Dec 2019 at 01:02, Rahila Syed  wrote:
>
> Hi,
>
>> Hi, do you consistently get this failure on your machine ? I am not
>> able to get this failure, but I am going to analyze when/how this can
>> fail. Thanks
>>
> Yes, I am getting it each time I run make -C src/test/recovery/ check 
> PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
> Also, there aren't any errors in logs indicating the cause.

Thanks for the reproduction. Finally I could reproduce the behaviour.
It occurs once in 7-8 runs of the test on my machine. The issue is :
on master, the catalog_xmin does not immediately get updated. It
happens only after the hot standby feedback reaches on master. And I
haven't used wait_for_xmins() for these failing cases. I should use
that. Working on the same ...

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Should we rename amapi.h and amapi.c?

2019-12-24 Thread Julien Rouhaud
On Tue, Dec 24, 2019 at 3:57 AM Michael Paquier  wrote:
>
> On Mon, Dec 23, 2019 at 12:28:36PM -0800, Ashwin Agrawal wrote:
> > I had raised the same earlier and [1] has response from Andres, which was
> > "We probably should rename it, but not in 12..."
> >
> > [1]
> > https://www.postgresql.org/message-id/20190508215135.4eljnhnle5xp3jwb%40alap3.anarazel.de
>
> Okay, glad to see that this has been mentioned.  So let's do some
> renaming for v13 then.  I have studied first if we had better remove
> amapi.c, then move amvalidate() to amvalidate.c and the handler lookup
> routine to indexam.c as it already exists, but keeping things ordered
> as they are makes sense to limit spreading too much dependencies with
> the syscache mainly, so instead the attached patch does the following
> changes:
> - amapi.h -> indexam.h
> - amapi.c -> indexamapi.c.  Here we have an equivalent in access/table/
> as tableamapi.c.
> - amvalidate.c -> indexamvalidate.c
> - amvalidate.h -> indexamvalidate.h
> - genam.c -> indexgenam.c
>
> Please note that we have also amcmds.c and amcmds.c in the code, but
> the former could be extended to have utilities for table AMs, and the
> latter applies to both, so they are better left untouched in my
> opinion.

Looks good to me.  There are still references to amapi.c in various
.po files, but those should rather be taken care of with the next
update-po cycle right?




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-12-24 Thread Amit Kapila
On Tue, Dec 24, 2019 at 11:17 AM Masahiko Sawada
 wrote:
>
> On Fri, 20 Dec 2019 at 22:30, Amit Kapila  wrote:
> >
> >
> > The main aim of this feature is to reduce apply lag.  Because if we
> > send all the changes together it can delay there apply because of
> > network delay, whereas if most of the changes are already sent, then
> > we will save the effort on sending the entire data at commit time.
> > This in itself gives us decent benefits.  Sure, we can further improve
> > it by having separate workers (dedicated to apply the changes) as you
> > are suggesting and in fact, there is a patch for that as well(see the
> > performance results and bgworker patch at [1]), but if try to shove in
> > all the things in one go, then it will be difficult to get this patch
> > committed (there are already enough things and the patch is quite big
> > that to get it right takes a lot of energy).  So, the plan is
> > something like that first we get the basic feature and then try to
> > improve by having dedicated workers or things like that.  Does this
> > make sense to you?
> >
>
> Thank you for explanation. The plan makes sense. But I think in the
> current design it's a problem that logical replication worker doesn't
> receive changes (and doesn't check interrupts) during applying
> committed changes even if we don't have a worker dedicated for
> applying. I think the worker should continue to receive changes and
> save them to temporary files even during applying changes.
>

Won't it beat the purpose of this feature which is to reduce the apply
lag?  Basically, it can so happen that while applying commit, it
constantly gets changes of other transactions which will delay the
apply of the current transaction.  Also, won't it create some further
work to identify the order of commits?  Say while applying commit-1,
it receives 5 other commits that are written to separate temporary
files.  How will we later identify which transaction's WAL we need to
apply first?  We might deduce by LSN's, but I think that could be
tricky.  Another thing is that I think it could lead to some design
complications as well because while applying commit, you need some
sort of callback or something like that to receive and flush totally
unrelated changes.  It could lead to another kind of failure mode
wherein while applying commit if it tries to receive another
transaction data and some failure happens while writing the data of
that transaction.  I am not sure if it is a good idea to try something
like that.

> Otherwise
> the buffer would be easily full and replication gets stuck.
>

Are you telling about network buffer?  I think the best way as
discussed is to launch new workers for streamed transactions, but we
can do that as an additional feature. Anyway, as proposed, users can
choose the streaming mode for subscriptions, so there is an option to
turn this selectively.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Columns correlation and adaptive query optimization

2019-12-24 Thread Konstantin Knizhnik
New version of patch implicitly adding multicolumn statistic in 
auto_explain extension and using it in optimizer for more precise 
estimation of join selectivity.
This patch fixes race condition while adding statistics and restricts 
generated statistic name to fit in 64 bytes (NameData).


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index a9536c2..3f9d6ef 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,32 @@
 #include "postgres.h"
 
 #include 
+#include 
 
+#include "access/hash.h"
 #include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/skey.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/pg_statistic_ext.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -33,7 +53,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -218,6 +240,30 @@ _PG_init(void)
 			 NULL,
 			 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+			 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+			 "Zero disables implicit creation of multicolumn statistic.",
+			 _explain_add_statistics_threshold,
+			 0.0,
+			 0.0,
+			 INT_MAX,
+			 PGC_SUSET,
+			 0,
+			 NULL,
+			 NULL,
+			 NULL);
+
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+			 "Do not create statistic but just record in WAL suggested create statistics statement.",
+			 NULL,
+			 _explain_suggest_only,
+			 false,
+			 PGC_SUSET,
+			 0,
+			 NULL,
+			 NULL,
+			 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -353,6 +399,256 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+	ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index varno = 0;
+		Bitmapset* colmap = NULL;
+
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+Var *var = (Var *) node;
+if (cols == NULL || var->varnoold == varno)
+{
+	varno = var->varnoold;
+	if (var->varoattno > 0 &&
+		!bms_is_member(var->varoattno, colmap) &&
+		varno >= 1 &&
+		varno <= list_length(es->rtable) &&
+		list_length(cols) < STATS_MAX_DIMENSIONS)
+	{
+		RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+		if (rte->rtekind == RTE_RELATION)
+		{
+			ColumnRef  *col = makeNode(ColumnRef);
+			char *colname = get_rte_attribute_name(rte, var->varoattno);
+			col->fields = list_make1(makeString(colname));
+			cols = lappend(cols, col);
+			colmap = bms_add_member(colmap, var->varoattno);
+		}
+	}
+}
+else
+{
+	continue;
+}
+			}
+			vars = 

Re: Implementing Incremental View Maintenance

2019-12-24 Thread Tatsuo Ishii
> Unfortunately, it's not clear to me which of ON STATEMENT or ON COMMIT the 
> user should choose.  The benefit of ON STATEMENT is that the user does not 
> have to create and maintain the materialized view log.  But I'm not sure if 
> and when the benefit defeats the performance overhead on DML statements.  
> It's not disclosed whether ON STATEMENT uses triggers.

AFAIK benefit of ON STATEMENT is the transaction can see the result of
update to the base tables. With ON COMMIT, the transaction does not
see the result until the transaction commits.

> Could you give your opinion on the following to better understand the 
> proposed feature and/or Oracle's ON STATEMENT refresh mode?
> 
> * What use case does the feature fit?
> If the trigger makes it difficult to use in the data ware house, does the 
> feature target OLTP?

Well, I can see use cases of IVM in both DWH and OLTP.

For example, a user create a DWH-like data using materialized
view. After the initial data is loaded, the data is seldom updated.
However one day a user wants to change just one row to see how it
affects to the whole DWH data. IVM will help here because it could be
done in shorter time than loading whole data.

Another use case is a ticket selling system. The system shows how many
tickets remain in a real time manner. For this purpose it needs to
count the number of tickets already sold from a log table. By using
IVM, it could be accomplished in simple and effective way.

> What kind of data and query would benefit most from the feature (e.g. join of 
> a large sales table and a small product table, where the data volume and 
> frequency of data loading is ...)?
> In other words, this is about what kind of example we can recommend as a 
> typical use case of this feature.

Here are some use cases suitable for IVM I can think of:

- Users are creating home made triggers to get data from tables. Since
  IVM could eliminates some of those triggers, we could expect less
  maintenance cost and bugs accidentally brought in when the triggers
  were created.

- Any use case in which the cost of refreshing whole result table
  (materialized view) is so expensive that it justifies the cost of
  updating of base tables. See the example of use cases above.

> * Do you think the benefit of ON STATEMENT (i.e. do not have to use 
> materialized view log) outweighs the drawback of ON  STATEMENT (i.g. DML 
> overhead)?

Outweights to what?

> * Do you think it's important to refresh the materialized view after every 
> statement, or the per-statement refresh is not a requirement but simply the 
> result of implementation?

I think it's important to refresh the materialized view after every
statement and the benefit for users are apparent because it brings
real time data refresh to users.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp