Re: Is it useful to record whether plans are generic or custom?

2020-07-22 Thread torikoshia

On 2020-07-20 13:57, torikoshia wrote:


As I proposed earlier in this thread, I'm now trying to add information
about generic/cudstom plan to pg_stat_statements.
I'll share the idea and the poc patch soon.


Attached a poc patch.

Main purpose is to decide (1) the user interface and (2) the
way to get the plan type from pg_stat_statements.

(1) the user interface
I added a new boolean column 'generic_plan' to both
pg_stat_statements view and the member of the hash key of
pg_stat_statements.

This is because as Legrand pointed out the feature seems
useful under the condition of differentiating all the
counters for a queryid using a generic plan and the one
using a custom one.

I thought it might be preferable to make a GUC to enable
or disable this feature, but changing the hash key makes
it harder.

(2) way to get the plan type from pg_stat_statements
To know whether the plan is generic or not, I added a
member to CachedPlan and get it in the ExecutorStart_hook
from ActivePortal.
I wished to do it in the ExecutorEnd_hook, but the
ActivePortal is not available on executorEnd, so I keep
it on a global variable newly defined in pg_stat_statements.


Any thoughts?

This is a poc patch and I'm going to do below things later:

- update pg_stat_statements version
- change default value for the newly added parameter in
  pg_stat_statements_reset() from -1 to 0(since default for
  other parameters are all 0)
- add regression tests and update docs



Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONFrom 793eafad8e988b6754c9d89e0ea14b64b07eef81 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi
Date: Wed, 22 Jul 2020 16:00:04 +0900
Subject: [PATCH] [poc] Previously the number of custom and generic plans are
 recoreded only in pg_prepared_statements, meaning we could only track them
 regarding current session. This patch records them in pg_stat_statements and
 it enables to track them regarding all sessions of the PostgreSQL instance.

---
 .../pg_stat_statements--1.6--1.7.sql  |  3 +-
 .../pg_stat_statements--1.7--1.8.sql  |  1 +
 .../pg_stat_statements/pg_stat_statements.c   | 44 +++
 src/backend/utils/cache/plancache.c   |  2 +
 src/include/utils/plancache.h |  1 +
 5 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.6--1.7.sql b/contrib/pg_stat_statements/pg_stat_statements--1.6--1.7.sql
index 6fc3fed4c9..5ab0a26b77 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--1.6--1.7.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.6--1.7.sql
@@ -12,7 +12,8 @@ DROP FUNCTION pg_stat_statements_reset();
 /* Now redefine */
 CREATE FUNCTION pg_stat_statements_reset(IN userid Oid DEFAULT 0,
 	IN dbid Oid DEFAULT 0,
-	IN queryid bigint DEFAULT 0
+	IN queryid bigint DEFAULT 0,
+	IN generic_plan int DEFAULT -1
 )
 RETURNS void
 AS 'MODULE_PATHNAME', 'pg_stat_statements_reset_1_7'
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql b/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql
index 0f63f08f7e..0d7c4e7343 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql
@@ -15,6 +15,7 @@ DROP FUNCTION pg_stat_statements(boolean);
 CREATE FUNCTION pg_stat_statements(IN showtext boolean,
 OUT userid oid,
 OUT dbid oid,
+OUT generic_plan bool,
 OUT queryid bigint,
 OUT query text,
 OUT plans int8,
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 14cad19afb..5d74dc04cd 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -78,9 +78,11 @@
 #include "storage/ipc.h"
 #include "storage/spin.h"
 #include "tcop/utility.h"
+#include "tcop/pquery.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/plancache.h"
 
 PG_MODULE_MAGIC;
 
@@ -156,6 +158,7 @@ typedef struct pgssHashKey
 	Oid			userid;			/* user OID */
 	Oid			dbid;			/* database OID */
 	uint64		queryid;		/* query identifier */
+	bool			is_generic_plan;
 } pgssHashKey;
 
 /*
@@ -266,6 +269,9 @@ static int	exec_nested_level = 0;
 /* Current nesting depth of planner calls */
 static int	plan_nested_level = 0;
 
+/* Current plan type */
+static bool	is_generic_plan = false;
+
 /* Saved hook values in case of unload */
 static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 static post_parse_analyze_hook_type prev_post_parse_analyze_hook = NULL;
@@ -367,7 +373,7 @@ static char *qtext_fetch(Size query_offset, int query_len,
 		 char *buffer, Size buffer_size);
 static bool need_gc_qtexts(void);
 static void gc_qtexts(void);
-static void entry_reset(Oid userid, Oid dbid, uint64 queryid);
+static void entry_reset(Oid userid, Oid dbid, uint64 queryid, int generic_plan);
 static void AppendJumble(pgssJumbleState *jstate,
 		 cons

Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)

2020-07-22 Thread Georgios



‐‐‐ Original Message ‐‐‐
On Tuesday, July 21, 2020 11:52 PM, Peter Geoghegan  wrote:

> On Mon, Jul 6, 2020 at 1:35 AM Georgios Kokolatos
> gkokola...@protonmail.com wrote:
>
> > As a general overview, the series of patches in the mail thread do match 
> > their description. The addition of the stricter, explicit use of 
> > instrumentation does improve the design as the distinction of the use cases 
> > requiring a pin or a lock is made more clear. The added commentary is 
> > descriptive and appears grammatically correct, at least to a non native 
> > speaker.
>
> I didn't see this review until now because it ended up in gmail's spam
> folder. :-(
>
> Thanks for taking a look at it!

No worries at all. It happens and it was beneficial for me to read the patch.

//Georgios
>
> --
>
> Peter Geoghegan






Re: Parallel Seq Scan vs kernel read ahead

2020-07-22 Thread David Rowley
On Wed, 22 Jul 2020 at 18:17, k.jami...@fujitsu.com
 wrote:
> Even though I read the documentation [1][2] on parallel query, I might not 
> have
> understood it clearly yet. So thank you very much for explaining simpler how 
> the
> relation size, GUCs, and reloption affect the query planner's behavior
> So in this test case, I shouldn't force the workers to have same values for 
> workers
> planned and workers launched, is it correct? To just let the optimizer do its 
> own decision.

What you want to do is force the planner's hand with each test as to
how many parallel workers it uses by setting the reloption
parallel_workers to the number of workers you want to test.  Just make
sure the executor has enough workers to launch by setting
max_parallel_workers and max_worker_processes to something high enough
to conduct the tests without the executor failing to launch any
workers.

> Maybe the relation size is also small as you mentioned, that the query 
> optimizer decided
> to only use 6 workers in my previous test. So let me see first if the results 
> would vary
> again with the above configuration and testing different values for 
> parallel_workers.

The parallel_worker reloption will overwrite the planner's choice of
how many parallel workers to use. Just make sure the executor has
enough to use. You'll be able to determine that from the Workers
Planned matching Workers Launched.

David




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-07-22 Thread Andrey V. Lepikhov

On 7/16/20 2:14 PM, Amit Langote wrote:

Hi Andrey,

Thanks for this work.  I have been reading through your patch and
here's a what I understand it does and how:

The patch aims to fix the restriction that COPYing into a foreign
table can't use multi-insert buffer mechanism effectively.  That's
because copy.c currently uses the ExecForeignInsert() FDW API which
can be passed only 1 row at a time.  postgres_fdw's implementation
issues an `INSERT INTO remote_table VALUES (...)` statement to the
remote side for each row, which is pretty inefficient for bulk loads.
The patch introduces a new FDW API ExecForeignCopyIn() that can
receive multiple rows and copy.c now calls it every time it flushes
the multi-insert buffer so that all the flushed rows can be sent to
the remote side in one go.  postgres_fdw's now issues a `COPY
remote_table FROM STDIN` to the remote server and
postgresExecForeignCopyIn() funnels the tuples flushed by the local
copy to the server side waiting for tuples on the COPY protocol.


Fine


Here are some comments on the patch.

* Why the "In" in these API names?

+   /* COPY a bulk of tuples into a foreign relation */
+   BeginForeignCopyIn_function BeginForeignCopyIn;
+   EndForeignCopyIn_function EndForeignCopyIn;
+   ExecForeignCopyIn_function ExecForeignCopyIn;


I used an analogy from copy.c.


* fdwhandler.sgml should be updated with the description of these new APIs.




* As far as I can tell, the following copy.h additions are for an FDW
to use copy.c to obtain an external representation (char string) to
send to the remote side of the individual rows that are passed to
ExecForeignCopyIn():

+typedef void (*copy_data_dest_cb) (void *outbuf, int len);
+extern CopyState BeginForeignCopyTo(Relation rel);
+extern char *NextForeignCopyRow(CopyState cstate, TupleTableSlot *slot);
+extern void EndForeignCopyTo(CopyState cstate);

So, an FDW's ExecForeignCopyIn() calls copy.c: NextForeignCopyRow()
which in turn calls copy.c: CopyOneRowTo() which fills
CopyState.fe_msgbuf.  The data_dest_cb() callback that runs after
fe_msgbuf contains the full row simply copies it into a palloc'd char
buffer whose pointer is returned back to ExecForeignCopyIn().  I
wonder why not let FDWs implement the callback and pass it to copy.c
through BeginForeignCopyTo()?  For example, you could implement a
pgfdw_copy_data_dest_cb() in postgres_fdw.c which gets a direct
pointer of fe_msgbuf to send it to the remote server.

It is good point! Thank you.


Do you think all FDWs would want to use copy,c like above?  If not,
maybe the above APIs are really postgres_fdw-specific?  Anyway, adding
comments above the definitions of these functions would be helpful.

Agreed


* I see that the remote copy is performed from scratch on every call
of postgresExecForeignCopyIn(), but wouldn't it be more efficient to
send the `COPY remote_table FROM STDIN` in
postgresBeginForeignCopyIn() and end it in postgresEndForeignCopyIn()
when there are no errors during the copy?
It is not possible. FDW share one connection between all foreign 
relations from a server. If two or more partitions will be placed at one 
foreign server you will have problems with concurrent COPY command. May 
be we can create new connection for each partition?


I tried implementing these two changes -- pgfdw_copy_data_dest_cb()
and sending `COPY remote_table FROM STDIN` only once instead of on
every flush -- and I see significant speedup.  Please check the
attached patch that applies on top of yours.

I integrated first change and rejected the second by the reason as above.
  One problem I spotted

when trying my patch but didn't spend much time debugging is that
local COPY cannot be interrupted by Ctrl+C anymore, but that should be
fixable by adjusting PG_TRY() blocks.

Thanks


* ResultRelInfo.UseBulkModifying should be ri_usesBulkModify for consistency.

+1

I will post a new version of the patch a little bit later.

--
regards,
Andrey Lepikhov
Postgres Professional




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-22 Thread Amit Kapila
On Wed, Jul 22, 2020 at 10:20 AM Dilip Kumar  wrote:
>
> On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila  wrote:
> >
> > On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar  wrote:
> > >
> > > There was one warning in release mode in the last version in 0004 so
> > > attaching a new version.
> > >
> >
> > Today, I was reviewing patch
> > v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
> > small problem with it.
> >
> > + /*
> > + * Execute the invalidations for xid-less transactions,
> > + * otherwise, accumulate them so that they can be processed at
> > + * the commit time.
> > + */
> > + if (!ctx->fast_forward)
> > + {
> > + if (TransactionIdIsValid(xid))
> > + {
> > + ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
> > +   invals->nmsgs, invals->msgs);
> > + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
> > +   buf->origptr);
> > + }
> >
> > I think we need to set ReorderBufferXidSetCatalogChanges even when
> > ctx->fast-forward is true because we are dependent on that flag for
> > snapshot build (see SnapBuildCommitTxn).  We are already doing the
> > same way in DecodeCommit where even though we skip adding
> > invalidations for fast-forward cases but we do set the flag to
> > indicate that this txn has catalog changes.  Is there any reason to do
> > things differently here?
>
> I think it is wrong,  we should set the
> ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode.
>

Thanks for the change.  I have one more minor comment in the patch
0001-WAL-Log-invalidations-at-command-end-with-wal_le.

 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+ int nmsgs; /* number of shared inval msgs */
+ SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+} xl_xact_invalidations;

I see that we already have a structure xl_xact_invals in the code
which has the same members, so I think it is better to use that
instead of defining a new one.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [Proposal] Global temporary tables

2020-07-22 Thread wenjing zeng


> 2020年7月14日 下午10:28,Pavel Stehule  写道:
> 
> 
> 
> pá 10. 7. 2020 v 11:04 odesílatel wenjing zeng  > napsal:
> HI all
> 
> I started using my personal email to respond to community issue.
> 
> 
> 
>> 2020年7月7日 下午6:05,Pavel Stehule > > 写道:
>> 
>> Hi
>>  
>> GTT Merge the latest PGMaster and resolves conflicts.
>> 
>> 
>> 
>> I tested it and it looks fine. I think it is very usable in current form, 
>> but still there are some issues:
>> 
>> postgres=# create global temp table foo(a int);
>> CREATE TABLE
>> postgres=# insert into foo values(10);
>> INSERT 0 1
>> postgres=# alter table foo add column x int;
>> ALTER TABLE
>> postgres=# analyze foo;
>> WARNING:  reloid 16400 not support update attstat after add colunm
>> WARNING:  reloid 16400 not support update attstat after add colunm
>> ANALYZE
> This is a limitation that we can completely eliminate.
> 
>> 
>> Please, can you summarize what is done, what limits are there, what can be 
>> implemented hard, what can be implemented easily?
> Sure.
> 
> The current version of the GTT implementation supports all regular table 
> operations.
> 1 what is done
> 1.1 insert/update/delete on GTT.
> 1.2 The GTT supports all types of indexes, and the query statement supports 
> the use of GTT indexes to speed up the reading of data in the GTT.
> 1.3 GTT statistics keep a copy of THE GTT local statistics, which are 
> provided to the optimizer to choose the best query plan.
> 1.4 analyze vacuum GTT.
> 1.5 truncate cluster GTT.
> 1.6 all DDL on GTT.
> 1.7 GTT table can use  GTT sequence  or Regular sequence.
> 1.8 Support for creating views on GTT.
> 1.9 Support for creating views on foreign key.
> 1.10 support global temp partition.
> 
> I feel like I cover all the necessary GTT requirements.
> 
> For cluster GTT,I think it's complicated.
> I'm not sure the current implementation is quite reasonable. Maybe you can 
> help review it.
> 
> 
>> 
>> 
>> 
>> I found one open question - how can be implemented table locks - because 
>> data is physically separated, then we don't need table locks as protection 
>> against race conditions. 
> Yes, but GTT’s DML DDL still requires table locking.
> 1 The DML requires table locks (RowExclusiveLock) to ensure that 
> definitions do not change during run time (the DDL may modify or delete them).
> This part of the implementation does not actually change the code,
> because the DML on GTT does not block each other between sessions.
> 
> 2 For truncate/analyze/vacuum reinidex cluster GTT is now like DML, 
> they only modify local data and do not modify the GTT definition.
> So I lowered the table lock level held by the GTT, only need RowExclusiveLock.
> 
> 3 For DDLs that need to be modified the GTT table definition(Drop GTT Alter 
> GTT), 
> an exclusive level of table locking is required(AccessExclusiveLock), 
> as is the case for regular table.
> This part of the implementation also does not actually change the code.
> 
> Summary: What I have done is to adjust the GTT lock levels in different types 
> of statements based on the above thinking.
> For example, truncate GTT, I'm reducing the GTT holding table lock level to 
> RowExclusiveLock,
> So We can truncate data in the same GTT between different sessions at the 
> same time.
> 
> What do you think about table locks on GTT?
> 
> I am thinking about explicit LOCK statements. Some applications use explicit 
> locking from some reasons - typically as protection against race conditions. 
> 
> But on GTT race conditions are not possible. So my question is - does the 
> exclusive lock on GTT  protection other sessions do insert into their own 
> instances of the same GTT?
In my opinion, with a GTT, always work on the private data of the session, 
there is no need to do anything by holding the lock, so the lock statement 
should do nothing (The same is true for ORACLE GTT)

What do you think?

> 
> What is a level where table locks are active? shared part of GTT or session 
> instance part of GTT?
I don't quite understand what you mean, could you explain it a little bit?



Wenjing



> 
> 
> 
> 
> 
> Wenjing
> 
> 
>> 
>> Now, table locks are implemented on a global level. So exclusive lock on GTT 
>> in one session block insertion on the second session. Is it expected 
>> behaviour? It is safe, but maybe it is too strict. 
>> 
>> We should define what table lock is meaning on GTT.
>> 
>> Regards
>> 
>> Pavel
>>  
>> Pavel
>> 
>> 
>>> With Regards,
>>> Prabhat Kumar Sahu
>>> EnterpriseDB: http://www.enterprisedb.com 
>> 
>> 
> 



Re: expose parallel leader in CSV and log_line_prefix

2020-07-22 Thread Julien Rouhaud
On Tue, Jul 21, 2020 at 6:33 AM Michael Paquier  wrote:
>
> On Mon, Jul 20, 2020 at 11:12:31PM -0500, Justin Pryzby wrote:
> > On Tue, Jul 21, 2020 at 12:51:45PM +0900, Michael Paquier wrote:
> > The documentation could talk about either:
> >
> > 1) "lock group leader" - low-level, raw view of the internal data structure
> > (with a secondary mention that "for a parallel process, this is its parallel
> > leader).
> > 2) "parallel leaders" high-level, user-facing, "cooked" view;
> >
> > Right now it doesn't matter, but it seems that if we document the high-level
> > "parallel leader", then we don't need to accomodate future uses (at least 
> > until
> > the future happens).
>
> Hmm.  Not sure.  This sounds like material for a separate and larger
> patch.
>
> >>
> >> -   Process ID of the parallel group leader if this process is or
> >> -   has been involved in parallel query, or null. This field is set
> >> -   when a process wants to cooperate with parallel workers, and
> >> -   remains set as long as the process exists. For a parallel group 
> >> leader,
> >> -   this field is set to its own process ID. For a parallel worker,
> >> -   this field is set to the process ID of the parallel group leader.
> >> +   Process ID of the parallel group leader if this process is involved
> >> +   in parallel query, or null.  For a parallel group leader, this 
> >> field
> >> +   is NULL.
> >>
> >
> > FWIW , I prefer something like my earlier phrase:
> >
> > | For a parallel worker, this is the Process ID of its leader process.  Null
> > | for processes which are not parallel workers.
>
> I preferred mine, and it seems to me that the first sentence of the
> previous patch covers already both things mentioned in your sentence.
> It also seems to me that it is an important thing to directly outline
> that this field remains NULL for group leaders.

I agree that Michael's version seems less error prone and makes
everything crystal clear, so +1 for it.




Re: Parallel copy

2020-07-22 Thread vignesh C
Thanks for reviewing and providing the comments Ashutosh.
Please find my thoughts below:

On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma 
wrote:
>
> Some review comments (mostly) from the leader side code changes:
>
> 1) Do we need a DSM key for the FORCE_QUOTE option? I think FORCE_QUOTE
option is only used with COPY TO and not COPY FROM so not sure why you have
added it.
>
> PARALLEL_COPY_KEY_FORCE_QUOTE_LIST
>

Fixed

> 2) Should we be allocating the parallel copy data structure only when it
is confirmed that the parallel copy is allowed?
>
> pcdata = (ParallelCopyData *) palloc0(sizeof(ParallelCopyData));
> cstate->pcdata = pcdata;
>
> Or, if you want it to be allocated before confirming if Parallel copy is
allowed or not, then I think it would be good to allocate it in
*cstate->copycontext* memory context so that when EndCopy is called towards
the end of the COPY FROM operation, the entire context itself gets deleted
thereby freeing the memory space allocated for pcdata. In fact it would be
good to ensure that all the local memory allocated inside the ctstate
structure gets allocated in the *cstate->copycontext* memory context.
>

Fixed

> 3) Should we allow Parallel Copy when the insert method is
CIM_MULTI_CONDITIONAL?
>
> +   /* Check if the insertion mode is single. */
> +   if (FindInsertMethod(cstate) == CIM_SINGLE)
> +   return false;
>
> I know we have added checks in CopyFrom() to ensure that if any trigger
(before row or instead of) is found on any of partition being loaded with
data, then COPY FROM operation would fail, but does it mean that we are
okay to perform parallel copy on partitioned table. Have we done some
performance testing with the partitioned table where the data in the input
file needs to be routed to the different partitions?
>

Partition data is handled like what Amit had told in one of earlier mails
[1].  My colleague Bharath has run performance test with partition table,
he will be sharing the results.

> 4) There are lot of if-checks in IsParallelCopyAllowed function that are
checked in CopyFrom function as well which means in case of Parallel Copy
those checks will get executed multiple times (first by the leader and from
second time onwards by each worker process). Is that required?
>

It is called from BeginParallelCopy, This will be called only once. This
change is ok.

> 5) Should the worker process be calling this function when the leader has
already called it once in ExecBeforeStmtTrigger()?
>
> /* Verify the named relation is a valid target for INSERT */
> CheckValidResultRel(resultRelInfo, CMD_INSERT);
>

Fixed.

> 6) I think it would be good to re-write the comments atop
ParallelCopyLeader(). From the present comments it appears as if you were
trying to put the information pointwise but somehow you ended up putting in
a paragraph. The comments also have some typos like *line beaks* which
possibly means line breaks. This is applicable for other comments as well
where you
>

Fixed.

> 7) Is the following checking equivalent to IsWorker()? If so, it would be
good to replace it with an IsWorker like macro to increase the readability.
>
> (IsParallelCopy() && !IsLeader())
>

Fixed.

These have been fixed and the new patch is attached as part of my previous
mail.
[1] -
https://www.postgresql.org/message-id/CAA4eK1LQPxULxw8JpucX0PwzQQRk%3Dq4jG32cU1us2%2B-mtzZUQg%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Re: [PATCH] fix GIN index search sometimes losing results

2020-07-22 Thread Tom Lane
Pavel Borisov  writes:
> For 0002-remove-calc-not-flag.patch
> The patch changes the behavior which is now considered default. This is true 
> in RUM module and maybe in some other tsearch side modules. Applying the 
> patch can make code more beautiful but possibly will not give some 
> performance gain and bug is anyway fixed by patch 0001.

I'd be willing to compromise on just adding TS_EXEC_CALC_NOT to the
calls that are missing it today.  But I don't see why that's really
a great idea --- it still leaves a risk-of-omission hazard for future
callers.  Calculating NOTs correctly really ought to be the default
behavior.

What do you think of replacing TS_EXEC_CALC_NOT with a different
flag having the opposite sense, maybe called TS_EXEC_SKIP_NOT?
If anyone really does need that behavior, they could still get it,
but they'd have to be explicit.

> Overall I'd recommend patch 0001-make-callbacks-ternary.patch and close the 
> issue.

The other issue we have to agree on is whether we want to sneak this
fix into v13, or wait another year for it.  I feel like it's pretty
late to be making potentially API-breaking changes, but on the other
hand this is undoubtedly a bug fix.

regards, tom lane




RE: Parallel Seq Scan vs kernel read ahead

2020-07-22 Thread k.jami...@fujitsu.com
On Tuesday, July 21, 2020 7:33 PM, Amit Kapila wrote:
> On Tue, Jul 21, 2020 at 3:08 PM k.jami...@fujitsu.com 
> wrote:
> >
> > On Tuesday, July 21, 2020 12:18 PM, Amit Kapila wrote:
> > > On Tue, Jul 21, 2020 at 8:06 AM k.jami...@fujitsu.com
> > > 
> > > wrote:
> > > >
> > > > I am definitely missing something. Perhaps I think I could not
> > > > understand why there's no I/O difference between the Master and
> > > > Patched (V2). Or has it been already improved even without this patch?
> > > >
> > >
> > > I don't think it is strange that you are not seeing much difference
> > > because as per the initial email by Thomas this patch is not
> > > supposed to give benefits on all systems.  I think we wanted to
> > > check that the patch should not regress performance in cases where
> > > it doesn't give benefits.  I think it might be okay to run with a
> > > higher number of workers than you have CPUs in the system as we
> > > wanted to check if such cases regress as shown by Soumyadeep above
> > > [1].  Can you once try with
> > > 8 and or 10 workers as well?
> > >
> >
> > You are right. Kindly excuse me on that part, which only means it may
> > or may not have any benefits on the filesystem I am using. But for
> > other fs, as we can see from David's benchmarks significant 
> > results/benefits.
> >
> > Following your advice on regression test case, I increased the number
> > of workers, but the query planner still capped the workers at 6, so
> > the results from 6 workers onwards are almost the same.
> >
> 
> I am slightly confused if the number of workers are capped at 6, then what 
> exactly
> the data at 32 worker count means?  If you want query planner to choose more
> number of workers, then I think either you need to increase the data or use 
> Alter
> Table  Set (parallel_workers = );

Oops I'm sorry, the "workers" labelled in those tables actually mean 
max_parallel_workers_per_gather
and not parallel_workers. In the query planner, I thought the _per_gather 
corresponds or controls
the workers planned/launched values, and those are the numbers that I used in 
the tables.

I used the default max_parallel_workers & max_worker_proceses which is 8 by 
default in postgresql.conf.
IOW, I ran all those tests with maximum of 8 processes set. But my query 
planner capped both the
Workers Planned and Launched at 6 for some reason when increasing the value for
max_parallel_workers_per_gather. 

However, when I used the ALTER TABLE SET (parallel_workers = N) based from your 
suggestion,
the query planner acquired that set value only for "Workers Planned", but not 
for "Workers Launched". 
The behavior of query planner is also different when I also set the value of 
max_worker_processes
and max_parallel_workers to parallel_workers + 1.

For example (ran on Master),
1. Set same value as parallel_workers, but "Workers Launched" and "Workers 
Planned" do not match.
max_worker_processes = 8
max_parallel_workers = 8
ALTER TABLE t_heap SET (parallel_workers = 8);
ALTER TABLE
SET max_parallel_workers_per_gather = 8;
SET
test=# EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap;
QUERY PLAN
--
 Finalize Aggregate  (cost=619778.66..619778.67 rows=1 width=8) (actual 
time=16316.295..16316.295 rows=1 loops=1)
   Buffers: shared read=442478 dirtied=442478 written=44
   ->  Gather  (cost=619777.83..619778.64 rows=8 width=8) (actual 
time=16315.528..16316.668 rows=8 loops=1)
 Workers Planned: 8
 Workers Launched: 7
 Buffers: shared read=442478 dirtied=442478 written=44
 ->  Partial Aggregate  (cost=618777.83..618777.84 rows=1 width=8) 
(actual time=16305.092..16305.092 rows=1 loops=8)
   Buffers: shared read=442478 dirtied=442478 written=44
   ->  Parallel Seq Scan on t_heap  (cost=0.00..583517.86 
rows=14103986 width=0) (actual time=0.725..14290.117 rows=1250 loops=8)
 Buffers: shared read=442478 dirtied=442478 written=44
 Planning Time: 5.327 ms
   Buffers: shared hit=17 read=10
 Execution Time: 16316.915 ms
(13 rows)

2. Match the workers launched and workers planned values (parallel_workers + 1)
max_worker_processes = 9
max_parallel_workers = 9

ALTER TABLE t_heap SET (parallel_workers = 8);
ALTER TABLE;
SET max_parallel_workers_per_gather = 8;
SET

test=# EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap;
QUERY PLAN
--
 Finalize Aggregate  (cost=619778.66..619778.67 rows=1 width=8) (actual 
time=16783.944..16783.944 rows=1 loops=1)
   Buffers: shared read=442478 dirtied=442478 written=442190
   ->  Ga

RE: Parallel Seq Scan vs kernel read ahead

2020-07-22 Thread k.jami...@fujitsu.com
On Wednesday, July 22, 2020 2:21 PM (GMT+9), David Rowley wrote:

> On Wed, 22 Jul 2020 at 16:40, k.jami...@fujitsu.com 
> wrote:
> > I used the default max_parallel_workers & max_worker_proceses which is 8 by
> default in postgresql.conf.
> > IOW, I ran all those tests with maximum of 8 processes set. But my
> > query planner capped both the Workers Planned and Launched at 6 for
> > some reason when increasing the value for max_parallel_workers_per_gather.
> 
> max_parallel_workers_per_gather just imposes a limit on the planner as to the
> maximum number of parallel workers it may choose for a given parallel portion 
> of
> a plan. The actual number of workers the planner will decide is best to use is
> based on the size of the relation. More pages = more workers. It sounds like 
> in
> this case the planner didn't think it was worth using more than 6 workers.
> 
> The parallel_workers reloption, when not set to -1 overwrites the planner's
> decision on how many workers to use. It'll just always try to use
> "parallel_workers".
>
> > However, when I used the ALTER TABLE SET (parallel_workers = N) based
> > from your suggestion, the query planner acquired that set value only for
> "Workers Planned", but not for "Workers Launched".
> > The behavior of query planner is also different when I also set the
> > value of max_worker_processes and max_parallel_workers to parallel_workers
> + 1.
> 
> When it comes to execution, the executor is limited to how many parallel 
> worker
> processes are available to execute the plan. If all workers happen to be busy 
> with
> other tasks then it may find itself having to process the entire query in 
> itself
> without any help from workers.  Or there may be workers available, just not as
> many as the planner picked to execute the query.

Even though I read the documentation [1][2] on parallel query, I might not have
understood it clearly yet. So thank you very much for explaining simpler how 
the 
relation size, GUCs, and reloption affect the query planner's behavior
So in this test case, I shouldn't force the workers to have same values for 
workers
planned and workers launched, is it correct? To just let the optimizer do its 
own decision.

> The number of available workers is configured with the
> "max_parallel_workers". That's set in postgresql.conf.   PostgreSQL
> won't complain if you try to set a relation's parallel_workers reloption to a 
> number
> higher than the "max_parallel_workers" GUC.
> "max_parallel_workers" is further limited by "max_worker_processes".
> Likely you'll want to set both those to at least 32 for this test, then just 
> adjust the
> relation's parallel_workers setting for each test.
> 
Thank you for the advice. For the same test case [3], I will use the following 
configuration:
shared_buffers = 32MB
max_parallel_workers =32
max_worker_processes = 32

Maybe the relation size is also small as you mentioned, that the query 
optimizer decided
to only use 6 workers in my previous test. So let me see first if the results 
would vary
again with the above configuration and testing different values for 
parallel_workers.

Kind regards,
Kirk Jamison

[1] https://www.postgresql.org/docs/13/how-parallel-query-works.html
[2] https://www.postgresql.org/docs/current/runtime-config-resource.html
[3] 
https://www.postgresql.org/message-id/CADwEdoqirzK3H8bB=xyj192ezcnwgfcca_wj5ghvm7sv8oe...@mail.gmail.com



Re: expose parallel leader in CSV and log_line_prefix

2020-07-22 Thread Alvaro Herrera
On 2020-Jul-21, Michael Paquier wrote:

> On Mon, Jul 20, 2020 at 11:12:31PM -0500, Justin Pryzby wrote:

> >> +   Process ID of the parallel group leader if this process is involved
> >> +   in parallel query, or null.  For a parallel group leader, this 
> >> field
> >> +   is NULL.
> >>

> > | For a parallel worker, this is the Process ID of its leader process.  Null
> > | for processes which are not parallel workers.

How about we combine both.  "Process ID of the parallel group leader, if
this process is a parallel query worker.  NULL if this process is a
parallel group leader or does not participate in parallel query".

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Parallel Seq Scan vs kernel read ahead

2020-07-22 Thread Soumyadeep Chakraborty
Hi David,

Apologies for the delay, I had missed these emails.

On Tue, Jul 14, 2020 at 8:52 PM David Rowley  wrote:

> It would be good to know if the
> regression is repeatable or if it was affected by some other process.


These are the latest results on the same setup as [1].
(TL;DR: the FreeBSD VM with Google Persistent Disk is too unstable -
there is too much variability in performance to conclude that there is a
regression)


With 1 rows:

master (606c384598):

max_parallel_workers_per_gatherTime(seconds)
  0   20.09s
  19.77s
  29.92s
  69.55s


v2 patch (applied on 606c384598):

max_parallel_workers_per_gatherTime(seconds)
  0   18.34s
  19.68s
  29.15s
  69.11s


The above results were averaged across 3 runs with little or no
deviation between runs. The absolute values are very different from the
results reported in [1].

So, I tried to repro the regression as I had reported in [1] with
15000 rows:

master (449e14a561)

max_parallel_workers_per_gatherTime(seconds)
  0 42s, 42s
  1   395s, 393s
  2   404s, 403s
  6   403s, 403s

Thomas' patch (applied on 449e14a561):

max_parallel_workers_per_gatherTime(seconds)
  0  43s,43s
  1203s, 42s
  2 42s, 42s
  6 44s, 43s


v2 patch (applied on 449e14a561):

max_parallel_workers_per_gatherTime(seconds)
  0   274s, 403s
  1   419s, 419s
  2   448s, 448s
  6   137s, 419s


As you can see, I got wildly different results with 15000 rows (even
between runs of the same experiment)
I don't think that the environment is stable enough to tell if there is
any regression.

What I can say is that there are no processes apart from Postgres
running on the system. Also, the environment is pretty constrained -
just 1G of free hard drive space before the start of every run, when I
have 15000 rows, apart from the mere 32M of shared buffers and only
4G of RAM.

I don't know much about Google Persistent Disk to be very honest.
Basically, I just provisioned one when I provisioned a GCP VM for testing on
FreeBSD, as Thomas had mentioned that FreeBSD UFS is a bad case for
parallel seq scan.

> It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET
> track_io_timing = on; for each value of max_parallel_workers.

As for running EXPLAIN ANALYZE, running that on this system incurs a
non-trivial amount of overhead. The overhead is simply staggering. This
is the result of pg_test_timing in the FreeBSD GCP VM:

$ /usr/local/pgsql/bin/pg_test_timing -d 50
Testing timing overhead for 50 seconds.
Per loop time including overhead: 4329.80 ns
Histogram of timing durations:
  < us   % of total  count
 1  0.0  0
 2  0.0  0
 4  3.08896 356710
 8 95.97096   11082616
16  0.37748  43591
32  0.55502  64093
64  0.00638737
   128  0.00118136
   256  0.2  2

As a point of comparison, on my local Ubuntu workstation:

$ /usr/local/pgsql/bin/pg_test_timing -d 50
Testing timing overhead for 50 seconds.
Per loop time including overhead: 22.65 ns
Histogram of timing durations:
  < us   % of total  count
 1 97.73691 2157634382
 2  2.26246   49945854
 4  0.00039   8711
 8  0.00016   3492
16  0.8   1689
32  0.0 63
64  0.0  1

This is why I opted to simply use \timing on.

Regards,

Soumyadeep (VMware)

[1] 
https://www.postgresql.org/message-id/CADwEdoqirzK3H8bB=xyj192ezcnwgfcca_wj5ghvm7sv8oe...@mail.gmail.com




Re: Parallel Seq Scan vs kernel read ahead

2020-07-22 Thread Soumyadeep Chakraborty
On Tue, Jul 21, 2020 at 9:33 PM Thomas Munro  wrote:
>
> On Wed, Jul 22, 2020 at 3:57 PM Amit Kapila  wrote:
> > Yeah, that is true but every time before the test the same amount of
> > data should be present in shared buffers (or OS cache) if any which
> > will help in getting consistent results.  However, it is fine to
> > reboot the machine as well if that is a convenient way.
>
> We really should have an extension (pg_prewarm?) that knows how to
> kick stuff out PostgreSQL's shared buffers and the page cache
> (POSIX_FADV_DONTNEED).
>
>
+1. Clearing the OS page cache on FreeBSD is non-trivial during testing.
You can't do this on FreeBSD:
sync; echo 3 > /proc/sys/vm/drop_caches

Also, it would be nice to evict only those pages from the OS page cache
that are Postgres pages instead of having to drop everything.

Regards,
Soumyadeep (VMware)




RE: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-07-22 Thread Floris Van Nee
Hi Andy,

A small thing I found:

+static List *
+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+   
 List *const_exprs,
+   
 List *const_expr_opfamilies,
+   
 Bitmapset *used_varattrs,
+   
 bool *useful,
+   
 bool *multi_nullvals)
…
+ indexpr_item = list_head(unique_index->indexprs);
+ for(c = 0; c < unique_index->ncolumns; c++)
+ {

I believe the for loop must be over unique_index->nkeycolumns, rather than 
columns. It shouldn’t include the extra non-key columns. This can currently 
lead to invalid memory accesses as well a few lines later when it does an array 
access of unique_index->opfamily[c] – this array only has nkeycolumns entries.

-Floris


From: Andy Fan 
Sent: Sunday 19 July 2020 5:03 AM
To: Dmitry Dolgov <9erthali...@gmail.com>
Cc: David Rowley ; PostgreSQL Hackers 
; Tom Lane ; Ashutosh 
Bapat ; rushabh.lat...@gmail.com
Subject: Re: [PATCH] Keeps tracking the uniqueness with UniqueKey [External]

Fixed a test case in v10.

--
Best Regards
Andy Fan


Re: Default setting for enable_hashagg_disk

2020-07-22 Thread Robert Haas
On Tue, Jul 14, 2020 at 6:49 PM Peter Geoghegan  wrote:
> Maybe I missed your point here. The problem is not so much that we'll
> get HashAggs that spill -- there is nothing intrinsically wrong with
> that. While it's true that the I/O pattern is not as sequential as a
> similar group agg + sort, that doesn't seem like the really important
> factor here. The really important factor is that in-memory HashAggs
> can be blazingly fast relative to *any* alternative strategy -- be it
> a HashAgg that spills, or a group aggregate + sort that doesn't spill,
> whatever. We're mostly concerned about keeping the one available fast
> strategy than we are about getting a new, generally slow strategy.

I don't know; it depends. Like, if the less-sequential I/O pattern
that is caused by a HashAgg is not really any slower than a
Sort+GroupAgg, then whatever. The planner might as well try a HashAgg
- because it will be fast if it stays in memory - and if it doesn't
work out, we've lost little by trying. But if a Sort+GroupAgg is
noticeably faster than a HashAgg that ends up spilling, then there is
a potential regression. I thought we had evidence that this was a real
problem, but if that's not the case, then I think we're fine as far as
v13 goes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] fix GIN index search sometimes losing results

2020-07-22 Thread Pavel Borisov
ср, 22 июл. 2020 г. в 19:10, Tom Lane :

> Pavel Borisov  writes:
> > For 0002-remove-calc-not-flag.patch
> > The patch changes the behavior which is now considered default. This is
> true in RUM module and maybe in some other tsearch side modules. Applying
> the patch can make code more beautiful but possibly will not give some
> performance gain and bug is anyway fixed by patch 0001.
>
> I'd be willing to compromise on just adding TS_EXEC_CALC_NOT to the
> calls that are missing it today.  But I don't see why that's really
> a great idea --- it still leaves a risk-of-omission hazard for future
> callers.  Calculating NOTs correctly really ought to be the default
> behavior.
>
> What do you think of replacing TS_EXEC_CALC_NOT with a different
> flag having the opposite sense, maybe called TS_EXEC_SKIP_NOT?
> If anyone really does need that behavior, they could still get it,
> but they'd have to be explicit.
>
> > Overall I'd recommend patch 0001-make-callbacks-ternary.patch and close
> the issue.
>
> The other issue we have to agree on is whether we want to sneak this
> fix into v13, or wait another year for it.  I feel like it's pretty
> late to be making potentially API-breaking changes, but on the other
> hand this is undoubtedly a bug fix.
>
> regards, tom lane
>

I am convinced patch 0001 is necessary and enough to fix a bug, so I think
it's very much worth adding it to v13.

As for 0002 I see the beauty of this change but I also see the value of
leaving defaults as they were before.
The change of CALC_NOT behavior doesn't seem to be a source of big changes,
though. I'm just not convinced it is very much needed.
The best way I think is to leave 0002 until the next version and add
commentary in the code that this default behavior of NOT
doing TS_EXEC_CALC_NOT is inherited from past, so basically any caller
should set this flag (see patch 0003-add-comments-on-calc-not.

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com


0003-add-comments-on-calc-not.diff
Description: Binary data


Re: OpenSSL randomness seeding

2020-07-22 Thread Daniel Gustafsson
> On 22 Jul 2020, at 07:00, Noah Misch  wrote:
> 
> On Tue, Jul 21, 2020 at 02:13:32PM +0200, Daniel Gustafsson wrote:
>> The silver lining here is that while OpenSSL nooped RAND_cleanup, they also
>> changed what is mixed into seeding so we are still not sharing a sequence.  
>> To
>> fix this, changing the RAND_cleanup call to RAND_poll should be enough to
>> ensure re-seeding after forking across all supported OpenSSL versions.  Patch
>> 0001 implements this along with a comment referencing when it can be removed
>> (which most likely won't be for quite some time).
>> 
>> Another thing that stood out when reviewing this code is that we optimize for
>> RAND_poll failing in pg_strong_random, when we already have RAND_status
>> checking for a sufficiently seeded RNG for us.  ISTM that we can simplify the
>> code by letting RAND_status do the work as per 0002, and also (while 
>> unlikely)
>> survive any transient failures in RAND_poll by allowing all the retries we've
>> defined for the loop.
> 
>> Thoughts on these?
> 
> These look good.  I'll push them on Saturday or later.

Thanks for picking it up!

>  I wondered whether to
> do both RAND_cleanup() and RAND_poll(), to purge all traces of the old seed on
> versions supporting both.  Since that would strictly (albeit negligibly)
> increase seed predictability, I like your decision here.

That's a good question.  I believe that if one actually do use RAND_cleanup as
a re-seeding mechanism then that can break FIPS enabled OpenSSL installations
as RAND_cleanup resets the RNG method from the FIPS RNG to the built-in one.  I
would be inclined to follow the upstream recommendations of using RAND_poll
exclusively, but I'm far from an expert here.

> Do you happen to know how OpenSSL 1.1.1 changed its reaction to forks in order
> to make the RAND_poll() superfluous?  (No need to research it if you don't.)

I'm not entirely sure, but I do believe that 1.1.1 ported over the RNG from the
FIPS module which re-seeds itself with fork() protection.  There was however a
bug, fixed in 1.1.1d or thereabouts as CVE-2019-1549, where the fork protection
wasn't activated by default..  so there is that.  Since that bug was found,
there has been tests introduced to catch any regression on that which is
comforting.

cheers ./daniel



Re: Infinities in type numeric

2020-07-22 Thread Dean Rasheed
On Tue, 21 Jul 2020 at 23:18, Tom Lane  wrote:
>
> Here's a v4 that syncs numeric in_range() with the new behavior of
> float in_range(), and addresses your other comments too.
>

LGTM.

Regards,
Dean




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-22 Thread Soumyadeep Chakraborty
Hello,

I think we should really term this feature, as it stands, as a means to
solely stop WAL writes from happening.

The feature doesn't truly make the system read-only (e.g. dirty buffer
flushes may succeed the system being put into a read-only state), which
does make it confusing to a degree.

Ideally, if we were to have a read-only system, we should be able to run
pg_checksums on it, or take file-system snapshots etc, without the need
to shut down the cluster. It would also enable an interesting use case:
we should also be able to do a live upgrade on any running cluster and
entertain read-only queries at the same time, given that all the
cluster's files will be immutable?

So if we are not going to address those cases, we should change the
syntax and remove the notion of read-only. It could be:

ALTER SYSTEM SET wal_writes TO off|on;
or
ALTER SYSTEM SET prohibit_wal TO off|on;

If we are going to try to make it truly read-only, and cater to the
other use cases, we have to:

Perform a checkpoint before declaring the system read-only (i.e. before
the command returns). This may be expensive of course, as Andres has
pointed out in this thread, but it is a price that has to be paid. If we
do this checkpoint, then we can avoid an additional shutdown checkpoint
and an end-of-recovery checkpoint (if we restart the primary after a
crash while in read-only mode). Also, we would have to prevent any
operation that touches control files, which I am not sure we do today in
the current patch.

Why not have the best of both worlds? Consider:

ALTER SYSTEM SET read_only to {off, on, wal};

-- on: wal writes off + no writes to disk
-- off: default
-- wal: only wal writes off

Of course, there can probably be better syntax for the above.

Regards,

Soumyadeep (VMware)




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-22 Thread SATYANARAYANA NARLAPURAM
+1 to this feature and I have been thinking about it for sometime. There
are several use cases with marking database read only (no transaction log
generation). Some of the examples in a hosted service scenario are 1/ when
customer runs out of storage space, 2/ Upgrading the server to a different
major version (current server can be set to read only, new one can be built
and then switch DNS), 3/ If user wants to force a database to read only and
not accept writes, may be for import / export a database.

Thanks,
Satya

On Wed, Jul 22, 2020 at 3:04 PM Soumyadeep Chakraborty <
soumyadeep2...@gmail.com> wrote:

> Hello,
>
> I think we should really term this feature, as it stands, as a means to
> solely stop WAL writes from happening.
>
> The feature doesn't truly make the system read-only (e.g. dirty buffer
> flushes may succeed the system being put into a read-only state), which
> does make it confusing to a degree.
>
> Ideally, if we were to have a read-only system, we should be able to run
> pg_checksums on it, or take file-system snapshots etc, without the need
> to shut down the cluster. It would also enable an interesting use case:
> we should also be able to do a live upgrade on any running cluster and
> entertain read-only queries at the same time, given that all the
> cluster's files will be immutable?
>
> So if we are not going to address those cases, we should change the
> syntax and remove the notion of read-only. It could be:
>
> ALTER SYSTEM SET wal_writes TO off|on;
> or
> ALTER SYSTEM SET prohibit_wal TO off|on;
>
> If we are going to try to make it truly read-only, and cater to the
> other use cases, we have to:
>
> Perform a checkpoint before declaring the system read-only (i.e. before
> the command returns). This may be expensive of course, as Andres has
> pointed out in this thread, but it is a price that has to be paid. If we
> do this checkpoint, then we can avoid an additional shutdown checkpoint
> and an end-of-recovery checkpoint (if we restart the primary after a
> crash while in read-only mode). Also, we would have to prevent any
> operation that touches control files, which I am not sure we do today in
> the current patch.
>
> Why not have the best of both worlds? Consider:
>
> ALTER SYSTEM SET read_only to {off, on, wal};
>
> -- on: wal writes off + no writes to disk
> -- off: default
> -- wal: only wal writes off
>
> Of course, there can probably be better syntax for the above.
>
> Regards,
>
> Soumyadeep (VMware)
>
>
>


Why it is not possible to create custom AM which behaves similar to btree?

2020-07-22 Thread Konstantin Knizhnik

Hi hackers.

I tried to create LSM AM which can be used instead of nbtree.
I looked at contrib/btree/gin, contrib/isn and try to do the following:

CREATE OPERATOR FAMILY lsm3_float_ops USING lsm3;

CREATE OPERATOR CLASS float4_ops DEFAULT
    FOR TYPE float4 USING lsm3 FAMILY lsm3_float_ops AS
    OPERATOR 1  <,
    OPERATOR 2  <=,
    OPERATOR 3  =,
    OPERATOR 4  >=,
    OPERATOR 5  >,
    FUNCTION 1  btfloat4cmp(float4,float4);

CREATE OPERATOR CLASS float8_ops DEFAULT
    FOR TYPE float8 USING lsm3 FAMILY lsm3_float_ops AS
    OPERATOR 1  <,
    OPERATOR 2  <=,
    OPERATOR 3  =,
    OPERATOR 4  >=,
    OPERATOR 5  >,
    FUNCTION 1  btfloat8cmp(float8,float8);


ALTER OPERATOR FAMILY lsm3_float_ops USING lsm3 ADD
    OPERATOR 1  < (float4,float8),
    OPERATOR 1  < (float8,float4),

    OPERATOR 2  <= (float4,float8),
    OPERATOR 2  <= (float8,float4),

    OPERATOR 3  = (float4,float8),
    OPERATOR 3  = (float8,float4),

    OPERATOR 4  >= (float4,float8),
    OPERATOR 4  >= (float8,float4),

    OPERATOR 5  > (float4,float8),
    OPERATOR 5  > (float8,float4),

    FUNCTION 1  btfloat48cmp(float4,float8),
    FUNCTION 1  btfloat84cmp(float8,float4);


But then I get error for btfloat48cmp and btfloat84cmp functions:

ERROR:  associated data types must be specified for index support function

If I replace lsm3 with btree in ALTER FAMILY, then there is no error.
I wonder if it is possible in Postgres to define custom index, which can 
handle comparison of different types, i.e.


create table t(pk bigint);
create index on t using lsm3(pk);
select * from t where pk=1;

I failed to make Postgres use index in this case. Index is used only if 
I rewrite this query in this way:

select * from t where pk=1::bigint;

Thanks in advance,
Konstantin









Re: Infinities in type numeric

2020-07-22 Thread Tom Lane
Dean Rasheed  writes:
> On Tue, 21 Jul 2020 at 23:18, Tom Lane  wrote:
>> Here's a v4 that syncs numeric in_range() with the new behavior of
>> float in_range(), and addresses your other comments too.

> LGTM.

Pushed.  Thanks again for reviewing!

regards, tom lane




Re: Why it is not possible to create custom AM which behaves similar to btree?

2020-07-22 Thread Tom Lane
Konstantin Knizhnik  writes:
> But then I get error for btfloat48cmp and btfloat84cmp functions:
> ERROR:  associated data types must be specified for index support function

You need to specify the amproclefttype and amprocrighttype types you
want the function to be registered under.  The core code knows that
for btree, those are the same as the actual parameter types of the
function; but there's no reason to make such an assumption for other AMs.
So you have to write it out; perhaps

 ...
 FUNCTION 1(float4,float8) btfloat48cmp(float4,float8),
 ...

regards, tom lane






Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-22 Thread Soumyadeep Chakraborty
Hi Amul,

On Tue, Jun 16, 2020 at 6:56 AM amul sul  wrote:
> The proposed feature is built atop of super barrier mechanism commit[1] to
> coordinate
> global state changes to all active backends.  Backends which executed
> ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
> process to change the requested WAL read/write state aka WAL prohibited and
> WAL
> permitted state respectively.  When the checkpointer process sees the WAL
> prohibit
> state change request, it emits a global barrier and waits until all
> backends that
> participate in the ProcSignal absorbs it.

Why should the checkpointer have the responsibility of setting the state
of the system to read-only? Maybe this should be the postmaster's
responsibility - the checkpointer should just handle requests to
checkpoint. I think the backend requesting the read-only transition
should signal the postmaster, which in turn, will take on the aforesaid
responsibilities. The postmaster, could also additionally request a
checkpoint, using RequestCheckpoint() (if we want to support the
read-onlyness discussed in [1]). checkpointer.c should not be touched by
this feature.

Following on, any condition variable used by the backend to wait for the
ALTER SYSTEM command to finish (the patch uses
CheckpointerShmem->readonly_cv), could be housed in ProcGlobal.

Regards,
Soumyadeep (VMware)

[1] 
https://www.postgresql.org/message-id/CAE-ML%2B-zdWODAyWNs_Eu-siPxp_3PGbPkiSg%3DtoLeW9iS_eioA%40mail.gmail.com




Re: expose parallel leader in CSV and log_line_prefix

2020-07-22 Thread Michael Paquier
On Wed, Jul 22, 2020 at 11:36:05AM -0400, Alvaro Herrera wrote:
> How about we combine both.  "Process ID of the parallel group leader, if
> this process is a parallel query worker.  NULL if this process is a
> parallel group leader or does not participate in parallel query".

Sounds fine to me.  Thanks.

Do others have any objections with this wording?
--
Michael


signature.asc
Description: PGP signature


Re: expose parallel leader in CSV and log_line_prefix

2020-07-22 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Jul 22, 2020 at 11:36:05AM -0400, Alvaro Herrera wrote:
>> How about we combine both.  "Process ID of the parallel group leader, if
>> this process is a parallel query worker.  NULL if this process is a
>> parallel group leader or does not participate in parallel query".

> Sounds fine to me.  Thanks.
> Do others have any objections with this wording?

Is "NULL" really le mot juste here?  If we're talking about text strings,
as the thread title implies (I've not read the patch), then I think you
should say "empty string", because the SQL concept of null doesn't apply.

regards, tom lane




Re: expose parallel leader in CSV and log_line_prefix

2020-07-22 Thread Michael Paquier
On Wed, Jul 22, 2020 at 08:59:04PM -0400, Tom Lane wrote:
> Is "NULL" really le mot juste here?  If we're talking about text strings,
> as the thread title implies (I've not read the patch), then I think you
> should say "empty string", because the SQL concept of null doesn't apply.

Sorry for the confusion.  This part of the thread applies to the open
item for v13 related to pg_stat_activity's leader_pid.  A different
thread should have been spawned for this specific topic, but things
are as they are..
--
Michael


signature.asc
Description: PGP signature


Re: Default setting for enable_hashagg_disk

2020-07-22 Thread Peter Geoghegan
On Tue, Jul 21, 2020 at 1:30 PM Bruce Momjian  wrote:
> On Tue, Jul 14, 2020 at 03:49:40PM -0700, Peter Geoghegan wrote:
> > Maybe I missed your point here. The problem is not so much that we'll
> > get HashAggs that spill -- there is nothing intrinsically wrong with
> > that. While it's true that the I/O pattern is not as sequential as a
> > similar group agg + sort, that doesn't seem like the really important
> > factor here. The really important factor is that in-memory HashAggs
> > can be blazingly fast relative to *any* alternative strategy -- be it
> > a HashAgg that spills, or a group aggregate + sort that doesn't spill,
> > whatever. We're mostly concerned about keeping the one available fast
> > strategy than we are about getting a new, generally slow strategy.
>
> Do we have any data that in-memory HashAggs are "blazingly fast relative
> to *any* alternative strategy?"  The data I have tested myself and what
> I saw from Tomas was that spilling sort or spilling hash are both 2.5x
> slower.  Are we sure the quoted statement is true?

I admit that I was unclear in the remarks you quote here. I placed too
much emphasis on the precise cross-over point at which a hash agg that
didn't spill in Postgres 12 spills now. That was important to Andres,
who was concerned about the added I/O, especially with things like
cloud providers [1] -- it's not desirable to go from no I/O to lots of
I/O when upgrading, regardless of how fast your disks for temp files
are. But that was not the point I was trying to make (though it's a
good point, and one that I agree with).

I'll take another shot at it. I'll use with Andres' test case in [1].
Specifically this query (I went with this example because it was
convenient):

SELECT a, array_agg(b) FROM (SELECT generate_series(1, 1)) a(a),
(SELECT generate_series(1, 1)) b(b) GROUP BY a HAVING
array_length(array_agg(b), 1) = 0;

The planner generally prefers a hashagg here, though it's not a
particularly sympathetic case for hash agg. For one thing the input to
the sort is already sorted. For another, there isn't skew. But the
planner seems to have it right, at least when everything fits in
memory, because that takes ~17.6 seconds with a group agg + sort vs
~13.2 seconds with an in-memory hash agg. Importantly, hash agg's peak
memory usage is 1443558kB (once we get to the point that no spilling
is required), whereas for the sort we're using 7833229kB for the
quicksort. Don't forget that in-memory hash agg is using ~5.4x less
memory in this case on account of the way hash agg represents things.
It's faster, and much much more efficient once you take a holistic
view (i.e. something like work done per second per KB of memory).

Clearly the precise "query operation spills" cross-over point isn't
that relevant to query execution time (on my server with a fast nvme
SSD), because if I give the sort 95% - 99% of the memory it needs to
be an in-memory quicksort then it makes a noticeable difference, but
not a huge difference. I get one big run and one tiny run in the
tuplesort. The query itself takes ~23.4 seconds -- higher than 17.6
seconds, but not all that much higher considering we have to write and
read ~7GB of data. If I try to do approximately the same thing with
hash agg (give it very slightly less than optimal memory) I find that
the difference is smaller -- it takes ~14.1 seconds (up from ~13.2
seconds). It looks like my original remarks are totally wrong so far,
because it's as if the performance hit is entirely explainable as the
extra temp file I/O (right down to the fact that hash agg takes a
smaller hit because it has much less to write out to disk). But let's
keep going.

= Sort vs Hash =

We'll focus on how the group agg + sort case behaves as we take memory
away. What I notice is that it literally doesn't matter how much
memory I take away any more (now that the sort has started to spill).
I said that it was ~23.4 seconds when we have two runs, but if I keep
taking memory away so that we get 10 runs it takes 23.2 seconds. If
there are 36 runs it takes 22.8 seconds. And if there are 144 runs
(work_mem is 50MB, down from the "optimal" required for the sort to be
internal, ~7GB) then it takes 21.9 seconds. So it gets slightly
faster, not slower. We really don't need very much memory to do the
sort in one pass, and it pretty much doesn't matter how many runs we
need to merge provided it doesn't get into the thousands, which is
quite rare (when random I/O from multiple passes finally starts to
bite).

Now for hash agg -- this is where it gets interesting. If we give it
about half the memory it needs (work_mem 700MB) we still have 4
batches and it hardly changes -- it takes 19.8 seconds, which is
slower than the 4 batch case that took 14.1 seconds but not that
surprising. 300MB still gets 4 batches which now takes ~23.5 seconds.
200MB gets 2424 batches and takes ~27.7 seconds -- a big jump! With
100MB it takes ~31.1 seconds (3340 batches). 50MB it's ~32.8 seconds
(3

Re: Parallel copy

2020-07-22 Thread Bharath Rupireddy
On Wed, Jul 22, 2020 at 7:56 PM vignesh C  wrote:
>
> Thanks for reviewing and providing the comments Ashutosh.
> Please find my thoughts below:
>
> On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma 
wrote:
> >
> > Some review comments (mostly) from the leader side code changes:
> >
> > 3) Should we allow Parallel Copy when the insert method is
CIM_MULTI_CONDITIONAL?
> >
> > +   /* Check if the insertion mode is single. */
> > +   if (FindInsertMethod(cstate) == CIM_SINGLE)
> > +   return false;
> >
> > I know we have added checks in CopyFrom() to ensure that if any trigger
(before row or instead of) is found on any of partition being loaded with
data, then COPY FROM operation would fail, but does it mean that we are
okay to perform parallel copy on partitioned table. Have we done some
performance testing with the partitioned table where the data in the input
file needs to be routed to the different partitions?
> >
>
> Partition data is handled like what Amit had told in one of earlier mails
[1].  My colleague Bharath has run performance test with partition table,
he will be sharing the results.
>

I ran tests for partitioned use cases - results are similar to that of non
partitioned cases[1].

parallel workers test case 1(exec time in sec): copy from csv file, 5.1GB,
10million tuples, 4 range partitions, 3 indexes on integer columns unique
data test case 2(exec time in sec): copy from csv file, 5.1GB, 10million
tuples, 4 range partitions, unique data
0 205.403(1X) 135(1X)
2 114.724(1.79X) 59.388(2.27X)
4 99.017(2.07X) 56.742(2.34X)
8 99.722(2.06X) 66.323(2.03X)
16 98.147(2.09X) 66.054(2.04X)
20 97.723(2.1X) 66.389(2.03X)
30 97.048(2.11X) 70.568(1.91X)

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


[PATCH] keep the message consistent in buffile.c

2020-07-22 Thread Lu, Chenyang
Hi,hackers

When I analyze this commit:
https://github.com/postgres/postgres/commit/7897e3bb902c557412645b82120f4d95f7474906
I noticed that the message was not consistent with the previous one in 
‘src/backend/storage/file/buffile.c’
To keep the message consistent, I made the patch.


See the attachment for the patch.


Best regards





buffile.patch
Description: buffile.patch


Re: [PATCH] keep the message consistent in buffile.c

2020-07-22 Thread Thomas Munro
On Thu, Jul 23, 2020 at 3:24 PM Lu, Chenyang  wrote:
> When I analyze this commit:
>
> https://github.com/postgres/postgres/commit/7897e3bb902c557412645b82120f4d95f7474906
>
> I noticed that the message was not consistent with the previous one in 
> ‘src/backend/storage/file/buffile.c’
>
> To keep the message consistent, I made the patch.

Thanks.  I will push this later today.




Re: Parallel copy

2020-07-22 Thread Amit Kapila
On Thu, Jul 23, 2020 at 8:51 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Wed, Jul 22, 2020 at 7:56 PM vignesh C  wrote:
> >
> > Thanks for reviewing and providing the comments Ashutosh.
> > Please find my thoughts below:
> >
> > On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma 
> wrote:
> > >
> > > Some review comments (mostly) from the leader side code changes:
> > >
> > > 3) Should we allow Parallel Copy when the insert method is
> CIM_MULTI_CONDITIONAL?
> > >
> > > +   /* Check if the insertion mode is single. */
> > > +   if (FindInsertMethod(cstate) == CIM_SINGLE)
> > > +   return false;
> > >
> > > I know we have added checks in CopyFrom() to ensure that if any
> trigger (before row or instead of) is found on any of partition being
> loaded with data, then COPY FROM operation would fail, but does it mean
> that we are okay to perform parallel copy on partitioned table. Have we
> done some performance testing with the partitioned table where the data in
> the input file needs to be routed to the different partitions?
> > >
> >
> > Partition data is handled like what Amit had told in one of earlier
> mails [1].  My colleague Bharath has run performance test with partition
> table, he will be sharing the results.
> >
>
> I ran tests for partitioned use cases - results are similar to that of non
> partitioned cases[1].
>

I could see the gain up to 10-11 times for non-partitioned cases [1], can
we use similar test case here as well (with one of the indexes on text
column or having gist index) to see its impact?

[1] -
https://www.postgresql.org/message-id/CALj2ACVR4WE98Per1H7ajosW8vafN16548O2UV8bG3p4D3XnPg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Implement UNLOGGED clause for COPY FROM

2020-07-22 Thread Amit Kapila
On Wed, Jul 22, 2020 at 11:11 AM osumi.takami...@fujitsu.com
 wrote:
>
> > If you are going to suggest users not to replicate such tables then why 
> > can't you
> > suggest them to create such tables as UNLOGGED in the first place?  Another
> > idea could be that you create an 'unlogged'
> > table, copy the data to it.  Then perform Alter Table .. SET Logged and 
> > attach it to
> > the main table.  I think for this you need the main table to be partitioned 
> > but I
> > think if that is possible then it might be better than all the hacking you 
> > are
> > proposing to do in the server for this special operation.
> Thank you for your comment.
>
> At the beginning, I should have mentioned this function was
> for data warehouse, where you need to load large amounts of data
> in the shortest amount of time.
> Sorry for my bad explanation.
>
> Based on the fact that data warehouse cannot be separated from
> usage of applications like B.I. tool in general,
> we cannot define unlogged table at the beginning easily.
> Basically, such tools don't support to define unlogged table as far as I know.
>
> And if you want to do so, you need *modification or fix of existing 
> application*
> which is implemented by a third party and commercially available for data 
> analytics.
> In other words, to make CREATE UNLOGGED TABLE available in that application,
> you must revise the product's source code of the application directly,
> which is an act to invalidate the warranty from the software company of B.I. 
> tool.
> In my opinion, it would be like unrealistic for everyone to do so.
>

So, does this mean that the need for data warehouse application can be
satisfied if the table would have been an 'UNLOGGED'?  However, you
still need 'COPY UNLOGGED ..' syntax because they don't have control
over table definition.  I think if this is the case, then the
application user should find a way for this.  BTW, if the application
is anyway going to execute a PG native syntax like 'COPY UNLOGGED ..'
then why can't they simply set tables as UNLOGGED by using Alter
Table?

IIUC, I think here the case is that the applications are not allowed
to change the standard definitions of tables owned/created by B.I tool
and instead they use some Copy Unlogged sort of syntax provided by
other databases to load the data at a faster speed and now they expect
PG to also have a similar alternative.  If that is true, then it is
possible that those database doesn't have 'Create Unlogged Table ...'
types of syntax which PG has and that is why they have provided such
an alternative for Copy kind of commands.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




RE: Global snapshots

2020-07-22 Thread tsunakawa.ta...@fujitsu.com
Hello,

While I'm thinking of the following issues of the current approach Andrey 
raised, I'm getting puzzled and can't help asking certain things.  Please 
forgive me if I'm missing some discussions in the past.

> 1. Dependency on clocks synchronization
> 2. Needs guarantees of monotonically increasing of the CSN in the case 
> of an instance restart/crash etc.
> 3. We need to delay increasing of OldestXmin because it can be needed 
> for a transaction snapshot at another node.

While Clock-SI seems to be considered the best promising for global 
serializability here,

* Why does Clock-SI gets so much attention?  How did Clock-SI become the only 
choice?

* Clock-SI was devised in Microsoft Research.  Does Microsoft or some other 
organization use Clock-SI?


Have anyone examined the following Multiversion Commitment Ordering (MVCO)?  
Although I haven't understood this yet, it insists that no concurrency control 
information including timestamps needs to be exchanged among the cluster nodes. 
 I'd appreciate it if someone could give an opinion.

Commitment Ordering Based Distributed Concurrency Control for Bridging Single 
and Multi Version Resources.
 Proceedings of the Third IEEE International Workshop on Research Issues on 
Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS), Vienna, 
Austria, pp. 189-198, April 1993. (also DEC-TR 853, July 1992)
https://ieeexplore.ieee.org/document/281924?arnumber=281924


The author of the above paper, Yoav Raz, seems to have had strong passion at 
least until 2011 about making people believe the mightiness of Commitment 
Ordering (CO) for global serializability.  However, he complains (sadly) that 
almost all researchers ignore his theory, as written in his following  site and 
wikipedia page for Commitment Ordering.  Does anyone know why CO is ignored?

Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co


FWIW, some researchers including Michael Stonebraker evaluated the performance 
of various distributed concurrency control methods in 2017.  Have anyone looked 
at this?  (I don't mean there was some promising method that we might want to 
adopt.)

An Evaluation of Distributed Concurrency Control
Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. 2017.
Proc. VLDB Endow. 10, 5 (January 2017), 553-564. 
https://doi.org/10.14778/3055540.3055548


Regards
Takayuki Tsunakawa



Re: Parallel copy

2020-07-22 Thread Ashutosh Sharma
I think, when doing the performance testing for partitioned table, it would
be good to also mention about the distribution of data in the input file.
One possible data distribution could be that we have let's say 100 tuples
in the input file, and every consecutive tuple belongs to a different
partition.

On Thu, Jul 23, 2020 at 8:51 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Wed, Jul 22, 2020 at 7:56 PM vignesh C  wrote:
> >
> > Thanks for reviewing and providing the comments Ashutosh.
> > Please find my thoughts below:
> >
> > On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma 
> wrote:
> > >
> > > Some review comments (mostly) from the leader side code changes:
> > >
> > > 3) Should we allow Parallel Copy when the insert method is
> CIM_MULTI_CONDITIONAL?
> > >
> > > +   /* Check if the insertion mode is single. */
> > > +   if (FindInsertMethod(cstate) == CIM_SINGLE)
> > > +   return false;
> > >
> > > I know we have added checks in CopyFrom() to ensure that if any
> trigger (before row or instead of) is found on any of partition being
> loaded with data, then COPY FROM operation would fail, but does it mean
> that we are okay to perform parallel copy on partitioned table. Have we
> done some performance testing with the partitioned table where the data in
> the input file needs to be routed to the different partitions?
> > >
> >
> > Partition data is handled like what Amit had told in one of earlier
> mails [1].  My colleague Bharath has run performance test with partition
> table, he will be sharing the results.
> >
>
> I ran tests for partitioned use cases - results are similar to that of non
> partitioned cases[1].
>
> parallel workers test case 1(exec time in sec): copy from csv file,
> 5.1GB, 10million tuples, 4 range partitions, 3 indexes on integer columns
> unique data test case 2(exec time in sec): copy from csv file, 5.1GB,
> 10million tuples, 4 range partitions, unique data
> 0 205.403(1X) 135(1X)
> 2 114.724(1.79X) 59.388(2.27X)
> 4 99.017(2.07X) 56.742(2.34X)
> 8 99.722(2.06X) 66.323(2.03X)
> 16 98.147(2.09X) 66.054(2.04X)
> 20 97.723(2.1X) 66.389(2.03X)
> 30 97.048(2.11X) 70.568(1.91X)
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-22 Thread Amit Kapila
On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar  wrote:
>
> You are right.  I have changed it.
>

Thanks, I have pushed the second patch in this series which is
0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest
patch.  I will continue working on remaining patches.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Global snapshots

2020-07-22 Thread Masahiko Sawada
On Mon, 13 Jul 2020 at 20:18, Amit Kapila  wrote:
>
> On Fri, Jul 10, 2020 at 8:46 AM Masahiko Sawada
>  wrote:
> >
> > On Wed, 8 Jul 2020 at 21:35, Amit Kapila  wrote:
> > >
> > >
> > > Cool. While studying, if you can try to think whether this approach is
> > > different from the global coordinator based approach then it would be
> > > great.  Here is my initial thought apart from other reasons the global
> > > coordinator based design can help us to do the global transaction
> > > management and snapshots.  It can allocate xids for each transaction
> > > and then collect the list of running xacts (or CSN) from each node and
> > > then prepare a global snapshot that can be used to perform any
> > > transaction. OTOH, in the design proposed in this patch, we don't need any
> > > coordinator to manage transactions and snapshots because each node's
> > > current CSN will be sufficient for snapshot and visibility as
> > > explained above.
> >
> > Yeah, my thought is the same as you. Since both approaches have strong
> > points and weak points I cannot mention which is a better approach,
> > but that 2PC patch would go well together with the design proposed in
> > this patch.
> >
>
> I also think with some modifications we might be able to integrate
> your 2PC patch with the patches proposed here.  However, if we decide
> not to pursue this approach then it is uncertain whether your proposed
> patch can be further enhanced for global visibility.

Yes. I think even if we decide not to pursue this approach it's not
the reason for not pursuing the 2PC patch. if so we would need to
consider the design of 2PC patch again so it generically resolves the
atomic commit problem.

> Does it make
> sense to dig the design of this approach a bit further so that we can
> be somewhat more sure that pursuing your 2PC patch would be a good
> idea and we can, in fact, enhance it later for global visibility?

Agreed.

> AFAICS, Andrey has mentioned couple of problems with this approach
> [1], the details of which I am also not sure at this stage but if we
> can dig those it would be really great.
>
> > > Now, sure this assumes that there is no clock skew
> > > on different nodes or somehow we take care of the same (Note that in
> > > the proposed patch the CSN is a timestamp.).
> >
> > As far as I read Clock-SI paper, we take care of the clock skew by
> > putting some waits on the transaction start and reading tuples on the
> > remote node.
> >
>
> Oh, but I am not sure if this patch is able to solve that, and if so, how?

I'm not sure the details but, as far as I read the patch I guess the
transaction will sleep at GlobalSnapshotSync() when the received
global csn is greater than the local global csn.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Resetting spilled txn statistics in pg_stat_replication

2020-07-22 Thread Masahiko Sawada
On Thu, 16 Jul 2020 at 20:01, Amit Kapila  wrote:
>
> On Thu, Jul 16, 2020 at 4:04 PM Masahiko Sawada
>  wrote:
> >
> > On Thu, 16 Jul 2020 at 18:16, Amit Kapila  wrote:
> > >
> > > On Thu, Jul 16, 2020 at 1:45 PM Masahiko Sawada
> > >  wrote:
> > > >
> > > > A possible solution would be to add an in-use flag to
> > > > PgStat_ReplSlotStats indicating whether the stats for slot is used or
> > > > not. When receiving a drop message for a slot, the stats collector
> > > > just marks the corresponding stats as unused. When receiving the stats
> > > > report for a new slot but there is no unused stats slot, ignore it.
> > > > What do you think?
> > > >
> > >
> > > As of now, you have a boolean flag msg.m_drop to distinguish the drop
> > > message but we don't have a similar way to distinguish the 'create'
> > > message.  What if have a way to distinguish 'create' message (we can
> > > probably keep some sort of flag to indicate the type of message
> > > (create, drop, update)) and then if the slot with the same name
> > > already exists, we ignore such a message.  Now, we also need a way to
> > > create the entry for a slot for a normal stats update message as well
> > > to accommodate for the lost 'create' message.  Does that make sense?
> >
> > I might be missing your point, but even if we have 'create' message,
> > the problem can happen if when slots are full the user drops slot
> > ‘slot_a’, creates slot ‘slot_b', and messages arrive in the reverse
> > order?
> >
>
> In that case, also, we should drop the 'create' message of 'slot_b' as
> we don't have space but later when an 'update' message arrives with
> stats for the 'slot_b', we will create the entry.

Agreed.

>  I am also thinking
> what if send only 'update' and 'drop' message, the message ordering
> problem can still happen but we will lose one 'update' message in that
> case?

Yes, I think so too. We will lose one 'update' message at a maximum.

I've updated the patch so that the stats collector ignores the
'update' message if the slot stats array is already full.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 048ccc0988..1aebe82ef6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -314,6 +314,15 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   
  
 
+ 
+  pg_stat_replication_slotspg_stat_replication_slots
+  One row per replication slot, showing statistics about
+   replication slot usage.
+   See 
+   pg_stat_replication_slots for details.
+  
+ 
+
  
   pg_stat_wal_receiverpg_stat_wal_receiver
   Only one row, showing statistics about the WAL receiver from
@@ -2511,7 +2520,86 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  
 
- 
+ 
+  pg_stat_replication_slots
+
+  
+   pg_stat_replication_slots
+  
+
+   
+The pg_stat_replication_slots view will contain
+one row per replication slot, showing statistics about replication
+slot usage.
+   
+
+   
+pg_stat_replication_slots View
+
+ 
+  
+   
+ Column Type
+
+
+ Description
+   
+  
+ 
+
+ 
+  
+   
+ name text
+
+
+ A unique, cluster-wide identifier for the replication slot
+   
+  
+
+  
+   
+ spill_txns bigint
+
+
+ Number of transactions spilled to disk after the memory used by
+ logical decoding exceeds logical_decoding_work_mem. The
+ counter gets incremented both for toplevel transactions and
+ subtransactions.
+   
+  
+
+  
+   
+ spill_count bigint
+
+
+ Number of times transactions were spilled to disk. Transactions
+ may get spilled repeatedly, and this counter gets incremented on every
+ such invocation.
+   
+  
+
+  
+   
+ spill_bytes bigint
+
+
+ Amount of decoded transaction data spilled to disk.
+   
+  
+ 
+
+   
+
+   
+Tracking of spilled transactions works only for logical replication.  In
+physical replication, the tracking mechanism will display 0 for spilled
+statistics.
+   
+  
+
+  
   pg_stat_wal_receiver
 
   
@@ -4710,6 +4798,26 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 can be granted EXECUTE to run the function.

   
+
+  
+
+
+  pg_stat_reset_replication_slot
+
+pg_stat_reset_replication_slot ( text )
+void
+   
+   
+ Resets statistics to zero for a single replication slot, or for all
+ replication slots in the cluster.  If the argument is NULL, all counters
+ show

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-07-22 Thread Andrey V. Lepikhov

On 7/16/20 2:14 PM, Amit Langote wrote:

Amit Langote
EnterpriseDB: http://www.enterprisedb.com



Version 5 of the patch. With changes caused by Amit's comments.

--
regards,
Andrey Lepikhov
Postgres Professional
>From 24465d61d6f0ec6a45578d252bda1690ac045543 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Thu, 9 Jul 2020 11:16:56 +0500
Subject: [PATCH] Fast COPY FROM into the foreign or sharded table.

This feature enables bulk COPY into foreign table in the case of
multi inserts is possible and foreign table has non-zero number of columns.

FDWAPI was extended by next routines:
* BeginForeignCopyIn
* EndForeignCopyIn
* ExecForeignCopyIn

BeginForeignCopyIn and EndForeignCopyIn initialize and free
the CopyState of bulk COPY. The ExecForeignCopyIn routine send
'COPY ... FROM STDIN' command to the foreign server, in iterative
manner send tuples by CopyTo() machinery, send EOF to this connection.

Code that constructed list of columns for a given foreign relation
in the deparseAnalyzeSql() routine is separated to the deparseRelColumnList().
It is reused in the deparseCopyFromSql().

Added TAP-tests on the specific corner cases of COPY FROM STDIN operation.

By the analogy of CopyFrom() the CopyState structure was extended
with data_dest_cb callback. It is used for send text representation
of a tuple to a custom destination.
The PgFdwModifyState structure is extended with the cstate field.
It is needed for avoid repeated initialization of CopyState. ALso for this
reason CopyTo() routine was split into the set of routines CopyToStart()/
CopyTo()/CopyToFinish().

Discussion: https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru

Authors: Andrey Lepikhov, Ashutosh Bapat, Amit Langote
---
 contrib/postgres_fdw/deparse.c|  60 -
 .../postgres_fdw/expected/postgres_fdw.out|  33 ++-
 contrib/postgres_fdw/postgres_fdw.c   | 146 +++
 contrib/postgres_fdw/postgres_fdw.h   |   1 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |  28 ++
 doc/src/sgml/fdwhandler.sgml  |  74 ++
 src/backend/commands/copy.c   | 247 +++---
 src/backend/executor/execMain.c   |   1 +
 src/backend/executor/execPartition.c  |  34 ++-
 src/include/commands/copy.h   |  11 +
 src/include/foreign/fdwapi.h  |  15 ++
 src/include/nodes/execnodes.h |   8 +
 12 files changed, 547 insertions(+), 111 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index ad37a74221..a37981ff66 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -184,6 +184,8 @@ static void appendAggOrderBy(List *orderList, List *targetList,
 static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
+static List *deparseRelColumnList(StringInfo buf, Relation rel,
+  bool enclose_in_parens);
 
 /*
  * Helper functions
@@ -1758,6 +1760,20 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte,
 		 withCheckOptionList, returningList, retrieved_attrs);
 }
 
+/*
+ * Deparse COPY FROM into given buf.
+ * We need to use list of parameters at each query.
+ */
+void
+deparseCopyFromSql(StringInfo buf, Relation rel)
+{
+	appendStringInfoString(buf, "COPY ");
+	deparseRelation(buf, rel);
+	(void) deparseRelColumnList(buf, rel, true);
+
+	appendStringInfoString(buf, " FROM STDIN ");
+}
+
 /*
  * deparse remote UPDATE statement
  *
@@ -2061,6 +2077,30 @@ deparseAnalyzeSizeSql(StringInfo buf, Relation rel)
  */
 void
 deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
+{
+	appendStringInfoString(buf, "SELECT ");
+	*retrieved_attrs = deparseRelColumnList(buf, rel, false);
+
+	/* Don't generate bad syntax for zero-column relation. */
+	if (list_length(*retrieved_attrs) == 0)
+		appendStringInfoString(buf, "NULL");
+
+	/*
+	 * Construct FROM clause
+	 */
+	appendStringInfoString(buf, " FROM ");
+	deparseRelation(buf, rel);
+}
+
+/*
+ * Construct the list of columns of given foreign relation in the order they
+ * appear in the tuple descriptor of the relation. Ignore any dropped columns.
+ * Use column names on the foreign server instead of local names.
+ *
+ * Optionally enclose the list in parantheses.
+ */
+static List *
+deparseRelColumnList(StringInfo buf, Relation rel, bool enclose_in_parens)
 {
 	Oid			relid = RelationGetRelid(rel);
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2069,10 +2109,8 @@ deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
 	List	   *options;
 	ListCell   *lc;
 	bool		first = true;
+	List	   *retrieved_attrs = NIL;
 
-	*retrieved_attrs = NIL;
-
-	appendStringInfoString(buf, "SELECT ");
 	for (i = 0; i < tupdesc->natts; i++)
 	{
 		/* Ignore dropped columns. */
@@ -2081,6 +2119,9 @@ deparseAnalyzeSql(StringInfo 

Re: [Proposal] Global temporary tables

2020-07-22 Thread Pavel Stehule
> I am thinking about explicit LOCK statements. Some applications use
> explicit locking from some reasons - typically as protection against race
> conditions.
>
> But on GTT race conditions are not possible. So my question is - does the
> exclusive lock on GTT  protection other sessions do insert into their own
> instances of the same GTT?
>
> In my opinion, with a GTT, always work on the private data of the session,
> there is no need to do anything by holding the lock, so the lock statement
> should do nothing (The same is true for ORACLE GTT)
>
> What do you think?
>
>
> What is a level where table locks are active? shared part of GTT or
> session instance part of GTT?
>
> I don't quite understand what you mean, could you explain it a little bit?
>

It is about perspective, how we should see GTT tables. GTT table is a mix
of two concepts - session private (data), and session shared (catalog). And
hypothetically we can place locks to the private part (no effect) or shared
part (usual effect how we know it). But can has sense, and both have an
advantage and disadvantage. I afraid little bit about behaviour of stupid
ORM systems - but the most important part of table are data - and then I
prefer empty lock implementation for GTT.

Regards

Pavel



>
>
> Wenjing
>
>
>
>
>
>
>
>>
>> Wenjing
>>
>>
>>
>> Now, table locks are implemented on a global level. So exclusive lock on
>> GTT in one session block insertion on the second session. Is it expected
>> behaviour? It is safe, but maybe it is too strict.
>>
>> We should define what table lock is meaning on GTT.
>>
>> Regards
>>
>> Pavel
>>
>>
>>> Pavel
>>>
>>>
 With Regards,
 Prabhat Kumar Sahu
 EnterpriseDB: http://www.enterprisedb.com



>>>
>>
>