Re: [HACKERS] Parallel Index Scans

2016-12-22 Thread tushar

On 12/22/2016 01:35 PM, tushar wrote:

On 12/22/2016 09:49 AM, Amit Kapila wrote:

I think you can focus on the handling of array scan keys for testing.
In general, one of my colleagues has shown interest in testing this
patch and I think he has tested as well but never posted his findings.
I will request him to share his findings and what kind of tests he has
done, if any.
Sure, We (Prabhat and I) have done some testing for this feature 
internally but never published the test-scripts on this forum. PFA the 
sql scripts ( along with the expected .out files) we have used for 
testing for your ready reference.


In addition we had generated the LCOV (code coverage) report and 
compared the files which are changed for the "Parallel index scan" patch.
You can see the numbers for  "with patch" V/s "Without patch" (.pdf 
file is attached)


In addition to that, we  run the sqlsmith against PG v10+PIS (parallel 
index scan) patches and found a crash  but that is coming on plain  PG 
v10 (without applying any patches) as well


postgres=# select
   70 as c0,
   pg_catalog.has_server_privilege(
cast(ref_0.indexdef as text),
cast(cast(coalesce((select name from pg_catalog.pg_settings 
limit 1 offset 16)

,
   null) as text) as text)) as c1,
   pg_catalog.pg_export_snapshot() as c2,
   ref_0.indexdef as c3,
   ref_0.indexname as c4
 from
  pg_catalog.pg_indexes as ref_0
 where (ref_0.tablespace = ref_0.tablespace)
   or (46 = 22)
 limit 103;
TRAP: FailedAssertion("!(keylen < 64)", File: "hashfunc.c", Line: 139)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2016-12-23 
11:19:50.627 IST [2314] LOG:  server process (PID 2322) was terminated 
by signal 6: Aborted
2016-12-23 11:19:50.627 IST [2314] DETAIL:  Failed process was running: 
select

   70 as c0,
   pg_catalog.has_server_privilege(
cast(ref_0.indexdef as text),
cast(cast(coalesce((select name from 
pg_catalog.pg_settings limit 1 offset 16)

,
   null) as text) as text)) as c1,
   pg_catalog.pg_export_snapshot() as c2,
   ref_0.indexdef as c3,
   ref_0.indexname as c4
 from
  pg_catalog.pg_indexes as ref_0
 where (ref_0.tablespace = ref_0.tablespace)
   or (46 = 22)
 limit 103;
2016-12-23 11:19:50.627 IST [2314] LOG:  terminating any other active 
server processes
2016-12-23 11:19:50.627 IST [2319] WARNING:  terminating connection 
because of crash of another server process
2016-12-23 11:19:50.627 IST [2319] DETAIL:  The postmaster has commanded 
this server process to roll back the current transaction and exit, 
because another server process exited abnormally and possibly corrupted 
shared memory.
2016-12-23 11:19:50.627 IST [2319] HINT:  In a moment you should be able 
to reconnect to the database and repeat your command.
2016-12-23 11:19:50.629 IST [2323] FATAL:  the database system is in 
recovery mode

Failed.
!> 2016-12-23 11:19:50.629 IST [2314] LOG:  all server processes 
terminated; reinitializing
2016-12-23 11:19:50.658 IST [2324] LOG:  database system was 
interrupted; last known up at 2016-12-23 11:19:47 IST
2016-12-23 11:19:50.810 IST [2324] LOG:  database system was not 
properly shut down; automatic recovery in progress
2016-12-23 11:19:50.812 IST [2324] LOG:  invalid record length at 
0/155E408: wanted 24, got 0

2016-12-23 11:19:50.812 IST [2324] LOG:  redo is not required
2016-12-23 11:19:50.819 IST [2324] LOG:  MultiXact member wraparound 
protections are now enabled
2016-12-23 11:19:50.822 IST [2314] LOG:  database system is ready to 
accept connections

2016-12-23 11:19:50.822 IST [2328] LOG:  autovacuum launcher started

--
regards,tushar



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multi-level partitions and partition-wise joins

2016-12-22 Thread Ashutosh Bapat
Another question: do we want to commit the code for creating
unflattened hierarchy separately or along with partition-wise join?

On Fri, Dec 23, 2016 at 9:58 AM, Ashutosh Bapat
 wrote:
> On Thu, Dec 22, 2016 at 10:52 PM, Robert Haas  wrote:
>> On Wed, Dec 21, 2016 at 11:31 PM, Ashutosh Bapat
>>  wrote:
>>> Given the scenario described above, it looks like we have to retain
>>> partition hierarchy in the form of inheritance hierarchy in order to
>>> implement partition-wise joins for multi-leveled partition tables. Is
>>> that the right thing to do? PFA a patch retained by Amit Langote to
>>> translate partition hierarchy into inheritance hierarchy. Is this
>>> something on the right direction?
>>
>> I am not sure whether Amit's patch is the right way to go.  I don't
>> fully understand it, and I remember complaining about some aspects of
>> it before, such as this unexplained and fairly random-looking
>> exception:
>>
>> +/*
>> + * Do not flatten the inheritance hierarchy if partitioned table, unless
>> + * this is the result relation.
>> + */
>>
>> However, I think the overall idea of doing flattening later in the
>> process for partitioned tables is probably correct.  It's not that we
>> shouldn't do flattening at all -- the final Plan shouldn't involve
>> nested Append nodes -- but maybe we should delay it. Perhaps the Path
>> tree retains the structure and the final Plan flattens it.
>
> While creating append paths we flatten any append paths added to the children.
>
>> We might
>> consider doing that way for both inheritance trees and partitioning,
>> just so we don't have two different code paths to validate.
>>
>
> AFAIU the reason why we chose to flatten the inheritance hierarchy is
> multiple inheritance. Since the same child can inherit from two
> parents, in an unflattened version its paths would be included twice.
> It would be clumsy to keep the inheritance unflattened but not include
> a relation more than once in the final plan tree.
>
> However, for partitioned tables, we are guaranteed that there's only a
> single parent and thus every child relation will be considered only
> once. We will need separate code to handle (possible) multiple
> inheritance and strictly single inheritance imposed by partitioning.
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Parallel Append implementation

2016-12-22 Thread Amit Khandekar
Currently an Append plan node does not execute its subplans in
parallel. There is no distribution of workers across its subplans. The
second subplan starts running only after the first subplan finishes,
although the individual subplans may be running parallel scans.

Secondly, we create a partial Append path for an appendrel, but we do
that only if all of its member subpaths are partial paths. If one or
more of the subplans is a non-parallel path, there will be only a
non-parallel Append. So whatever node is sitting on top of Append is
not going to do a parallel plan; for example, a select count(*) won't
divide it into partial aggregates if the underlying Append is not
partial.

The attached patch removes both of the above restrictions.  There has
already been a mail thread [1] that discusses an approach suggested by
Robert Haas for implementing this feature. This patch uses this same
approach.

Attached is pgbench_create_partition.sql (derived from the one
included in the above thread) that distributes pgbench_accounts table
data into 3 partitions pgbench_account_[1-3]. The below queries use
this schema.

Consider a query such as :
select count(*) from pgbench_accounts;

Now suppose, these two partitions do not allow parallel scan :
alter table pgbench_accounts_1 set (parallel_workers=0);
alter table pgbench_accounts_2 set (parallel_workers=0);

On HEAD, due to some of the partitions having non-parallel scans, the
whole Append would be a sequential scan :

 Aggregate
   ->  Append
 ->  Index Only Scan using pgbench_accounts_pkey on pgbench_accounts
 ->  Seq Scan on pgbench_accounts_1
 ->  Seq Scan on pgbench_accounts_2
 ->  Seq Scan on pgbench_accounts_3

Whereas, with the patch, the Append looks like this :

 Finalize Aggregate
   ->  Gather
 Workers Planned: 6
 ->  Partial Aggregate
   ->  Parallel Append
 ->  Parallel Seq Scan on pgbench_accounts
 ->  Seq Scan on pgbench_accounts_1
 ->  Seq Scan on pgbench_accounts_2
 ->  Parallel Seq Scan on pgbench_accounts_3

Above, Parallel Append is generated, and it executes all these
subplans in parallel, with 1 worker executing each of the sequential
scans, and multiple workers executing each of the parallel subplans.


=== Implementation details 

--- Adding parallel-awareness ---

In a given worker, this Append plan node will be executing just like
the usual partial Append node. It will run a subplan until completion.
The subplan may or may not be a partial parallel-aware plan like
parallelScan. After the subplan is done, Append will choose the next
subplan. It is here where it will be different than the current
partial Append plan: it is parallel-aware. The Append nodes in the
workers will be aware that there are other Append nodes running in
parallel. The partial Append will have to coordinate with other Append
nodes while choosing the next subplan.

--- Distribution of workers 

The coordination info is stored in a shared array, each element of
which describes the per-subplan info. This info contains the number of
workers currently executing the subplan, and the maximum number of
workers that should be executing it at the same time. For non-partial
sublans, max workers would always be 1. For choosing the next subplan,
the Append executor will sequentially iterate over the array to find a
subplan having the least number of workers currently being executed,
AND which is not already being executed by the maximum number of
workers assigned for the subplan. Once it gets one, it increments
current_workers, and releases the Spinlock, so that other workers can
choose their next subplan if they are waiting.

This way, workers would be fairly distributed across subplans.

The shared array needs to be initialized and made available to
workers. For this, we can do exactly what sequential scan does for
being parallel-aware : Using function ExecAppendInitializeDSM()
similar to ExecSeqScanInitializeDSM() in the backend to allocate the
array. Similarly, for workers, have ExecAppendInitializeWorker() to
retrieve the shared array.


 Generating Partial Append plan having non-partial subplans 

In set_append_rel_pathlist(), while generating a partial path for
Append, also include the non-partial child subpaths, besides the
partial subpaths. This way, it can contain a mix of partial and
non-partial children paths. But for a given child, its path would be
either the cheapest partial path or the cheapest non-partial path.

For a non-partial child path, it will only be included if it is
parallel-safe. If there is no parallel-safe path, a partial Append
path would not be generated. This behaviour also automatically
prevents paths that have a Gather node beneath.

Finally when it comes to create a partial append path using these
child paths, we also need to store a bitmap set indicating which of

Re: [HACKERS] Parallel Index-only scan

2016-12-22 Thread rafia.sabih
Please find the attached file for the latest version of patch.

parallel_index_only_v2.patch
  



--
View this message in context: 
http://postgresql.nabble.com/Parallel-Index-only-scan-tp5934352p5936010.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multi-level partitions and partition-wise joins

2016-12-22 Thread Ashutosh Bapat
On Thu, Dec 22, 2016 at 10:52 PM, Robert Haas  wrote:
> On Wed, Dec 21, 2016 at 11:31 PM, Ashutosh Bapat
>  wrote:
>> Given the scenario described above, it looks like we have to retain
>> partition hierarchy in the form of inheritance hierarchy in order to
>> implement partition-wise joins for multi-leveled partition tables. Is
>> that the right thing to do? PFA a patch retained by Amit Langote to
>> translate partition hierarchy into inheritance hierarchy. Is this
>> something on the right direction?
>
> I am not sure whether Amit's patch is the right way to go.  I don't
> fully understand it, and I remember complaining about some aspects of
> it before, such as this unexplained and fairly random-looking
> exception:
>
> +/*
> + * Do not flatten the inheritance hierarchy if partitioned table, unless
> + * this is the result relation.
> + */
>
> However, I think the overall idea of doing flattening later in the
> process for partitioned tables is probably correct.  It's not that we
> shouldn't do flattening at all -- the final Plan shouldn't involve
> nested Append nodes -- but maybe we should delay it. Perhaps the Path
> tree retains the structure and the final Plan flattens it.

While creating append paths we flatten any append paths added to the children.

> We might
> consider doing that way for both inheritance trees and partitioning,
> just so we don't have two different code paths to validate.
>

AFAIU the reason why we chose to flatten the inheritance hierarchy is
multiple inheritance. Since the same child can inherit from two
parents, in an unflattened version its paths would be included twice.
It would be clumsy to keep the inheritance unflattened but not include
a relation more than once in the final plan tree.

However, for partitioned tables, we are guaranteed that there's only a
single parent and thus every child relation will be considered only
once. We will need separate code to handle (possible) multiple
inheritance and strictly single inheritance imposed by partitioning.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [patch] psql tab completion for ALTER DEFAULT PRIVILEGES

2016-12-22 Thread Stephen Frost
Gilles,

* Gilles Darold (gilles.dar...@dalibo.com) wrote:
> Added to next commitfest. To explain more this patch, the completion of
> SQL command:
> 
> ALTER DEFAULT PRIVILEGES FOR ROLE xxx [tab]

Here is a cleaned up patch for master and 9.6.  Tomorrow I'll look into
what we can do for 9.5 and earlier, which are also wrong, but the code
is quite a bit different.

Note that beyond just changing the comments, I removed the alternative
spelling of 'role' when doing tab completion- there's no different
between 'role' and 'user', so there's no point in making the user have
to pick one when they're tab-completing.  Of course, we still accept
both and if the user chooses to write out 'for user', we will handle
that correctly and continue the tab completion beyond that.

Thanks!

Stephen
From 1f7eb8473d40497b67cc30b40aefe2e1529317c0 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Thu, 22 Dec 2016 22:00:55 -0500
Subject: [PATCH] Fix tab completion in psql for ALTER DEFAULT PRIVILEGES

When providing tab completion for ALTER DEFAULT PRIVILEGES, we are
including the list of roles as possible options for completion after the
GRANT or REVOKE.  Further, we accept FOR ROLE/IN SCHEMA at the same time
and in either order, but the tab completion was only working for one or
the other.  Lastly, we weren't using the actual list of allowed kinds of
objects for default privileges for completion after the 'GRANT X ON' but
instead were completeing to what 'GRANT X ON' supports, which isn't the
ssame at all.

Address these issues by improving the forward tab-completion for ALTER
DEFAULT PRIVILEGES and then constrain and correct how the tail
completion is done when it is for ALTER DEFAULT PRIVILEGES.

Author: Gilles Darold, cleaned up and comments added by me.
Discussion: https://www.postgresql.org/message-id/1614593c-e356-5b27-6dba-66320a9bc...@dalibo.com

Back-patch to 9.2.
---
 src/bin/psql/tab-complete.c | 57 ++---
 1 file changed, 48 insertions(+), 9 deletions(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index cd64c39..02c8d60 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1570,13 +1570,31 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH_CONST("PASSWORD");
 	/* ALTER DEFAULT PRIVILEGES */
 	else if (Matches3("ALTER", "DEFAULT", "PRIVILEGES"))
-		COMPLETE_WITH_LIST3("FOR ROLE", "FOR USER", "IN SCHEMA");
+		COMPLETE_WITH_LIST2("FOR ROLE", "IN SCHEMA");
 	/* ALTER DEFAULT PRIVILEGES FOR */
 	else if (Matches4("ALTER", "DEFAULT", "PRIVILEGES", "FOR"))
-		COMPLETE_WITH_LIST2("ROLE", "USER");
-	/* ALTER DEFAULT PRIVILEGES { FOR ROLE ... | IN SCHEMA ... } */
-	else if (Matches6("ALTER", "DEFAULT", "PRIVILEGES", "FOR", "ROLE|USER", MatchAny) ||
-		Matches6("ALTER", "DEFAULT", "PRIVILEGES", "IN", "SCHEMA", MatchAny))
+		COMPLETE_WITH_CONST("ROLE");
+	/* ALTER DEFAULT PRIVILEGES IN */
+	else if (Matches4("ALTER", "DEFAULT", "PRIVILEGES", "IN"))
+		COMPLETE_WITH_CONST("SCHEMA");
+	/* ALTER DEFAULT PRIVILEGES FOR ROLE|USER ... */
+	else if (Matches6("ALTER", "DEFAULT", "PRIVILEGES", "FOR", "ROLE|USER",
+MatchAny))
+		COMPLETE_WITH_LIST3("GRANT", "REVOKE", "IN SCHEMA");
+	/* ALTER DEFAULT PRIVILEGES IN SCHEMA ... */
+	else if (Matches6("ALTER", "DEFAULT", "PRIVILEGES", "IN", "SCHEMA",
+MatchAny))
+		COMPLETE_WITH_LIST3("GRANT", "REVOKE", "FOR ROLE");
+	/* ALTER DEFAULT PRIVILEGES IN SCHEMA ... FOR */
+	else if (Matches7("ALTER", "DEFAULT", "PRIVILEGES", "IN", "SCHEMA",
+MatchAny, "FOR"))
+		COMPLETE_WITH_CONST("ROLE");
+	/* ALTER DEFAULT PRIVILEGES FOR ROLE|USER ... IN SCHEMA ... */
+	/* ALTER DEFAULT PRIVILEGES IN SCHEMA ... FOR ROLE|USER ... */
+	else if (Matches9("ALTER", "DEFAULT", "PRIVILEGES", "FOR", "ROLE|USER",
+	MatchAny, "IN", "SCHEMA", MatchAny) ||
+		Matches9("ALTER", "DEFAULT", "PRIVILEGES", "IN", "SCHEMA",
+	MatchAny, "FOR", "ROLE|USER", MatchAny))
 		COMPLETE_WITH_LIST2("GRANT", "REVOKE");
 	/* ALTER DOMAIN  */
 	else if (Matches3("ALTER", "DOMAIN", MatchAny))
@@ -2566,10 +2584,22 @@ psql_completion(const char *text, int start, int end)
 	else if (TailMatches2("FOREIGN", "SERVER"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_servers);
 
-/* GRANT && REVOKE --- is allowed inside CREATE SCHEMA, so use TailMatches */
+/*
+ * GRANT and REVOKE are allowed inside CREATE SCHEMA and
+ * ALTER DEFAULT PRIVILEGES, so use TailMatches
+ */
 	/* Complete GRANT/REVOKE with a list of roles and privileges */
 	else if (TailMatches1("GRANT|REVOKE"))
-		COMPLETE_WITH_QUERY(Query_for_list_of_roles
+		/*
+		 * With ALTER DEFAULT PRIVILEGES, restrict completion
+		 * to grantable privileges (can't grant roles)
+		 */
+		if (HeadMatches3("ALTER","DEFAULT","PRIVILEGES"))
+			COMPLETE_WITH_LIST10("SELECT", "INSERT", "UPDATE",
+"DELETE", "TRUNCATE", "REFERENCES", "TRIGGER",
+		"EXECUTE", "USAGE", "ALL");
+		else
+			COMPLETE_WITH_QUERY(Query_for_list_of_roles
 			" 

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-12-22 Thread Amit Kapila
On Thu, Dec 22, 2016 at 6:59 PM, Tomas Vondra
 wrote:
> Hi,
>
> But as discussed with Amit in Tokyo at pgconf.asia, I got access to a
> Power8e machine (IBM 8247-22L to be precise). It's a much smaller machine
> compared to the x86 one, though - it only has 24 cores in 2 sockets, 128GB
> of RAM and less powerful storage, for example.
>
> I've repeated a subset of x86 tests and pushed them to
>
> https://bitbucket.org/tvondra/power8-results-2
>
> The new results are prefixed with "power-" and I've tried to put them right
> next to the "same" x86 tests.
>
> In all cases the patches significantly reduce the contention on
> CLogControlLock, just like on x86. Which is good and expected.
>

The results look positive.  Do you think we can conclude based on all
the tests you and Dilip have done, that we can move forward with this
patch (in particular group-update) or do you still want to do more
tests?   I am aware that in one of the tests we have observed that
reducing contention on CLOGControlLock has increased the contention on
WALWriteLock, but I feel we can leave that point as a note to
committer and let him take a final call.  From the code perspective
already Robert and Andres have taken one pass of review and I have
addressed all their comments, so surely more review of code can help,
but I think that is not a big deal considering patch size is
relatively small.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS related docs

2016-12-22 Thread Joe Conway
On 09/15/2016 02:34 PM, Joe Conway wrote:
> On 09/15/2016 01:33 PM, Robert Haas wrote:
>> On Sun, Aug 28, 2016 at 4:23 PM, Joe Conway  wrote:
> For COPY, I think perhaps it would be more logical to put the new note
> immediately after the third note which describes the privileges
> required, since it's kind of related, and then we can talk about the
> RLS policies required, e.g.:
>
> If row-level security is enabled for the table, COPY table TO is
> internally converted to COPY (SELECT * FROM table) TO, and the
> relevant security policies are applied. Currently, COPY FROM is not
> supported for tables with row-level security.

 This sounds better than what I had, so I will do it that way.
>>>
>>> Apologies for the delay, but new patch attached. Assuming no more
>>> comments, will commit this, backpatched to 9.5, in a day or two.
>>
>> I don't think this was ever committed, but my comment is that it seems
>> to be exposing rather more of the implementation than is probably
>> wise.  Can't we say that SELECT policies will apply rather than saying
>> that it is internally converted to a SELECT?

Committed that way, backpatched to 9.5.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-12-22 Thread Michael Paquier
On Fri, Dec 23, 2016 at 8:13 AM, Robert Haas  wrote:
> On Thu, Dec 22, 2016 at 2:34 PM, Andres Freund  wrote:
>> On 2016-12-22 08:32:56 -0800, Andres Freund wrote:
>>> I plan to commit this later today.  Hope I got the reviewers roughly right.
>>
>> And pushed. Thanks for the work on this everyone.
>
> Cool.  Also, +1 for the important/unimportant terminology.  I like that.

Thanks for the commit.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-22 Thread Kohei KaiGai
2016-12-23 8:24 GMT+09:00 Robert Haas :
> On Wed, Dec 7, 2016 at 11:01 PM, Kohei KaiGai  wrote:
>> Regardless of the ExpandedObject, does the flatten format need to
>> contain fully flatten data chunks?
>
> I suspect it does, and I think that's why this isn't going to get very
> far without a super-varlena format.
>
Yep, I'm now under investigation how to implement with typlen == -3
approach. Likely, it will be the most straight-forward infrastructure for
other potential use cases more than matrix/vector.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-22 Thread Kohei KaiGai
2016-12-23 8:23 GMT+09:00 Robert Haas :
> On Wed, Dec 7, 2016 at 10:44 PM, Kohei KaiGai  wrote:
>>> Handling objects >1GB at all seems like the harder part of the
>>> problem.
>>>
>> I could get your point almost. Does the last line above mention about
>> amount of the data object >1GB? even if the "super-varlena" format
>> allows 64bit length?
>
> Sorry, I can't understand your question about what I wrote.
>
I thought you just pointed out it is always harder part to treat large
amount of data even if data format allows >1GB or more. Right?

-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Potential data loss of 2PC files

2016-12-22 Thread Michael Paquier
On Fri, Dec 23, 2016 at 6:33 AM, Jim Nasby  wrote:
> On 12/22/16 12:02 PM, Andres Freund wrote:
>>
>>
>> On December 22, 2016 6:44:22 PM GMT+01:00, Robert Haas
>>  wrote:
>>>
>>> On Thu, Dec 22, 2016 at 12:39 PM, Andres Freund 
>>> wrote:

 It makes more sense of you mentally separate between filename(s) and
>>>
>>> file contents.  Having to do filesystem metatata transactions for an
>>> fsync intended to sync contents would be annoying...
>>>
>>> I thought that's why there's fdatasync.
>>
>> Not quite IIRC: that doesn't deal with file size increase.  All this would
>> be easier if hardlinks wouldn't exist IIUC. It's basically a question
>> whether dentry, inode or contents need to be synced.   Yes, it sucks.
>
>
> IIRC this isn't the first time we've run into this problem... should
> pg_fsync() automatically fsync the directory as well? I guess we'd need a
> flag to disable that for performance critical areas where we know we don't
> need it (presumably just certain WAL fsyncs).

I am not sure if that would be performance-wise. The case of the 2PC
files is quite special anyway as just doing the sync at checkpoint
phase for everything would be enough.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-12-22 Thread Amit Kapila
On Thu, Dec 22, 2016 at 9:56 PM, Robert Haas  wrote:
> On Mon, Dec 5, 2016 at 2:46 AM, Amit Kapila  wrote:
>>> I'll review after that, since I have other things to review meanwhile.
>>
>> Attached, please find the rebased patch attached with this e-mail.
>> There is no fundamental change in patch except for adapting the new
>> locking strategy in squeeze operation.  I have done the sanity testing
>> of the patch with master-standby setup and Kuntal has helped me to
>> verify it with his wal-consistency checker patch.
>
> This patch again needs a rebase,
>

I had completed it yesterday night and kept it for night long tests to
check the sanity of the patch, but I guess now I need another rebase.
Anyway, I feel this is all for the betterment of final patch.

> but before you do that I'd like to
> make it harder by applying the attached patch to remove
> _hash_chgbufaccess(), which I think is a bad plan for more or less the
> same reasons that motivated the removal of _hash_wrtbuf() in commit
> 25216c98938495fd741bf585dcbef45b3a9ffd40. I think there's probably
> more simplification and cleanup that can be done afterward in the wake
> of this; what I've done here is just a mechanical replacement of
> _hash_chgbufaccess() with LockBuffer() and/or MarkBufferDirty().

The patch has replaced usage of HASH_READ/HASH_WRITE with
BUFFER_LOCK_SHARE/BUFFER_LOCK_EXCLUSIVE which will make hash code
using two types of defines for the same purpose.  Now, we can say that
it is better to replace HASH_READ/HASH_WRITE in whole hash index code
or maybe the usage this patch is introducing is okay, however, it
seems like btree is using similar terminology (BT_READ/BT_WRITE).
Other than that your patch looks good.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP FUNCTION of multiple functions

2016-12-22 Thread Robert Haas
On Thu, Dec 1, 2016 at 9:32 PM, Peter Eisentraut
 wrote:
> I think it would be better to get rid of objargs and have objname be a
> general Node that can contain more specific node types so that there is
> some amount of type tracking.  FuncWithArgs would be one such type,
> Typename would be another, Value would be used for simple strings, and
> we could create some other ones, or stick with lcons for some simple
> cases.  But then we don't have to make stuff into one-item lists to just
> to satisfy the currently required List.
>
> That's the general idea.  But that's a rather big change that I would
> rather break down into smaller pieces.  I have a separate patch in
> progress for that, which I have attached here.  It breaks some
> regression tests in object_address.sql, which I haven't evaluated yet,
> but that's the idea.

I think I disagree with this concept. I wouldn't shed many tears if
objname and objargs got replaced with some other kind of name
representation.  But I don't think that should be done incrementally,
because then we'll be in this transitional zone where it's unclear
what the right way to do things is for a long time, possibly forever.
I'd be fine with a plan to rip out objname/objargs and replace it with
something less arbitrary, but until that's done I think new code
should stick with the existing representations.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Build HTML documentation using XSLT stylesheets by default

2016-12-22 Thread Tom Lane
Peter Eisentraut  writes:
> On 11/30/16 11:37 PM, Tom Lane wrote:
>> OK, I updated docbook-style-xsl to 1.79.1 from Fedora rawhide (building
>> and installing that was quite painless btw, didn't need a pile of build
>> dependencies like I'd feared it would take).  The extraneous commas are
>> gone, and the speed is better but still not really up to DSSSL speed:
>> 1m44s (vs 1m5s with old toolchain).  So there's some additional piece
>> that needs fixing, but that's certainly the worst of it.

> I've done a few more tweaks, and now it actually runs faster for me than
> the old build.

For me it's now 1m35s, which is better than the last round but not
quite up to the old speed.  It's tolerable though.  Thanks for hacking
on it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-22 Thread Robert Haas
On Wed, Dec 7, 2016 at 11:01 PM, Kohei KaiGai  wrote:
> Regardless of the ExpandedObject, does the flatten format need to
> contain fully flatten data chunks?

I suspect it does, and I think that's why this isn't going to get very
far without a super-varlena format.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-22 Thread Robert Haas
On Wed, Dec 7, 2016 at 10:44 PM, Kohei KaiGai  wrote:
>> Handling objects >1GB at all seems like the harder part of the
>> problem.
>>
> I could get your point almost. Does the last line above mention about
> amount of the data object >1GB? even if the "super-varlena" format
> allows 64bit length?

Sorry, I can't understand your question about what I wrote.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-12-22 Thread Robert Haas
On Thu, Dec 22, 2016 at 2:34 PM, Andres Freund  wrote:
> On 2016-12-22 08:32:56 -0800, Andres Freund wrote:
>> I plan to commit this later today.  Hope I got the reviewers roughly right.
>
> And pushed. Thanks for the work on this everyone.

Cool.  Also, +1 for the important/unimportant terminology.  I like that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-12-22 Thread Robert Haas
On Thu, Dec 22, 2016 at 3:35 AM, Amit Langote
 wrote:
> While working on that, I discovered yet-another-bug having to do with the
> tuple descriptor that's used as we route a tuple down a partition tree. If
> attnums of given key attribute(s) are different on different levels, it
> would be incorrect to use the original slot's (one passed by ExecInsert())
> tuple descriptor to inspect the original slot's heap tuple, as we go down
> the tree.  It might cause spurious "partition not found" at some level due
> to looking at incorrect field in the input tuple because of using the
> wrong tuple descriptor (root table's attnums not always same as other
> partitioned tables in the tree).  Patch 0001 fixes that including a test.

I committed this, but I'm a bit uncomfortable with it: should the
TupleTableSlot be part of the ModifyTableState rather than the EState?

> It also addresses the problem I mentioned previously that once
> tuple-routing is done, we failed to switch to a slot with the leaf
> partition's tupdesc (IOW, continued to use the original slot with root
> table's tupdesc causing spurious failures due to differences in attums
> between the leaf partition and the root table).
>
> Further patches 0002, 0003 and 0004 fix bugs that I sent one-big-patch for
> in my previous message.  Each patch has a test for the bug it's meant to fix.

Regarding 0002, I think that this is kind of a strange fix.  Wouldn't
it be better to get hold of the original tuple instead of reversing
the conversion?  And what of the idea of avoiding the conversion in
the (probably very common) case where we can?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Build HTML documentation using XSLT stylesheets by default

2016-12-22 Thread Peter Eisentraut
On 11/30/16 11:37 PM, Tom Lane wrote:
> Peter Eisentraut  writes:
>> OK, I got it.  The component of concern is the DocBook XSL stylesheets,
>> called docbook-style-xsl on RH-like systems (docbook-xsl on Debian).  If
>> it runs too slow, it's probably too old.
> 
> OK, I updated docbook-style-xsl to 1.79.1 from Fedora rawhide (building
> and installing that was quite painless btw, didn't need a pile of build
> dependencies like I'd feared it would take).  The extraneous commas are
> gone, and the speed is better but still not really up to DSSSL speed:
> 1m44s (vs 1m5s with old toolchain).  So there's some additional piece
> that needs fixing, but that's certainly the worst of it.

I've done a few more tweaks, and now it actually runs faster for me than
the old build.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-22 Thread Robert Haas
On Thu, Dec 22, 2016 at 4:41 PM, Jim Nasby  wrote:
> On 12/22/16 4:20 AM, amul sul wrote:
>> • pg_background_detach : This API takes the process id and detach the
>> background process. Stored worker's session is not dropped until this
>> called.
>
> When I hear "detach" I think that whatever I'm detaching from is going to
> stick around, which I don't think is the case here, right? I'd suggest
> pg_background_close() instead.

Uh, I think it is.  At least in the original version of this patch,
pg_background_detach() leaves the spawned process running but says
that you don't care to read any results it may generate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [patch] psql tab completion for ALTER DEFAULT PRIVILEGES

2016-12-22 Thread Stephen Frost
Gilles,

* Gilles Darold (gilles.dar...@dalibo.com) wrote:
> Le 20/11/2016 à 15:46, Gilles Darold a écrit :
> > When tab-completing after ALTER DEFAULT PRIVILEGES ... GRANT|REVOKE,
> > currently psql injects completion from the GRANT|REVOKE order, rather
> > than the one expected.
> >
> > A patch is attached. It adds the right completion to GRANT|REVOKE after
> > ALTER DEFAULT PRIVILEGES and after FOR ROLE|USER + IN SCHEMA.
> 
> Added to next commitfest. To explain more this patch, the completion of
> SQL command:

I've started looking at this.  First off, it looks pretty good and seems
like it's actually a bug fix which should be back-patched since the
current behavior in released branches is also wrong.  There's been some
changes in this area, so it might not be practical to go all the way
back, will have to see once I start getting into it.

One minor nit is that multi-line comments should be of the form:

/*
 * ...
 */

The tab-completion code does do some like this:

/* ... */
/* ... */

Which is probably alright, but you've add some like:

/* ...
   */

Which we really don't do.  I'll clean that up and might do a bit of
word-smithing on the comments also, so no need for a new patch, but
thought I'd mention it for future patches.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [BUG?] pg_event_trigger_ddl_commands() error with ALTER TEXT SEARCH CONFIGURATION

2016-12-22 Thread Stephen Frost
Artur,

* Artur Zakirov (a.zaki...@postgrespro.ru) wrote:
> 2016-12-21 20:34 GMT+03:00 Stephen Frost :
> > Did you happen to look at adding a regression test for this to
> > test_ddl_deparse?
> 
> Of course. I updated the patch.

I added a few comments and back-patched the relevant bits as discussed.

Thanks for the bug report and patch!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Indirect indexes

2016-12-22 Thread Alvaro Herrera
Here's a still-WIP patch that's a better candidate for inclusion.  In
this patch, I have created an executor node for indirect index scans.
This node is created whenever an indexscan path is chosen for an
indirect index.  The planner treats indirect indexes just like regular
indexes, except that they are explicitly excluded from index-only scans
and bitmap scans.

Indirect indexes are more costly to scan than regular indexes, because
of the need to scan the primary key index.  However, the fact that they
can be ignored for HOT considerations make them worthwhile in
update-heavy cases.

This patch only implements btree indirect indexes, but it is possible to
implement them with other index types too.  The btree case is probably
not terribly interesting in conjuncttion with Pavan's WARM, but on the
other hand it is expected that GIN indirect to remain useful.  I have
not implemented GIN indirect indexes yet, to keep the patch small; once
we have btree indirect indexes we can implement GIN, which should be
easy.

There are a few broken things yet, such as "REINDEX TABLE pg_class" and
some other operations specifically on pg_class.  This one in particular
breaks the regression tests, but that shouldn't be terribly difficult to
fix.  Other things I know about that need further work:

* The killtuple implementation is bogus: it may delete entries that are
visible to scans other than our own (in particular it may delete entries
that are being concurrently created, I think).

* We still pass PK values in scan->xs_ctup.t_self.  Shouldn't be
difficult to fix, if a bit invasive.

* Only one primary key column is supported.  Should be an easy fix if
the above is fixed.

* Fill in the UNIQUE_CHECK_INSERT_SINGLETON bits, to avoid duplicate
entries in the indirect index

* Figure out what's the deal with validate_index_heapscan

* Figure out if we need to apply ExecQual in IndirectIndexNext

* Verify whether NumOrderByKeys != 0 is necessary.  If so, test it.

* There's a probably bogus assertion in get_index_paths

* Better implementation of RelationGetPrimaryKey?  Maybe save PK in
relcache?


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1b45a4c..9f899c7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -92,6 +92,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amstorage = true;
amroutine->amclusterable = false;
amroutine->ampredlocks = false;
+   amroutine->amcanindirect = false;
amroutine->amkeytype = InvalidOid;
 
amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index f07eedc..1bc91d2 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -49,6 +49,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->amstorage = true;
amroutine->amclusterable = false;
amroutine->ampredlocks = false;
+   amroutine->amcanindirect = false;
amroutine->amkeytype = InvalidOid;
 
amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b8aa9bc..4ec34d5 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -69,6 +69,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amstorage = true;
amroutine->amclusterable = true;
amroutine->ampredlocks = false;
+   amroutine->amcanindirect = false;
amroutine->amkeytype = InvalidOid;
 
amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0eeb37d..7cfb776 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -66,6 +66,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amstorage = false;
amroutine->amclusterable = false;
amroutine->ampredlocks = false;
+   amroutine->amcanindirect = false;
amroutine->amkeytype = INT4OID;
 
amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..9e91b41 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -96,10 +96,10 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer 
oldbuf,
HeapTuple newtup, HeapTuple old_key_tup,
bool all_visible_cleared, bool 
new_all_visible_cleared);
 static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
-Bitmapset *hot_attrs,
-Bitmapset *key_attrs, 
Bitmapset *id_attrs,
+Bitmapset *hot_attrs, 
Bitmapset *key_attrs,
+Bitmapset 

Re: [HACKERS] pg_background contrib module proposal

2016-12-22 Thread Jim Nasby

On 12/22/16 4:20 AM, amul sul wrote:

• pg_background_detach : This API takes the process id and detach the
background process. Stored worker's session is not dropped until this
called.


When I hear "detach" I think that whatever I'm detaching from is going 
to stick around, which I don't think is the case here, right? I'd 
suggest pg_background_close() instead.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Potential data loss of 2PC files

2016-12-22 Thread Jim Nasby

On 12/22/16 12:02 PM, Andres Freund wrote:


On December 22, 2016 6:44:22 PM GMT+01:00, Robert Haas  
wrote:

On Thu, Dec 22, 2016 at 12:39 PM, Andres Freund 
wrote:

It makes more sense of you mentally separate between filename(s) and

file contents.  Having to do filesystem metatata transactions for an
fsync intended to sync contents would be annoying...

I thought that's why there's fdatasync.

Not quite IIRC: that doesn't deal with file size increase.  All this would be 
easier if hardlinks wouldn't exist IIUC. It's basically a question whether 
dentry, inode or contents need to be synced.   Yes, it sucks.


IIRC this isn't the first time we've run into this problem... should 
pg_fsync() automatically fsync the directory as well? I guess we'd need 
a flag to disable that for performance critical areas where we know we 
don't need it (presumably just certain WAL fsyncs).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-12-22 Thread Andres Freund
Hi,

On 2016-12-22 08:32:56 -0800, Andres Freund wrote:
> I plan to commit this later today.  Hope I got the reviewers roughly right.

And pushed. Thanks for the work on this everyone.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ilegal type cast in _hash_doinsert()

2016-12-22 Thread Robert Haas
On Thu, Dec 22, 2016 at 2:04 AM, Mithun Cy  wrote:
> There seems to be some base bug in _hash_doinsert().
> +* XXX this is useless code if we are only storing hash keys.
>
> +   */
>
> +if (itemsz > HashMaxItemSize((Page) metap))
>
> +ereport(ERROR,
>
> +(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
>
> + errmsg("index row size %zu exceeds hash maximum %zu",
>
> +itemsz, HashMaxItemSize((Page) metap)),
>
> +   errhint("Values larger than a buffer page cannot be
> indexed.")));
>
>  "metap" (HashMetaPage) is not a Page (they the entirely different
> structure), so above casting is not legal. Added a patch to correct same by
> passing actual Page which stores "HashMetaPage".

So, let's see here.  HashMaxItemSize uses page only to call
PageGetPageSize, which accesses page->pd_pagesize_version, stored 18
bytes into the structure.   Instead we're passing it a pointer to
HashMetaPageData, where at that offset it will find hashm_bsize which
on my system has a value of 8152, which is close enough to the actual
page size of 8192 that this check fails to fire in error.

It seems like a better idea to declare a new variable for this rather
than reusing one of the correct type that happens not to be doing
anything during this part of the function, so I committed and
back-patched this that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Read/write expanded objects versus domains and CASE

2016-12-22 Thread Tom Lane
I looked into the problem reported in bug #14472,
https://www.postgresql.org/message-id/20161221214744.25622.71...@wrigleys.postgresql.org
Although the submitter thought it was related to bug #14414, it isn't
particularly.  The failure scenario is that the input value to a
CoerceToDomain execution node is a read-write expanded datum.  We were
blindly passing that to any CHECK constraint expressions for the domain
type, which leaves called functions at liberty to modify or even delete
the expanded object.  Correct behavior is to pass a read-only pointer to
the CHECK expressions and then return the original read-write pointer as
the expression result.

I nosed around for other occurrences of the same problem and soon
realized that CASE with an "arg" expression has a similar issue,
since the "arg" value may get passed to multiple test expressions.
It'd be substantially harder to make a test case for that in the
current state of the code --- to get a failure, you'd need a
plpgsql function to be the equality operator for some data type ---
but it's surely not impossible.

Also, domain_check() could in principle be called with a r/w datum,
so it should also protect against this.

The fix for this is nominally simple, to call MakeExpandedObjectReadOnly
at the appropriate places; but that requires having the data type's typlen
at hand, which we don't in these places.

In the domain cases, the typlen is cheaply accessible through the domain's
typcache entry, which we could get at as long as we don't mind using
DomainConstraintRef.tcache, which had been documented as private to
typcache.c.  I don't see any particularly strong reason not to allow
callers to use it, though.

In the CASE case, there seems no help for it except to expend a
get_typlen() syscache lookup during executor setup.  It's kind of annoying
to do that to support a corner case that may very well never occur in the
field, but I don't see another alternative.

So I'm proposing the attached patch (sans test cases as yet).
Any objections?

regards, tom lane

diff --git a/src/backend/executor/execQual.c b/src/backend/executor/execQual.c
index 743e7d6..ab09691 100644
*** a/src/backend/executor/execQual.c
--- b/src/backend/executor/execQual.c
*** ExecEvalCase(CaseExprState *caseExpr, Ex
*** 2992,3003 
  
  	if (caseExpr->arg)
  	{
  		bool		arg_isNull;
  
! 		econtext->caseValue_datum = ExecEvalExpr(caseExpr->arg,
!  econtext,
!  _isNull,
!  NULL);
  		econtext->caseValue_isNull = arg_isNull;
  	}
  
--- 2992,3009 
  
  	if (caseExpr->arg)
  	{
+ 		Datum		arg_value;
  		bool		arg_isNull;
  
! 		arg_value = ExecEvalExpr(caseExpr->arg,
!  econtext,
!  _isNull,
!  NULL);
! 		/* Since caseValue_datum may be read multiple times, force to R/O */
! 		econtext->caseValue_datum =
! 			MakeExpandedObjectReadOnly(arg_value,
! 	   arg_isNull,
! 	   caseExpr->argtyplen);
  		econtext->caseValue_isNull = arg_isNull;
  	}
  
*** ExecEvalCoerceToDomain(CoerceToDomainSta
*** 4127,4137 
  	 * nodes. We must save and restore prior setting of
  	 * econtext's domainValue fields, in case this node is
  	 * itself within a check expression for another domain.
  	 */
  	save_datum = econtext->domainValue_datum;
  	save_isNull = econtext->domainValue_isNull;
  
! 	econtext->domainValue_datum = result;
  	econtext->domainValue_isNull = *isNull;
  
  	conResult = ExecEvalExpr(con->check_expr,
--- 4133,4150 
  	 * nodes. We must save and restore prior setting of
  	 * econtext's domainValue fields, in case this node is
  	 * itself within a check expression for another domain.
+ 	 *
+ 	 * Also, if we are working with a read-write expanded
+ 	 * datum, be sure that what we pass to CHECK expressions
+ 	 * is a read-only pointer; else called functions might
+ 	 * modify or even delete the expanded object.
  	 */
  	save_datum = econtext->domainValue_datum;
  	save_isNull = econtext->domainValue_isNull;
  
! 	econtext->domainValue_datum =
! 		MakeExpandedObjectReadOnly(result, *isNull,
! 	 cstate->constraint_ref->tcache->typlen);
  	econtext->domainValue_isNull = *isNull;
  
  	conResult = ExecEvalExpr(con->check_expr,
*** ExecInitExpr(Expr *node, PlanState *pare
*** 4939,4944 
--- 4952,4959 
  }
  cstate->args = outlist;
  cstate->defresult = ExecInitExpr(caseexpr->defresult, parent);
+ if (caseexpr->arg)
+ 	cstate->argtyplen = get_typlen(exprType((Node *) caseexpr->arg));
  state = (ExprState *) cstate;
  			}
  			break;
diff --git a/src/backend/utils/adt/domains.c b/src/backend/utils/adt/domains.c
index 265..1cd80ae 100644
*** a/src/backend/utils/adt/domains.c
--- b/src/backend/utils/adt/domains.c
***
*** 35,40 
--- 35,41 
  #include "executor/executor.h"
  #include 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-22 Thread Claudio Freire
On Thu, Dec 22, 2016 at 12:22 PM, Claudio Freire  wrote:
> On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova
>  wrote:
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, failed
>> Implements feature:   not tested
>> Spec compliant:   not tested
>> Documentation:not tested
>>
>> Hi,
>> I haven't read through the thread yet, just tried to apply the patch and run 
>> tests.
>> And it seems that the last attached version is outdated now. It doesn't 
>> apply to the master
>> and after I've tried to fix merge conflict, it segfaults at initdb.
>
>
> I'll rebase when I get some time to do it and post an updated version

Attached rebased patches. I called them both v3 to be consistent.

I'm not sure how you ran it, but this works fine for me:

./configure --enable-debug --enable-cassert
make clean
make check

... after a while ...

===
 All 168 tests passed.
===

I reckon the above is equivalent to installcheck, but doesn't require
sudo or actually installing the server, so installcheck should work
assuming the install went ok.
From f7c6a46ea2d721e472237e00fd2404081a4ddc75 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Fri, 2 Sep 2016 23:33:48 -0300
Subject: [PATCH 1/2] Vacuum: prefetch buffers on backward scan

This patch speeds up truncation by prefetching buffers during
the backward scan phase just prior to truncation. This optimization
helps rotating media because disk rotation induces a buffer-to-buffer
latency of roughly a whole rotation when reading backwards, such
disks are usually optimized for forward scans. Prefetching buffers
in larger chunks ameliorates the issue considerably and is otherwise
harmless. The prefetch amount should also not be an issue on SSD,
which won't benefit from the optimization, but won't be hurt either.
---
 src/backend/commands/vacuumlazy.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b5fb325..854bce3 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1825,7 +1825,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 static BlockNumber
 count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 {
-	BlockNumber blkno;
+	BlockNumber blkno, prefetchBlkno;
 	instr_time	starttime;
 
 	/* Initialize the starttime if we check for conflicting lock requests */
@@ -1833,6 +1833,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 	/* Strange coding of loop control is needed because blkno is unsigned */
 	blkno = vacrelstats->rel_pages;
+	prefetchBlkno = blkno & ~0x1f;
+	prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
 	while (blkno > vacrelstats->nonempty_pages)
 	{
 		Buffer		buf;
@@ -1882,6 +1884,23 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 		blkno--;
 
+		/*
+		 * Speed up reads on rotating rust by prefetching a few blocks ahead.
+		 * Backward scans on rotary disks are slow if done one buffer at a time
+		 * because readahead won't kick in on most OSes, and each buffer will
+		 * have to wait for the platter to do a full rotation.
+		 * Should be harmless on SSD.
+		 */
+		if (blkno <= prefetchBlkno) {
+			BlockNumber pblkno;
+			if (prefetchBlkno >= 32)
+prefetchBlkno -= 32;
+			else
+prefetchBlkno = 0;
+			for (pblkno = prefetchBlkno; pblkno < blkno; pblkno++)
+PrefetchBuffer(onerel, MAIN_FORKNUM, pblkno);
+		}
+
 		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
  RBM_NORMAL, vac_strategy);
 
-- 
1.8.4.5

From 084805d565dced136903262cfad4fd395468a9bb Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 2/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c | 295 +-
 1 file changed, 230 insertions(+), 65 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 854bce3..dfb0612 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -91,6 +91,7 @@
  * tables.
  */
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
+#define LAZY_MIN_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
 
 /*
  * Before we consider skipping a page that's marked as clean in
@@ -98,6 +99,27 @@
  */
 #define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	int			num_dead_tuples;	/* # of entries in the 

Re: [HACKERS] increasing the default WAL segment size

2016-12-22 Thread Tomas Vondra

On 12/20/2016 02:19 PM, Andres Freund wrote:

On 2016-12-20 08:10:29 -0500, Robert Haas wrote:

We could use the GUC assign hook to compute a mask and a shift, so
that this could be written as (CurrPos & mask_variable) == 0.  That
would avoid the division instruction, though not the memory access.


I suspect that'd be fine.



I hope this is all in the noise, though.


Could very well be.



I know this is code is hot but I think it'll be hard to construct a
test case where the bottleneck is anything other than the speed at
which the disk can absorb bytes.


I don't think that's really true. Heikki's WAL changes made a *BIG*
difference. And pretty small changes in xlog.c can make noticeable
throughput differences both in single and multi-threaded
workloads. E.g. witnessed by the fact that the crc computation used to
be a major bottleneck (and the crc32c instruction still shows up
noticeably in profiles).  SSDs have become fast enough that it's
increasingly hard to saturate them.



It's not just SSDs. RAID controllers with write cache (which is 
typically just DRR3 memory anyway) have about the same effect even with 
spinning rust.


So yes, this might make a measurable difference.

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Potential data loss of 2PC files

2016-12-22 Thread Andres Freund


On December 22, 2016 6:44:22 PM GMT+01:00, Robert Haas  
wrote:
>On Thu, Dec 22, 2016 at 12:39 PM, Andres Freund 
>wrote:
>> It makes more sense of you mentally separate between filename(s) and
>file contents.  Having to do filesystem metatata transactions for an
>fsync intended to sync contents would be annoying...
>
>I thought that's why there's fdatasync.

Not quite IIRC: that doesn't deal with file size increase.  All this would be 
easier if hardlinks wouldn't exist IIUC. It's basically a question whether 
dentry, inode or contents need to be synced.   Yes, it sucks.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] An attempt to reduce WALWriteLock contention

2016-12-22 Thread Tomas Vondra

On 12/22/2016 04:00 PM, Kuntal Ghosh wrote:

Hello all,

>

...

>

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid();
\watch 0.5
HEAD

48642 LWLockNamed | WALWriteLock

With Patch
--
31889 LWLockNamed | WALFlushLock
25212 LWLockNamed | WALWriteLock



How do these counts compare to the other wait events? For example 
CLogControlLock, which is what Amit's patch [1] is about?


[1] 
https://www.postgresql.org/message-id/flat/84c22fbb-b9c4-a02f-384b-b4feb2c67193%402ndquadrant.com


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-22 Thread Joe Conway
On 12/22/2016 06:55 AM, Tom Lane wrote:
> Joe Conway  writes:
>> On 12/21/2016 09:20 PM, Tom Lane wrote:
>>> I see that you need to pass the PGconn into dblink_res_error() now, but
>>> what's the point of the new "bool fail" parameter?
> 
>> That part isn't new -- we added it sometime prior to 9.2:
> 
> Oh!  I misread the patch --- something about an unluckily-placed line
> wrap and not looking very closely :-(.  Yeah, it's fine as is.
> Sorry for the noise.

Thanks -- committed/backpatched to 9.2

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2016-12-22 Thread Robert Haas
On Wed, Dec 21, 2016 at 8:00 PM, Amit Langote
 wrote:
> On 2016/12/22 0:31, Robert Haas wrote:
>> On Tue, Dec 20, 2016 at 12:22 PM, Alvaro Herrera
>>  wrote:
>>> Robert Haas wrote:
 Implement table partitioning.
>>>
>>> I thought it was odd to use rd_rel->reloftype as a boolean in
>>> ATExecAttachPartition, but apparently we do it elsewhere too, so let's
>>> leave that complaint for another day.
>>
>> Ugh.  I agree - that's bad style.
>
> Agreed, fixed in the attached patch.
>
>>> What I also found off in the same function is that we use
>>> SearchSysCacheCopyAttName() on each attribute and then don't free the
>>> result, and don't ever use the returned tuple either.  A simple fix, I
>>> thought, just remove the "Copy" and add a ReleaseSysCache().
>>
>> Or use SearchSysCachExists.
>
> Done, too.
>
>>> But then I
>>> noticed this whole thing is rather strange -- why not pass a boolean
>>> flag down to CreateInheritance() and from there to
>>> MergeAttributesIntoExisting() to implement the check?  That seems less
>>> duplicative.
>>
>> Hmm, that would be another way to do it.
>
> MergeAttributesIntoExisting() is called by ATExecAddInherit() as well,
> where we don't want to check that.  Sure, we can only check if
> child_is_partition, but I thought it'd be better to keep the shared code
> (between regular inheritance and partitioning) as close to the old close
> as possible.
>
> Attached patch also fixes a couple of other minor issues.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Potential data loss of 2PC files

2016-12-22 Thread Robert Haas
On Thu, Dec 22, 2016 at 12:39 PM, Andres Freund  wrote:
> It makes more sense of you mentally separate between filename(s) and file 
> contents.  Having to do filesystem metatata transactions for an fsync 
> intended to sync contents would be annoying...

I thought that's why there's fdatasync.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Potential data loss of 2PC files

2016-12-22 Thread Andres Freund


On December 22, 2016 5:50:38 PM GMT+01:00, Robert Haas  
wrote:
>On Wed, Dec 21, 2016 at 8:30 PM, Michael Paquier
> wrote:
>> Hi all,
>>
>> 2PC files are created using RecreateTwoPhaseFile() in two places
>currently:
>> - at replay on a XLOG_XACT_PREPARE record.
>> - At checkpoint with CheckPointTwoPhase().
>>
>> Now RecreateTwoPhaseFile() is careful to call pg_fsync() to be sure
>> that the 2PC files find their way into disk. But one piece is
>missing:
>> the parent directory pg_twophase is not fsync'd. At replay this is
>> more sensitive if there is a PREPARE record followed by a checkpoint
>> record. If there is a power failure after the checkpoint completes
>> there is a risk to lose 2PC status files here.
>>
>> It seems to me that we really should have CheckPointTwoPhase() call
>> fsync() on pg_twophase to be sure that no files are lost here. There
>> is no point to do this operation in RecreateTwoPhaseFile() as if
>there
>> are many 2PC transactions to replay performance would be impacted,
>and
>> we don't care about the durability of those files until a checkpoint
>> moves the redo pointer. I have drafted the patch attached to address
>> this issue.
>>
>> I am adding that as well to the next CF for consideration.
>
>It's pretty stupid that operating systems don't guarantee that calling
>pg_fsync() on the file, the file will also be visible in the parent
>directory.  There's not much use case for wanting to make sure that
>the file will have the correct contents but not caring whether it's
>there at all.

It makes more sense of you mentally separate between filename(s) and file 
contents.  Having to do filesystem metatata transactions for an fsync intended 
to sync contents would be annoying...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dblink get_connect_string() passes FDW option "updatable" to the connect string, connection fails.

2016-12-22 Thread Joe Conway
On 12/18/2016 02:47 PM, Corey Huinker wrote:
> On Sun, Dec 18, 2016 at 4:57 PM, Michael Paquier wrote:
>> On Mon, Dec 19, 2016 at 6:48 AM, Joe Conway wrote:
>>> Maybe if "CREATE FOREIGN DATA WRAPPER" had a way to specify that the FDW
>>> supports a libpq connection it would make sense to allows other FDWs
>>> with this attribute, but since there is none the current state strikes
>>> me as a bad idea.
>>> Thoughts?

>> libpq is proper to the implementation of the FDW, not the wrapper on
>> top of it, so using in the CREATE FDW command a way to do the
>> decision-making that does not look right to me. Filtering things at
>> connection attempt is a better solution.

> The only benefit I see would be giving the user a slightly more clear
> error message like ('dblink doesn't work with %', 'mysql') instead of
> the error about the failure of the connect string generated by the
> options that did overlap. 

Ok, I committed Corey's patch with only minor whitespace changes.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2016-12-22 Thread Robert Haas
On Thu, Dec 22, 2016 at 12:12 AM, Craig Ringer  wrote:
> On 22 December 2016 at 09:55, Craig Ringer  wrote:
>
>> Updated.
>>
>> If you think it's better to just take XidGenLock again, let me know.
>
> Here's a further update that merges in Robert's changes from the patch
> posted upthread.

This patch contains unresolved merge conflicts.  Maybe
SetPendingTransactionIdLimit could happen in TruncateCLOG rather than
the caller.  Instead of using an LWLock, how about a spin lock?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multi-level partitions and partition-wise joins

2016-12-22 Thread Robert Haas
On Wed, Dec 21, 2016 at 11:31 PM, Ashutosh Bapat
 wrote:
> Given the scenario described above, it looks like we have to retain
> partition hierarchy in the form of inheritance hierarchy in order to
> implement partition-wise joins for multi-leveled partition tables. Is
> that the right thing to do? PFA a patch retained by Amit Langote to
> translate partition hierarchy into inheritance hierarchy. Is this
> something on the right direction?

I am not sure whether Amit's patch is the right way to go.  I don't
fully understand it, and I remember complaining about some aspects of
it before, such as this unexplained and fairly random-looking
exception:

+/*
+ * Do not flatten the inheritance hierarchy if partitioned table, unless
+ * this is the result relation.
+ */

However, I think the overall idea of doing flattening later in the
process for partitioned tables is probably correct.  It's not that we
shouldn't do flattening at all -- the final Plan shouldn't involve
nested Append nodes -- but maybe we should delay it.  Perhaps the Path
tree retains the structure and the final Plan flattens it.  We might
consider doing that way for both inheritance trees and partitioning,
just so we don't have two different code paths to validate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Potential data loss of 2PC files

2016-12-22 Thread Robert Haas
On Wed, Dec 21, 2016 at 8:30 PM, Michael Paquier
 wrote:
> Hi all,
>
> 2PC files are created using RecreateTwoPhaseFile() in two places currently:
> - at replay on a XLOG_XACT_PREPARE record.
> - At checkpoint with CheckPointTwoPhase().
>
> Now RecreateTwoPhaseFile() is careful to call pg_fsync() to be sure
> that the 2PC files find their way into disk. But one piece is missing:
> the parent directory pg_twophase is not fsync'd. At replay this is
> more sensitive if there is a PREPARE record followed by a checkpoint
> record. If there is a power failure after the checkpoint completes
> there is a risk to lose 2PC status files here.
>
> It seems to me that we really should have CheckPointTwoPhase() call
> fsync() on pg_twophase to be sure that no files are lost here. There
> is no point to do this operation in RecreateTwoPhaseFile() as if there
> are many 2PC transactions to replay performance would be impacted, and
> we don't care about the durability of those files until a checkpoint
> moves the redo pointer. I have drafted the patch attached to address
> this issue.
>
> I am adding that as well to the next CF for consideration.

It's pretty stupid that operating systems don't guarantee that calling
pg_fsync() on the file, the file will also be visible in the parent
directory.  There's not much use case for wanting to make sure that
the file will have the correct contents but not caring whether it's
there at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-12-22 Thread Andres Freund
Hi,

On 2016-12-21 13:28:54 -0800, Andres Freund wrote:
> A mime-type of invalid/octet-stream? That's an, uh, odd choice.
>
> Working on committing this (tomorrow morning, not tonight).  There's
> some relatively minor things I want to change:
>
> - I don't like the name XLogSetFlags() - it's completely unclear what
>   that those flags refer to - it could just as well be replay
>   related. XLogSetRecordFlags()?
> - Similarly I don't like the name "progress LSN" much. What does
>   "progress" really mean in that". Maybe "consistency LSN"?
> - It's currently required to avoid triggering archive timeouts and
>   checkpoints triggering each other, but I'm nervous marking all xlog
>   switches as unimportant. I think it'd be better to only mark timeout
>   triggered switches as such.

Here's an updated version of this. Besides the above (with "consistency
LSN" now named "lastImportantAt" instead of the previous
lastProgressAt), I changed how the skipping works in the bgwriter: I
don't see any need to involve the checkpoint location there. This also
allows to drop GetLastCheckpointPtr().  Besides that I did a fair amount
of comment-smithing.

I plan to commit this later today.  Hope I got the reviewers roughly right.

Regards,

Andres
>From c5c9e9ce114d5c058e171caaa172ccb2ac066f13 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 22 Dec 2016 08:31:32 -0800
Subject: [PATCH] Skip checkpoints, archiving on idle systems.

Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.

To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.

Use the new facility for checkpoints, archive timeout and standby
snapshots.

Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
https://www.postgresql.org/message-id/20151016203031.3019.72...@wrigleys.postgresql.org
https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkyv_45d...@mail.gmail.com
Backpatch: -
---
 doc/src/sgml/config.sgml  |  10 +--
 src/backend/access/heap/heapam.c  |  10 +--
 src/backend/access/transam/xact.c |   2 +-
 src/backend/access/transam/xlog.c | 118 +++---
 src/backend/access/transam/xlogfuncs.c|   2 +-
 src/backend/access/transam/xloginsert.c   |  24 --
 src/backend/postmaster/bgwriter.c |   8 +-
 src/backend/postmaster/checkpointer.c |  45 
 src/backend/replication/logical/message.c |   2 +-
 src/backend/storage/ipc/standby.c |  11 ++-
 src/include/access/xlog.h |  12 ++-
 src/include/access/xlog_internal.h|   4 +-
 src/include/access/xloginsert.h   |   2 +-
 13 files changed, 173 insertions(+), 77 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1b98c416e0..b6b20a368e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2852,12 +2852,10 @@ include_dir 'conf.d'
 parameter is greater than zero, the server will switch to a new
 segment file whenever this many seconds have elapsed since the last
 segment file switch, and there has been any database activity,
-including a single checkpoint.  (Increasing
-checkpoint_timeout will reduce unnecessary
-checkpoints on an idle system.)
-Note that archived files that are closed early
-due to a forced switch are still the same length as completely full
-files.  Therefore, it is unwise to use a very short
+including a single checkpoint (checkpoints are skipped if there is
+no database activity).  Note that archived files that are closed
+early due to a forced switch are still the same length as completely
+full files.  Therefore, it is unwise to use a very short
 archive_timeout  it will bloat your archive
 storage.  archive_timeout settings of a minute or so are
 usually reasonable.  You should consider using streaming replication,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1a0d..ea579a00be 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2507,7 +2507,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 			heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-12-22 Thread Robert Haas
On Mon, Dec 5, 2016 at 2:46 AM, Amit Kapila  wrote:
>> I'll review after that, since I have other things to review meanwhile.
>
> Attached, please find the rebased patch attached with this e-mail.
> There is no fundamental change in patch except for adapting the new
> locking strategy in squeeze operation.  I have done the sanity testing
> of the patch with master-standby setup and Kuntal has helped me to
> verify it with his wal-consistency checker patch.

This patch again needs a rebase, but before you do that I'd like to
make it harder by applying the attached patch to remove
_hash_chgbufaccess(), which I think is a bad plan for more or less the
same reasons that motivated the removal of _hash_wrtbuf() in commit
25216c98938495fd741bf585dcbef45b3a9ffd40. I think there's probably
more simplification and cleanup that can be done afterward in the wake
of this; what I've done here is just a mechanical replacement of
_hash_chgbufaccess() with LockBuffer() and/or MarkBufferDirty().  The
point is that having MarkBufferDirty() calls happen implicitly instead
some other function is not what we want for write-ahead logging.  Your
patch gets rid of that, too; this is just doing it somewhat more
thoroughly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0eeb37d..1fa087a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -274,7 +274,7 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 	 * Reacquire the read lock here.
 	 */
 	if (BufferIsValid(so->hashso_curbuf))
-		_hash_chgbufaccess(rel, so->hashso_curbuf, HASH_NOLOCK, HASH_READ);
+		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
 
 	/*
 	 * If we've already initialized this scan, we can just advance it in the
@@ -354,7 +354,7 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 
 	/* Release read lock on current buffer, but keep it pinned */
 	if (BufferIsValid(so->hashso_curbuf))
-		_hash_chgbufaccess(rel, so->hashso_curbuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
 	scan->xs_ctup.t_self = so->hashso_heappos;
@@ -524,7 +524,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	orig_ntuples = metap->hashm_ntuples;
 	memcpy(_metapage, metap, sizeof(local_metapage));
 	/* release the lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 	/* Scan the buckets that we know exist */
 	cur_bucket = 0;
@@ -576,9 +576,9 @@ loop_top:
 			 * (and thus can't be further split), update our cached metapage
 			 * data.
 			 */
-			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			LockBuffer(metabuf, BUFFER_LOCK_SHARE);
 			memcpy(_metapage, metap, sizeof(local_metapage));
-			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+			LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 		}
 
 		bucket_buf = buf;
@@ -597,7 +597,7 @@ loop_top:
 	}
 
 	/* Write-lock metapage and check for split since we started */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
 
 	if (cur_maxbucket != metap->hashm_maxbucket)
@@ -605,7 +605,7 @@ loop_top:
 		/* There's been a split, so process the additional bucket(s) */
 		cur_maxbucket = metap->hashm_maxbucket;
 		memcpy(_metapage, metap, sizeof(local_metapage));
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 		goto loop_top;
 	}
 
@@ -821,7 +821,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 * page
 		 */
 		if (retain_pin)
-			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
@@ -836,7 +836,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	if (buf != bucket_buf)
 	{
 		_hash_relbuf(rel, buf);
-		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+		LockBuffer(bucket_buf, BUFFER_LOCK_EXCLUSIVE);
 	}
 
 	/*
@@ -866,7 +866,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 			bstrategy);
 	else
-		_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
 
 void
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 59c4213..b51b487 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -102,7 +102,7 @@ restart_insert:
 		lowmask = metap->hashm_lowmask;
 
 		/* Release metapage lock, but keep pin. */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 		/*
 		 * If the previous 

Re: [HACKERS] patch: function xmltable

2016-12-22 Thread Alvaro Herrera
Pavel Stehule wrote:

> another update - lot of cleaning

Ah, the tupledesc stuff in this one appears much more reasonable to me.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SET NOT NULL [NOT VALID / CONCURRENTLY]?

2016-12-22 Thread Robert Haas
On Wed, Dec 21, 2016 at 7:55 PM, Joel Jacobson  wrote:

> Attached is the function SET_NOT_NULL(_Schema name, _Table name, _Column
> name) which does the following:
>
> 1. LOCK TABLE %I.%I IN ACCESS EXCLUSIVE MODE
> just like the normal DDL commands would do
>
> 2. SELECT EXISTS (SELECT 1 FROM %I.%I WHERE %I IS NULL)
> which is fast if there is an index on the column
>
> 3. UPDATE pg_catalog.pg_attribute SET attnotnull = TRUE
> WHERE attrelid = %L::oid
> AND   attname  = %L
>
> Pragmatically, would this be a safe approach?
>

Hmm, I don't see a problem with it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Hang in pldebugger after git commit : 98a64d0

2016-12-22 Thread Robert Haas
On Wed, Dec 21, 2016 at 9:34 PM, Amit Kapila  wrote:
> On Wed, Dec 21, 2016 at 9:46 PM, Robert Haas  wrote:
>> On Tue, Dec 20, 2016 at 11:32 PM, Amit Kapila  
>> wrote:
>>> Ashutosh Sharma has helped to test that pldebugger issue is fixed with
>>> attached version.
>>
>> Committed.
>>
>
> Thanks!
>
>> Patch by Amit Kapial.  Reported and tested by Ashutosh Sharma, who
>> also provided some analysis.  Further analysis by Michael Paquier.
>
> In commit note, my name is not spelled correctly, but I think there is
> nothing we can do about it.

Sorry, it's too late to fix now.  But you spelled me name wrong today
too, so maybe we're even.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw bug in 9.6

2016-12-22 Thread Robert Haas
On Wed, Dec 21, 2016 at 11:04 AM, Ashutosh Bapat
 wrote:
> Some review comments
>
> 1. postgres_fdw doesn't push down semi and anti joins so you may want to
> discount these two too.
> +   jointype == JOIN_SEMI ||
> +   jointype == JOIN_ANTI);

But in the future, it might.  We shouldn't randomly leave foot-guns
lying around if there's an easy alternative.

> 3. Adding new members to JoinPathExtraData may save some work for postgres_fdw
> and other FDWs which would use CreateLocalJoinPath(), but it will add few 
> bytes
> to the structure even when there is no "FULL foreign join which requires EPQ"
> involved in the query. That may not be so much of memory overhead since the
> structure is used locally to add_paths_to_joinrel(), but it might be something
> to think about. Instead, what if we call select_mergejoin_clauses() within
> CreateLocalJoinPath() to get that information?

I think that's exactly backwards.  The few bytes of storage don't
matter, but extra CPU cycles might.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-22 Thread Claudio Freire
On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova
 wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, failed
> Implements feature:   not tested
> Spec compliant:   not tested
> Documentation:not tested
>
> Hi,
> I haven't read through the thread yet, just tried to apply the patch and run 
> tests.
> And it seems that the last attached version is outdated now. It doesn't apply 
> to the master
> and after I've tried to fix merge conflict, it segfaults at initdb.


I'll rebase when I get some time to do it and post an updated version


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-22 Thread Anastasia Lubennikova
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Hi,
I haven't read through the thread yet, just tried to apply the patch and run 
tests.
And it seems that the last attached version is outdated now. It doesn't apply 
to the master
and after I've tried to fix merge conflict, it segfaults at initdb. 

So, looking forward to a new version for review.

The new status of this patch is: Waiting on Author

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor correction in alter_table.sgml

2016-12-22 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Alvaro Herrera  writes:
> >> I had considered removing those but thought that some users might think
> >> that they're only "altering" one table and therefore felt it made sense
> >> to keep those explicitly listed.
> 
> > I agree with Stephen; it's not crystal clear from the user's POV that
> > the command is altering two tables.  It's worth mentioning this
> > explicitly; otherwise this is just a documented gotcha.
> 
> Well, it already is shown explicitly in the syntax summary.  The text
> is simply trying to restate that in an easily remembered fashion, and
> the more exceptions, the harder it is to remember.  You might as well
> forget trying to provide a rule at all and just say something like
> "Most forms of ALTER TABLE can be combined, except as shown in the
> syntax diagram."

I do wonder if perhaps we should change 'action' to something like
'combinable_action' or something more explicit which we could easily
refer to later in a statement along the lines of:

  Multiple combinable_actions specified in a single ALTER TABLE
  statement will be applied together in a single pass over the table.

> (Of course, maybe the question we ought to be asking here is why
> ATTACH/DETACH PARTITION failed to go with the flow and be a
> combinable action.)

I did wonder that myself but havne't looked at the code.  I'm guessing
there's a reason it's that way.

Thanks!

Stephen


signature.asc
Description: Digital signature


[HACKERS] An attempt to reduce WALWriteLock contention

2016-12-22 Thread Kuntal Ghosh
Hello all,

In a recent post[1] by Robert, wait events for different LWLOCKS have been
analyzed. The results clearly indicate a significant lock contention
overhead on WAL Write locks. To get an idea of this overhead, we did the
following two tests.

1. Hacked the code to comment out WAL write and flush calls to see the
overhead of WAL writing. The TPS for read-write pgbench tests at 300 scale
factor with 64 client count increased from 27871 to 45068.

2. Hacked the code to comment out WAL flush calls to see the overhead of
WAL flushing (fsync=off). The TPS for read-write pgbench tests at 300 scale
factor with 64 client count increased from 27871 to 41835.

All the tests have been performed for 15 minutes with following pg
configurations:
max_wal_size: 40GB
checkpoint_timeout: 15mins
maintenance_work_mem: 4GB
checkpoint_completion_target: 0.9
Shared buffer: 8GB
(Other settings have default values)

>From above experiments, it is clear that flush is the main cost in WAL
writing which is no surprise, but still, the above data shows the exact
overhead of flush. Robert and Amit suggested (in offline discussions) using
separate WALFlushLock to flush the WAL data. The idea is to take WAL flush
calls out of WAL Write Lock and introduce a new lock (WAL Flush Lock) to
flush the data. This should allow simultaneous os writes when a fsync is in
progress. LWLockAcquireOrWait is used for the newly introduced WAL Flush
Lock to accumulate flush calls. We did a pgbench read/write (s.f. 300) test
with above configurations for various clients. But, we didn't see any
performance improvements, rather it decreased by 10%-12%. Hence to measure
the wait events, we performed a run for 30 minutes with 64 clients.

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid();
\watch 0.5
HEAD

48642 LWLockNamed | WALWriteLock

With Patch
--
31889 LWLockNamed | WALFlushLock
25212 LWLockNamed | WALWriteLock

The contention on WAL Write Lock was reduced, but together with WAL Flush
lock, the total contention got increased. We also measured the number of
times fsync() and write() have been called for a 10-minutes pgbench
read/write test with 16 clients. We noticed a huge increase in write()
system calls and this is happening as we've reduced the contention on WAL
Write Lock.

Due to reduced contention on WAL Write Lock, lot of backends are going for
small os writes, sometimes on same 8KB page, i.e., write calls are not
properly accumulated. For example,
backend 1 - 1 KB write() - 15-20 micro secs
backend 2 - 1 KB write() - 15-20 micro secs
backend 3 - 1 KB write() - 15-20 micro secs
backend 4 - 1 KB write() - 15-20 micro secs
But, if we accumulate these 4 requests, 4KB can be written in 50-60 micro
secs. Apart from that, we are also paying for lock acquire and lock release
for os write and lseek(). For the same reason, when a fsync is going, we
are not able to accumulate sufficient data for next fsync. This also
increases the contention on WAL Flush Lock. So, we tried adding
delay(pg_usleep) before flush/write to accumulate data. But, this severely
increases the contention on WAL flush locks.

To reduce the contention on WAL Write Lock further, Amit suggested the
following change on top of the existing patch:
Backend as Write Leader:
Except one proc, all other proc's will wait for their write location to be
written in OS buffer. Each proc will advertise it's write location and wait
on the semaphore to check whether it's write location has been completed.
Only the leader will compete for WALWriteLock.After data is written, it
wakes all the procs for which it has written the WAL and once done with
waking it will release the WALWriteLock. Ashutosh and Amit have helped a
lot for the implementation of the above idea.  Even after this idea, we
didn't see any noticeable performance improvement with
synchronous_commit=on mode, however there was no regression. Again, to
measure the wait events, we performed a 30 minutes run with 64 clients.
(pgbench r/w test with s.f. 300)

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid();
\watch 0.5
HEAD

48642  LWLockNamed | WALWriteLock

With Patch
--
38952 LWLockNamed | WALFlushLock
1679 LWLockNamed | WALWriteLock

We reduced the contention on WAL write locks. The reason is that only the
group leader is competing for write lock on behalf of a group of procs.
Still, the number of small write requests is not reduced.

Finally, we performed some tests with synchronous_commit=off and data
doesn't fit in shared buffer. This should accumulate the data properly for
write without waiting on some locks or semaphores. Besides, write and fsync
can be done simultaneously. Next results are for various scale factors and
shared buffers. (Please see below for system configuration):

./pgbench -c $threads -j $threads -T 900 -M prepared postgres

Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-22 Thread Tom Lane
Joe Conway  writes:
> On 12/21/2016 09:20 PM, Tom Lane wrote:
>> I see that you need to pass the PGconn into dblink_res_error() now, but
>> what's the point of the new "bool fail" parameter?

> That part isn't new -- we added it sometime prior to 9.2:

Oh!  I misread the patch --- something about an unluckily-placed line
wrap and not looking very closely :-(.  Yeah, it's fine as is.
Sorry for the noise.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor correction in alter_table.sgml

2016-12-22 Thread Tom Lane
Alvaro Herrera  writes:
>> I had considered removing those but thought that some users might think
>> that they're only "altering" one table and therefore felt it made sense
>> to keep those explicitly listed.

> I agree with Stephen; it's not crystal clear from the user's POV that
> the command is altering two tables.  It's worth mentioning this
> explicitly; otherwise this is just a documented gotcha.

Well, it already is shown explicitly in the syntax summary.  The text
is simply trying to restate that in an easily remembered fashion, and
the more exceptions, the harder it is to remember.  You might as well
forget trying to provide a rule at all and just say something like
"Most forms of ALTER TABLE can be combined, except as shown in the
syntax diagram."

(Of course, maybe the question we ought to be asking here is why
ATTACH/DETACH PARTITION failed to go with the flow and be a
combinable action.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor correction in alter_table.sgml

2016-12-22 Thread Alvaro Herrera
Stephen Frost wrote:
> Tom,
> 
> * Tom Lane (t...@sss.pgh.pa.us) wrote:

> > BTW, so far as the HEAD version of this patch goes, I notice that
> > ATTACH PARTITION and DETACH PARTITION were recently added to the
> > list of exceptions.  But they're not exceptions according to this
> > wording: they act on more than one table (the parent and the
> > partition), no?  So we could simplify the sentence some more by
> > removing them again.
> 
> I had considered removing those but thought that some users might think
> that they're only "altering" one table and therefore felt it made sense
> to keep those explicitly listed.

I agree with Stephen; it's not crystal clear from the user's POV that
the command is altering two tables.  It's worth mentioning this
explicitly; otherwise this is just a documented gotcha.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor correction in alter_table.sgml

2016-12-22 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> So maybe something like
> >> 
> >> All the forms of ALTER TABLE that act on a single table,
> >> except RENAME and SET SCHEMA, can be combined into a
> >> list of multiple alterations to be applied together.
> 
> > Committed and back-patch'd that way.
> 
> BTW, so far as the HEAD version of this patch goes, I notice that
> ATTACH PARTITION and DETACH PARTITION were recently added to the
> list of exceptions.  But they're not exceptions according to this
> wording: they act on more than one table (the parent and the
> partition), no?  So we could simplify the sentence some more by
> removing them again.

I had considered removing those but thought that some users might think
that they're only "altering" one table and therefore felt it made sense
to keep those explicitly listed.

I don't hold that position very strongly, but wanted to explain my
thinking there.  If you still feel it'd be better to keep it simple
rather than be explicit for those cases then I'll change it.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-12-22 Thread Tomas Vondra

Hi,


The attached results show that:

(a) master shows the same zig-zag behavior - No idea why this wasn't
observed on the previous runs.

(b) group_update actually seems to improve the situation, because the
performance keeps stable up to 72 clients, while on master the
fluctuation starts way earlier.

I'll redo the tests with a newer kernel - this was on 3.10.x which is
what Red Hat 7.2 uses, I'll try on 4.8.6. Then I'll try with the patches
you submitted, if the 4.8.6 kernel does not help.

Overall, I'm convinced this issue is unrelated to the patches.


I've been unable to rerun the tests on this hardware with a newer 
kernel, so nothing new on the x86 front.


But as discussed with Amit in Tokyo at pgconf.asia, I got access to a 
Power8e machine (IBM 8247-22L to be precise). It's a much smaller 
machine compared to the x86 one, though - it only has 24 cores in 2 
sockets, 128GB of RAM and less powerful storage, for example.


I've repeated a subset of x86 tests and pushed them to

https://bitbucket.org/tvondra/power8-results-2

The new results are prefixed with "power-" and I've tried to put them 
right next to the "same" x86 tests.


In all cases the patches significantly reduce the contention on 
CLogControlLock, just like on x86. Which is good and expected.


Otherwise the results are rather boring - no major regressions compared 
to master, and all the patches perform almost exactly the same. Compare 
for example this:


* http://tvondra.bitbucket.org/#dilip-300-unlogged-sync

* http://tvondra.bitbucket.org/#power-dilip-300-unlogged-sync

So the results seem much smoother compared to x86, and the performance 
difference is roughly 3x, which matches the 24 vs. 72 cores.


For pgbench, the difference is much more significant, though:

* http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip

* http://tvondra.bitbucket.org/#power-pgbench-300-unlogged-sync-skip

So, we're doing ~40k on Power8, but 220k on x86 (which is ~6x more, so 
double per-core throughput). My first guess was that this is due to the 
x86 machine having better I/O subsystem, so I've reran the tests with 
data directory in tmpfs, but that produced almost the same results.


Of course, this observation is unrelated to this patch.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-22 Thread amul sul
Hi all,

As we have discussed previously, we need to rework this patch as a client of
Peter Eisentraut's background sessions code[1].

Attaching trial version patch to discussed possible design and api.
We could have following APIs :

• pg_background_launch : This function start and stores new background
session, and returns the process id of background worker.

• pg_background_run : This API takes the process id and SQL command as
input parameter. Using this process id, stored worker's session is
retrieved and give SQL command is executed under it.

• pg_background_result : This API takes the process id as input
parameter and returns the result of command executed thought the
background worker session.  Same as it was before but now result can
be fetch in LIFO order i.e. result of last executed query using
pg_background_run will be fetched first.

• pg_background_detach : This API takes the process id and detach the
background process. Stored worker's session is not dropped until this
called.

• TBC : API to discard result of last query or discard altogether?

• TBC : How about having one more api to see all existing sessions ?


Kindly share your thoughts/suggestions.  Note that attach patch is WIP
version, code, comments & behaviour could be vague.

--
Quick demo:
--
Apply attach patch to the top of Peter Eisentraut's
0001-Add-background-sessions.patch[1]

postgres=# select pg_background_launch();
 pg_background_launch
--
21004
(1 row)

postgres=# select pg_background_run(21004, 'vacuum verbose foo');
 pg_background_run
---

(1 row)

postgres=# select * from pg_background_result(21004) as (x text);
INFO:  vacuuming "public.foo"
INFO:  "foo": found 0 removable, 5 nonremovable row versions in 1 out of 1 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
   x

 VACUUM
(1 row)

postgres=# select pg_background_run(21004, 'select * from foo');
 pg_background_run
---

(1 row)

postgres=# select * from pg_background_result(21004) as (x int);
 x
---
 1
 2
 3
 4
 5
(5 rows)

postgres=# select pg_background_detach(21004);
 pg_background_detach
--

(1 row)


References :
[1] 
https://www.postgresql.org/message-id/e1c2d331-ee6a-432d-e9f5-dcf85cffaf29%402ndquadrant.com.


Regards,
Amul Sul


0002-pg_background_worker_as_client_of_bgsession_trial.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-12-22 Thread Amit Kapila
On Thu, Dec 22, 2016 at 3:10 AM, Andres Freund  wrote:
> On 2016-12-21 16:35:28 -0500, Robert Haas wrote:
>> On Wed, Dec 21, 2016 at 4:28 PM, Andres Freund  wrote:
>> > - Similarly I don't like the name "progress LSN" much. What does
>> >   "progress" really mean in that". Maybe "consistency LSN"?
>>
>> Whoa.  -1 from me for "consistency LSN".  Consistency has to with
>> whether the cluster has recovered up to the minimum recovery point or
>> whatever -- that is -- questions like "am i going to run into torn
>> pages?" and "should I expect some heap tuples to maybe be missing
>> index tuples, or the other way around?".
>
> That's imo pretty much what progress LSN currently describes. Have there
> been any records which are important for durability/consistency and
> hence need to be archived and such.
>
>
>> What I think "progress LSN"
>> is getting at -- actually fairly well -- is whether we're getting
>> anything *important* done, not whether we are consistent.  I don't
>> mind changing the name, but not to consistency LSN.
>
> Well, progress could just as well be replay. Or the actual insertion
> point. Or up to where we've written out. Or synced out. Or
> replicated
>
> Open to other suggestions - I'm not really happy with consistency LSN,
> but definitely unhappy with progress LSN.
>

last_essential_LSN?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2016-12-22 Thread Amit Kapila
On Wed, Aug 24, 2016 at 2:24 PM, Heikki Linnakangas  wrote:
>
> And here are the results on the 72 core machine (thanks again, Alexander!).
> The test setup was the same as on the 32-core machine, except that I ran it
> with more clients since the system has more CPU cores. In summary, in the
> best case, the patch increases throughput by about 10%. That peak is with 64
> clients. Interestingly, as the number of clients increases further, the gain
> evaporates, and the CSN version actually performs worse than unpatched
> master. I don't know why that is. One theory that by eliminating one
> bottleneck, we're now hitting another bottleneck which doesn't degrade as
> gracefully when there's contention.
>

Quite possible and I think it could be either due CLOGControlLock or
WALWriteLock.  Are you planning to work on this for this release, if
not do you think meanwhile we should pursue "caching the snapshot"
idea of Andres?  Last time Mithun did some benchmarks [1] after fixing
some issues in the patch and he found noticeable performance increase.

[1] - 
https://www.postgresql.org/message-id/CAD__Ouic1Tvnwqm6Wf6j7Cz1Kk1DQgmy0isC7%3DOgX%2B3JtfGk9g%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-12-22 Thread Amit Langote
On 2016/12/22 1:50, Robert Haas wrote:
> On Wed, Dec 21, 2016 at 5:33 AM, Amit Langote
>  wrote:
>> Breaking changes into multiple commits/patches does not seem to work for
>> adding regression tests.  So, I've combined multiple patches into a single
>> patch which is now patch 0002 in the attached set of patches.
> 
> Ugh, seriously?  It's fine to combine closely related bug fixes but
> not all of these are.  I don't see why you can't add some regression
> tests in one patch and then add some more in the next patch.

I managed to do that this time around with the attached set of patches.
Guess I gave up too easily in the previous attempt.

While working on that, I discovered yet-another-bug having to do with the
tuple descriptor that's used as we route a tuple down a partition tree. If
attnums of given key attribute(s) are different on different levels, it
would be incorrect to use the original slot's (one passed by ExecInsert())
tuple descriptor to inspect the original slot's heap tuple, as we go down
the tree.  It might cause spurious "partition not found" at some level due
to looking at incorrect field in the input tuple because of using the
wrong tuple descriptor (root table's attnums not always same as other
partitioned tables in the tree).  Patch 0001 fixes that including a test.
It also addresses the problem I mentioned previously that once
tuple-routing is done, we failed to switch to a slot with the leaf
partition's tupdesc (IOW, continued to use the original slot with root
table's tupdesc causing spurious failures due to differences in attums
between the leaf partition and the root table).

Further patches 0002, 0003 and 0004 fix bugs that I sent one-big-patch for
in my previous message.  Each patch has a test for the bug it's meant to fix.

Patch 0005 is the same old "Add some more tests for tuple-routing" per [1]:

> Meanwhile, committed the latest 0001 and the elog() patch.

Thanks!

Regards,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZ86v1G%2Bzx9etMiSQaBBvYMKfU-iitqZArSh5z0n8Q4cA%40mail.gmail.com
>From cac625b22990c12a537eaa4c77434017a2303b92 Mon Sep 17 00:00:00 2001
From: amit 
Date: Thu, 22 Dec 2016 15:33:41 +0900
Subject: [PATCH 1/5] Use correct tuple descriptor before and after routing a
 tuple

When we have a multi-level partitioned hierarchy where tables at
successive levels have different attnums for a partition key attribute(s),
we must use the tuple descriptor of a partitioned table of the given level
to inspect the appropriately converted representation of the input tuple
before computing the partition key to perform tuple-routing.

So, store in each PartitionDispatch object, a standalone TupleTableSlot
initialized with the TupleDesc of the corresponding partitioned table,
along with a TupleConversionMap to map tuples from the its parent's
rowtype to own rowtype.

After the routing is done and a leaf partition returned, we must use the
leaf partition's tuple descriptor, not the root table's.  For roughly
same reasons as above.  For the ext row and so on, we must then switch
back to the root table's tuple descriptor.  To that end, a dedicated
TupleTableSlot is allocated in EState called es_partition_tuple_slot,
whose descriptor is set to a given leaf partition for every input tuple
after it's routed.

Reported by: n/a
Patch by: Amit Langote
Reports: n/a
---
 src/backend/catalog/partition.c| 74 --
 src/backend/commands/copy.c| 31 +-
 src/backend/executor/nodeModifyTable.c | 27 +
 src/include/catalog/partition.h|  7 
 src/include/nodes/execnodes.h  |  3 ++
 src/test/regress/expected/insert.out   | 37 +
 src/test/regress/sql/insert.sql| 26 
 7 files changed, 190 insertions(+), 15 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 9980582b77..fca874752f 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -923,13 +923,19 @@ RelationGetPartitionQual(Relation rel, bool recurse)
 	return generate_partition_qual(rel, recurse);
 }
 
-/* Turn an array of OIDs with N elements into a list */
-#define OID_ARRAY_TO_LIST(arr, N, list) \
+/*
+ * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
+ * append pointer rel to the list 'parents'.
+ */
+#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
 	do\
 	{\
 		int		i;\
-		for (i = 0; i < (N); i++)\
-			(list) = lappend_oid((list), (arr)[i]);\
+		for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
+		{\
+			(partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
+			(parents) = lappend((parents), (rel));\
+		}\
 	} while(0)
 
 /*
@@ -944,11 +950,13 @@ PartitionDispatch *
 RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
  int *num_parted, List **leaf_part_oids)
 {
-	PartitionDesc rootpartdesc =