[HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-05 Thread Filip Rembiałkowski
- new GUC in "Statement Behaviour" section, notify_duplicate_removal
(default true)

Initial discussion in this thread:
http://www.postgresql.org/message-id/CAP_rwwmpzk9=sbjrztod05bdctyc43wnknu_m37dygvl4sa...@mail.gmail.com

Rationale: for some legitimate use cases, duplicate removal is not
required, and it gets O(N^2) cost on large COPY/ insert transactions.
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..9fb5504 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6095,6 +6095,24 @@ SET XML OPTION { DOCUMENT | CONTENT };
   
  
 
+ 
+  notify_duplicate_removal (bool)
+  
+   notify_duplicate_removal configuration parameter
+  
+  
+  
+   
+Try to remove duplicate messages while processing NOTIFY.  When
+on (the default), the server will avoid duplicate messages
+(with same channel and payload).  Setting this variable to
+off can increase performance in some situations - at the
+cost of having duplicate messages in NOTIFY queue.  See  for more information.
+   
+  
+ 
+
  
 
  
diff --git a/doc/src/sgml/ref/notify.sgml b/doc/src/sgml/ref/notify.sgml
index 4dd5608..86a9bed 100644
--- a/doc/src/sgml/ref/notify.sgml
+++ b/doc/src/sgml/ref/notify.sgml
@@ -95,16 +95,17 @@ NOTIFY channel [ , 
If the same channel name is signaled multiple times from the same
-   transaction with identical payload strings, the
-   database server can decide to deliver a single notification only.
-   On the other hand, notifications with distinct payload strings will
-   always be delivered as distinct notifications. Similarly, notifications from
-   different transactions will never get folded into one notification.
-   Except for dropping later instances of duplicate notifications,
+   transaction with identical payload strings, and
+   notify_duplicate_removal is set to true, the database server
+   can decide to deliver a single notification only.  On the other hand,
+   notifications with distinct payload strings will always be delivered as
+   distinct notifications. Similarly, notifications from different
+   transactions will never get folded into one notification.  Except for
+   dropping later instances of duplicate notifications,
NOTIFY guarantees that notifications from the same
transaction get delivered in the order they were sent.  It is also
-   guaranteed that messages from different transactions are delivered in
-   the order in which the transactions committed.
+   guaranteed that messages from different transactions are delivered in the
+   order in which the transactions committed.
   
 
   
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index c39ac3a..a7bc9f1 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -364,8 +364,9 @@ static bool amRegisteredListener = false;
 /* has this backend sent notifications in the current transaction? */
 static bool backendHasSentNotifications = false;
 
-/* GUC parameter */
+/* GUC parameters */
 bool		Trace_notify = false;
+bool		notify_duplicate_removal = true;
 
 /* local function prototypes */
 static bool asyncQueuePagePrecedes(int p, int q);
@@ -570,9 +571,12 @@ Async_Notify(const char *channel, const char *payload)
 	 errmsg("payload string too long")));
 	}
 
-	/* no point in making duplicate entries in the list ... */
-	if (AsyncExistsPendingNotify(channel, payload))
-		return;
+	if (notify_duplicate_removal)
+	{
+		/* check for duplicate entries in the list */
+		if (AsyncExistsPendingNotify(channel, payload))
+			return;
+	}
 
 	/*
 	 * The notification list needs to live until end of transaction, so store
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 83b8388..b737c29 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1617,6 +1617,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"notify_duplicate_removal", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Remove duplicate messages during NOTIFY."),
+			NULL
+		},
+		_duplicate_removal,
+		true,
+		NULL, NULL, NULL
+	},
 
 	/* End-of-list marker */
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 029114f..2831c1b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -536,6 +536,7 @@
 #xmloption = 'content'
 #gin_fuzzy_search_limit = 0
 #gin_pending_list_limit = 4MB
+#notify_duplicate_removal = on
 
 # - Locale and Formatting -
 
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index b4c13fa..c572691 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -23,6 +23,7 @@
 #define NUM_ASYNC_BUFFERS	8
 
 extern bool Trace_notify;
+extern bool notify_duplicate_removal;
 extern volatile sig_atomic_t 

Re: [HACKERS] Sanity checking for ./configure options?

2016-02-05 Thread David Fetter
On Wed, Feb 03, 2016 at 06:02:57PM -0600, Jim Nasby wrote:
> I just discovered that ./configure will happily accept '--with-pgport=' (I
> was actually doing =$PGPORT, and didn't realize $PGPORT was empty). What you
> end up with is a compile error in guc.c, with no idea why it's broken. Any
> reason not to have configure or at least make puke if pgport isn't valid?

That seems like a good idea.

I've been getting rejection to happen with phrases like

--with-pgport=${PGPORT:?}

which while it looks a little odd, only adds 4 characters to each
shell variable.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] silent data loss with ext4 / all current versions

2016-02-05 Thread Tomas Vondra

On 02/04/2016 09:59 AM, Michael Paquier wrote:

On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund  wrote:

On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:

And there is no actual risk of data loss


Huh?


More precise: what I mean here is that should an OS crash or a power
failure happen, we would fall back to recovery at next restart, so we
would not actually *lose* data.


Except that we actually can't perform the recovery properly because we 
may not have the last WAL segment (or multiple segments), so we can't 
replay the last batch of transactions. And we don't even notice that.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] First-draft release notes for next week's back-branch releases

2016-02-05 Thread Michael Paquier
On Sat, Feb 6, 2016 at 7:11 AM, Tom Lane  wrote:
> I've done a first pass at next week's release notes; please review.
>
> Committed at
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7008e70d105b572821406744ce080771b74c06ab
> and should be visible in the website's devel-branch docs after the next
> guaibasaurus buildfarm run, due a couple hours from now.
>
> As usual, I just dumped all the notes into one branch's release-N.N.sgml
> file, and will sort out which ones apply to which branches later.  I chose
> to put them into 9.4, though, not 9.5, since many of these issues are
> already fixed in 9.5.0 and will not need to appear in the 9.5.1 section.

+
+ 
+  Ensure that dynloader.h is included in the installed
+  header files in MSVC builds (Michael Paquier)
+ 
+

Bruce is the main author of this patch. I used what he did as a base
to build a version correct for MSVC.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostgreSQL Audit Extension

2016-02-05 Thread Stephen Frost
All,

* Robert Haas (robertmh...@gmail.com) wrote:
> OK, I'll bite: I'm worried that this patch will be a maintenance
> burden.  It's easy to imagine that changes to core will result in the
> necessity or at least desirability of changes to pgaudit, but I'm
> definitely not prepared to insist that future authors try to insist
> that future patch submitters have to understand this code and update
> it as things change.

I agree with this concern in general, and have since pgaudit was first
proposed for inclusion by Simon two years ago.  Having auditing as an
extension is what makes this into an issue though.  Were auditing baked
into core, it'd be far more straight-forward, much easier to maintain,
and would be updated as we add new core capabilities naturally.  David's
comments about how pgaudit has to work are entirely accurate- everything
ends up having to be reconstructed in a very painful way.

> The set of things that the patch can audit is pretty arbitrary and not
> well tied into the core code.  

This also speaks to the difficulties of having auditing implemented as
an extension.

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Yeah.  Auditing strikes me as a fine example of something for which there
> is no *technical* reason to need to put it in core.  It might need some
> more hooks than we have now, but that's no big deal.

I disagree with this, quite strongly.  There are very good, technical,
reasons why auditing should be a core capability.  Adding more hooks,
exposing various internal functions for use by extensions, etc, doesn't
change that any extension would have to be constantly updated as new
capabilities are added to core and auditing, as a capability, would
continue to be limited by virtue of being implemented as an extension.

I don't mean to detract anything from what 2ndQ and David have done with
pgaudit (which is the only auditing solution of its kind that I'm aware
of, so I don't understand the "competing implementations" argument which
has been brought up at all- there's really only one).  They've done just
as much as each version of PG has allowed them to do.  It's a much
better solution than what we have today (which is basically just
log_statement = 'all', which no one should ever consider to be a real
auditing solution).  I've already stated that I'd be willing to take on
resonsibility for maintaining it as an extension (included with PG or
not), but it's not the end-all, be-all of auditing for PG and we should
be continuing to look at implementing a full in-core auditing solution.

I'm still of the opinion that we should include pgaudit for the reason
that it's a good incremental step towards proper in-core auditing, but
there's clearly no consensus on that and doesn't appear that further
discussion is likely to change that.

An in-core auditing solution would provide us with proper grammar
support, ability to directly mark objects for auditing in the catalog,
allow us to much more easily maintain auditing capability over time as
a small incremental bit of work for each new feature (with proper
in-core infrastructure for it) and generally be a far better technical
solution.  Leveraging the GRANT system is quite cute, and does work, but
it's certainly non-intuitive and is only because we've got no better
way, due to it being implemented as an extension.

Looking at pgaudit and the other approaches to auditing which have been
developed (eg: applications which sit in front of PG and essentially
have to reimplement large bits of PG to then audit the commands sent
before passing them to PG, or hacks which try to make sense out of log
files full of SQL statements) make it quite clear, in my view, that
attempts to bolt-on auditing to PG result in a poorer solution, from a
technical perspective, than what this project is known for and capable
of.  To make true progress towards that, however, we need to get past
the thinking that auditing doesn't need to be in-core or that it should
be a second-class citizen feature or that we don't need it in PG.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] PostgreSQL Audit Extension

2016-02-05 Thread Stephen Frost
* Joe Conway (m...@joeconway.com) wrote:
> On 02/05/2016 10:16 AM, Stephen Frost wrote:
> > An in-core auditing solution would provide us with proper grammar
> > support, ability to directly mark objects for auditing in the catalog,
> > allow us to much more easily maintain auditing capability over time as
> > a small incremental bit of work for each new feature (with proper
> > in-core infrastructure for it) and generally be a far better technical
> > solution.  Leveraging the GRANT system is quite cute, and does work, but
> > it's certainly non-intuitive and is only because we've got no better
> > way, due to it being implemented as an extension.
> 
> I think one additional item needed would be the ability for the audit
> logs to be sent to a different location than the standard logs.

Indeed, reworking the logging to be supportive of multiple destinations
with tagging of the source, etc, has long been a desire of mine (and
others), though that's largely independent of auditing itself.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Using quicksort for every external sort run

2016-02-05 Thread Robert Haas
On Thu, Feb 4, 2016 at 6:14 AM, Peter Geoghegan  wrote:
> The economics of using 4MB or even 20MB to sort 10GB of data are
> already preposterously bad for everyone that runs a database server,
> no matter how budget conscious they may be. I can reluctantly accept
> that we need to still use a heap with very low work_mem settings to
> avoid the risk of a regression (in the event of a strong correlation)
> on general principle, but I'm well justified in proposing "just don't
> do that" as the best practical advice.
>
> I thought I had your agreement on that point, Robert; is that actually the 
> case?

Peter and I spent a few hours talking on Skype this morning about this
point and I believe we have agreed on an algorithm that I think will
address all of my concerns and hopefully also be acceptable to him.
Peter, please weigh in and let me know if I've gotten anything
incorrect here or if you think of other concerns afterwards.

The basic idea is that we will add a new GUC with a name like
replacement_sort_mem that will have a default value in the range of
20-30MB; or possibly we will hardcode this value, but for purposes of
this email I'm going to assume it's a GUC.  If the value of work_mem
or maintenance_work_mem, whichever applies, is smaller than the value
of replacement_sort_mem, then the latter has no effect.  However, if
replacement_sort_mem is the smaller value, then the amount of memory
that can be used for a heap with replacement selection is limited to
replacement_sort_mem: we can use more memory than that in total for
the sort, but the amount that can be used for a heap is restricted to
that value.  The way we do this is explained in more detail below.
One thing I just thought of (after the call) is that it might be
better for this GUC to be in units of tuples rather than in units of
memory; it's not clear to me why the optimal heap size should be
dependent on the tuple size, so we could have a threshold like 300,000
tuples or whatever.   But that's a secondary issue and I might be
wrong about it: the point is that in order to have a chance of
winning, a heap used for replacement selection needs to be not very
big at all by the standards of modern hardware, so the plan is to
limit it to a size at which it may have a chance.

Here's how that will work, assuming Peter and I understand each other:

1. We start reading the input data.  If we reach the end of the input
data before (maintenance_)work_mem is exhausted, then we can simply
quicksort the data and we're done.  This is no different than what we
already do today.

2. If (maintenance_)work_mem fills up completely, we will quicksort
all of the data we have in memory.  We will then regard the tail end
of that sorted data, in an amount governed by replacement_sort_mem, as
a heap, and use it to perform replacement selection until no tuples
remain for the current run.  Meanwhile, the rest of the sorted data
remains in memory untouched.  Logically, we're constructing a run of
tuples which is split between memory and disk: the head of the run
(what fits in all of (maintenance_)work_mem except for
replacement_sort_mem) is in memory, and the tail of the run is on
disk.

3. If we reach the end of input before replacement selection runs out
of tuples for the current run, and if it finds no tuples for the next
run prior to that time, then we are done.  All of the tuples form a
single run and we can return the tuples in memory first followed by
the tuples on disk.  This case is highly likely to be a huge win over
what we have today, because (a) some portion of the tuples were sorted
via quicksort rather than heapsort and that's faster, (b) the tuples
that were sorted using a heap were sorted using a small heap rather
than a big one, and (c) we only wrote out the minimal number of tuples
to tape instead of, as we would have done today, all of them.

4. If we reach this step, then replacement selection with a small heap
wasn't able to sort the input in a single run.  We have a bunch of
sorted data in memory which is the head of the same run whose tail is
already on disk; we now spill all of these tuples to disk.  That
leaves only the heapified tuples in memory.  We just ignore the fact
that they are a heap and treat them as unsorted.  We repeatedly do the
following: read tuples until work_mem is full, sort them, and dump the
result to disk as a run.  When all runs have been created, we merge
runs just as we do today.

This algorithm seems very likely to beat what we do today in
practically all cases.  The benchmarking Peter and others have already
done shows that building runs with quicksort rather than replacement
selection can often win even if the larger number of tapes requires a
multi-pass merge.  The only cases where it didn't seem to be a clear
win involved data that was already in sorted order, or very close to
it.  But with this algorithm, presorted input is fine: we'll quicksort
some of it (which is faster than replacement 

Re: [HACKERS] Development with Eclipse - Wrong error messages in IDE

2016-02-05 Thread Jason Petersen
> On Feb 3, 2016, at 2:38 AM, Peter Moser  wrote:
> 
> Does anyone had similar problems? Do I have to configure Eclipse to 
> understand the PG_RMGR macro or is there another possibility to teach Eclipse 
> these macros?

I just built 9.6 under Eclipse CDT to try this out and was able to open e.g. 
heapam.c without any error markers.

I added PostgreSQL as a “Makefile Project with Existing Code” after running 
./configure from the command-line. After that, I built the project from within 
Eclipse by adding the ‘all’ make target and running it.

One setting I usually change: right-click the project, pick Properties, then 
drill down through C/C++ General -> Preprocessor Include Paths. In the Provider 
pane, there is an entry for “CDT GCC Build Output Parser”. I’m not sure if this 
is strictly necessary, but I set my “Container to keep discovered entries” 
setting to “File”.

Basically, Eclipse scans the make output for -I flags, then notes all the 
includes used to build each file, so the static analyzer, etc. can have more 
accurate information (it is crucial that the “Compiler command pattern” in this 
window be a regex that will match the compiler binary you use, so if you have 
/usr/local/bin/gcc, and “gcc” is the pattern, you are in for trouble).

After running the build, Eclipse should now know what includes are used for 
each file and stop whining. If it ever seems to have problems, you can kick it 
by running a clean target, then all, then picking “Project -> C/C++ Index -> 
Rebuild” (I think).

--
Jason Petersen
Software Engineer | Citus Data
303.736.9255
ja...@citusdata.com



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [HACKERS] PostgreSQL Audit Extension

2016-02-05 Thread Joe Conway
On 02/05/2016 10:16 AM, Stephen Frost wrote:
> An in-core auditing solution would provide us with proper grammar
> support, ability to directly mark objects for auditing in the catalog,
> allow us to much more easily maintain auditing capability over time as
> a small incremental bit of work for each new feature (with proper
> in-core infrastructure for it) and generally be a far better technical
> solution.  Leveraging the GRANT system is quite cute, and does work, but
> it's certainly non-intuitive and is only because we've got no better
> way, due to it being implemented as an extension.

I think one additional item needed would be the ability for the audit
logs to be sent to a different location than the standard logs.

> To make true progress towards that, however, we need to get past
> the thinking that auditing doesn't need to be in-core or that it should
> be a second-class citizen feature or that we don't need it in PG.

+1

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Development with Eclipse - Wrong error messages in IDE

2016-02-05 Thread Peter Moser

Peter Moser wrote:


I have some strange error message inside Eclipse, that some symbols cannot
be found. I work with version 9.6 currently. For instance,

Symbol 'RM_HEAP_ID' could not be resolved
src/backend/access/heap/heapam.c

It affects all occurrences of symbols that are defined in
src/include/access/rmgrlist.h. Eclipse just says "Syntax error" here.

However, the source code compiles and runs without any compile-time error or
warning. It is just an IDE problem I think, but it distracts me from finding
real bugs.


Disclaimer: I've never used eclipse.

The problem is some perhaps-too-clever stuff we do to avoid repetitive
declarations of things.  The rmgr stuff uses a PG_RMGR macro, which is
defined differently in src/backend/access/transam/rmgr.c and
src/include/access/rmgr.h; the latter contains the actual enum
definition.  On the other hand Eclipse is trying to be too clever by
processing the C files, but not actually getting it completely right
(which is understandable, really).  We have other similar cases, such as
grammar keywords (kwlist.h)

I'm afraid that you'd have to teach Eclipse to deal with such things
(which might be tricky) or live with it.



Ok,
thank you for the comment.

I think, I can live with it.

Perhaps, when I have some spare time I give it a try to solve this 
"non-issue"...


Cheers,
Peter


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2016-02-05 Thread pokurev
Hello,

Please find attached updated patch.
>The point of having pgstat_report_progress_update_counter() is so that 
>you can efficiently update a single counter without having to update 
>everything, when only one counter has changed.  But here you are 
>calling this function a whole bunch of times in a row, which 
>completely misses the point - if you are updating all the counters, 
>it's more efficient to use an interface that does them all at once 
>instead of one at a time.

The pgstat_report_progress_update_counter() is called at appropriate places in 
the attached patch.

>So I've spent a fair amount of time debugging really-long-running 
>VACUUM processes with customers, and generally what I really want to 
>know is:
 What block number are we at? <<<

Agreed. The attached patch reported current block number.

Regards,
Vinayak


Vacuum_progress_checker_v11.patch
Description: Vacuum_progress_checker_v11.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-05 Thread Michael Paquier
On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier
 wrote:
> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier
>  wrote:
>> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas  wrote:
>>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
>>>  wrote:
 Yes, please let's use the custom language, and let's not care of not
 more than 1 level of nesting so as it is possible to represent
 pg_stat_replication in a simple way for the user.
>>>
>>> "not" is used twice in this sentence in a way that renders me not able
>>> to be sure that I'm not understanding it not properly.
>>
>> 4 times here. Score beaten.
>>
>> Sorry. Perhaps I am tired... I was just wondering if it would be fine
>> to only support configurations up to one level of nested objects, like
>> that:
>> 2[node1, node2, node3]
>> node1, 2[node2, node3], node3
>> In short, we could restrict things so as we cannot define a group of
>> nodes within an existing group.
>
> No, actually, that's stupid. Having up to two nested levels makes more
> sense, a quite common case for this feature being something like that:
> 2{node1,[node2,node3]}
> In short, sync confirmation is waited from node1 and (node2 or node3).
>
> Flattening groups of nodes with a new catalog will be necessary to
> ease the view of this data to users:
> - group name?
> - array of members with nodes/groups
> - group type: quorum or priority
> - number of items to wait for in this group

So, here are some thoughts to make that more user-friendly. I think
that the critical issue here is to properly flatten the meta data in
the custom language and represent it properly in a new catalog,
without messing up too much with the existing pg_stat_replication that
people are now used to for 5 releases since 9.0. So, I would think
that we will need to have a new catalog, say
pg_stat_replication_groups with the following things:
- One line of this catalog represents the status of a group or of a single node.
- The status of a node/group is either sync or potential, if a
node/group is specified more than once, it may be possible that it
would be sync and potential depending on where it is defined, in which
case setting its status to 'sync' has the most sense. If it is in sync
state I guess.
- Move sync_priority and sync_state, actually an equivalent from
pg_stat_replication into this new catalog, because those represent the
status of a node or group of nodes.
- group name, and by that I think that we had perhaps better make
mandatory the need to append a name with a quorum or priority group.
The group at the highest level is forcibly named as 'top', 'main', or
whatever if not directly specified by the user. If the entry is
directly a node, use the application_name.
- Type of group, quorum or priority
- Elements in this group, an element can be a group name or a node
name, aka application_name. If group is of type priority, the elements
are listed in increasing order. So the elements with lower priority
get first, etc. We could have one column listing explicitly a list of
integers that map with the elements of a group but it does not seem
worth it, what users would like to know is what are the nodes that are
prioritized. This covers the former 'priority' field of
pg_stat_replication.

We may have a good idea of how to define a custom language, still we
are going to need to design a clean interface at catalog level more or
less close to what is written here. If we can get a clean interface,
the custom language implemented, and TAP tests that take advantage of
this user interface to check the node/group statuses, I guess that we
would be in good shape for this patch.

Anyway that's not a small project, and perhaps I am over-complicating
the whole thing.

Thoughts?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] IF (NOT) EXISTS in psql-completion

2016-02-05 Thread Kyotaro HORIGUCHI
Hello,

I considered how to make tab-completion robust for syntactical
noises, in other words, optional words in syntax. Typically "IF
(NOT) EXISTS", UNIQUE and TEMPORARY are words that don't affect
further completion. However, the current delimit-matching
mechanism is not so capable (or is complexty-prone) to live with
such noises. I have proposed to use regular expressions or
simplified one for the robustness but it was too complex to be
applied.

This is another answer for the problem. Removal of such words
on-the-fly makes further matching more robust.

Next, currently some CREATE xxx subsyntaxes of CREATE SCHEMA are
matched using TailMatching but it makes difficult the
options-removal operations, which needs forward matching.

So I introduced two things to resolve them by this patch.

1. HEAD_SHIFT macro

  It shifts the beginning index in previous_words for *MatchN
  macros. When the varialbe is 3 and previous_words is as
  following,

  {"NOT", "IF", "CONCURRENTLY", "INDEX", "UNIQUE", "CREATE"}

  Match3("CONCURRENTLY", "IF", "NOT") reutrns true. HeadMatches
  and TailMatches works on the same basis. This allows us to
  match "CREATE xxx" subsyntaxes of CREATE SCHEMA
  independently. SHIFT_TO_LAST1() macro finds the last appearance
  of specified word and HEAD_SHIFT to there if found.

2. MidMatchAndRemoveN() macros

  These macros remove specified words starts from specfied number
  of words after the current beginning. Having head_shift = 0 and
  the following previous_words,

  {"i_t1_a", "EXISTS", "IF", "INDEX", "DROP"}

  MidMatchAndRemove2(2, "IF", "EXISTS") leaves the following
  previous_words.

  {"i_t1_a", "INDEX", "DROP"}


Using these things, the patch as whole does the following things.

A. Allows "IF (NOT) EXISTS" at almost everywhere it allowed
   syntactically.

   The boilerplate is like the following,

   | else if (MatchesN(words before IF EXISTS))
   |  COMPLETE_WITH_XXX(... "UNION SELECT 'IF EXISTS'");
   | else if (HeadMatchesN(words before "IF EXISTS") &&
   |  MidMatchAndRemoveM(N, words to be removed) &&
   |  MatchesN(words before "IF EXISTS"))
   |  COMPLETE_WITH_();

   The first "else if" makes suggestion for the 'thing' with "IF
   EXISTS".

   The MidMatchAndRemoveM in the second "else if" actually
   removes "IF EXISTS" if match and the third MatchesN or all
   matching macros ever after don't see the removed words.  So
   they can make a suggestion without seeing such noises.

   This looks a bit hackery but works well in the current
   framework.


B. Masks the part in CREATE SCHEMA unrelated to the last CREATE
  subsyntax currently focused on.

   |else if (HeadMatches2("CREATE", "SCHEMA") &&
   | SHIFT_TO_LAST1("CREATE") &&
   | false) {} /* FALL THROUGH */

   The result of this, for the query like this,

CREATE SCHEMA foo bar baz CREATE foe fee CREATE hoge hage

   all the following part of psql_completion works as if the
   current query is just "CREATE hoge hage".


Does anybody have suggestions, opinions or objections?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 82928996a0887882212d05aca3406a3bd3cf22e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 5 Feb 2016 16:50:35 +0900
Subject: [PATCH] Suggest IF (NOT) EXISTS for tab-completion of psql

This patch lets psql to suggest "IF (NOT) EXISTS". Addition to that,
since this patch introduces some mechanism for syntactical robustness,
it allows psql completion to omit some optional part on matching.
---
 src/bin/psql/tab-complete.c | 559 
 1 file changed, 463 insertions(+), 96 deletions(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5f27120..d95698a 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -656,6 +656,10 @@ static const SchemaQuery Query_for_list_of_matviews = {
 "   FROM pg_catalog.pg_roles "\
 "  WHERE substring(pg_catalog.quote_ident(rolname),1,%d)='%s'"
 
+#define Query_for_list_of_rules \
+"SELECT pg_catalog.quote_ident(rulename) FROM pg_catalog.pg_rules "\
+" WHERE substring(pg_catalog.quote_ident(rulename),1,%d)='%s'"
+
 #define Query_for_list_of_grant_roles \
 " SELECT pg_catalog.quote_ident(rolname) "\
 "   FROM pg_catalog.pg_roles "\
@@ -763,6 +767,11 @@ static const SchemaQuery Query_for_list_of_matviews = {
 "SELECT pg_catalog.quote_ident(tmplname) FROM pg_catalog.pg_ts_template "\
 " WHERE substring(pg_catalog.quote_ident(tmplname),1,%d)='%s'"
 
+#define Query_for_list_of_triggers \
+"SELECT pg_catalog.quote_ident(tgname) FROM pg_catalog.pg_trigger "\
+" WHERE substring(pg_catalog.quote_ident(tgname),1,%d)='%s' AND "\
+"   NOT tgisinternal"
+
 #define Query_for_list_of_fdws \
 " SELECT pg_catalog.quote_ident(fdwname) "\
 "   FROM pg_catalog.pg_foreign_data_wrapper "\
@@ -906,7 +915,7 @@ static const pgsql_thing_t words_after_create[] = {
 	{"PARSER", 

Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-05 Thread Etsuro Fujita

On 2016/02/04 21:42, Ashutosh Bapat wrote:


* Is it safe to replace outerjoinpath with its fdw_outerpath the
following way?  I think that if the join relation represented by
outerjoinpath has local conditions that can't be executed remotely,
we have to keep outerjoinpath in the path tree; we will otherwise
fail to execute the local conditions.  No?

+   /*
+* If either inner or outer path is a
ForeignPath corresponding to
+* a pushed down join, replace it with the
fdw_outerpath, so that we
+* maintain path for EPQ checks built
entirely of local join
+* strategies.
+*/
+   if (IsA(joinpath->outerjoinpath, ForeignPath))
+   {
+   ForeignPath *foreign_path;
+   foreign_path = (ForeignPath
*)joinpath->outerjoinpath;
+   if
(foreign_path->path.parent->reloptkind == RELOPT_JOINREL)
+   joinpath->outerjoinpath =
foreign_path->fdw_outerpath;
+   }



all the conditions (local and remote) should be part of fdw_outerpath as
well, since that's the alternate local path, which should produce (when
converted to the plan) the same result as the foreign path.
fdw_outerpath should be a local path set when paths for
outerjoinpath->parent was being created. Am I missing something?


I assumed by mistake that only the remote conditions were evaluated in a 
plan created from each fdw_outerpath.  Sorry for that.  I think that is 
a good idea!


Btw, IIUC, I think the patch fails to adjust the targetlist of the top 
plan created that way, to output the fdw_scan_tlist, as discussed in [1] 
(ie, I think the attached patch is needed, which is created on top of 
your patch pg_fdw_join_v8.patch).


Best regards,
Etsuro Fujita

[1] 
http://www.postgresql.org/message-id/ca+tgmoba4mskgquicgt5ckbpqj-tmpqefht_wy49ndwa91w...@mail.gmail.com
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***
*** 1067,1072  postgresGetForeignPlan(PlannerInfo *root,
--- 1067,1076 
  
  		/* Build the list of columns to be fetched from the foreign server. */
  		fdw_scan_tlist = build_tlist_to_deparse(foreignrel);
+ 
+ 		/* Replace the targetlist of outer_plan with fdw_scan_tlist, if any */
+ 		if (outer_plan)
+ 			outer_plan->targetlist = fdw_scan_tlist;
  	}
  
  	/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] First-draft release notes for next week's back-branch releases

2016-02-05 Thread Tom Lane
I've done a first pass at next week's release notes; please review.

Committed at
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7008e70d105b572821406744ce080771b74c06ab
and should be visible in the website's devel-branch docs after the next
guaibasaurus buildfarm run, due a couple hours from now.

As usual, I just dumped all the notes into one branch's release-N.N.sgml
file, and will sort out which ones apply to which branches later.  I chose
to put them into 9.4, though, not 9.5, since many of these issues are
already fixed in 9.5.0 and will not need to appear in the 9.5.1 section.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostgreSQL Audit Extension

2016-02-05 Thread Robert Haas
On Fri, Feb 5, 2016 at 1:16 PM, Stephen Frost  wrote:
> An in-core auditing solution would provide us with proper grammar
> support, ability to directly mark objects for auditing in the catalog,
> allow us to much more easily maintain auditing capability over time as
> a small incremental bit of work for each new feature (with proper
> in-core infrastructure for it) and generally be a far better technical
> solution.  Leveraging the GRANT system is quite cute, and does work, but
> it's certainly non-intuitive and is only because we've got no better
> way, due to it being implemented as an extension.

Yeah, I agree.  Let me clarify my position: I'm not opposed to in-core
auditing.  I'm also not in favor of it.  If somebody makes a proposal
in that area, I'll probably have an opinion on it, but until then I
don't.  What I'm opposed to is shipping this in core rather than
letting it continue to exist outside core.  I see no benefit to that
and some possible downsides that I think justify rejecting it.  Of
course other people are entitled to their own opinions; what I'm
saying is: that's my opinion.  If somebody came along with a proposal
to put auditing in core that offered the advantages you cite here,
none of my objections to this would be objections to that.  Would I
have other objections?  Maybe, but not necessarily.

For example, one thing that occurs to me is that, at least in some
cases, we've got a built-in way of finding out all of the objects that
a query touches: the list of PlanInvalItems.  Is it a clever idea to
try to drive auditing off that list, or a terrible idea?  I'm not
sure, but clearly if it's a good idea that would make this largely
self-maintaining, which IMHO completely changes the calculus about
whether it's worth doing.  Also, we've got quite a few
ObjectAddress-related facilities in the core system now that know how
to do various kinds of things with any sort of SQL-accessible object
in the system.  Leveraging that infrastructure seems like a big plus.
I'd be much more likely to support a proposal that does this stuff in
a generic way that touches every SQL object type, and contains little
or no hard-wired information about particular SQL object types in the
auditing code itself.  I'm not blind to the fact that auditing
relation access is more valuable than auditing text search parser
access, but it's harder to predict what we might think about the next
three SQL object types we add, and I don't want the burden to be on
the people who add that stuff to teach auditing about it if somebody
wants that.  Rather, I think the auditing system should know about
everything we've got, and which stuff actually gets audited should be
a matter of configuration.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-05 Thread Robert Haas
On Fri, Feb 5, 2016 at 4:23 AM, Ashutosh Bapat
 wrote:
> I have corrected this in this set of patches. Also, I have included the
> change to build the join relation description while constructing fpinfo in
> the main patch since that avoids repeated building of the same at a small
> cost of constructing relation name for base relations, which goes waste if
> that relation is not going to be part of any pushable join tree.
>
> Ran pgindent as well.

pg_fdw_join_v9.patch does not aplpy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-05 Thread Robert Haas
On Fri, Feb 5, 2016 at 9:09 AM, Ashutosh Bapat
 wrote:
>> Would it be nuts to set fdw_scan_tlist in all cases?  Would the code
>> come out simpler than what we have now?
>
> PFA the patch that can be applied on v9 patches.
>
> If there is a whole-row reference for base relation, instead of adding that
> as an additional column deparseTargetList() creates a list of all the
> attributes of that foreign table . The whole-row reference gets constructed
> during projection. This saves some network bandwidth while transferring the
> data from the foreign server. If we build the target list for base relation,
> we should include Vars for all the columns (like deparseTargetList). Thus we
> borrow some code from deparseTargetList to get all the attributes of a
> relation. I included this logic in function build_tlist_from_attrs_used(),
> which is being called by build_tlist_to_deparse(). So, before calling
> deparseSelectStmtForRel() the callers can just call build_tlist_to_deparse()
> and pass the targetlist to deparseSelectStmtForRel() and use the same to be
> handed over to the core code as fdw_scan_tlist. build_tlist_to_deparse()
> also constructs retrieved_attrs list, so that doesn't need to be passed
> around in deparse routines.
>
> But we now have similar code in three places deparseTargetList(),
> deparseAnalyzeSql() and build_tlist_from_attrs_used(). So, I re-wrote
> deparseTargetList() (which is now used to deparse returning list) to call
> build_tlist_from_attrs_used() followed by deparseExplicitTargetList(). The
> later and its minion deparseVar requires a deparse_expr_cxt to be passed.
> deparse_expr_cxt has a member to store RelOptInfo of the relation being
> deparsed. The callers of deparseReturningList() do not have it and thus
> deparse_expr_cxt required changes, which in turn required changes in other
> places. All in all, a larger refactoring. It touches more places than
> necessary for foreign join push down. So, I think, we should try that as a
> separate refactoring work. We may just do the work described in the first
> paragraph for join pushdown, but we will then be left with duplicate code,
> which I don't think is worth the output.

Yeah, I'm not convinced this is actually simpler; at first look, it
seems like it is just moving the complexity around.  I don't like the
fact that there are so many places where we have to do one thing for a
baserel and something totally different for a joinrel, which was the
point of my comment.  But this seems to add one instance of that and
remove one instance of that, so I don't see how we're coming out
better than a wash.  Maybe it's got more merit than I'm giving it
credit for, and I'm just not seeing it right at the moment...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-05 Thread Ashutosh Bapat
> Btw, IIUC, I think the patch fails to adjust the targetlist of the top
> plan created that way, to output the fdw_scan_tlist, as discussed in [1]
> (ie, I think the attached patch is needed, which is created on top of your
> patch pg_fdw_join_v8.patch).
>
>
fdw_scan_tlist represents the output fetched from the foreign server and is
not necessarily the output of ForeignScan. ForeignScan node's output is
represented by tlist argument to.

1119 return make_foreignscan(tlist,
1120 local_exprs,
1121 scan_relid,
1122 params_list,
1123 fdw_private,
1124 fdw_scan_tlist,
1125 remote_exprs,
1126 outer_plan);

This tlist is built using build_path_tlist() for all join plans. IIUC, all
of them output the same targetlist. We don't need to make sure that
targetlist match as long as we are using the targetlist passed in by
create_scan_plan(). Do you have a counter example?
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] silent data loss with ext4 / all current versions

2016-02-05 Thread Michael Paquier
On Thu, Feb 4, 2016 at 2:34 PM, Michael Paquier
 wrote:
> On Thu, Feb 4, 2016 at 12:02 PM, Michael Paquier
>  wrote:
>> On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote:
>>> Not wrong, and this leads to the following:
>>> void rename_safe(const char *old, const char *new, bool isdir, int elevel);
>>> Controlling elevel is necessary per the multiple code paths that would
>>> use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does
>>> that look fine?
>>
>> After really coding it, I finished with the following thing:
>> +int
>> +rename_safe(const char *old, const char *new)
>>
>> There is no need to extend that for directories, well we could of
>> course but all the renames happen on files so I see no need to make
>> that more complicated. More refactoring of the other rename() calls
>> could be done as well by extending rename_safe() with a flag to fsync
>> the data within a critical section, particularly for the replication
>> slot code. I have let that out to not complicate more the patch.
>
> Andres just poked me (2m far from each other now) regarding the fact
> that we should fsync even after the link() calls when
> HAVE_WORKING_LINK is used. So we could lose some meta data here. Mea
> culpa. And the patch misses that.

So, attached is an updated patch that adds a new routine link_safe()
to ensure that a hard link is on-disk persistent. link() is called
twice in timeline.c and once in xlog.c, so those three code paths are
impacted. I noticed as well that my previous patch was sometimes doing
palloc calls in a critical section (oops), I fixed that on the way.

Thoughts welcome.
-- 
Michael


xlog-fsync-v5.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-05 Thread Joshua Berkus

> We may have a good idea of how to define a custom language, still we
> are going to need to design a clean interface at catalog level more or
> less close to what is written here. If we can get a clean interface,
> the custom language implemented, and TAP tests that take advantage of
> this user interface to check the node/group statuses, I guess that we
> would be in good shape for this patch.
> 
> Anyway that's not a small project, and perhaps I am over-complicating
> the whole thing.

Yes.  The more I look at this, the worse the idea of custom syntax looks.  Yes, 
I realize there are drawbacks to using JSON, but this is worse.

Further, there's a lot of horse-cart inversion here.  This proposal involves 
letting the syntax for sync_list configuration determine the feature set for 
N-sync.  That's backwards; we should decide the total list of features we want 
to support, and then adopt a syntax which will make it possible to have them.

-- 
Josh Berkus
Red Hat OSAS
(opinions are my own)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-05 Thread Michael Paquier
On Fri, Feb 5, 2016 at 12:19 PM, Masahiko Sawada  wrote:
> On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier
>  wrote:
> I agree with adding new system catalog to easily checking replication
> status for user. And group name will needed for this.
> What about adding group name with ":" to immediately after set of
> standbys like follows?

This way is fine for me.

> 2[local, 2[london1, london2, london3]:london, (tokyo1, tokyo2):tokyo]
>
> Also, regarding sync replication according to configuration, the view
> I'm thinking is following definition.
>
> =# \d pg_synchronous_replication
>  Column  |  Type   | Modifiers
> -+---+---
>  name| text  |
>  sync_type | text  |
>  wait_num  | integer  |
>  sync_priority | inteter   |
>  sync_state| text  |
>  member| text[] |
>  level | integer  |
>  write_location| pg_lsn  |
>  flush_location| pg_lsn  |
>  apply_location   | pg_lsn   |
>
> - "name" : node name or group name, or "main" meaning top level node.

Check.

> - "sync_type" : 'priority' or 'quorum' for group node, otherwise NULL.

That would be one or the other.

> - "wait_num" : number of nodes/groups to wait for in this group.

Check. This is taken directly from the meta data.

> - "sync_priority" : priority of node/group in this group. "main" node has "0".
>   - the standby is in quorum group always has
> priority 1.
>   - the standby is in priority group has
> priority according to definition order.

This is a bit confusing if the same node or group in in multiple
groups. My previous suggestion was to list the elements of the group
in increasing order of priority. That's an important point.

> - "sync_state" : 'sync' or 'potential' or 'quorum'.
>  - the standby is in quorum group is always 'quorum'.
>  - the standby is in priority group is 'sync'
> / 'potential'.

potential and quorum are the same thing, no? The only difference is
based on the group type here.

> - "member" : array of members for group node, otherwise NULL.

This can be NULL only when the entry is a node.

> - "level" : nested level. "main" node is level 0.

Not sure this one is necessary.

> - "write/flush/apply_location" : group/node calculated LSN according
> to configuration.

This does not need to be part of this catalog, that's a representation
of the data that is part of the WAL sender.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count_nulls(VARIADIC "any")

2016-02-05 Thread Marko Tiikkaja

On 2016-02-05 05:06, Tom Lane wrote:

I wrote:

Pavel Stehule  writes:

[ num_nulls_v6.patch ]



I started looking through this.  It seems generally okay, but I'm not
very pleased with the function name "num_notnulls".  I think it would
be better as "num_nonnulls", as I see Oleksandr suggested already.


Not hearing any complaints, I pushed it with that change and some other
cosmetic adjustments.


Thanks Tom and Pavel and everyone who provided feedback.


.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2016-02-05 Thread Rushabh Lathia
On Wed, Feb 3, 2016 at 3:31 PM, Etsuro Fujita 
wrote:

> On 2016/01/28 15:20, Rushabh Lathia wrote:
>
>> On Thu, Jan 28, 2016 at 11:33 AM, Etsuro Fujita
>> > wrote:
>>
>> On 2016/01/27 21:23, Rushabh Lathia wrote:
>>
>> If I understood correctly, above documentation means, that if
>> FDW have
>> DMLPushdown APIs that is enough. But in reality thats not the
>> case, we
>> need  ExecForeignInsert, ExecForeignUpdate, or ExecForeignDelete
>> in case
>> DML is not pushable.
>>
>> And here fact is DMLPushdown APIs are optional for FDW, so that
>> if FDW
>> don't have DMLPushdown APIs they can still very well perform the
>> DML
>> operations using ExecForeignInsert, ExecForeignUpdate, or
>> ExecForeignDelete.
>>
>
> So documentation should be like:
>>
>> If the IsForeignRelUpdatable pointer is set to NULL, foreign
>> tables are
>> assumed to be insertable, updatable, or deletable if the FDW
>> provides
>> ExecForeignInsert, ExecForeignUpdate, or ExecForeignDelete
>> respectively,
>>
>> If FDW provides DMLPushdown APIs and the DML are pushable to the
>> foreign
>> server, then FDW still needs ExecForeignInsert,
>> ExecForeignUpdate, or
>> ExecForeignDelete for the non-pushable DML operation.
>>
>> What's your opinion ?
>>
>
> I agree that we should add this to the documentation, too.
>>
>
> I added docs to the IsForeignRelUpdatable documentation.  Also, a brief
> introductory remark has been added at the beginning of the DML pushdown
> APIs' documentation.
>
> BTW, if I understand correctly, I think we should also modify
>> relation_is_updatabale() accordingly.  Am I right?
>>
>
> Yep, we need to modify relation_is_updatable().
>>
>
> I thought I'd modify that function in the same way as
> CheckValidResultRel(), but I noticed that we cannot do that, because we
> don't have any information on whether each update is pushed down to the
> remote server by PlanDMLPushdown, during relation_is_updatabale().


Sorry I didn't get you here. Can't resultRelInfo->ri_FdwPushdown gives
information update whether update is pushed down safe or not ? What my
concern here is, lets say resultRelInfo->ri_FdwPushdown marked as true
(PlanDMLPushdown return true), but later into CheckValidResultRel() it
found out that missing BeginDMLPushdown, IterateDMLPushdown and
EndDMLPushdown APIs and it will end up with error.

What I think CheckValidResultRel() should do is, if
resultRelInfo->ri_FdwPushdown is true then check for the DMLPushdown API
and if it doesn't find those API then check for traditional APIs
(ExecForeignInsert, ExecForeignUpdate or ExecForeignDelete). And when it
doesn't find both then it should return an error.

I changed CheckValidResultRel(), where

1) Don't throw an error if resultRelInfo->ri_FdwPushdown is true and
DMLPushdown APIs are missing as query can still perform operation with
traditional ExecForeign APIs. So in this situation just marked
resultRelInfo->ri_FdwPushdown to false.

(Wondering can we add the checks for DMLPushdown APIs into PlanDMLPushdown
as additional check? Means PlanDMLPushdown should return true only if FDW
provides the BeginDMLPushdown & IterateDMLPushdown & EndDMLPushdown APIs ?
What you say ?)

2) Don't throw an error if resultRelInfo->ri_FdwPushdown is true and
DMLPushdown APIs is present but ExecForeign APIs are missing.
3) Throw an error if resultRelInfo->ri_FdwPushdown is false and ExecForeign
APIs are missing.

Attaching is the WIP patch here, do share your thought.
(need to apply on top of V6 patch)


So, I left that function as-is.  relation_is_updatabale() is just used for
> display in the information_schema views, so ISTM that that function is fine
> as-is.  (As for CheckValidResultRel(), I revised it so as to check the
> presence of DML pushdown APIs after checking the existing APIs if the given
> command will be pushed down.  The reason is because we assume the presence
> of the existing APIs, anyway.)
>
>
I revised other docs and some comments, mostly for consistency.
>
> Attached is an updated version of the patch, which has been created on top
> of the updated version of the bugfix patch posted by Robert in [1]
> (attached).
>
>
> Best regards,
> Etsuro Fujita
>
> [1]
> http://www.postgresql.org/message-id/ca+tgmoz40j2uc5ac1nxu03oj4crvolks15xx+ptfp-1u-8z...@mail.gmail.com
>



-- 
Rushabh Lathia
diff --git a/src/backend/executor/.execMain.c.swp b/src/backend/executor/.execMain.c.swp
index 6f54fd9..8a29cc6 100644
Binary files a/src/backend/executor/.execMain.c.swp and b/src/backend/executor/.execMain.c.swp differ
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index eb24051..04d90ea 100644
--- a/src/backend/executor/execMain.c
+++ 

Re: [HACKERS] Relation extension scalability

2016-02-05 Thread Amit Kapila
On Tue, Feb 2, 2016 at 9:19 PM, Andres Freund  wrote:
>
> On 2016-01-28 16:53:08 +0530, Dilip Kumar wrote:
> > test_script:
> > 
> > ./psql -d postgres -c "truncate table data"
> > ./psql -d postgres -c "checkpoint"
> > ./pgbench -f copy_script -T 120 -c$ -j$ postgres
> >
> > Shared Buffer48GB
> > Table:Unlogged Table
> > ench -c$ -j$ -f -M Prepared postgres
> >
> > ClientsBasepatch
> > 1178   180
> > 2337   338
> > 4265   601
> > 8167   805
>
> Could you also measure how this behaves for an INSERT instead of a COPY
> workload?
>

I think such a test will be useful.

>
> I'm doubtful that anything that does the victim buffer search while
> holding the extension lock will actually scale in a wide range of
> scenarios.
>

I think the problem for victim buffer could be visible if the blocks
are dirty and it needs to write the dirty buffer and especially as
the patch is written where after acquiring the extension lock, it again
tries to extend the relation without checking if it can get a page with
space from FSM.  It seems to me that we should re-check the
availability of page because while one backend is waiting on extension
lock, other backend might have added pages.  To re-check the
availability we might want to use something similar to
LWLockAcquireOrWait() semantics as used during WAL writing.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2016-02-05 Thread Rushabh Lathia
On Fri, Feb 5, 2016 at 4:46 PM, Rushabh Lathia 
wrote:

>
>
> On Wed, Feb 3, 2016 at 3:31 PM, Etsuro Fujita  > wrote:
>
>> On 2016/01/28 15:20, Rushabh Lathia wrote:
>>
>>> On Thu, Jan 28, 2016 at 11:33 AM, Etsuro Fujita
>>> >
>>> wrote:
>>>
>>> On 2016/01/27 21:23, Rushabh Lathia wrote:
>>>
>>> If I understood correctly, above documentation means, that if
>>> FDW have
>>> DMLPushdown APIs that is enough. But in reality thats not the
>>> case, we
>>> need  ExecForeignInsert, ExecForeignUpdate, or ExecForeignDelete
>>> in case
>>> DML is not pushable.
>>>
>>> And here fact is DMLPushdown APIs are optional for FDW, so that
>>> if FDW
>>> don't have DMLPushdown APIs they can still very well perform the
>>> DML
>>> operations using ExecForeignInsert, ExecForeignUpdate, or
>>> ExecForeignDelete.
>>>
>>
>> So documentation should be like:
>>>
>>> If the IsForeignRelUpdatable pointer is set to NULL, foreign
>>> tables are
>>> assumed to be insertable, updatable, or deletable if the FDW
>>> provides
>>> ExecForeignInsert, ExecForeignUpdate, or ExecForeignDelete
>>> respectively,
>>>
>>> If FDW provides DMLPushdown APIs and the DML are pushable to the
>>> foreign
>>> server, then FDW still needs ExecForeignInsert,
>>> ExecForeignUpdate, or
>>> ExecForeignDelete for the non-pushable DML operation.
>>>
>>> What's your opinion ?
>>>
>>
>> I agree that we should add this to the documentation, too.
>>>
>>
>> I added docs to the IsForeignRelUpdatable documentation.  Also, a brief
>> introductory remark has been added at the beginning of the DML pushdown
>> APIs' documentation.
>>
>> BTW, if I understand correctly, I think we should also modify
>>> relation_is_updatabale() accordingly.  Am I right?
>>>
>>
>> Yep, we need to modify relation_is_updatable().
>>>
>>
>> I thought I'd modify that function in the same way as
>> CheckValidResultRel(), but I noticed that we cannot do that, because we
>> don't have any information on whether each update is pushed down to the
>> remote server by PlanDMLPushdown, during relation_is_updatabale().
>
>
> Sorry I didn't get you here. Can't resultRelInfo->ri_FdwPushdown gives
> information update whether update is pushed down safe or not ? What my
> concern here is, lets say resultRelInfo->ri_FdwPushdown marked as true
> (PlanDMLPushdown return true), but later into CheckValidResultRel() it
> found out that missing BeginDMLPushdown, IterateDMLPushdown and
> EndDMLPushdown APIs and it will end up with error.
>
> What I think CheckValidResultRel() should do is, if
> resultRelInfo->ri_FdwPushdown is true then check for the DMLPushdown API
> and if it doesn't find those API then check for traditional APIs
> (ExecForeignInsert, ExecForeignUpdate or ExecForeignDelete). And when it
> doesn't find both then it should return an error.
>
> I changed CheckValidResultRel(), where
>
> 1) Don't throw an error if resultRelInfo->ri_FdwPushdown is true and
> DMLPushdown APIs are missing as query can still perform operation with
> traditional ExecForeign APIs. So in this situation just marked
> resultRelInfo->ri_FdwPushdown to false.
>
> (Wondering can we add the checks for DMLPushdown APIs into PlanDMLPushdown
> as additional check? Means PlanDMLPushdown should return true only if FDW
> provides the BeginDMLPushdown & IterateDMLPushdown & EndDMLPushdown APIs ?
> What you say ?)
>
>
On another thought, we should not give responsibility to check for the APIs
to the FDW. So may be we should call PlanDMLPushdown only if
BeginDMLPushdown & IterateDMLPushdown & EndDMLPushdown APIs are present
into FDW. That means prepare DMLPushdown plan only when all the required
APIs are available with FDW. This will also reduce the changes into
CheckValidResultRel().

Thanks Ashutosh Bapat for healthy discussion.

PFA patch.



> 2) Don't throw an error if resultRelInfo->ri_FdwPushdown is true and
> DMLPushdown APIs is present but ExecForeign APIs are missing.
> 3) Throw an error if resultRelInfo->ri_FdwPushdown is false and
> ExecForeign APIs are missing.
>
> Attaching is the WIP patch here, do share your thought.
> (need to apply on top of V6 patch)
>
>
> So, I left that function as-is.  relation_is_updatabale() is just used for
>> display in the information_schema views, so ISTM that that function is fine
>> as-is.  (As for CheckValidResultRel(), I revised it so as to check the
>> presence of DML pushdown APIs after checking the existing APIs if the given
>> command will be pushed down.  The reason is because we assume the presence
>> of the existing APIs, anyway.)
>>
>>
> I revised other docs and some comments, mostly for consistency.
>>
>> Attached is an updated version of the patch, 

Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

2016-02-05 Thread Peter Geoghegan
On Wed, Feb 3, 2016 at 11:31 AM, Robert Haas  wrote:
> Thanks for the review.  I fixed the OID conflict, tweaked a few
> comments, and committed this.

Thanks. I noticed a tiny, preexisting typo in the string abbreviated
key code. The attached patch fixes it.

-- 
Peter Geoghegan
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 1a74e5e..5e7536a 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -2154,7 +2154,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		len = bpchartruelen(authoritative_data, len);
 
 	/*
-	 * If we're using the C collation, use memcmp(), rather than strxfrm(), to
+	 * If we're using the C collation, use memcpy(), rather than strxfrm(), to
 	 * abbreviate keys.  The full comparator for the C locale is always
 	 * memcmp().  It would be incorrect to allow bytea callers (callers that
 	 * always force the C collation -- bytea isn't a collatable type, but this

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-05 Thread Ashutosh Bapat
Would it be nuts to set fdw_scan_tlist in all cases?  Would the code
> come out simpler than what we have now?
>
>
PFA the patch that can be applied on v9 patches.

If there is a whole-row reference for base relation, instead of adding that
as an additional column deparseTargetList() creates a list of all the
attributes of that foreign table . The whole-row reference gets constructed
during projection. This saves some network bandwidth while transferring the
data from the foreign server. If we build the target list for base
relation, we should include Vars for all the columns (like
deparseTargetList). Thus we borrow some code from deparseTargetList to get
all the attributes of a relation. I included this logic in function
build_tlist_from_attrs_used(), which is being called by
build_tlist_to_deparse(). So, before calling deparseSelectStmtForRel() the
callers can just call build_tlist_to_deparse() and pass the targetlist to
deparseSelectStmtForRel() and use the same to be handed over to the core
code as fdw_scan_tlist. build_tlist_to_deparse() also constructs
retrieved_attrs list, so that doesn't need to be passed around in deparse
routines.

But we now have similar code in three places deparseTargetList(),
deparseAnalyzeSql() and build_tlist_from_attrs_used(). So, I re-wrote
deparseTargetList() (which is now used to deparse returning list) to call
build_tlist_from_attrs_used() followed by deparseExplicitTargetList(). The
later and its minion deparseVar requires a deparse_expr_cxt to be passed.
deparse_expr_cxt has a member to store RelOptInfo of the relation being
deparsed. The callers of deparseReturningList() do not have it and thus
deparse_expr_cxt required changes, which in turn required changes in other
places. All in all, a larger refactoring. It touches more places than
necessary for foreign join push down. So, I think, we should try that as a
separate refactoring work. We may just do the work described in the first
paragraph for join pushdown, but we will then be left with duplicate code,
which I don't think is worth the output.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index fb72f45..7a2a67b 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -93,9 +93,10 @@ typedef struct foreign_loc_cxt
 typedef struct deparse_expr_cxt
 {
PlannerInfo *root;  /* global planner state */
-   RelOptInfo *foreignrel; /* the foreign relation we are planning 
for */
+   Relids  relids; /* Relids for which to deparse 
*/
StringInfo  buf;/* output buffer to append to */
List  **params_list;/* exprs that will become remote Params 
*/
+   boolis_joinrel; /* Are we deparsing for a join 
relation */
 } deparse_expr_cxt;
 
 #define REL_ALIAS_PREFIX   "r"
@@ -122,8 +123,7 @@ static void deparseTargetList(StringInfo buf,
  Bitmapset *attrs_used,
  List **retrieved_attrs,
  bool qualify_col);
-static void deparseExplicitTargetList(List *tlist, List **retrieved_attrs,
- deparse_expr_cxt *context);
+static void deparseExplicitTargetList(List *tlist, deparse_expr_cxt *context);
 static void deparseReturningList(StringInfo buf, PlannerInfo *root,
 Index rtindex, Relation rel,
 bool trig_after_row,
@@ -151,13 +151,15 @@ static void printRemoteParam(int paramindex, Oid 
paramtype, int32 paramtypmod,
 deparse_expr_cxt *context);
 static void printRemotePlaceholder(Oid paramtype, int32 paramtypmod,
   deparse_expr_cxt *context);
-static void deparseSelectSql(List *tlist, List **retrieved_attrs,
-deparse_expr_cxt *context);
+static void deparseSelectSql(List *tlist, RelOptInfo *rel,
+deparse_expr_cxt 
*context);
 static void deparseLockingClause(deparse_expr_cxt *context);
 static void appendOrderByClause(List *pathkeys, deparse_expr_cxt *context);
 static void appendConditions(List *exprs, deparse_expr_cxt *context);
 static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root,
RelOptInfo *joinrel, bool use_alias, 
List **params_list);
+static List *build_tlist_from_attrs_used(Index rtindex, Bitmapset *attrs_used,
+   PlannerInfo *root, List 
**retrieved_attrs);
 
 
 /*
@@ -715,26 +717,119 @@ deparse_type_name(Oid type_oid, int32 typemod)
 }
 
 /*
+ * Convert input bitmap representation of columns into targetlist of
+ * corresponding Var nodes.
+ *
+ * List of 

[HACKERS] 9.6 Release Schedule

2016-02-05 Thread Dave Page
Per discussion at the Brussels developer meeting and within the
release team, the high level schedule for 9.6 will be:

Beta: May (before PGCon)
Release: September

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Comment typo in slot.c

2016-02-05 Thread Robert Haas
On Thu, Feb 4, 2016 at 4:39 AM, Michael Paquier
 wrote:
> I just bumped into the following typo in slot.c:
> /*
>  * If we'd now fail - really unlikely - we wouldn't know
> whether this slot
>  * would persist after an OS crash or not - so, force a restart. The
> -* restart would try to fysnc this again till it works.
> +* restart would try to fsync this again till it works.
>  */
> START_CRIT_SECTION();

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2016-02-05 Thread Robert Haas
On Thu, Feb 4, 2016 at 11:30 PM, Amit Kapila  wrote:
> On Fri, Feb 5, 2016 at 3:17 AM, Robert Haas  wrote:
>>
>> On Thu, Feb 4, 2016 at 7:00 AM, Amit Kapila 
>> wrote:
>> > [ new patch ]
>>
>> I've committed this after heavy rewriting.  In particular, I merged
>> two loops, used local variables to get rid of the very long and quite
>> repetitive lines, and did various cleanup on the documentation and
>> comments.
>>
>
> I think there is a typo in the committed code, patch to fix the same
> is attached.

Thanks!  Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

2016-02-05 Thread Robert Haas
On Fri, Feb 5, 2016 at 6:14 AM, Peter Geoghegan  wrote:
> On Wed, Feb 3, 2016 at 11:31 AM, Robert Haas  wrote:
>> Thanks for the review.  I fixed the OID conflict, tweaked a few
>> comments, and committed this.
>
> Thanks. I noticed a tiny, preexisting typo in the string abbreviated
> key code. The attached patch fixes it.

Gosh, I must have looked at that line 10 times without seeing that
mistake.  Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-05 Thread Etsuro Fujita

On 2016/02/04 21:57, Ashutosh Bapat wrote:


One more: I think the following in postgresGetForeignJoinPaths() is
a good idea, but I think it's okay to just check whether
root->rowMarks is non-NIL, because that since we have rowmarks for
all base relations except the target, if we have
root->parse->commandType==CMD_DELETE (or
root->parse->commandType==CMD_UPDATE), then there would be at least
one non-target base relation in the joinrel, which would have a rowmark.



Sorry, I am unable to understand it fully. But what you are suggesting
that if there are root->rowMarks, then we are sure that there is at
least one base relation apart from the target, which needs locking rows.
Even if we don't have one, still changes in a row of target relation
after it was scanned, can result in firing EPQ check, which would need
the local plan to be executed, thus even if root->rowMarks is NIL, EPQ
check can fire and we will need alternate local plan.


Yeah, I think that is true, but if root->rowMarks==NIL, we won't have 
non-target foreign tables, and therefore postgresGetForeignJoinPaths() 
will never be called.  No?


Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-05 Thread Masahiko Sawada
On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier
 wrote:
> On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier
>  wrote:
>> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier
>>  wrote:
>>> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas  wrote:
 On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
  wrote:
> Yes, please let's use the custom language, and let's not care of not
> more than 1 level of nesting so as it is possible to represent
> pg_stat_replication in a simple way for the user.

 "not" is used twice in this sentence in a way that renders me not able
 to be sure that I'm not understanding it not properly.
>>>
>>> 4 times here. Score beaten.
>>>
>>> Sorry. Perhaps I am tired... I was just wondering if it would be fine
>>> to only support configurations up to one level of nested objects, like
>>> that:
>>> 2[node1, node2, node3]
>>> node1, 2[node2, node3], node3
>>> In short, we could restrict things so as we cannot define a group of
>>> nodes within an existing group.
>>
>> No, actually, that's stupid. Having up to two nested levels makes more
>> sense, a quite common case for this feature being something like that:
>> 2{node1,[node2,node3]}
>> In short, sync confirmation is waited from node1 and (node2 or node3).
>>
>> Flattening groups of nodes with a new catalog will be necessary to
>> ease the view of this data to users:
>> - group name?
>> - array of members with nodes/groups
>> - group type: quorum or priority
>> - number of items to wait for in this group
>
> So, here are some thoughts to make that more user-friendly. I think
> that the critical issue here is to properly flatten the meta data in
> the custom language and represent it properly in a new catalog,
> without messing up too much with the existing pg_stat_replication that
> people are now used to for 5 releases since 9.0. So, I would think
> that we will need to have a new catalog, say
> pg_stat_replication_groups with the following things:
> - One line of this catalog represents the status of a group or of a single 
> node.
> - The status of a node/group is either sync or potential, if a
> node/group is specified more than once, it may be possible that it
> would be sync and potential depending on where it is defined, in which
> case setting its status to 'sync' has the most sense. If it is in sync
> state I guess.
> - Move sync_priority and sync_state, actually an equivalent from
> pg_stat_replication into this new catalog, because those represent the
> status of a node or group of nodes.
> - group name, and by that I think that we had perhaps better make
> mandatory the need to append a name with a quorum or priority group.
> The group at the highest level is forcibly named as 'top', 'main', or
> whatever if not directly specified by the user. If the entry is
> directly a node, use the application_name.
> - Type of group, quorum or priority
> - Elements in this group, an element can be a group name or a node
> name, aka application_name. If group is of type priority, the elements
> are listed in increasing order. So the elements with lower priority
> get first, etc. We could have one column listing explicitly a list of
> integers that map with the elements of a group but it does not seem
> worth it, what users would like to know is what are the nodes that are
> prioritized. This covers the former 'priority' field of
> pg_stat_replication.
>
> We may have a good idea of how to define a custom language, still we
> are going to need to design a clean interface at catalog level more or
> less close to what is written here. If we can get a clean interface,
> the custom language implemented, and TAP tests that take advantage of
> this user interface to check the node/group statuses, I guess that we
> would be in good shape for this patch.
>
> Anyway that's not a small project, and perhaps I am over-complicating
> the whole thing.
>

I agree with adding new system catalog to easily checking replication
status for user. And group name will needed for this.
What about adding group name with ":" to immediately after set of
standbys like follows?

2[local, 2[london1, london2, london3]:london, (tokyo1, tokyo2):tokyo]

Also, regarding sync replication according to configuration, the view
I'm thinking is following definition.

=# \d pg_synchronous_replication
 Column  |  Type   | Modifiers
-+---+---
 name| text  |
 sync_type | text  |
 wait_num  | integer  |
 sync_priority | inteter   |
 sync_state| text  |
 member| text[] |
 level | integer  |
 write_location| pg_lsn  |
 flush_location| pg_lsn  |
 apply_location   | pg_lsn   |

- "name" : node name or group name, or "main" meaning top level node.
- "sync_type" : 'priority' or 'quorum' for group 

[HACKERS] Fix for OpenSSL error queue bug

2016-02-05 Thread Peter Geoghegan
Attached patch fixes an issue reported by William Felipe Welter about
a year ago [1]. It is loosely based on his original patch.

As Heikki goes into on that thread, the appropriate action seems to be
to constantly reset the error queue, and to make sure that we
ourselves clear the queue consistently. (Note that we might not have
consistently called ERR_get_error() in the event of an OOM within
SSLerrmessage(), for example). I have not changed backend code in the
patch, though; I felt that we had enough control of the general
situation there for it to be unnecessary to lock everything down.

The interface that OpenSSL exposes for all of this is very poorly
thought out. It's not exactly clear how a client of OpenSSL can be a
"good citizen" in handling the error queue. Correctly using the
library is only ever described in terms of a very exact thing that
must happen or must not happen. There is no overarching concept of how
things fit together so that each OpenSSL client doesn't clobber the
other. It's all rather impractical. Clearly, this patch needs careful
review.

[1] 
http://www.postgresql.org/message-id/flat/20150224030956.2529.83...@wrigleys.postgresql.org#20150224030956.2529.83...@wrigleys.postgresql.org
-- 
Peter Geoghegan
From d5433bc562f792afd75ef8664729db9e6a60ee62 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 26 Jan 2016 15:11:15 -0800
Subject: [PATCH] Distrust external OpenSSL clients; clear err queue

OpenSSL has an unfortunate tendency to mix up per-session state error
handling, and per-thread OpenSSL error handling.  This can cause
problems when programs that link to libpq with OpenSSL enabled have some
other use of OpenSSL; without care, one caller of OpenSSL may cause
problems for the other caller.

To fix, don't trust other users of OpenSSL to clear the per-thread error
queue.  Instead, clear the entire per-thread queue ahead of certain I/O
operations (these I/O operations mostly need to call SSL_get_error() to
check for success, which relies on the queue being empty).  This is a
bit aggressive, but it's pretty clear that the other callers have a very
dubious claim to ownership of the per-thread queue; problem reports
involving PHP scripts that use both PHP's OpenSSL extension, and the
pgsql PHP extension (libpq) are themselves evidence of this.

Finally, be more careful about clearing our own error queue, so as to
not cause these problems ourself.  It's possibly that control previously
did not always reach SSLerrmessage(), where ERR_get_error() was supposed
to be called to clear the queue's earliest code.  Make sure
ERR_get_error() is always called, so as to spare other users of OpenSSL
the possibility of similar problems caused by libpq (as opposed to
problems caused by a third party OpenSSL library like PHP's OpenSSL
extension).
---
 src/interfaces/libpq/fe-secure-openssl.c | 125 +--
 1 file changed, 104 insertions(+), 21 deletions(-)

diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index 133546b..270d184 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -70,7 +70,7 @@ static int verify_peer_name_matches_certificate_name(PGconn *conn,
 static void destroy_ssl_system(void);
 static int	initialize_SSL(PGconn *conn);
 static PostgresPollingStatusType open_client_SSL(PGconn *);
-static char *SSLerrmessage(void);
+static char *SSLerrmessage(unsigned long errcode);
 static void SSLerrfree(char *buf);
 
 static int	my_sock_read(BIO *h, char *buf, int size);
@@ -152,7 +152,7 @@ pgtls_open_client(PGconn *conn)
 			!SSL_set_app_data(conn->ssl, conn) ||
 			!my_SSL_set_fd(conn, conn->sock))
 		{
-			char	   *err = SSLerrmessage();
+			char	   *err = SSLerrmessage(ERR_get_error());
 
 			printfPQExpBuffer(>errorMessage,
    libpq_gettext("could not establish SSL connection: %s\n"),
@@ -209,11 +209,37 @@ pgtls_read(PGconn *conn, void *ptr, size_t len)
 	int			result_errno = 0;
 	char		sebuf[256];
 	int			err;
+	unsigned long errcode;
 
 rloop:
 	SOCK_ERRNO_SET(0);
+
+	/*
+	 * Prepare to call SSL_get_error() by clearing thread's OpenSSL error
+	 * queue.  In general, the current thread's error queue must be empty
+	 * before the TLS/SSL I/O operation is attempted, or SSL_get_error()
+	 * will not work reliably.  Since the possibility exists that other
+	 * OpenSSL clients running in the same thread but not under our control
+	 * will fail to call ERR_get_error() themselves (after their own I/O
+	 * operations), pro-actively clear the per-thread error queue now.
+	 */
+	ERR_clear_error();
 	n = SSL_read(conn->ssl, ptr, len);
 	err = SSL_get_error(conn->ssl, n);
+
+	/*
+	 * Other clients of OpenSSL may fail to call ERR_get_error(), but we
+	 * always do (so as to not cause problems for OpenSSL clients that
+	 * don't call ERR_clear_error() defensively, but are responsible enough
+	 * to call ERR_get_error() to clear error 

[HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-05 Thread Tom Lane
I have sussed what's happening in bug #13908.  Basically, commit
45f6240a8fa9d355 ("Pack tuples in a hash join batch densely, to save
memory") broke things for the case where a hash join is using a skew
table.  The reason is that that commit only changed the storage of tuples
going into the main hash table; tuples going into the skew table are
still allocated with a palloc apiece, without being put into the "chunk"
storage.  Now, if we're loading the hash table and we find that we've
exceeded the storage allowed for skew tuples, ExecHashRemoveNextSkewBucket
wants to push some skew tuples back into the main hash table; and it
believes that linking such tuples into the appropriate main hashbucket
chain is sufficient for that.  Which it was before the aforesaid commit,
and still is in simple cases.  However, if we later try to do
ExecHashIncreaseNumBatches, that function contains new code that assumes
that it can find all tuples in the main hashtable by scanning the "chunk"
storage directly.  Thus, the pushed-back tuples are not scanned and are
neither re-entered into the hash table nor dumped into a batch file.
So they won't get joined.

It looks like ExecHashIncreaseNumBuckets, if it were to run after some
executions of ExecHashRemoveNextSkewBucket, would break things in the same
way.  That's not what's happening in this particular test case, though.

I'm of the opinion that this is a stop-ship bug for 9.5.1.  Barring
somebody producing a fix over the weekend, I will deal with it by
reverting the aforementioned commit.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-05 Thread Tom Lane
Robert Haas  writes:
> On Fri, Feb 5, 2016 at 10:17 AM, Filip Rembiałkowski
>  wrote:
>> - new GUC in "Statement Behaviour" section, notify_duplicate_removal

> I agree with what Merlin said about this:
> http://www.postgresql.org/message-id/CAHyXU0yoHe8Qc=yc10ahu1nfia1tbhsg+35ds-oeueuapo7...@mail.gmail.com

Yeah, I agree that a GUC for this is quite unappetizing.

One idea would be to build a hashtable to aid with duplicate detection
(perhaps only once the pending-notify list gets long).

Another thought is that it's already the case that duplicate detection is
something of a "best effort" activity; note for example the comment in
AsyncExistsPendingNotify pointing out that we don't collapse duplicates
across subtransactions.  Would it be acceptable to relax the standards
a bit further?  For example, if we only checked for duplicates among the
last N notification list entries (for N say around 100), we'd probably
cover just about all the useful cases, and the runtime would stay linear.
The data structure isn't tremendously conducive to that, but it could be
done.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GiST support for UUIDs

2016-02-05 Thread Paul Jungwirth

On 12/25/2015 03:23 AM, Ildus Kurbangaliev wrote:

On Fri, 25 Dec 2015 13:34:25 +0300
Teodor Sigaev  wrote:

First, variables (high and low) should not be declared in the middle
of code. Second, assert will fail if ua.v64[0] == ub.v64[0] and
ua.v64[1] > ub.v64[1] although it's a possible and allowed case.
Third, actually you multiply high value by 2^64 anf low by 2^-64.
Seems, it's needed to do only one multiplication.


Thank you for review. Fixed.


Just wanted to follow up on this and see about getting it added to the 
next commitfest. It looks like Iluds's latest patch 
(btree_gist_uuid_5.patch) doesn't appear on the commitfest page here:


https://commitfest.postgresql.org/7/332/

Also the last few emails of the thread don't show up, although you can 
read them here:


http://www.postgresql.org/message-id/20151225142340.46e577dd@lp

I'm not sure the next step here. Do I click "Move to next CF" in the 
"Status" dropdown? Apologies for not being familiar with the protocol. 
I'd be sad to see this work just get ignored.


Thanks,
Paul



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-05 Thread Robert Haas
On Fri, Feb 5, 2016 at 10:17 AM, Filip Rembiałkowski
 wrote:
> - new GUC in "Statement Behaviour" section, notify_duplicate_removal
> (default true)
>
> Initial discussion in this thread:
> http://www.postgresql.org/message-id/CAP_rwwmpzk9=sbjrztod05bdctyc43wnknu_m37dygvl4sa...@mail.gmail.com
>
> Rationale: for some legitimate use cases, duplicate removal is not required,
> and it gets O(N^2) cost on large COPY/ insert transactions.

I agree with what Merlin said about this:

http://www.postgresql.org/message-id/CAHyXU0yoHe8Qc=yc10ahu1nfia1tbhsg+35ds-oeueuapo7...@mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LLVM Address Sanitizer (ASAN) and valgrind support

2016-02-05 Thread Noah Misch
On Mon, Sep 07, 2015 at 05:05:10PM +0100, Greg Stark wrote:
> I feel like I remember hearing about this before but I can't find any
> mention of it in my mail archives. It seems pretty simple to add
> support for LLVM's Address Sanitizer (asan) by using the hooks we
> already have for valgrind.

Nice.

> In fact I think this would actually be sufficient. I'm not sure what
> the MEMPOOL valgrind stuff is though. I don't think it's relevant to
> address sanitizer which only tracks references to free'd or
> unallocated pointers.

Documentation: http://valgrind.org/docs/manual/mc-manual.html#mc-manual.mempools

A Valgrind memory pool is the precise analog of a PostgreSQL memory context.
The design of the PostgreSQL Valgrind support gave mcxt.c primary
responsibility for validating and invalidating pointers, and mcxt.c uses
MEMPOOL client requests to do so.  While Valgrind could catch all the same
errors with just VALGRIND_MAKE_MEM_{DEFINED,NOACCESS,UNDEFINED}, the MEMPOOL
client requests improve the attribution of errors.  You get this:

==00:00:14:08.472 471537==  Address 0x8ecd848 is 24 bytes inside a block of 
size 25 client-defined
==00:00:14:08.472 471537==at 0x7717FD: MemoryContextAlloc (mcxt.c:589)

Instead of this:

==00:00:02:29.587 18940==  Address 0x6a1b51b is 83,835 bytes inside a block of 
size 262,144 alloc'd
==00:00:02:29.587 18940==at 0x4C2353F: malloc (vg_replace_malloc.c:236)
==00:00:02:29.587 18940==by 0x7FD12F: AllocSetAlloc (aset.c:792)
==00:00:02:29.587 18940==by 0x7FEE9D: MemoryContextAlloc (mcxt.c:524)

> I don't even see any need offhand for a configure flag or autoconf
> test. We could have a configure flag just to be consistent with
> valgrind but it seems pointless. If you're compiling with asan I don't
> see any reason to not use it. I'm building this to see if it works
> now.

I agree.  A flag guards Valgrind client requests, because we'd otherwise have
no idea whether the user plans to run the binary under Valgrind.  For ASAN,
all relevant decisions happen at build time.

> --- a/src/include/utils/memdebug.h
> +++ b/src/include/utils/memdebug.h
> @@ -19,6 +19,27 @@
> 
>  #ifdef USE_VALGRIND
>  #include 
> +
> +#elif __has_feature(address_sanitizer) || defined(__SANITIZE_ADDRESS__)
> +
> +#include 
> +
> +#define VALGRIND_MAKE_MEM_DEFINED(addr, size) \
> + ASAN_UNPOISON_MEMORY_REGION(addr, size)
> +
> +#define VALGRIND_MAKE_MEM_NOACCESS(addr, size) \
> + ASAN_POISON_MEMORY_REGION(addr, size)
> +
> +#define VALGRIND_MAKE_MEM_UNDEFINED(addr, size) \
> + ASAN_UNPOISON_MEMORY_REGION(addr, size)
> +
> +#define VALGRIND_CHECK_MEM_IS_DEFINED(addr, size) do {} while (0)
> +#define VALGRIND_CREATE_MEMPOOL(context, redzones, zeroed) do {} while (0)
> +#define VALGRIND_DESTROY_MEMPOOL(context) do {} while (0)
> +#define VALGRIND_MEMPOOL_ALLOC(context, addr, size) do {} while (0)
> +#define VALGRIND_MEMPOOL_FREE(context, addr) do {} while (0)
> +#define VALGRIND_MEMPOOL_CHANGE(context, optr, nptr, size) do {} while (0)

aset.c relies on the fact that VALGRIND_MEMPOOL_ALLOC() has an implicit
VALGRIND_MAKE_MEM_UNDEFINED() and VALGRIND_MEMPOOL_FREE() has an implicit
VALGRIND_MAKE_MEM_NOACCESS().  #define those two accordingly.  If ASAN has no
equivalent of MEMPOOL client requests, it's fine to stub out the rest.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-05 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> I'm of the opinion that this is a stop-ship bug for 9.5.1.  Barring
> somebody producing a fix over the weekend, I will deal with it by
> reverting the aforementioned commit.

Agreed.

Thanks!

Stephen


signature.asc
Description: Digital signature