Re: [HACKERS] Parallelized polymorphic aggs, and aggtype vs aggoutputtype

2016-06-21 Thread David Rowley
On 20 June 2016 at 22:19, David Rowley  wrote:
> I've gone and implemented the dummy argument approach for
> deserialization functions.
>
> If we go with this, I can then write the docs for 35.10 which'll serve
> to explain parallel user defined aggregates in detail.

I've attached my proposed patch for the user defined aggregate docs.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


create_aggregate_docs.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] primary_conninfo missing from pg_stat_wal_receiver

2016-06-21 Thread Michael Paquier
On Wed, Jun 22, 2016 at 12:04 AM, Stephen Frost  wrote:
> Ugh.  I would certainly rather not have yet another special, hard-coded,
> bit of logic that magically makes things available to a superuser when
> it's not available to regular users.
> What that results in is the need to have a new default role to control
> access to that column for the non-superuser case.

OK, we could always update system_views.sql to revoke all rights from
public.. This ship has not sailed yet.

> As for the password showing up, sorry, but we need a solution for *that*
> regardless of the rest- the password shouldn't be exposed to anyone, nor
> should it be sent and kept in the backend's memory for longer than
> necessary.  I'm not suggesting we've got that figured out already, but
> that's where we should be trying to go.

This connection string is stored in shared memory in WalRcvData, and
that's the case for a couple of releases now, so it has already a high
footprint, though I agree that making that available at SQL level
makes it even worse. Looking at the code, as libpq does not link
directly to libpqcommon, I think that the cleanest solution is to do
something similar to wchar.c, aka split the parsing, deparsing
routines that we are going to use in a file located in src/backend/,
and then build libpq using it. Writing a patch for that would not be
that complicated. What is stored in WalRcvData is then the connection
string filtered of its password field, or we could just obfuscate it
with ***. It may still be useful to the user to know that a password
has been used.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing checks when malloc returns NULL...

2016-06-21 Thread Michael Paquier
On Tue, Jun 21, 2016 at 10:46 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> - mcxt.c uses that, which is surprising:
>> @@ -704,7 +704,8 @@ MemoryContextCreate(NodeTag tag, Size size,
>> {
>> /* Special case for startup: use good ol' malloc */
>> node = (MemoryContext) malloc(needed);
>> -   Assert(node != NULL);
>> +   if (node == NULL)
>> +   elog(PANIC, "out of memory");
>> }
>> I think that a PANIC is cleaner here instead of a simple crash.
>
> But the elog mechanism assumes that memory contexts are working.
> If this ever actually did fire, no good would come of it.

OK, there is not much that we can do here then. What about the rest?
Those seem like legit concerns to me.
-- 
Michael


malloc-nulls-v2.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] signed division in hash_search_with_hash_value(ENTER) has high overhead

2016-06-21 Thread Andres Freund
Hi,

During several profile runs I've seen the division in
if (action == HASH_ENTER || action == HASH_ENTER_NULL)
{
/*
 * Can't split if running in partitioned mode, nor if frozen, 
nor if
 * table is the subject of any active hash_seq_search scans.  
Strange
 * order of these tests is to try to check cheaper conditions 
first.
 */
if (!IS_PARTITIONED(hctl) && !hashp->frozen &&
hctl->freeList[0].nentries / (long) (hctl->max_bucket + 
1) >= hctl->ffactor &&
!has_seq_scans(hashp))
(void) expand_table(hashp);
}
taking up significant amounts of time.  Which is not particularly
surprising: A signed 64 integer division (which is what we're dealing
with here) takes up to ~100 cycles, is microcoded, and prevents
instruction level parallelism.

We could cast to unsigned long - which would be faster - but it seems
like it'd be better to compute a threshold in
init_htab()/expand_table(), and make that && hctl->freeList[0].nentries >= 
hctl.next_expansion
or somesuch.

I don't plan to do that, but I wanted to document it, should somebody
else be motivated to look into this.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speaking of breaking compatibility...standard_conforming_strings

2016-06-21 Thread Bruce Momjian
On Tue, May 24, 2016 at 04:07:02PM -0400, David G. Johnston wrote:
> On Tue, May 24, 2016 at 3:57 PM, Tom Lane  wrote:
> 
> "David G. Johnston"  writes:
> > I just noticed this comment in scan.l:
> > /*
> >  * GUC variables.  This is a DIRECT violation of the warning given at 
> the
> >  * head of gram.y, ie flex/bison code must not depend on any GUC
> variables;
> >  * as such, changing their values can induce very unintuitive behavior.
> >  * But we shall have to live with it as a short-term thing until the
> switch
> >  * to SQL-standard string syntax is complete.
> >  */
> > int backslash_quote = BACKSLASH_QUOTE_SAFE_ENCODING;
> > bool escape_string_warning = true;
> > bool standard_conforming_strings = true;
> 
> > I'm not exactly sure what else needs to happen to remove these forbidden
> > GUCs and if we are not prepared to do this now when will we ever be...
> 
> Dunno, are you prepared to bet that nobody is turning off
> standard_conforming_strings anymore?
> 
> In any case, we keep adding new violations of this rule (cf
> operator_precedence_warning) so I have little hope that it will ever be
> completely clean.
> 
> 
> ​I tend to hold the same position.  I'd probably update the last sentence of
> the comment to reflect that reality.

I changed it to:

 * But we shall have to live with it until we can remove these variables.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MultiXactId error after upgrade to 9.3.4

2016-06-21 Thread Alvaro Herrera
Alvaro Herrera wrote:
> Robert Haas wrote:
> > On Fri, Jun 17, 2016 at 9:33 AM, Andrew Gierth
> >  wrote:
> > >> "Robert" == Robert Haas  writes:
> > >  >> Why is the correct rule not "check for and ignore pre-upgrade mxids
> > >  >> before even trying to fetch members"?
> > >
> > >  Robert> I entirely believe that's the correct rule, but doesn't
> > >  Robert> implementing it require a crystal balll?
> > >
> > > Why would it? Pre-9.3 mxids are identified by the combination of flag
> > > bits in the infomask, see Alvaro's patch.
> > 
> > I see the patch, but I don't see much explanation of why the patch is
> > correct, which I think is pretty scary in view of the number of
> > mistakes we've already made in this area.
> 
> ... and actually the patch fails one isolation tests in 9.3, which is
> why I haven't pushed (I haven't tested 9.4 but I suppose it should be
> the same).  I'm looking into that now.

Ah, it should have been obvious; the reason it's failing is because 9.3
and 9.4 lack commit 27846f02c176 which removed
MultiXactHasRunningRemoteMembers(), so the straight backpatch plus
conflict fixes left one GetMultiXactIdMembers call with the allow_old
flag to true.  The attached patch fixes that omission.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
commit 17507b2a04b8c381997410d1fe3fbaacc34f5a31
Author: Alvaro Herrera 
AuthorDate: Tue Jun 21 18:07:49 2016 -0400
CommitDate: Tue Jun 21 18:07:49 2016 -0400

fixup MultiXactHasRunningRemoteMembers

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 15de62d..efbca6f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1448,7 +1448,7 @@ MultiXactHasRunningRemoteMembers(MultiXactId multi)
 	int			nmembers;
 	int			i;
 
-	nmembers = GetMultiXactIdMembers(multi, , true);
+	nmembers = GetMultiXactIdMembers(multi, , false);
 	if (nmembers <= 0)
 		return false;
 
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 9d7050a..931e2fb 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -701,7 +701,9 @@ HeapTupleSatisfiesUpdate(HeapTupleHeader tuple, CommandId curcid,
 
 if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 {
-	if (MultiXactHasRunningRemoteMembers(xmax))
+	if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
+		return HeapTupleMayBeUpdated;
+	else if (MultiXactHasRunningRemoteMembers(xmax))
 		return HeapTupleBeingUpdated;
 	else
 		return HeapTupleMayBeUpdated;
@@ -725,6 +727,7 @@ HeapTupleSatisfiesUpdate(HeapTupleHeader tuple, CommandId curcid,
 
 /* not LOCKED_ONLY, so it has to have an xmax */
 Assert(TransactionIdIsValid(xmax));
+Assert(!HEAP_LOCKED_UPGRADED(tuple->t_infomask));
 
 /* updating subtransaction must have aborted */
 if (!TransactionIdIsCurrentTransactionId(xmax))

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Requesting external_pid_file with postgres -C when not initialized lead to coredump

2016-06-21 Thread Michael Paquier
On Wed, Jun 22, 2016 at 2:02 AM, Alvaro Herrera
 wrote:
> alain radix wrote:
>> So, here is my first patch for PostgreSQL.
>
> Looking forward to further ones,

A comment on this patch: this is an incorrect approach anyway.
PostmasterMain relies on this value being NULL to decide if this PID
file should be written or not, so this patch applied as-is would
result in a message outputted to stderr if the logic is not changed
there.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rethinking behavior of force_parallel_mode = regress

2016-06-21 Thread Tom Lane
Robert Haas  writes:
> On Sat, Jun 18, 2016 at 4:49 PM, Tom Lane  wrote:
>> With that thought in mind, I propose that the behavior of
>> force_parallel_mode = regress is ill-designed so far as EXPLAIN is
>> concerned.  What it ought to do is suppress *all* Gathers from the output,
>> not just ones that were added in response to force_parallel_mode itself.

> No, that doesn't sound like a very good idea.  If you do that, then
> you have no hope of the differences being *zero*, because any place
> that the regression tests are intended to produce a parallel plan is
> going to look different.

Well, sure, but in those areas you just set force_parallel_mode to on.

> The charter of force_parallel_mode=regress
> is that any regression test that passes normally should still pass
> with that setting.

I would like that charter to include scenarios with other nondefault GUC
settings, to the extent we can reasonably make it work, because setting
*only* force_parallel_mode is pretty sad in terms of test coverage.
Or hadn't you noticed all the bugs we flushed from cover as soon as we
tried changing the cost values?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rethinking behavior of force_parallel_mode = regress

2016-06-21 Thread Robert Haas
On Sat, Jun 18, 2016 at 4:49 PM, Tom Lane  wrote:
> As of HEAD it is possible to get through all of our regression tests
> with these settings:
>
> alter system set force_parallel_mode = regress;
> alter system set max_parallel_workers_per_gather = 2;
> alter system set parallel_tuple_cost = 0;
> alter system set parallel_setup_cost = 0;
> alter system set min_parallel_relation_size = 0;
>
> although there are quite a number of cosmetic differences in the outputs
> for the core regression tests.  (Curiously, contrib, pl, and isolation
> seem to pass without any diffs.)  In view of the number of bugs we've been
> able to identify with this setup, it would be nice to reduce the volume of
> the cosmetic differences to make it easier to review the diffs by hand.
> I'm not sure there's much that can be done about the row-ordering diffs;
> some randomness in the output order from a parallel seqscan seems
> inevitable.  But we could tamp down the EXPLAIN output differences, which
> are much harder to review anyway.
>
> With that thought in mind, I propose that the behavior of
> force_parallel_mode = regress is ill-designed so far as EXPLAIN is
> concerned.  What it ought to do is suppress *all* Gathers from the output,
> not just ones that were added in response to force_parallel_mode itself.

No, that doesn't sound like a very good idea.  If you do that, then
you have no hope of the differences being *zero*, because any place
that the regression tests are intended to produce a parallel plan is
going to look different.  The charter of force_parallel_mode=regress
is that any regression test that passes normally should still pass
with that setting.  This change would clearly break that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Robert Haas
On Tue, Jun 21, 2016 at 3:46 PM, Andres Freund  wrote:
> On 2016-06-21 15:38:25 -0400, Robert Haas wrote:
>> On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund  wrote:
>> >> I'm also a bit dubious that LockAcquire is safe to call in general
>> >> with interrupts held.
>> >
>> > Looks like we could just acquire the tuple-lock *before* doing the
>> > toast_insert_or_update/RelationGetBufferForTuple, but after releasing
>> > the buffer lock. That'd allow us to do avoid doing the nested locking,
>> > should make the recovery just a goto l2;, ...
>>
>> Why isn't that racey?  Somebody else can grab the tuple lock after we
>> release the buffer content lock and before we acquire the tuple lock.
>
> Sure, but by the time the tuple lock is released, they'd have updated
> xmax. So once we acquired that we can just do
> if (xmax_infomask_changed(oldtup.t_data->t_infomask,
>   infomask) ||
> 
> !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data),
>  xwait))
> goto l2;
> which is fine, because we've not yet done the toasting.

I see.

> I'm not sure wether this approach is better than deleting potentially
> toasted data though. It's probably faster, but will likely touch more
> places in the code, and eat up a infomask bit (infomask & HEAP_MOVED
> == HEAP_MOVED in my prototype).

Ugh.  That's not very desirable at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions

2016-06-21 Thread Merlin Moncure
Hello hackers,

Observe the following test case (apologies if this is a well
understood problem):

create temp table foo as select generate_series(1,100) id;
create index on foo(id);

create temp table bar as select id, id % 10 = 0 as good from
generate_series(1,100) id;
create index on bar(good);

analyze foo;
analyze bar;

explain analyze select * from foo where false or exists (select 1 from
bar where good and foo.id = bar.id);  -- A
explain analyze select * from foo where exists (select 1 from bar
where good and foo.id = bar.id);  -- B

These queries are trivially verified as identical but give very different plans.
A gives
  QUERY PLAN
─
 Seq Scan on foo  (cost=0.00..4459425.00 rows=50 width=4) (actual
time=13.299..130.271 rows=10 loops=1)
   Filter: (alternatives: SubPlan 1 or hashed SubPlan 2)
   Rows Removed by Filter: 90
   SubPlan 1
 ->  Index Scan using bar_good_idx on bar  (cost=0.42..4.45 rows=1
width=0) (never executed)
   Index Cond: (good = true)
   Filter: (good AND (foo.id = id))
   SubPlan 2
 ->  Index Scan using bar_good_idx on bar bar_1  (cost=0.42..4.44
rows=1 width=4) (actual time=0.024..0.055 rows=10 loops=1)
   Index Cond: (good = true)
   Filter: good
 Planning time: 0.103 ms
 Execution time: 130.312 ms

B gives
  QUERY PLAN
───
 Nested Loop  (cost=4.87..12.91 rows=1 width=4) (actual
time=0.075..0.161 rows=10 loops=1)
   ->  HashAggregate  (cost=4.45..4.46 rows=1 width=4) (actual
time=0.058..0.060 rows=10 loops=1)
 Group Key: bar.id
 ->  Index Scan using bar_good_idx on bar  (cost=0.42..4.44
rows=1 width=4) (actual time=0.018..0.045 rows=10 loops=1)
   Index Cond: (good = true)
   Filter: good
   ->  Index Only Scan using foo_id_idx on foo  (cost=0.42..8.44
rows=1 width=4) (actual time=0.009..0.009 rows=1 loops=10)
 Index Cond: (id = bar.id)
 Heap Fetches: 10
 Planning time: 0.193 ms
 Execution time: 0.187 ms

This is a general problem to OR expressions while AND expressions will
generally pass the optimization through.   The 'old school'
optimization approach is to rewrite the OR expressions to UNION ALL
but this can have unpleasant downstream effects on the query in real
world scenarios.  The question is: can the one time filter logic be
expanded such the first query can be functionally be written into the
second one?  This type of query happens a lot when trying to mix
multiple different filtering expressions (a 'filter mode' if you will)
in a single query based on a user supplied switch.  Food for thought.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MultiXactId error after upgrade to 9.3.4

2016-06-21 Thread Alvaro Herrera
Robert Haas wrote:
> On Fri, Jun 17, 2016 at 9:33 AM, Andrew Gierth
>  wrote:
> >> "Robert" == Robert Haas  writes:
> >  >> Why is the correct rule not "check for and ignore pre-upgrade mxids
> >  >> before even trying to fetch members"?
> >
> >  Robert> I entirely believe that's the correct rule, but doesn't
> >  Robert> implementing it require a crystal balll?
> >
> > Why would it? Pre-9.3 mxids are identified by the combination of flag
> > bits in the infomask, see Alvaro's patch.
> 
> I see the patch, but I don't see much explanation of why the patch is
> correct, which I think is pretty scary in view of the number of
> mistakes we've already made in this area.

... and actually the patch fails one isolation tests in 9.3, which is
why I haven't pushed (I haven't tested 9.4 but I suppose it should be
the same).  I'm looking into that now.

> The comments just say:
> 
> + * A tuple that has HEAP_XMAX_IS_MULTI and HEAP_XMAX_LOCK_ONLY but neither of
> + * XMAX_EXCL_LOCK and XMAX_KEYSHR_LOCK must come from a tuple that was
> + * share-locked in 9.2 or earlier and then pg_upgrade'd.
> 
> Why must that be true?

The reason this must be true is that in 9.2 and earlier multixacts were only
used to lock tuples FOR SHARE, which had that specific bit pattern.  I suppose
I need to make this comment more explicit.

> + * We must not try to resolve such multixacts locally, because the result 
> would
> + * be bogus, regardless of where they stand with respect to the current valid
> + * range.
> 
> What about other pre-upgrade mxacts that don't have this exact bit
> pattern?  Why can't we try to resolve them and end up in trouble just
> as easily?

There shouldn't be any.  Back then, it was not possible to have tuples
locked and updated at the same time; FOR UPDATE (the only other locking
mode back then) didn't allow other lockers, so the only possibility was
FOR SHARE with that bit pattern.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Andres Freund
On 2016-06-21 15:38:25 -0400, Robert Haas wrote:
> On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund  wrote:
> >> I'm also a bit dubious that LockAcquire is safe to call in general
> >> with interrupts held.
> >
> > Looks like we could just acquire the tuple-lock *before* doing the
> > toast_insert_or_update/RelationGetBufferForTuple, but after releasing
> > the buffer lock. That'd allow us to do avoid doing the nested locking,
> > should make the recovery just a goto l2;, ...
> 
> Why isn't that racey?  Somebody else can grab the tuple lock after we
> release the buffer content lock and before we acquire the tuple lock.

Sure, but by the time the tuple lock is released, they'd have updated
xmax. So once we acquired that we can just do
if (xmax_infomask_changed(oldtup.t_data->t_infomask,
  infomask) ||

!TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data),
 xwait))
goto l2;
which is fine, because we've not yet done the toasting.


I'm not sure wether this approach is better than deleting potentially
toasted data though. It's probably faster, but will likely touch more
places in the code, and eat up a infomask bit (infomask & HEAP_MOVED
== HEAP_MOVED in my prototype).


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Try again to fix the way the scanjoin_target is used with partia

2016-06-21 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jun 21, 2016 at 1:28 PM, Tom Lane  wrote:
>> Also, I got rid of the target_parallel argument to
>> apply_projection_to_path, as I thought that that was just way too much
>> interconnection between apply_projection_to_path and its callers than
>> is justified for what it saves (namely one call of has_parallel_hazard
>> when planning a Gather).  In general, having that argument could *add*
>> extra calls of has_parallel_hazard, since callers might have to do
>> that computation whether or not a Gather is present.

> I had a feeling you weren't going to like that, but it also didn't
> seem great to redo that computation for every path.  Right now, we
> only need it for Gather paths, but if we start marking subquery RTEs
> as parallel-safe and fix upper rels to correctly set
> consider_parallel, I have a feeling this is going to be needed more.
> But feel free to ignore that for now since I don't have a completely
> well-thought-out theory on this.

If that does start being a problem, I'd be inclined to address it by
teaching PathTarget to track whether its contents are parallel-safe.
For now, though, I think we don't need the complication.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Robert Haas
On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund  wrote:
>> I'm also a bit dubious that LockAcquire is safe to call in general
>> with interrupts held.
>
> Looks like we could just acquire the tuple-lock *before* doing the
> toast_insert_or_update/RelationGetBufferForTuple, but after releasing
> the buffer lock. That'd allow us to do avoid doing the nested locking,
> should make the recovery just a goto l2;, ...

Why isn't that racey?  Somebody else can grab the tuple lock after we
release the buffer content lock and before we acquire the tuple lock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Try again to fix the way the scanjoin_target is used with partia

2016-06-21 Thread Robert Haas
On Tue, Jun 21, 2016 at 1:28 PM, Tom Lane  wrote:
> I wrote:
>> If we keep it like this, we definitely ought to refactor things so that
>> fewer places are aware of the possibility of the Result getting omitted.
>> Maybe push that logic into create_projection_path?  If we did, we might
>> not need a separate apply_projection_to_path function at all?  Too tired
>> to decide about it right now.
>
> After some contemplation, it seems to me that the right thing is to make
> ProjectionPath explicitly aware of the possibility that it's just a
> placeholder used for the purpose of not modifying the input Path node,
> and have create_projection_path calculate the correct costing either way.
> In this design, the principal difference between create_projection_path
> and apply_projection_to_path is that the latter assumes it can scribble
> directly on the given Path while the former doesn't modify that Path.

+1.

> There is one other difference: I did not do anything about making
> create_projection_path push the target below a Gather.  In principle,
> it could do that by cloning the given GatherPath, but right now I think
> that'd be unreachable code so I did not bother.

Works for me.

> I had hoped that this would result in simplifying create_projection_plan
> so that it just makes a Result or not according to what
> create_projection_path decided, but there's one regression test case that
> fails (in the sense of showing a Result in the plan that isn't really
> needed).  That happens because create_merge_append_plan adds sort columns
> to the tlist and so a tlist match is possible after that happens when it
> didn't match before.  For the moment I kluged create_projection_plan so
> that that keeps working, but I wonder if it'd be better to just accept an
> extra Result in that case.

Not sure if it's better to accept an extra Result in that case, but I
agree that's ugly.

> Also, I got rid of the target_parallel argument to
> apply_projection_to_path, as I thought that that was just way too much
> interconnection between apply_projection_to_path and its callers than
> is justified for what it saves (namely one call of has_parallel_hazard
> when planning a Gather).  In general, having that argument could *add*
> extra calls of has_parallel_hazard, since callers might have to do
> that computation whether or not a Gather is present.

I had a feeling you weren't going to like that, but it also didn't
seem great to redo that computation for every path.  Right now, we
only need it for Gather paths, but if we start marking subquery RTEs
as parallel-safe and fix upper rels to correctly set
consider_parallel, I have a feeling this is going to be needed more.
But feel free to ignore that for now since I don't have a completely
well-thought-out theory on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_bsd_indent - improvements around offsetof and sizeof

2016-06-21 Thread Bruce Momjian
On Tue, Jun 21, 2016 at 03:22:09PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Wed, May 25, 2016 at 03:13:23PM -0400, Tom Lane wrote:
> >> ... If we make a change like this, I think we should
> >> *strongly* consider reindenting all the live back branches along with
> >> HEAD.
> 
> > Uh, we have been running on back branches anytime the pgindent rules
> > change as part of policy, e.g.:
> > commit 2616a5d300e5bb5a2838d2a065afa3740e08727f
> 
> That was not actually a pgindent run, but a one-time application of a very
> limited filter.

Ah, yes, entab -m (only C comment periods).

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_bsd_indent - improvements around offsetof and sizeof

2016-06-21 Thread Tom Lane
Bruce Momjian  writes:
> On Wed, May 25, 2016 at 03:13:23PM -0400, Tom Lane wrote:
>> ... If we make a change like this, I think we should
>> *strongly* consider reindenting all the live back branches along with
>> HEAD.

> Uh, we have been running on back branches anytime the pgindent rules
> change as part of policy, e.g.:
>   commit 2616a5d300e5bb5a2838d2a065afa3740e08727f

That was not actually a pgindent run, but a one-time application of a very
limited filter.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gin index on array of uuid

2016-06-21 Thread Tom Lane
Enrique MailingLists  writes:
> Currently creating an index on an array of UUID involves defining an
> operator class. I was wondering if this would be a valid request to add as
> part of the uuid-ossp extension? This seems like a reasonable operator to
> support as a default for UUIDs.

This makes me itch, really, because if we do this then we should logically
do it for every other add-on type.

It seems like we are not that far from being able to have just one GIN
opclass on "anyarray".  The only parts of this declaration that are
UUID-specific are the comparator function and the storage type, both of
which could be gotten without that much trouble, one would think.

> Any downsides to adding this as a default?

Well, it'd likely break things at dump/reload time for people who had
already created a competing "default for _uuid" opclass manually.  I'm not
entirely sure, but possibly replacing the core opclasses with a single one
that is "default for anyarray" could avoid such failures.  We'd have to
figure out ambiguity resolution rules.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_bsd_indent - improvements around offsetof and sizeof

2016-06-21 Thread Bruce Momjian
On Wed, May 25, 2016 at 03:13:23PM -0400, Tom Lane wrote:
> Robert Haas  writes:
> > On Sun, May 22, 2016 at 4:16 PM, Piotr Stefaniak
> >  wrote:
> >> I think I've managed to improve pg_bsd_indent's handling of two types of
> >> cases.
> 
> > Wow, that seems pretty great.  I haven't scrutinized your changes to
> > pg_bsd_indent, but effect_on_pg.diff looks like a large improvement.
> 
> I'm excited about this too, not least because it suggests that maybe
> bsdindent isn't quite as opaque as it appears.  I'd love to see a fix
> for its brain damage around function pointer typedef formatting, too.
> 
> Assuming this patch withstands more careful review, we will need to think
> about project policy for how/when to apply such fixes.  The last time
> we made any real change to pgindent's behavior was when we changed its
> wrapping of comment blocks back around 8.1 ... and I cursed that decision
> at least weekly for the next five years, because it caused constant
> back-patching pain.  If we make a change like this, I think we should
> *strongly* consider reindenting all the live back branches along with
> HEAD.

Uh, we have been running on back branches anytime the pgindent rules
change as part of policy, e.g.:

commit 2616a5d300e5bb5a2838d2a065afa3740e08727f
Author: Bruce Momjian 
Date:   Tue May 6 11:26:26 2014 -0400

Remove tabs after spaces in C comments

This was not changed in HEAD, but will be done later as part of a
pgindent run.  Future pgindent runs will also do this.

Report by Tom Lane

Backpatch through all supported branches, but not HEAD


-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gin index on array of uuid

2016-06-21 Thread Peter Geoghegan
On Tue, Jun 21, 2016 at 11:42 AM, Enrique MailingLists
 wrote:
> This would be helpful for people trying to use arrays of UUIDs in cloud
> environments which limit root access.

I have personally seen numerous requests for this from users of Heroku
Postgres. So, I agree that there is a demand for this.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Gin index on array of uuid

2016-06-21 Thread Enrique MailingLists
Currently creating an index on an array of UUID involves defining an
operator class. I was wondering if this would be a valid request to add as
part of the uuid-ossp extension? This seems like a reasonable operator to
support as a default for UUIDs. Any downsides to adding this as a default?

http://stackoverflow.com/questions/19959735/postgresql-gin-index-on-array-of-uuid

This is the definition from the stack overflow reference:

CREATE OPERATOR CLASS _uuid_ops DEFAULT
  FOR TYPE _uuid USING gin AS
  OPERATOR 1 &&(anyarray, anyarray),
  OPERATOR 2 @>(anyarray, anyarray),
  OPERATOR 3 <@(anyarray, anyarray),
  OPERATOR 4 =(anyarray, anyarray),
  FUNCTION 1 uuid_cmp(uuid, uuid),
  FUNCTION 2 ginarrayextract(anyarray, internal, internal),
  FUNCTION 3 ginqueryarrayextract(anyarray, internal, smallint,
internal, internal, internal, internal),
  FUNCTION 4 ginarrayconsistent(internal, smallint, anyarray, integer,
internal, internal, internal, internal),
  STORAGE uuid;


This would be helpful for people trying to use arrays of UUIDs in cloud
environments which limit root access.

Thank you,
Enrique


Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Andres Freund
On 2016-06-21 13:03:24 -0400, Robert Haas wrote:
> On Tue, Jun 21, 2016 at 12:54 PM, Andres Freund  wrote:
> >> I don't quite understand the intended semantics of this proposed flag.
> >
> > Whenever the flag is set, we have to acquire the heavyweight tuple lock
> > before continuing. That guarantees nobody else can modify the tuple,
> > while the lock is released, without requiring to modify more than one
> > hint bit.  That should fix the torn page issue, no?
> 
> Yeah, I guess that would work.
> 
> >> If we don't already have the tuple lock at that point, we can't go
> >> acquire it before releasing the content lock, can we?
> >
> > Why not?  Afaics the way that tuple locks are used, the nesting should
> > be fine.
> 
> Well, the existing places where we acquire the tuple lock within
> heap_update() are all careful to release the page lock first, so I'm
> skeptical that doing it the other order is safe.  Certainly, if we've
> got some code that grabs the page lock and then the tuple lock and
> other code that grabs the tuple lock and then the page lock, that's a
> deadlock waiting to happen.

Just noticed this piece of code while looking into this:
UnlockReleaseBuffer(buffer);
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(tp.t_self), 
LockTupleExclusive);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
return result;

seems weird to release the vmbuffer after the tuplelock...


> I'm also a bit dubious that LockAcquire is safe to call in general
> with interrupts held.

Looks like we could just acquire the tuple-lock *before* doing the
toast_insert_or_update/RelationGetBufferForTuple, but after releasing
the buffer lock. That'd allow us to do avoid doing the nested locking,
should make the recovery just a goto l2;, ...

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 10.0

2016-06-21 Thread José Luis Tallón

On 06/20/2016 10:14 PM, Robert Haas wrote:

On Mon, Jun 20, 2016 at 4:00 PM, David G. Johnston
 wrote:

10.x is the desired output.

10.x is the output that some people desire.


(explicitly skipped up-thread to add this -- please forgive my jumping in)

Since we are still (as a community) debating this, I felt the need to 
add yet another possibility ...

(/me dons some asbestos underwear)

next version:
100100 [server_version] will be  (gasp!)  10g R1 (for marketing people) 
or simply 10.1.0 for the rest of the world.

Next update would be 1001001, a.k.a. 10g R1 u1

i.e. we would skip 10.0 alltogether, preserving everybody's sanity 
in the process.



IMV,  10g R1 uX should plainly be a different *major* release than 10g 
R2 uY

and hence requiring a proper "migration" (i.e. pg_upgrade)
...or at least that's what several multi-billion software companies 
have taught most everybody to believe :$




A significant number of
people, including me, would prefer to stick with the current
three-part versioning scheme, possibly with some change to the
algorithm for bumping the first digit (e.g. every 5 years like
clockwork).
That's another story  Either 5 (for historical reasons) or 10 year 
(sounds good)
 ... or whenever it makes sense! (such as when a multi-modal, hybrid 
row+column, vectored, MPP, super-duper-distributed, automagically 
self-scaling or the like release is ready)



Thanks,

/ J.L.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Requesting external_pid_file with postgres -C when not initialized lead to coredump

2016-06-21 Thread Alvaro Herrera
alain radix wrote:
> Hi,
> 
> I faced a coredump when reading the value of the parameter
> "external_pid_file" when it was not initialized in postgresql.conf
> This came from the value not being specified to be initialized to en empty
> string in guc.c in the ConfigureNamesString array.
> the behavior can easily been tested with the following commands :
> initdb test
> postgres -D test -C external_pid_file
> 
> I faced the problem with version 9.3, 9.5 and 9.6 beta 1
> This seems to come from a long time ago.
> 
> I wrote a patch ( with help from Stéphane Schildknecht ) to correct the
> problem with a proper initialization.

This was fixed six days ago in a slightly different manner:

commit 0b0baf26211a98a8593279fa24635bc4dd9ac99d
Author: Tom Lane 
AuthorDate: Thu Jun 16 12:17:03 2016 -0400
CommitDate: Thu Jun 16 12:17:38 2016 -0400

Avoid crash in "postgres -C guc" for a GUC with a null string value.

Emit "(null)" instead, which was the behavior all along on platforms
that don't crash, eg OS X.  Per report from Jehan-Guillaume de Rorthais.
Back-patch to 9.2 where -C option was introduced.

Michael Paquier

Report: <20160615204036.2d35d86a@firost>


> The patch also removed a useless initialization of cluster_name to save a
> little memory.

Yeah, we could do away with that I suppose.


> So, here is my first patch for PostgreSQL.

Looking forward to further ones,

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Robert Haas
On Tue, Jun 21, 2016 at 12:54 PM, Andres Freund  wrote:
>> I don't quite understand the intended semantics of this proposed flag.
>
> Whenever the flag is set, we have to acquire the heavyweight tuple lock
> before continuing. That guarantees nobody else can modify the tuple,
> while the lock is released, without requiring to modify more than one
> hint bit.  That should fix the torn page issue, no?

Yeah, I guess that would work.

>> If we don't already have the tuple lock at that point, we can't go
>> acquire it before releasing the content lock, can we?
>
> Why not?  Afaics the way that tuple locks are used, the nesting should
> be fine.

Well, the existing places where we acquire the tuple lock within
heap_update() are all careful to release the page lock first, so I'm
skeptical that doing it the other order is safe.  Certainly, if we've
got some code that grabs the page lock and then the tuple lock and
other code that grabs the tuple lock and then the page lock, that's a
deadlock waiting to happen.  I'm also a bit dubious that LockAcquire
is safe to call in general with interrupts held.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MultiXactId error after upgrade to 9.3.4

2016-06-21 Thread Robert Haas
On Fri, Jun 17, 2016 at 9:33 AM, Andrew Gierth
 wrote:
>> "Robert" == Robert Haas  writes:
>  >> Why is the correct rule not "check for and ignore pre-upgrade mxids
>  >> before even trying to fetch members"?
>
>  Robert> I entirely believe that's the correct rule, but doesn't
>  Robert> implementing it require a crystal balll?
>
> Why would it? Pre-9.3 mxids are identified by the combination of flag
> bits in the infomask, see Alvaro's patch.

I see the patch, but I don't see much explanation of why the patch is
correct, which I think is pretty scary in view of the number of
mistakes we've already made in this area.  The comments just say:

+ * A tuple that has HEAP_XMAX_IS_MULTI and HEAP_XMAX_LOCK_ONLY but neither of
+ * XMAX_EXCL_LOCK and XMAX_KEYSHR_LOCK must come from a tuple that was
+ * share-locked in 9.2 or earlier and then pg_upgrade'd.

Why must that be true?

+ * We must not try to resolve such multixacts locally, because the result would
+ * be bogus, regardless of where they stand with respect to the current valid
+ * range.

What about other pre-upgrade mxacts that don't have this exact bit
pattern?  Why can't we try to resolve them and end up in trouble just
as easily?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Andres Freund
On 2016-06-21 10:50:36 -0400, Robert Haas wrote:
> On Mon, Jun 20, 2016 at 11:51 PM, Andres Freund  wrote:
> >> > So far the best idea I have - and it's really not a good one - is to
> >> > invent a new hint-bit that tells concurrent updates to acquire a
> >> > heavyweight tuple lock, while releasing the page-level lock. If that
> >> > hint bit does not require any other modifications - and we don't need an
> >> > xid in xmax for this use case - that'll avoid doing all the other
> >> > `already_marked` stuff early, which should address the correctness
> >> > issue.
> >> >
> >>
> >> Don't we need to clear such a flag in case of error?  Also don't we need to
> >> reset it later, like when modifying the old page later before WAL.
> >
> > If the flag just says "acquire a heavyweight lock", then there's no need
> > for that. That's cheap enough to just do if it's errorneously set.  At
> > least I can't see any reason.
> 
> I don't quite understand the intended semantics of this proposed flag.

Whenever the flag is set, we have to acquire the heavyweight tuple lock
before continuing. That guarantees nobody else can modify the tuple,
while the lock is released, without requiring to modify more than one
hint bit.  That should fix the torn page issue, no?

> If we don't already have the tuple lock at that point, we can't go
> acquire it before releasing the content lock, can we?

Why not?  Afaics the way that tuple locks are used, the nesting should
be fine.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] PgQ and pg_dump

2016-06-21 Thread Martín Marqués
2016-06-21 13:08 GMT-03:00 Robert Haas :
> On Thu, Jun 16, 2016 at 1:46 PM, Martín Marqués  
> wrote:
>> The comment is accurate on what is going to be dumpable and what's not
>> from the code. In our case, as the pgq schema is not dumpable becaause
>> it comes from an extension, other objects it contain will not be
>> dumpable as well.
>>
>> That's the reason why the PgQ event tables created by
>> pgq.create_queue() are not dumped.
>
> That sucks.

Yes, and I'm surprised we haven't had any bug report yet on
inconsistent dumps. The patch that changed pg_dump's behavior on
extension objects is more then a year old.

I'll find some time today to add tests and check for other objects
that are not dumped for the same reason.

Cheers,

-- 
Martín Marquéshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Choosing the cheapest optimizer cost

2016-06-21 Thread Bruce Momjian
On Tue, Jun 21, 2016 at 11:17:19AM -0400, Robert Haas wrote:
> If the index scans are parameterized by values from the seq scan,
> which is likely the situation in which this sort of plan will be
> generated, we'll pay the extra cost of building the hash table once
> per row in something_big.
> 
> I think we should consider switching from a nested loop to a hash join
> on the fly if the outer relation turns out to be bigger than expected.
> We could work out during planning what the expected breakeven point
> is; if the actual outer row count passes that, switch to a hash join.
> This has been discussed before, but nobody's tried to do the work,
> AFAIK.

Yes, the idea of either adjusting the execution plan when counts are
inaccurate, or feeding information about misestimation back to the
optimizer for future queries is something I hope we try someday.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Lets (not) break all the things. Was: [pgsql-advocacy] 9.6 -> 10.0

2016-06-21 Thread Bruce Momjian
On Tue, Jun 21, 2016 at 12:12:34PM -0400, Robert Haas wrote:
> >> > What is confusing you?
> >>
> >> I don't think I'm confused.  Sure, you can do that, but the effects of
> >> any writes performed on the new cluster will not be there when you
> >> revert back to the old cluster.  So you will have effectively lost
> >> data, unless you somehow have the ability to re-apply all of those
> >> write transactions somehow.
> >
> > Yes, that is true.  I assume _revert_ means something really bad
> > happened and you don't want those writes because they are somehow
> > corrupt.
> 
> I think that it's pretty likely you could, say, upgrade to a new major
> release, discover that it has a performance problem or some other bug
> that causes a problem for you, and want to go back to the older
> release.  There's not really an easy way to do that, because a pg_dump
> taken from the new system might not restore on the older one.  Logical
> replication - e.g. Slony - can provide a way, but we don't have
> anything in core that can do it.

Yes, there is data loss in a rollback to the old cluster, no question.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Lets (not) break all the things. Was: [pgsql-advocacy] 9.6 -> 10.0

2016-06-21 Thread Bruce Momjian
On Tue, Jun 21, 2016 at 08:56:09PM +0800, Craig Ringer wrote:
> Also, if you run *with* --link, IIRC there's no guarantee that the old version
> will be happy to see any new infomask bits etc introduced by the new Pg. I

Well, we only write system tables in pg_upgrade in the new cluster, and
those are not hard linked.  As far as I know, we never write to anything
we hard link from the old cluster.

> think there will also be issues with oid to relfilenode mappings in pg_class 
> if
> the new cluster did any VACUUM FULLs or anything. It seems likely to be a bit

pg_upgrade turns off all vacuums.

> risky to fall back on the old cluster once you've upgraded with --link . TBH 
> it
> never even occurred to me that it'd be possible at all until you mentioned.

Well, with --link, you can't start the old cluster, and that is
documented, and pg_control is renamed to prevent accidental start.  I
think it is possible to start the old cluster before the new cluster is
started.

> I always thought of pg_upgrade as a one-way no-going-back process either way,
> really. Either due to a fork in history (without --link) or due to possibly
> incompatible datadir changes (with --link).

Yes.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Lets (not) break all the things. Was: [pgsql-advocacy] 9.6 -> 10.0

2016-06-21 Thread Robert Haas
On Tue, Jun 21, 2016 at 11:34 AM, Bruce Momjian  wrote:
> On Tue, Jun 21, 2016 at 08:19:55AM -0400, Robert Haas wrote:
>> On Mon, Jun 20, 2016 at 10:08 PM, Bruce Momjian  wrote:
>> >> No, not really.  Once you let write transactions into the new cluster,
>> >> there's no way to get back to the old server version no matter which
>> >> option you used.
>> >
>> > Yes, there is, and it is documented:
>> >
>> > If you ran pg_upgrade without
>> > --link or did not start the new server, the
>> > old cluster was not modified except that, if linking
>> > started, a .old suffix was appended to
>> > $PGDATA/global/pg_control.  To reuse the old
>> > cluster, possibly remove the .old suffix from
>> > $PGDATA/global/pg_control; you can then restart the
>> > old cluster.
>> >
>> > What is confusing you?
>>
>> I don't think I'm confused.  Sure, you can do that, but the effects of
>> any writes performed on the new cluster will not be there when you
>> revert back to the old cluster.  So you will have effectively lost
>> data, unless you somehow have the ability to re-apply all of those
>> write transactions somehow.
>
> Yes, that is true.  I assume _revert_ means something really bad
> happened and you don't want those writes because they are somehow
> corrupt.

I think that it's pretty likely you could, say, upgrade to a new major
release, discover that it has a performance problem or some other bug
that causes a problem for you, and want to go back to the older
release.  There's not really an easy way to do that, because a pg_dump
taken from the new system might not restore on the older one.  Logical
replication - e.g. Slony - can provide a way, but we don't have
anything in core that can do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] PgQ and pg_dump

2016-06-21 Thread Robert Haas
On Thu, Jun 16, 2016 at 1:46 PM, Martín Marqués  wrote:
> The comment is accurate on what is going to be dumpable and what's not
> from the code. In our case, as the pgq schema is not dumpable becaause
> it comes from an extension, other objects it contain will not be
> dumpable as well.
>
> That's the reason why the PgQ event tables created by
> pgq.create_queue() are not dumped.

That sucks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-06-21 Thread Robert Haas
On Thu, Jun 16, 2016 at 3:28 AM, Amit Kapila  wrote:
>> Incomplete splits can be completed either by vacuum or insert as both
>> needs exclusive lock on bucket.  If vacuum finds split-in-progress flag on a
>> bucket then it will complete the split operation, vacuum won't see this flag
>> if actually split is in progress on that bucket as vacuum needs cleanup lock
>> and split retains pin till end of operation.  To make it work for Insert
>> operation, one simple idea could be that if insert finds split-in-progress
>> flag, then it releases the current exclusive lock on bucket and tries to
>> acquire a cleanup lock on bucket, if it gets cleanup lock, then it can
>> complete the split and then the insertion of tuple, else it will have a
>> exclusive lock on bucket and just perform the insertion of tuple.  The
>> disadvantage of trying to complete the split in vacuum is that split might
>> require new pages and allocating new pages at time of vacuum is not
>> advisable.  The disadvantage of doing it at time of Insert is that Insert
>> might skip it even if there is some scan on the bucket is going on as scan
>> will also retain pin on the bucket, but I think that is not a big deal.  The
>> actual completion of split can be done in two ways: (a) scan the new bucket
>> and build a hash table with all of the TIDs you find there.  When copying
>> tuples from the old bucket, first probe the hash table; if you find a match,
>> just skip that tuple (idea suggested by Robert Haas offlist) (b) delete all
>> the tuples that are marked as moved_by_split in the new bucket and perform
>> the split operation from the beginning using old bucket.
>
> I have completed the patch with respect to incomplete splits and delayed
> cleanup of garbage tuples.  For incomplete splits, I have used the option
> (a) as mentioned above.  The incomplete splits are completed if the
> insertion sees split-in-progress flag in a bucket.

It seems to me that there is a potential performance problem here.  If
the split is still being performed, every insert will see the
split-in-progress flag set.  The in-progress split retains only a pin
on the primary bucket, so other backends could also get an exclusive
lock, which is all they need for an insert.  It seems that under this
algorithm they will now take the exclusive lock, release the exclusive
lock, try to take a cleanup lock, fail, again take the exclusive lock.
That seems like a lot of extra monkeying around.  Wouldn't it be
better to take the exclusive lock and then afterwards check if the pin
count is 1?  If so, even though we only intended to take an exclusive
lock, it is actually a cleanup lock.  If not, we can simply proceed
with the insertion.  This way you avoid unlocking and relocking the
buffer repeatedly.

> The second major thing
> this new version of patch has achieved is cleanup of garbage tuples i.e the
> tuples that are left in old bucket during split.  Currently (in HEAD), as
> part of a split operation, we clean the tuples from old bucket after moving
> them to new bucket, as we have heavy-weight locks on both old and new bucket
> till the whole split operation.  In the new design, we need to take cleanup
> lock on old bucket and exclusive lock on new bucket to perform the split
> operation and we don't retain those locks till the end (release the lock as
> we move on to overflow buckets).  Now to cleanup the tuples we need a
> cleanup lock on a bucket which we might not have at split-end.  So I choose
> to perform the cleanup of garbage tuples during vacuum and when re-split of
> the bucket happens as during both the operations, we do hold cleanup lock.
> We can extend the cleanup of garbage to other operations as well if
> required.

I think it's OK for the squeeze phase to be deferred until vacuum or a
subsequent split, but simply removing dead tuples seems like it should
be done earlier  if possible.  As I noted in my last email, it seems
like any process that gets an exclusive lock can do that, and probably
should.  Otherwise, the index might become quite bloated.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-06-21 Thread Robert Haas
On Tue, May 10, 2016 at 8:09 AM, Amit Kapila  wrote:
>
> For making hash indexes usable in production systems, we need to improve its 
> concurrency and make them crash-safe by WAL logging them.  The first problem 
> I would like to tackle is improve the concurrency of hash indexes.  First 
> advantage, I see with improving concurrency of hash indexes is that it has 
> the potential of out performing btree for "equal to" searches (with my WIP 
> patch attached with this mail, I could see hash index outperform btree index 
> by 20 to 30% for very simple cases which are mentioned later in this e-mail). 
>   Another advantage as explained by Robert [1] earlier is that if we remove 
> heavy weight locks under which we perform arbitrarily large number of 
> operations, it can help us to sensibly WAL log it.  With this patch, I would 
> also like to make hash indexes capable of completing the incomplete_splits 
> which can occur due to interrupts (like cancel) or errors or crash.
>
> I have studied the concurrency problems of hash index and some of the 
> solutions proposed for same previously and based on that came up with below 
> solution which is based on idea by Robert [1], community discussion on thread 
> [2] and some of my own thoughts.
>
> Maintain a flag that can be set and cleared on the primary bucket page, call 
> it split-in-progress, and a flag that can optionally be set on particular 
> index tuples, call it moved-by-split. We will allow scans of all buckets and 
> insertions into all buckets while the split is in progress, but (as now) we 
> will not allow more than one split for a bucket to be in progress at the same 
> time.  We start the split by updating metapage to incrementing the number of 
> buckets and set the split-in-progress flag in primary bucket pages for old 
> and new buckets (lets number them as old bucket - N+1/2; new bucket - N + 1 
> for the matter of discussion). While the split-in-progress flag is set, any 
> scans of N+1 will first scan that bucket, ignoring any tuples flagged 
> moved-by-split, and then ALSO scan bucket N+1/2. To ensure that vacuum 
> doesn't clean any tuples from old or new buckets till this scan is in 
> progress, maintain a pin on both of the buckets (first pin on old bucket 
> needs to be acquired). The moved-by-split flag never has any effect except 
> when scanning the new bucket that existed at the start of that particular 
> scan, and then only if the split-in-progress flag was also set at that time.

You really need parentheses in (N+1)/2.  Because you are not trying to
add 1/2 to N.  https://en.wikipedia.org/wiki/Order_of_operations

> Once the split operation has set the split-in-progress flag, it will begin 
> scanning bucket (N+1)/2.  Every time it finds a tuple that properly belongs 
> in bucket N+1, it will insert the tuple into bucket N+1 with the 
> moved-by-split flag set.  Tuples inserted by anything other than a split 
> operation will leave this flag clear, and tuples inserted while the split is 
> in progress will target the same bucket that they would hit if the split were 
> already complete.  Thus, bucket N+1 will end up with a mix of moved-by-split 
> tuples, coming from bucket (N+1)/2, and unflagged tuples coming from parallel 
> insertion activity.  When the scan of bucket (N+1)/2 is complete, we know 
> that bucket N+1 now contains all the tuples that are supposed to be there, so 
> we clear the split-in-progress flag on both buckets.  Future scans of both 
> buckets can proceed normally.  Split operation needs to take a cleanup lock 
> on primary bucket to ensure that it doesn't start if there is any Insertion 
> happening in the bucket.  It will leave the lock on primary bucket, but not 
> pin as it proceeds for next overflow page.  Retaining pin on primary bucket 
> will ensure that vacuum doesn't start on this bucket till the split is 
> finished.

In the second-to-last sentence, I believe you have reversed the words
"lock" and "pin".

> Insertion will happen by scanning the appropriate bucket and needs to retain 
> pin on primary bucket to ensure that concurrent split doesn't happen, 
> otherwise split might leave this tuple unaccounted.

What do you mean by "unaccounted"?

> Now for deletion of tuples from (N+1/2) bucket, we need to wait for the 
> completion of any scans that began before we finished populating bucket N+1, 
> because otherwise we might remove tuples that they're still expecting to find 
> in bucket (N+1)/2. The scan will always maintain a pin on primary bucket and 
> Vacuum can take a buffer cleanup lock (cleanup lock includes Exclusive lock 
> on bucket and wait till all the pins on buffer becomes zero) on primary 
> bucket for the buffer.  I think we can relax the requirement for vacuum to 
> take cleanup lock (instead take Exclusive Lock on buckets where no split has 
> happened) with the additional flag has_garbage which will be set on primary 
> bucket, if any tuples have 

[HACKERS] Requesting external_pid_file with postgres -C when not initialized lead to coredump

2016-06-21 Thread alain radix
Hi,

I faced a coredump when reading the value of the parameter
"external_pid_file" when it was not initialized in postgresql.conf
This came from the value not being specified to be initialized to en empty
string in guc.c in the ConfigureNamesString array.
the behavior can easily been tested with the following commands :
initdb test
postgres -D test -C external_pid_file

I faced the problem with version 9.3, 9.5 and 9.6 beta 1
This seems to come from a long time ago.

I wrote a patch ( with help from Stéphane Schildknecht ) to correct the
problem with a proper initialization.
The patch also removed a useless initialization of cluster_name to save a
little memory.

So, here is my first patch for PostgreSQL.

Best regards.

-- 
Alain Radix


external_pid_file.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Lets (not) break all the things. Was: [pgsql-advocacy] 9.6 -> 10.0

2016-06-21 Thread Bruce Momjian
On Tue, Jun 21, 2016 at 08:19:55AM -0400, Robert Haas wrote:
> On Mon, Jun 20, 2016 at 10:08 PM, Bruce Momjian  wrote:
> >> No, not really.  Once you let write transactions into the new cluster,
> >> there's no way to get back to the old server version no matter which
> >> option you used.
> >
> > Yes, there is, and it is documented:
> >
> > If you ran pg_upgrade without
> > --link or did not start the new server, the
> > old cluster was not modified except that, if linking
> > started, a .old suffix was appended to
> > $PGDATA/global/pg_control.  To reuse the old
> > cluster, possibly remove the .old suffix from
> > $PGDATA/global/pg_control; you can then restart the
> > old cluster.
> >
> > What is confusing you?
> 
> I don't think I'm confused.  Sure, you can do that, but the effects of
> any writes performed on the new cluster will not be there when you
> revert back to the old cluster.  So you will have effectively lost
> data, unless you somehow have the ability to re-apply all of those
> write transactions somehow.

Yes, that is true.  I assume _revert_ means something really bad
happened and you don't want those writes because they are somehow
corrupt.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Choosing the cheapest optimizer cost

2016-06-21 Thread Robert Haas
On Wed, Jun 15, 2016 at 3:46 PM, Bruce Momjian  wrote:
> Right now, the optimizer chooses the path with the cheapest cost.
>
> However, do we take into account the behavior of the plan in handling
> mis-estimated row counts?

No.

> For example, if a path has a log(n) behavior
> for changes in the row count, and another plan that is slightly cheaper
> has a log(n^2) behavior, should we choose the former, knowing the the
> row counts are often inaccurate?

Maybe.  It's not really that simple, though.  In practice, the
decision we have to make is something like: should we use a nested
loop here, which will be better if either the inner or outer relation
has < 2 rows, or should we use a hash join, which will be better if
both sides are big?  The nested loop can be really, really bad if it
turns out that the inner side which we expected to have 1 row actually
has 1,000,000 rows, but the hash join can lose, too.  Consider:

Nested Loop Left Join
-> Seq Scan on something_bug
-> Hash Join
  -> Index Scan on at_most_one_row_expected
  -> Hash
-> Index Scan on not_very_many_rows

If the index scans are parameterized by values from the seq scan,
which is likely the situation in which this sort of plan will be
generated, we'll pay the extra cost of building the hash table once
per row in something_big.

I think we should consider switching from a nested loop to a hash join
on the fly if the outer relation turns out to be bigger than expected.
We could work out during planning what the expected breakeven point
is; if the actual outer row count passes that, switch to a hash join.
This has been discussed before, but nobody's tried to do the work,
AFAIK.

> I suppose one approach would be to track not only the path costs, but
> the handling of mis-estimated, and account for that in the final path
> choice?  Do we already do this by giving less stable plans higher costs?
> Does that have the same effect?

The problem with tracking more things during planning is that it
increases the cost of planning.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] primary_conninfo missing from pg_stat_wal_receiver

2016-06-21 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Michael Paquier  writes:
> > On Tue, Jun 21, 2016 at 11:29 AM, Tom Lane  wrote:
> >> What I would want to know is whether this specific change is actually a
> >> good idea.  In particular, I'm concerned about the possible security
> >> implications of exposing primary_conninfo --- might it not contain a
> >> password, for example?
> 
> > Yes it could, as a connection string, but we make the information of
> > this view only visible to superusers. For the others, that's just
> > NULL.
> 
> Well, that's okay for now, but I'm curious to hear Stephen Frost's
> opinion on this.  He's been on the warpath to decrease our dependence
> on superuser-ness for protection purposes.  Seems to me that having
> one column in this view that is a lot more security-sensitive than
> the others is likely to be an issue someday.

Ugh.  I would certainly rather not have yet another special, hard-coded,
bit of logic that magically makes things available to a superuser when
it's not available to regular users.

What that results in is the need to have a new default role to control
access to that column for the non-superuser case.

As for the password showing up, sorry, but we need a solution for *that*
regardless of the rest- the password shouldn't be exposed to anyone, nor
should it be sent and kept in the backend's memory for longer than
necessary.  I'm not suggesting we've got that figured out already, but
that's where we should be trying to go.

Apologies, I've not followed this thread entirely, so these comments are
based only on what's quoted above.  I'll try to read the complete thread
tomorrow.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Robert Haas
On Tue, Jun 21, 2016 at 10:47 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund  wrote:
>>> Consider what happens if we, after restarting at l2, notice that we
>>> can't actually insert, but return in the !HeapTupleMayBeUpdated
>>> branch.
>
>> OK, I see what you mean.  Still, that doesn't seem like such a
>> terrible cost.  If you try to update a tuple and if it looks like you
>> can update it but then after TOASTing you find that the status of the
>> tuple has changed such that you can't update it after all, then you
>> might need to go set xmax = MyTxid() on all of the TOAST tuples you
>> created (whose CTIDs we could save someplace, so that it's just a
>> matter of finding them by CTID to kill them).
>
> ... and if you get an error or crash partway through that, what happens?

Then the transaction is aborted anyway, and we haven't leaked anything
because VACUUM will clean it up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Robert Haas
On Mon, Jun 20, 2016 at 11:51 PM, Andres Freund  wrote:
>> > So far the best idea I have - and it's really not a good one - is to
>> > invent a new hint-bit that tells concurrent updates to acquire a
>> > heavyweight tuple lock, while releasing the page-level lock. If that
>> > hint bit does not require any other modifications - and we don't need an
>> > xid in xmax for this use case - that'll avoid doing all the other
>> > `already_marked` stuff early, which should address the correctness
>> > issue.
>> >
>>
>> Don't we need to clear such a flag in case of error?  Also don't we need to
>> reset it later, like when modifying the old page later before WAL.
>
> If the flag just says "acquire a heavyweight lock", then there's no need
> for that. That's cheap enough to just do if it's errorneously set.  At
> least I can't see any reason.

I don't quite understand the intended semantics of this proposed flag.
If we don't already have the tuple lock at that point, we can't go
acquire it before releasing the content lock, can we?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund  wrote:
>> Consider what happens if we, after restarting at l2, notice that we
>> can't actually insert, but return in the !HeapTupleMayBeUpdated
>> branch.

> OK, I see what you mean.  Still, that doesn't seem like such a
> terrible cost.  If you try to update a tuple and if it looks like you
> can update it but then after TOASTing you find that the status of the
> tuple has changed such that you can't update it after all, then you
> might need to go set xmax = MyTxid() on all of the TOAST tuples you
> created (whose CTIDs we could save someplace, so that it's just a
> matter of finding them by CTID to kill them).

... and if you get an error or crash partway through that, what happens?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Robert Haas
On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund  wrote:
> On 2016-06-20 17:55:19 -0400, Robert Haas wrote:
>> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund  wrote:
>> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
>> >> I
>> >> mean, suppose we just don't do any of that before we go off to do
>> >> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
>> >> when we reacquire the page lock, we might find that somebody else has
>> >> already updated the tuple, but couldn't that be handled by
>> >> (approximately) looping back up to l2 just as we do in several other
>> >> cases?
>> >
>> > We'd potentially have to undo a fair amount more work: the toasted data
>> > would have to be deleted and such, just to retry. Which isn't going to
>> > super easy, because all of it will be happening with the current cid (we
>> > can't just increase CommandCounterIncrement() for correctness reasons).
>>
>> Why would we have to delete the TOAST data?  AFAIUI, the tuple points
>> to the TOAST data, but not the other way around.  So if we change our
>> mind about where to put the tuple, I don't think that requires
>> re-TOASTing.
>
> Consider what happens if we, after restarting at l2, notice that we
> can't actually insert, but return in the !HeapTupleMayBeUpdated
> branch. If the caller doesn't error out - and there's certainly callers
> doing that - we'd "leak" a toasted datum. Unless the transaction aborts,
> the toasted datum would never be cleaned up, because there's no datum
> pointing to it, so no heap_delete will ever recurse into the toast
> datum (via toast_delete()).

OK, I see what you mean.  Still, that doesn't seem like such a
terrible cost.  If you try to update a tuple and if it looks like you
can update it but then after TOASTing you find that the status of the
tuple has changed such that you can't update it after all, then you
might need to go set xmax = MyTxid() on all of the TOAST tuples you
created (whose CTIDs we could save someplace, so that it's just a
matter of finding them by CTID to kill them).  That's not likely to
happen particularly often, though, and when it does happen it's not
insanely expensive.  We could also reduce the cost by letting the
caller of heap_update() decide whether to back out the work; if the
caller intends to throw an error anyway, then there's no point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Precedence of new phrase search tsquery operator

2016-06-21 Thread Teodor Sigaev

On Wed, Jun 8, 2016 at 7:13 PM, Tom Lane  wrote:

It appears that the new <-> operator has been made to have exactly the
same grammatical precedence as the existing & (AND) operator.  Thus,
for example, 'a & b <-> c'::tsquery means something different from
'b <-> c & a'::tsquery:
I find this surprising.  My intuitive feeling is that <-> ought to
bind tighter than & (and therefore also tighter than |).  What's
the reasoning for making it act like this?


ah, now we remember :)   The idea about equivalence of  & and <->
operators appeared in situation when <-> degenerates to & in case of
absence of positional information. Looks like we mixed different
things, will fix.


Attached patch changes a precedences of operations to |, &, <->, | in ascending 
order. BTW, it simplifies a bit a code around printing and parsing of tsquery.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


phrase_predecence-2.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing checks when malloc returns NULL...

2016-06-21 Thread Tom Lane
Michael Paquier  writes:
> - mcxt.c uses that, which is surprising:
> @@ -704,7 +704,8 @@ MemoryContextCreate(NodeTag tag, Size size,
> {
> /* Special case for startup: use good ol' malloc */
> node = (MemoryContext) malloc(needed);
> -   Assert(node != NULL);
> +   if (node == NULL)
> +   elog(PANIC, "out of memory");
> }
> I think that a PANIC is cleaner here instead of a simple crash.

But the elog mechanism assumes that memory contexts are working.
If this ever actually did fire, no good would come of it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Lets (not) break all the things. Was: [pgsql-advocacy] 9.6 -> 10.0

2016-06-21 Thread Craig Ringer
On 21 June 2016 at 20:19, Robert Haas  wrote:

> On Mon, Jun 20, 2016 at 10:08 PM, Bruce Momjian  wrote:
> > On Fri, May 20, 2016 at 07:40:53PM -0400, Robert Haas wrote:
> >> On Mon, May 16, 2016 at 3:36 AM, Bruce Momjian 
> wrote:
> >> > On Sun, May 15, 2016 at 03:23:52PM -0500, Jim Nasby wrote:
> >> >> 2) There's no ability at all to revert, other than restore a backup.
> That
> >> >> means if you pull the trigger and discover some major performance
> problem,
> >> >> you have no choice but to deal with it (you can't switch back to the
> old
> >> >> version without losing data).
> >> >
> >> > In --link mode only
> >>
> >> No, not really.  Once you let write transactions into the new cluster,
> >> there's no way to get back to the old server version no matter which
> >> option you used.
> >
> > Yes, there is, and it is documented:
> >
> > If you ran pg_upgrade without
> > --link or did not start the new server, the
> > old cluster was not modified except that, if linking
> > started, a .old suffix was appended to
> > $PGDATA/global/pg_control.  To reuse the old
> > cluster, possibly remove the .old suffix from
> > $PGDATA/global/pg_control; you can then restart the
> > old cluster.
> >
> > What is confusing you?
>
> I don't think I'm confused.  Sure, you can do that, but the effects of
> any writes performed on the new cluster will not be there when you
> revert back to the old cluster.  So you will have effectively lost
> data, unless you somehow have the ability to re-apply all of those
> write transactions somehow.
>

Also, if you run *with* --link, IIRC there's no guarantee that the old
version will be happy to see any new infomask bits etc introduced by the
new Pg. I think there will also be issues with oid to relfilenode mappings
in pg_class if the new cluster did any VACUUM FULLs or anything. It seems
likely to be a bit risky to fall back on the old cluster once you've
upgraded with --link . TBH it never even occurred to me that it'd be
possible at all until you mentioned.

I always thought of pg_upgrade as a one-way no-going-back process either
way, really. Either due to a fork in history (without --link) or due to
possibly incompatible datadir changes (with --link).

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Postgres_fdw join pushdown - wrong results with whole-row reference

2016-06-21 Thread Etsuro Fujita

On 2016/06/21 21:37, Ashutosh Bapat wrote:


How about using a system column eg, ctid, for the CASE WHEN
conversion; in Rushabh's example the reference to "r1" would be
converted with "CASE WHEN r1.ctid IS NOT NULL THEN ROW(r1.empno,
r1.ename, r1.job, r1.mgr, r1.hiredate, r1.sal, r1.comm, r1.deptno)
END".  IMO I think that that would be much simpler than Ashutosh's
approach.



A foreign table can have a view, a regular table, another foreign table
or a materialised view a its target. A view does not support any of the
system columns, so none of them are available.


You are right.  Sorry for the noise.

Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres_fdw join pushdown - wrong results with whole-row reference

2016-06-21 Thread Ashutosh Bapat
> How about using a system column eg, ctid, for the CASE WHEN conversion; in
> Rushabh's example the reference to "r1" would be converted with "CASE WHEN
> r1.ctid IS NOT NULL THEN ROW(r1.empno, r1.ename, r1.job, r1.mgr,
> r1.hiredate, r1.sal, r1.comm, r1.deptno) END".  IMO I think that that would
> be much simpler than Ashutosh's approach.
>
>
A foreign table can have a view, a regular table, another foreign table or
a materialised view a its target. A view does not support any of the system
columns, so none of them are available.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Postgres_fdw join pushdown - wrong results with whole-row reference

2016-06-21 Thread Etsuro Fujita

On 2016/06/21 20:42, Ashutosh Bapat wrote:

On Tue, Jun 21, 2016 at 4:36 PM, Amit Langote
>
wrote:



On 2016/06/21 16:27, Rushabh Lathia wrote:
> Now I was under impression the IS NOT NULL should be always in
inverse of
> IS NULL, but clearly here its not the case with wholerow. But further
> looking at
> the document its saying different thing for wholerow:
>
> https://www.postgresql.org/docs/9.5/static/functions-comparison.html
>
> Note: If the expression is row-valued, then IS NULL is true when
the row
> expression
> itself is null or when all the row's fields are null, while IS NOT
NULL is
> true
> when the row expression itself is non-null and all the row's
fields are
> non-null.
> Because of this behavior, IS NULL and IS NOT NULL do not always return
> inverse
> results for row-valued expressions, i.e., a row-valued expression that
> contains
> both NULL and non-null values will return false for both tests. This
> definition
> conforms to the SQL standard, and is a change from the
inconsistent behavior
> exhibited by PostgreSQL versions prior to 8.2.



> And as above documentation clearly says that IS NULL and IS NOT
NULL do not
> always return inverse results for row-valued expressions. So need
to change
> the
> deparse logic into postgres_fdw - how ? May be to use IS NULL
rather then IS
> NOT NULL?
>
> Input/thought?



Perhaps - NOT expr IS NULL?  Like in the attached.



As the documentation describes row is NULL is going to return true when
all the columns in a row are NULL, even though row itself is not null.
So, with your patch a row (null, null, null) is going to be output as a
NULL row. Is that right?


Yeah, I think so.


In an outer join we have to differentiate between a row being null
(because there was no joining row on nullable side) and a non-null row
with all column values being null. If we cast the whole-row expression
to a text e.g. r.*::text and test the resultant value for nullness, it
gives us what we want. A null row casted to text is null and a row with
all null values casted to text is not null.
postgres=# select (null, null, null)::text is not null;
 ?column?
--
 t
(1 row)

postgres=# select null::record::text is not null;
 ?column?
--
 f
(1 row)

In find_coercion_pathway(), we resort to IO coercion for record::text
explicit coercion. This should always work the way we want as record_out
is a strict function and for non-null values it outputs at least the
parenthesis which will be treated as non-null text.
2253 /*
2254  * If we still haven't found a possibility, consider
automatic casting
2255  * using I/O functions.  We allow assignment casts to
string types and
2256  * explicit casts from string types to be handled this way.
(The
2257  * CoerceViaIO mechanism is a lot more general than that,
but this is
2258  * all we want to allow in the absence of a pg_cast entry.)
It would
2259  * probably be better to insist on explicit casts in both
directions,
2260  * but this is a compromise to preserve something of the
pre-8.3
2261  * behavior that many types had implicit (yipes!) casts to
text.
2262  */



PFA the patch with the cast to text. This is probably uglier than
expected, but I don't know any better test to find nullness of a record,
the way we want here. The patch also includes the expected output
changes in the EXPLAIN output.


How about using a system column eg, ctid, for the CASE WHEN conversion; 
in Rushabh's example the reference to "r1" would be converted with "CASE 
WHEN r1.ctid IS NOT NULL THEN ROW(r1.empno, r1.ename, r1.job, r1.mgr, 
r1.hiredate, r1.sal, r1.comm, r1.deptno) END".  IMO I think that that 
would be much simpler than Ashutosh's approach.


Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Lets (not) break all the things. Was: [pgsql-advocacy] 9.6 -> 10.0

2016-06-21 Thread Robert Haas
On Mon, Jun 20, 2016 at 10:08 PM, Bruce Momjian  wrote:
> On Fri, May 20, 2016 at 07:40:53PM -0400, Robert Haas wrote:
>> On Mon, May 16, 2016 at 3:36 AM, Bruce Momjian  wrote:
>> > On Sun, May 15, 2016 at 03:23:52PM -0500, Jim Nasby wrote:
>> >> 2) There's no ability at all to revert, other than restore a backup. That
>> >> means if you pull the trigger and discover some major performance problem,
>> >> you have no choice but to deal with it (you can't switch back to the old
>> >> version without losing data).
>> >
>> > In --link mode only
>>
>> No, not really.  Once you let write transactions into the new cluster,
>> there's no way to get back to the old server version no matter which
>> option you used.
>
> Yes, there is, and it is documented:
>
> If you ran pg_upgrade without
> --link or did not start the new server, the
> old cluster was not modified except that, if linking
> started, a .old suffix was appended to
> $PGDATA/global/pg_control.  To reuse the old
> cluster, possibly remove the .old suffix from
> $PGDATA/global/pg_control; you can then restart the
> old cluster.
>
> What is confusing you?

I don't think I'm confused.  Sure, you can do that, but the effects of
any writes performed on the new cluster will not be there when you
revert back to the old cluster.  So you will have effectively lost
data, unless you somehow have the ability to re-apply all of those
write transactions somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel query and temp_file_limit

2016-06-21 Thread Robert Haas
On Mon, Jun 20, 2016 at 11:01 PM, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> On Wed, May 18, 2016 at 3:40 AM, Robert Haas  wrote:
>>> What I'm tempted to do is trying to document that, as a point of
>>> policy, parallel query in 9.6 uses up to (workers + 1) times the
>>> resources that a single session might use.  That includes not only CPU
>>> but also things like work_mem and temp file space.  This obviously
>>> isn't ideal, but it's what could be done by the ship date.
>
>> Where would that be documented, though? Would it need to be noted in
>> the case of each such GUC?
>
> Why can't we just note this in the number-of-workers GUCs?  It's not like
> there even *is* a GUC for many of our per-process resource consumption
> behaviors.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres_fdw join pushdown - wrong results with whole-row reference

2016-06-21 Thread Ashutosh Bapat
On Tue, Jun 21, 2016 at 4:36 PM, Amit Langote  wrote:

> On 2016/06/21 16:27, Rushabh Lathia wrote:
> > Now I was under impression the IS NOT NULL should be always in inverse of
> > IS NULL, but clearly here its not the case with wholerow. But further
> > looking at
> > the document its saying different thing for wholerow:
> >
> > https://www.postgresql.org/docs/9.5/static/functions-comparison.html
> >
> > Note: If the expression is row-valued, then IS NULL is true when the row
> > expression
> > itself is null or when all the row's fields are null, while IS NOT NULL
> is
> > true
> > when the row expression itself is non-null and all the row's fields are
> > non-null.
> > Because of this behavior, IS NULL and IS NOT NULL do not always return
> > inverse
> > results for row-valued expressions, i.e., a row-valued expression that
> > contains
> > both NULL and non-null values will return false for both tests. This
> > definition
> > conforms to the SQL standard, and is a change from the inconsistent
> behavior
> > exhibited by PostgreSQL versions prior to 8.2.
> >
> >
> > And as above documentation clearly says that IS NULL and IS NOT NULL do
> not
> > always return inverse results for row-valued expressions. So need to
> change
> > the
> > deparse logic into postgres_fdw - how ? May be to use IS NULL rather
> then IS
> > NOT NULL?
> >
> > Input/thought?
>
> Perhaps - NOT expr IS NULL?  Like in the attached.
>
>
As the documentation describes row is NULL is going to return true when all
the columns in a row are NULL, even though row itself is not null. So, with
your patch a row (null, null, null) is going to be output as a NULL row. Is
that right?

In an outer join we have to differentiate between a row being null (because
there was no joining row on nullable side) and a non-null row with all
column values being null. If we cast the whole-row expression to a text
e.g. r.*::text and test the resultant value for nullness, it gives us what
we want. A null row casted to text is null and a row with all null values
casted to text is not null.
postgres=# select (null, null, null)::text is not null;
 ?column?
--
 t
(1 row)

postgres=# select null::record::text is not null;
 ?column?
--
 f
(1 row)

In find_coercion_pathway(), we resort to IO coercion for record::text
explicit coercion. This should always work the way we want as record_out is
a strict function and for non-null values it outputs at least the
parenthesis which will be treated as non-null text.
2253 /*
2254  * If we still haven't found a possibility, consider automatic
casting
2255  * using I/O functions.  We allow assignment casts to string
types and
2256  * explicit casts from string types to be handled this way.
(The
2257  * CoerceViaIO mechanism is a lot more general than that, but
this is
2258  * all we want to allow in the absence of a pg_cast entry.) It
would
2259  * probably be better to insist on explicit casts in both
directions,
2260  * but this is a compromise to preserve something of the
pre-8.3
2261  * behavior that many types had implicit (yipes!) casts to
text.
2262  */


PFA the patch with the cast to text. This is probably uglier than expected,
but I don't know any better test to find nullness of a record, the way we
want here. The patch also includes the expected output changes in the
EXPLAIN output.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


pg_null_wr.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres_fdw join pushdown - wrong results with whole-row reference

2016-06-21 Thread Amit Langote
On 2016/06/21 16:27, Rushabh Lathia wrote:
> Now I was under impression the IS NOT NULL should be always in inverse of
> IS NULL, but clearly here its not the case with wholerow. But further
> looking at
> the document its saying different thing for wholerow:
> 
> https://www.postgresql.org/docs/9.5/static/functions-comparison.html
> 
> Note: If the expression is row-valued, then IS NULL is true when the row
> expression
> itself is null or when all the row's fields are null, while IS NOT NULL is
> true
> when the row expression itself is non-null and all the row's fields are
> non-null.
> Because of this behavior, IS NULL and IS NOT NULL do not always return
> inverse
> results for row-valued expressions, i.e., a row-valued expression that
> contains
> both NULL and non-null values will return false for both tests. This
> definition
> conforms to the SQL standard, and is a change from the inconsistent behavior
> exhibited by PostgreSQL versions prior to 8.2.
> 
> 
> And as above documentation clearly says that IS NULL and IS NOT NULL do not
> always return inverse results for row-valued expressions. So need to change
> the
> deparse logic into postgres_fdw - how ? May be to use IS NULL rather then IS
> NOT NULL?
> 
> Input/thought?

Perhaps - NOT expr IS NULL?  Like in the attached.

explain verbose select e, e.empno, d.deptno, d.dname from f_emp e left
join f_dept d on e.deptno = d.deptno and e.sal > 3000 order by 2, 3 limit 10;



 QUERY PLAN



-
-
-
 Limit  (cost=100.00..136.86 rows=10 width=236)
   Output: e.*, e.empno, d.deptno, d.dname
   ->  Foreign Scan  (cost=100.00..2304.10 rows=598 width=236)
 Output: e.*, e.empno, d.deptno, d.dname
 Relations: (public.f_emp e) LEFT JOIN (public.f_dept d)
 Remote SQL: SELECT CASE WHEN NOT r1.* IS NULL THEN ROW(r1.empno,
r1.ename, r1.job, r1.mgr, r1.hiredate, r1.sal, r1.comm, r1.deptno) END,
r1.empno, r2.deptno
, r2.dname FROM (public.emp r1 LEFT JOIN public.dept r2 ON (((r1.sal >
3000::numeric)) AND ((r1.deptno = r2.deptno ORDER BY r1.empno ASC
NULLS LAST, r2.deptno AS
C NULLS LAST
(6 rows)

select e, e.empno, d.deptno, d.dname from f_emp e left join f_dept d on
e.deptno = d.deptno and e.sal > 3000 order by 2, 3 limit 10;
  e | empno | deptno |   dname
---+---++
 (7369,SMITH,CLERK,7902,1980-12-17,800.00,,20) |  7369 ||
 (7499,ALLEN,SALESMAN,7698,1981-02-20,1600.00,300.00,30)   |  7499 ||
 (7521,WARD,SALESMAN,7698,1981-02-22,1250.00,500.00,30)|  7521 ||
 (7566,JONES,MANAGER,7839,1981-04-02,2975.00,,20)  |  7566 ||
 (7654,MARTIN,SALESMAN,7698,1981-09-28,1250.00,1400.00,30) |  7654 ||
 (7698,BLAKE,MANAGER,7839,1981-05-01,2850.00,,30)  |  7698 ||
 (7782,CLARK,MANAGER,7839,1981-06-09,2450.00,,10)  |  7782 ||
 (7788,SCOTT,ANALYST,7566,1987-04-19,3000.00,,20)  |  7788 ||
 (7839,KING,PRESIDENT,,1981-11-17,5000.00,,10) |  7839 |
10 | ACCOUNTING
 (7844,TURNER,SALESMAN,7698,1981-09-08,1500.00,0.00,30)|  7844 ||
(10 rows)


Thanks,
Amit
diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index c91f3a5..7446506 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -1644,9 +1644,9 @@ deparseColumnRef(StringInfo buf, int varno, int varattno, PlannerInfo *root,
 		 */
 		if (qualify_col)
 		{
-			appendStringInfoString(buf, "CASE WHEN ");
+			appendStringInfoString(buf, "CASE WHEN NOT ");
 			ADD_REL_QUALIFIER(buf, varno);
-			appendStringInfo(buf, "* IS NOT NULL THEN ");
+			appendStringInfo(buf, "* IS NULL THEN ");
 		}
 
 		appendStringInfoString(buf, "ROW(");

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 10.0

2016-06-21 Thread Cédric Villemain
On 20/06/2016 22:41, Alvaro Herrera wrote:
> Tom Lane wrote:
>> Robert Haas  writes:
>>> On Mon, Jun 20, 2016 at 4:00 PM, David G. Johnston
>>>  wrote:
 10.x is the desired output.
>>
>>> 10.x is the output that some people desire.  A significant number of
>>> people, including me, would prefer to stick with the current
>>> three-part versioning scheme, possibly with some change to the
>>> algorithm for bumping the first digit (e.g. every 5 years like
>>> clockwork).
>>
>> If we were going to do it like that, I would argue for "every ten years
>> like clockwork", e.g. 10.0.x is next after 9.9.x.  But in point of fact,
>> Robert, you already made your case for that approach and nobody else
>> cared for it.
> 
> I voted for this approach initially too, and I think it has merit --
> notably, that it would stop this discussion.  It was said that moving
> to two-part numbers would stop all discussion, but it seems to have had
> exactly the opposite effect.

If voting is still possible, then I agree: no changes please!
It won't make things easier to have a 10g or a 10.8 to explain, instead
of a 10.0.8... and I'm not sure it'll make things easier to not have the
chance to bump the 2 major parts if it happened to be interesting in the
future like it was for 7.4->8 and 8.4->9 (9 is «new», it's the first
time we go over .4 to bump first digit, but it's also the first time we
have found a way to shorten release cycle)

-- 
-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2016-06-21 Thread Amit Langote
On 2016/06/21 16:49, Ashutosh Bapat wrote:
> Hi Amit,
> I tried creating partitioned table by range on an expression like
> CREATE TABLE pt1_e (a int, b int, c varchar) PARTITION BY RANGE(a +
> b);
>   ERROR:  syntax error at or near
> "+"
> 
>   LINE 1: ...E pt1_e (a int, b int, c varchar) PARTITION BY RANGE(a + b);
> 
> But when I try to create partitioned table by range on expression chr(a) it
> works.

The expression needs to be parenthesized (function call expressions don't
need though).  So something like the following would work:

CREATE TABLE pt1_e (a int, b int, c varchar) PARTITION BY RANGE(add(a, b));

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2016-06-21 Thread Ashutosh Bapat
Hi Amit,
I tried creating partitioned table by range on an expression like
CREATE TABLE pt1_e (a int, b int, c varchar) PARTITION BY RANGE(a +
b);
  ERROR:  syntax error at or near
"+"

  LINE 1: ...E pt1_e (a int, b int, c varchar) PARTITION BY RANGE(a + b);

But when I try to create partitioned table by range on expression chr(a) it
works.

On Thu, Jun 9, 2016 at 7:20 AM, Amit Langote 
wrote:

> On 2016/06/08 22:22, Ashutosh Bapat wrote:
> > On Mon, May 23, 2016 at 3:35 PM, Amit Langote wrote
> >>
> >> [...]
> >>
> >> I made a mistake in the last version of the patch which caused a
> relcache
> >> field to be pfree'd unexpectedly.  Attached updated patches.
> >
> > 0003-... patch does not apply cleanly. It has some conflicts in
> pg_dump.c.
> > I have tried fixing the conflict in attached patch.
>
> Thanks.  See attached rebased patches.
>
> Regards,
> Amit
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


[HACKERS] Postgres_fdw join pushdown - wrong results with whole-row reference

2016-06-21 Thread Rushabh Lathia
Below query returns the wrong result when join getting pushdown to the
remote
server.

(PFA fdw_setup.sql, to create objects for the test)

postgres=# select e, e.empno, d.deptno, d.dname from f_emp e left join
f_dept d on e.deptno = d.deptno and e.sal > 3000 order by 2, 3 limit 10;
  e | empno | deptno |
  dname
---+---++
   |  7369 |
 |
 (7499,ALLEN,SALESMAN,7698,1981-02-20,1600.00,300.00,30)   |  7499 |
 |
 (7521,WARD,SALESMAN,7698,1981-02-22,1250.00,500.00,30)|  7521 |
 |
   |  7566 |
 |
 (7654,MARTIN,SALESMAN,7698,1981-09-28,1250.00,1400.00,30) |  7654 |
 |
   |  7698 |
 |
   |  7782 |
 |
   |  7788 |
 |
   |  7839 | 10
| ACCOUNTING
 (7844,TURNER,SALESMAN,7698,1981-09-08,1500.00,0.00,30)|  7844 |
 |
(10 rows)


Here, wholerow is coming as NULL even though with non-null empno. If we
remove
limit clause from the query - that will not push the query to the remote
side
and in such case getting correct output.

postgres=# select e, e.empno, d.deptno, d.dname from f_emp e left join
f_dept d on e.deptno = d.deptno and e.sal > 3000 order by 2, 3;
 e | empno | deptno
|   dname
---+---++
 (7369,SMITH,CLERK,7902,1980-12-17,800.00,,20) |  7369 |
 |
 (7499,ALLEN,SALESMAN,7698,1981-02-20,1600.00,300.00,30)   |  7499 |
 |
 (7521,WARD,SALESMAN,7698,1981-02-22,1250.00,500.00,30)|  7521 |
 |
 (7566,JONES,MANAGER,7839,1981-04-02,2975.00,,20)  |  7566 |
 |
 (7654,MARTIN,SALESMAN,7698,1981-09-28,1250.00,1400.00,30) |  7654 |
 |
 (7698,BLAKE,MANAGER,7839,1981-05-01,2850.00,,30)  |  7698 |
 |
 (7782,CLARK,MANAGER,7839,1981-06-09,2450.00,,10)  |  7782 |
 |
 (7788,SCOTT,ANALYST,7566,1987-04-19,3000.00,,20)  |  7788 |
 |
 (7839,KING,PRESIDENT,,1981-11-17,5000.00,,10) |  7839 | 10
| ACCOUNTING
 (7844,TURNER,SALESMAN,7698,1981-09-08,1500.00,0.00,30)|  7844 |
 |
 (7876,ADAMS,CLERK,7788,1987-05-23,1100.00,,20)|  7876 |
 |
 (7900,JAMES,CLERK,7698,1981-12-03,950.00,,30) |  7900 |
 |
 (7902,FORD,ANALYST,7566,1981-12-03,3000.00,,20)   |  7902 |
 |
 (7934,MILLER,CLERK,7782,1982-01-23,1300.00,,10)   |  7934 |
 |
(14 rows)

Explain verbose output for the query with LIMIT clause is:

postgres=# explain verbose select e, e.empno, d.deptno, d.dname from f_emp
e left join f_dept d on e.deptno = d.deptno and e.sal > 3000 order by 2, 3
limit 10;



QUERY PLAN



--
--
---
 Limit  (cost=100.00..136.86 rows=10 width=236)
   Output: e.*, e.empno, d.deptno, d.dname
   ->  Foreign Scan  (cost=100.00..2304.10 rows=598 width=236)
 Output: e.*, e.empno, d.deptno, d.dname
 Relations: (public.f_emp e) LEFT JOIN (public.f_dept d)
 Remote SQL: SELECT CASE WHEN r1.* IS NOT NULL THEN ROW(r1.empno,
r1.ename, r1.job, r1.mgr, r1.hiredate, r1.sal, r1.comm, r1.deptno) END,
r1.empno, r2
.deptno, r2.dname FROM (public.emp r1 LEFT JOIN public.dept r2 ON (((r1.sal
> 3000::numeric)) AND ((r1.deptno = r2.deptno ORDER BY r1.empno ASC
NULLS LAST
, r2.deptno ASC NULLS LAST
(6 rows)

Further looking I found that here problem is because we converting wholerow
reference with ROW - and binding it with CASE clause.

So, in the above example reference to "r" is converted with
"CASE WHEN r1.* IS NOT NULL THEN ROW(r1.empno, r1.ename, r1.job, r1.mgr,
r1.hiredate, r1.sal, r1.comm, r1.deptno) END"

Here r1.* IS NOT NULL is behaving strange, it return TRUE only when all the
elements in the wholerow is NOT NULL.

Example with normal table (not postgres_fdw involded):

postgres=# select r, r.* is null as isnull, r.* is not null as isnotnull
from emp r;
 r | isnull |
isnotnull
---++---
 (7369,SMITH,CLERK,7902,1980-12-17,800.00,,20) | f  | f
 (7499,ALLEN,SALESMAN,7698,1981-02-20,1600.00,300.00,30)   | f  | t
 (7521,WARD,SALESMAN,7698,1981-02-22,1250.00,500.00,30)| f  | t
 (7566,JONES,MANAGER,7839,1981-04-02,2975.00,,20)  | f  | f