Re: [HACKERS] Maintenance on git.postgresql.org

2015-12-22 Thread Magnus Hagander
On Tue, Dec 22, 2015 at 2:56 PM, Magnus Hagander 
wrote:

> I'm going to run some upgrade tasks on git.postgresql.org over the next
> hour or two. Thus it's likely to drop off the net a few times and come
> back, and things might work on and off.
>
> Anything pushed should be safe there normally, but it might be
> inaccessible for some periods of time, giving strange error messages :)
>
> Also, expect the web management interface to be offline a bit longer than
> the actual git server.
>
> This does *not* affect gitmaster, just git.postgresql.org.
>
>
This maintenance should now be complete. If there is something that does
not work then please let me know, as it's actually broken then :)


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


[HACKERS] Remove Windows crash dump support?

2015-12-22 Thread Craig Ringer
Hi all

Back in 2010 I submitted a small feature to allow the creation of minidumps
when backends crashed; see commit dcb09b595f88a3bca6097a6acc17bf2ec935d55f .

At the time Windows lacked useful support for postmortem debugging and
crash-dump management in the operating system its self, especially for
applications running as services. That has since improved considerably.

The feature was also included in 9.4, which I think was the PostgreSQL
release that was really rock solid on Windows. Consequently it's never
served the purpose I wrote it for, a way to make it easy to get crash data
from end users into the hands of a developer without needing to
remote-debug the users' system.

It's practically dead code and we might as well remove it unless there's
someone out there using it who's keeping quiet about it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] A typo in syncrep.c

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 3:18 AM, Kyotaro HORIGUCHI
 wrote:
> Thank you Robert and sorry for bothering you with a silly question!
>
> I understand what I did clearly thanks to your attentive indication.
>
> At Mon, 21 Dec 2015 07:50:40 -0500, Robert Haas  wrote 
> in 
>> >> >> + * lead the client to believe that the transaction is aborted, which
>> >> No, that's correct the way it is.  What you're proposing wouldn't
>> >> exactly be wrong, but it's a little less clear and direct.
>> >
>> > Hmm. I thought they are equal in meaning and make clearer, but I
>> > understand they have such difference. Thank you for correcting
>> > it.
>>
>> The difference is that if you say "the transaction aborted" you mean
>> that the transaction did something - specifically, it aborted.  If you
>> say that "the transaction is aborted" you are talking about the state
>> in which the transaction ended up, without really saying how it got
>> that way.
>
> What I made here was a mistake of the word class of the
> "transaction" by somehow omitting the "that" in the original
> sentense. It is not the objective case as in the case where the
> "that" is omitted, but the subjective case there. Then the
> "aborted" is not the objective complement but the past tense. The
> "that" had been returned in my mind before I knew it but, after
> all, adding "is" there utterly changes the maning as you pointed
> out.

Actually, you might be surprised to hear that you can omit the word
"that" here without changing the meaning.  I tend to avoid that in
formal writing for clarity but the word isn't technically necessary.

>>  In this case we are talking about whether the client might
>> think that the transaction did something (aborted), not what the
>> client might think about the state we ended up in (aborted), so the
>> current wording seems better to me.
>
> I understand that you're completely right. Sorry for my silly
> mistake.

It's not a silly mistake.  And I do appreciate you taking the time to proofread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel joins, and better parallel explain

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 4:14 AM, Dilip Kumar  wrote:
> On Fri, Dec 18, 2015 at 8:47 PM Robert Wrote,
>>> Yes, you are right, that create_gather_path() sets parallel_safe to false
>>> unconditionally but whenever we are building a non partial path, that
>>> time
>>> we should carry forward the parallel_safe state to its parent, and it
>>> seems
>>> like that part is missing here..
>
>>Ah, right.  Woops.  I can't exactly replicate your results, but I've
>>attempted to fix this in a systematic way in the new version attached
>>here (parallel-join-v3.patch).
>
> I Have tested with the latest patch, problem is solved..
>
> During my testing i observed one more behaviour in the hash join, where
> Parallel hash join is taking more time compared to Normal hash join,

I think the gather-reader-order patch will fix this.  Here's a test
with all three patches.

rhaas=# SET max_parallel_degree=0;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.192 ms
  count
-
 250
(1 row)

Time: 11331.425 ms
rhaas=# SET max_parallel_degree=1;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.185 ms
  count
-
 250
(1 row)

Time: 8796.190 ms
rhaas=# SET max_parallel_degree=2;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.192 ms
  count
-
 250
(1 row)

Time: 8153.258 ms
rhaas=# SET max_parallel_degree=3;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.187 ms
  count
-
 250
(1 row)

Time: 6198.163 ms
rhaas=# SET max_parallel_degree=4;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.190 ms
  count
-
 250
(1 row)

Time: 7511.970 ms
rhaas=# SET max_parallel_degree=5;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.152 ms
  count
-
 250
(1 row)

Time: 7651.862 ms
rhaas=# SET max_parallel_degree=6;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.195 ms
  count
-
 250
(1 row)

Time: 7584.073 ms

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some questions about the array.

2015-12-22 Thread Tom Lane
Uriy Zhuravlev  writes:
> On понедельник, 21 декабря 2015 г. 20:28:43 MSK, Tom Lane 
> wrote:
>> With is_slice false, the only valid
>> case is lidx==NULL, uidx!=NULL, as before for non-slice notation.

> But now it becomes valid syntax:
> select ('{1,2,3,4}'::int[])[NULL:NULL];
> I do not think it's logical. Especially if in [:] function is used.
> Unexpected behavior. 

I think you are confused about the difference between a NULL constant
(which would give rise to an A_Const syntax tree node) and a NULL
syntax tree pointer (which cannot arise from any actual syntactical
construct, and would only be present if the grammar put it there due
to lack of any corresponding item in the input).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix for readline terminal size problems when window is resized with open pager

2015-12-22 Thread Merlin Moncure
On Wed, Dec 16, 2015 at 2:12 PM, Tom Lane  wrote:
> Merlin Moncure  writes:
>> On Wed, Dec 16, 2015 at 1:29 PM, Tom Lane  wrote:
>>> So I now think that print.c shouldn't be involved at all, and the right
>>> thing to do is just have gets_interactive() invoke the resize function
>>> immediately before calling readline().  That will still leave a window
>>> for SIGWINCH to be missed, but it's a pretty tiny one.
>
>> right -- agreed.
>
> I tried that and found out that the first call dumps core because readline
> hasn't been initialized yet.  To fix that, we need an explicit call to
> rl_initialize() instead of depending on readline() to do it for us.
> So I arrive at the attached modified patch.

This is working great.  Is there anything left for me to do here?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove Windows crash dump support?

2015-12-22 Thread Magnus Hagander
On Tue, Dec 22, 2015 at 3:53 PM, Craig Ringer  wrote:

> On 22 December 2015 at 22:50, Craig Ringer  wrote:
>
>> Hi all
>>
>> Back in 2010 I submitted a small feature to allow the creation of
>> minidumps when backends crashed; see
>> commit dcb09b595f88a3bca6097a6acc17bf2ec935d55f .
>>
>> At the time Windows lacked useful support for postmortem debugging and
>> crash-dump management in the operating system its self, especially for
>> applications running as services. That has since improved considerably.
>>
>> The feature was also included in 9.4
>>
>
> Ahem. 9.1. This is what I get for multi-tasking between writing this and
> packaging an extension for 9.4.
>
>
In which version(s) of Windows was this improvement added? I think that's
really the part that matters here, not necessarily which version of
PostgreSQL.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] fix for readline terminal size problems when window is resized with open pager

2015-12-22 Thread Tom Lane
Merlin Moncure  writes:
> This is working great.  Is there anything left for me to do here?

Nope, it's committed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve a few appendStringInfo* calls

2015-12-22 Thread Robert Haas
On Mon, Dec 21, 2015 at 6:28 PM, David Rowley
 wrote:
>  According to grep -rE "appendStringInfoString\(.*\.data\);" . we have 13
> such matches. None of them seem to be in very performance critical places,
> perhaps with the exception of xmlconcat().
>
> Would you say that we should build a macro to wrap up a call to
> appendBinaryStringInfo() or invent another function which looks similar?

The macro probably would have a multiple evaluation hazard, so I'd be
inclined to invent a function.  I mean, yeah, you could use a do {  }
while (0) block to fix the multiple evaluation, but complex macros are
harder to read than functions, and not necessarily any faster.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some questions about the array.

2015-12-22 Thread Yury Zhuravlev



I think you are confused about the difference between a NULL constant
(which would give rise to an A_Const syntax tree node) and a NULL
syntax tree pointer (which cannot arise from any actual syntactical
construct, and would only be present if the grammar put it there due
to lack of any corresponding item in the input).

regards, tom lane


You're right. I will try to provide a patch as soon as possible.
Thanks.

--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-22 Thread Aleksander Alekseev
> > Actually, I'd like to improve all partitioned hashes instead of
> > improve only one case.  
> 
> Yeah.  I'm not sure that should be an LWLock rather than a spinlock,
> but we can benchmark it both ways.  

I would like to share some preliminary results. I tested four
implementations:

- no locks and no element stealing from other partitions;
- single LWLock per partitioned table;
- single spinlock per partitioned table;
- NUM_LOCK_PARTITIONS spinlocks per partitioned table;

Interestingly "Shared Buffer Lookup Table" (see buf_table.c) has 128
partitions. The constant NUM_BUFFER_PARTITIONS was increased from 16 to
128 in commit 3acc10c9: 

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3acc10c997f916f6a741d0b4876126b7b08e3892;hp=952872698d9443fdf9b808a1376017f00c91065a

Obviously after splitting a freelist into NUM_LOCK_PARTITIONS
partitions (and assuming that all necessary locking/unlocking is done
on calling side) tables can't have more than NUM_LOCK_PARTITIONS
partitions because it would cause race conditions. For this reason I
had to define NUM_BUFFER_PARTITIONS as NUM_LOCK_PARTITIONS and compare
behaviour of PostgreSQL depending on different values of
NUM_LOCK_PARTITIONS.

So here are results:

Core i7, pgbench -j 8 -c 8 -T 30 pgbench
(3 tests, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock 
PARTITIONS | (99ccb2) |  |  |  |  array   
---|--|--|--|--|--
   |  295.4   |  297.4   |  299.4   |  285.6   |  302.7   
(1 << 4)   |  286.1   |  300.5   |  283.4   |  300.9   |  300.4   
   |  300.0   |  300.0   |  302.1   |  300.7   |  300.3   
---|--|--|--|--|--
   |  |  296.7   |  299.9   |  298.8   |  298.3
(1 << 5)   |      |  301.9   |  302.2   |  305.7   |  306.3
   |  |  287.7   |  301.0   |  303.0   |  304.5
---|--|--|--|--|--
   |  |  296.4   |  300.5   |  302.9   |  304.6   
(1 << 6)   |      |  301.7   |  305.6   |  306.4   |  302.3   
   |  |  299.6   |  304.5   |  306.6   |  300.4   
---|--|--|--|--|--
   |  |  295.9   |  298.7   |  295.3   |  305.0   
(1 << 7)   |      |  299.5   |  300.5   |  299.0   |  310.2   
   |  |  287.8   |  285.9   |  300.2   |  302.2   

Core i7, pgbench -j 8 -c 8 -f big_table.sql -T 30 my_database
(3 test, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock 
PARTITIONS | (99ccb2) |  |  |  |  array   
---|--|--|--|--|--
   |  505.1   |  521.3   |  511.1   |  524.4   |  501.6   
(1 << 4)   |  452.4   |  467.4   |  509.2   |  472.3   |  453.7   
   |  435.2   |  462.4   |  445.8   |  467.9   |  467.0   
---|--|--|--|--|--
   |  |  514.8   |  476.3   |  507.9   |  510.6   
(1 << 5)   |      |  457.5   |  491.2   |  464.6   |  431.7   
   |  |  442.2   |  457.0   |  495.5   |  448.2
---|--|--|--|--|--
   |  |  516.4   |  502.5   |  468.0   |  521.3   
(1 << 6)   |      |  463.6   |  438.7   |  488.8   |  455.4   
   |  |  434.2   |  468.1   |  484.7   |  433.5   
---|--|--|--|--|--
   |  |  513.6   |  459.4   |  519.6   |  510.3   
(1 << 7)   |      |  470.1   |  454.6   |  445.5   |  415.9   
   |  |  459.4   |  489.7   |  457.1   |  452.8   

60-core server, pgbench -j 64 -c 64 -T 30 pgbench
(3 tests, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock 
PARTITIONS | (99ccb2) |  |  |  |  array   
---|--|--|--|--|--
   |  3156.2  |  3157.9  |  3542.0  |  3444.3  |  3472.4  
(1 << 4)   |  3268.5  |  3444.7  |  3485.7  |  3486.0  |  3500.5  
   |  3251.2  |  3482.3  |  3398.7  |  3587.1  |  3557.7  
---|--|--|--|--|--
   |  |  3352.7  |  3556.0  |  3543.3  |  3526.8 
(1 << 5)   |      |  3465.0  |  3475.2  |  3486.9  |  3528.4  
   |  |  3410.0  |  3482.0  |  3493.7  |  3444.9  
---|--|--|--|--|--
   |  |  3437.8  |  3413.1  |  3445.8  |  3481.6  
(1 << 6)   |      |  3470.1  |  3478.4  |  3538.5  |  3579.9  
   |  |  3450.8  |  3431.1  |  3509.0  |  3512.5  
---|--|--|--|--|--
   |  

Re: [HACKERS] Remove Windows crash dump support?

2015-12-22 Thread Tom Lane
Magnus Hagander  writes:
> On Tue, Dec 22, 2015 at 3:53 PM, Craig Ringer  wrote:
>>> Back in 2010 I submitted a small feature to allow the creation of
>>> minidumps when backends crashed; see
>>> commit dcb09b595f88a3bca6097a6acc17bf2ec935d55f .
>>> At the time Windows lacked useful support for postmortem debugging and
>>> crash-dump management in the operating system its self, especially for
>>> applications running as services. That has since improved considerably.

> In which version(s) of Windows was this improvement added? I think that's
> really the part that matters here, not necessarily which version of
> PostgreSQL.

Even if it's all good in recent Windows, the impression I have is that an
awful lot of people are still running older versions, so I'd be pretty
hesitant to just drop the code.  Also, AFAICS it's pretty self-contained
and hence not much of a drag on development.  Is there any positive reason
to remove it?  If the native facilities on newer Windows are better,
maybe we should teach the crash handler to install itself only on older
versions?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possible marginally-incompatible change to array subscripting

2015-12-22 Thread Pavel Stehule
2015-12-22 18:34 GMT+01:00 Robert Haas :

> On Tue, Dec 22, 2015 at 11:51 AM, Tom Lane  wrote:
> > I'm reviewing Yury Zhuravlev's patch to allow array slice boundaries to
> be
> > omitted, for example "a[4:]" means "the slice extending from element 4 to
> > the last element of a".  It strikes me that there's an improvement we
> > could easily make for the case where a mixture of slice and non-slice
> > syntax appears, that is something like "a[3:4][5]".  Now, this has always
> > meant a slice, and the way we've traditionally managed that is to treat
> > simple subscripts as being the range upper bound with a lower bound of 1;
> > that is, what this example means is exactly "a[3:4][1:5]".
> >
> > ISTM that if we'd had Yury's code in there from the beginning, what we
> > would define this as meaning is "a[3:4][:5]", ie the implied range runs
> > from whatever the array lower bound is up to the specified subscript.
> >
> > This would make no difference of course for the common case where the
> > array lower bound is 1, but it seems a lot less arbitrary when it isn't.
> > So I think we should strongly consider changing it to mean that, even
> > though it would be non-backwards-compatible in such cases.
> >
> > Comments?
>
> Gosh, our arrays are strange.  I would have expected a[3:4][5] to mean
> a[3:4][5:5].
>

exactly,

Pavel


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

2015-12-22 Thread Peter Geoghegan
On Tue, Dec 22, 2015 at 9:31 AM, Robert Haas  wrote:
>> It has some nice properties, because unsigned integers are so simple
>> and flexible. For example, we can do it with int4 sometimes, and then
>> compare two attributes at once on 64-bit platforms. Maybe the second
>> attribute (the second set of 4 bytes) isn't an int4, though -- it's
>> any other type's abbreviated key (which can be assumed to have a
>> compatible representation). That's one more advanced possibility.
>
> Yikes.

I'm not suggesting we'd want to do that immediately, or even at all.
My point is -- stuff like this becomes possible. My intuition is that
it might come up in a bunch of places.

>> Another issue is that abbreviated keys in indexes are probably not
>> going to take 8 bytes, because they'll go in the ItemId array. It's
>> likely to be very useful to be able to take the first two bytes, and
>> know that a uint16 comparison is all that is needed, and have a basis
>> to perform an interpolation search rather than a binary search
>> (although that needs to be adaptive to different distributions, I
>> think it could be an effective technique -- there are many cache
>> misses for binary searches on internal B-Tree pages).
>
> Hmm, that seems iffy to me.  There are plenty of data sets where 8
> bytes is enough to get some meaningful information about what part of
> the keyspace you're in, but 2 bytes isn't.

Your experience with abbreviated keys for sorting is unlikely to
perfectly carry over here. That's because the 2 bytes are only put in
the internal pages (typically less than 1% of the total). These
internal pages typically have relatively low density anyway (we don't
use the The B* tree technique), so I think that the level of bloat
would be tiny. At the same time, these are the range of values that
naturally have very wide fan-out. For example, the entire range of
values in the index is represented at the root page.

All of that having been said, yes, it could still be useless. But
then, the overhead will still tend to be nill, due to the fact that
abbreviated key generation can be so effectively amortized (it happens
only once, per page split), and because branch prediction will tend to
remove any penalty.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 10:39 AM, Aleksander Alekseev
 wrote:
> Obviously after splitting a freelist into NUM_LOCK_PARTITIONS
> partitions (and assuming that all necessary locking/unlocking is done
> on calling side) tables can't have more than NUM_LOCK_PARTITIONS
> partitions because it would cause race conditions. For this reason I
> had to define NUM_BUFFER_PARTITIONS as NUM_LOCK_PARTITIONS and compare
> behaviour of PostgreSQL depending on different values of
> NUM_LOCK_PARTITIONS.

This is not at all obvious.  There is no obvious reason why the number
of ways that the freelist is partitioned needs to have any relation at
all to the number of ways that the hash table itself is partitioned.
Probably, each process should pick a freelist affinity based on
MyBackendId % number_of_partitions or something like that.  If it
doesn't find one there, it can search the others, but always starting
with the one for which it has an affinity.  That way, the number of
processes contending on a single spinlock will be limited to
max_connections/number_of_partitions except when the table is very
nearly full.  That may happen a lot, though, for the buffer mapping
table, so we might need to over-allocate to compensate.  We want it to
be the case that a backend will almost always find a free entry on its
own freelist when it needs one, and only occasionally need to visit
any other freelist.

> Now regarding 60-core server:
>
> - One spinlock per hash table doesn't scale. I personally was expecting
>   this;
> - LWLock's and array of spinlocks do scale on NUMA up to a certain
>   point;
> - Best results are shown by "no locks";
>
> I believe that "no locks" implementation is the best one since it is at
> least 3 times faster on NUMA then any other implementation. Also it is
> simpler and doesn't have stealing-from-other-freelists logic that
> executes rarely and therefore is a likely source of bugs. Regarding ~16
> elements of freelists which in some corner cases could but wouldn't be
> used --- as I mentioned before I believe its not such a big problem.
> Also its a small price to pay for 3 times more TPS.

I doubt it.  Having the system become fundamentally less reliable is a
big cost.  If the system tries to find a lock table entry and fails,
the user's query is going to error out.  They are not at that point
going to say, well, it's good that my system runs fast even though I
can't run pg_dump.  They are going to say, running pg_dump used to
work and now it fails.  The consequences of failing to make an entry
in the buffer mapping table are at least as bad.

And there's a second point, too, which is that you haven't proven that
accepting the costs of your proposed model is even necessary.  You've
tried a few experiments with other approaches, not even all of the
ones that were proposed in the earlier thread, and you don't seem to
have done any investigation of why they didn't perform as well, or if
you did, it's not discussed in this email.  So maybe those problems
can be fixed, after all.

> Regarding NUM_LOCK_PARTITIONS (and NUM_BUFFER_PARTITIONS) I have some
> doubts. For sure Robert had a good reason for committing 3acc10c9.
> Unfortunately I'm not familiar with a story behind this commit. What do
> you think?

See the thread "Scaling shared buffer eviction".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possible marginally-incompatible change to array subscripting

2015-12-22 Thread Joshua D. Drake

On 12/22/2015 10:01 AM, Robert Haas wrote:

On Tue, Dec 22, 2015 at 12:55 PM, Tom Lane  wrote:

Robert Haas  writes:

On Tue, Dec 22, 2015 at 11:51 AM, Tom Lane  wrote:

ISTM that if we'd had Yury's code in there from the beginning, what we
would define this as meaning is "a[3:4][:5]", ie the implied range runs
from whatever the array lower bound is up to the specified subscript.



Gosh, our arrays are strange.  I would have expected a[3:4][5] to mean
a[3:4][5:5].


Yeah, probably, now that you mention it ... but that seems like too much
of a compatibility break.  Or does anyone want to argue for just doing
that and never mind the compatibility issues?  This is a pretty weird
corner case already; there can't be very many people relying on it.


To be honest, I'd be inclined not to change the semantics at all.  But
that's just me.



I think a sane approach is better than a safe approach.

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A Typo in regress/sql/privileges.sql

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 1:39 PM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> On Tue, Dec 22, 2015 at 4:36 AM, Peter Geoghegan  wrote:
>> > On Fri, Dec 18, 2015 at 12:52 PM, Robert Haas  
>> > wrote:
>> >> Mind you, I don't think "inference specification" is very good
>> >> terminology, but what's there right now is just wrong.
>> >
>> > It doesn't appear in the documentation. The term "inference
>> > specification" only appears where it's necessary to precisely describe
>> > the input to unique index inference.
>>
>> Well, we can change this to say "inference specification", but I still
>> think calling it the "ON CONFLICT" clause would be clearer in this
>> context.
>
> TBH I'm kinda inclined to sort this out by removing all usage of the
> word "inference" everywhere --- error messages and code comments and
> documentation wording, and replace it with some other wording as
> appropriate for each context.

I would not object to that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 4:37 PM, Peter Geoghegan  wrote:
>> This looks like voodoo to me.  I assume you tested it and maybe it
>> gives correct answers, but it's got to be some kind of world record
>> for number of arbitrary constants per SLOC, and there's no real
>> justification for any of it.  The comments say, essentially, well, we
>> do this because it works.  But suppose I try it on some new piece of
>> hardware and it doesn't work well.  What do I do?  Email the author
>> and ask him to tweak the arbitrary constants?
>
> That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and
> DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do
> something, though.
>
> MaxAllocHugeSize is used fairly arbitrarily in pg_stat_statements.c.
> And that part (the MaxAllocSize part of my patch) only defines a point
> after which we require a really favorable case for replacement
> selection/quicksort with spillover to proceed. It's a safety valve. We
> try to err on the side of not using replacement selection.

Sure, there are arbitrary numbers all over the code, driven by
empirical observations about what factors are important to model.  But
this is not that.  You don't have a thing called seq_page_cost and a
thing called cpu_tuple_cost and then say, well, empirically the ratio
is about 100:1, so let's make the former 1 and the latter 0.01.  You
just have some numbers, and it's not clear what, if anything, they
actually represent.  In the space of 7 lines of code, you introduce 9
nameless constants:

The crossover point is clamped to a minimum of 40% [constant #1] and a
maximum of 85% [constant #2] when the size of the SortTuple array is
no more than MaxAllocSize.  Between those bounds, the crossover point
is 90% [constant #3] minus 7.5% [constant #4] per 32-byte increment
[constant #5] of estimated average tuple size.  On the other hand,
when the estimated average tuple size exceeds MaxAllocSize, the
crossover point is either 90% [constant #6] or 95% [constant #7]
depending on whether the average tuple size is greater than 32 bytes
[constant #8].  But if the row count hit is less than 50% [constant
#9] of the rows we've already seen, then we ignore it and do not use
selection.

You make no attempt to justify why any of these numbers are correct,
or what underlying physical reality they represent.  The comment which
describes the manner in which crossover point is computed for
SortTuple volumes under 1GB says "Starting from a threshold of 90%,
refund 7.5% per 32 byte average-size-increment."  That is a precise
restatement of what the code does, but it doesn't attempt to explain
why it's a good idea.  Perhaps the reader should infer that the
crossover point drops as the tuples get bigger, except that in the
over-1GB case, a larger tuple size causes the crossover point to go
*up* while in the under-1GB case, a larger tuple size causes the
crossover point to go *down*.  Concretely, if we're sorting 44,739,242
224-byte tuples, the estimated crossover point is 40%.  If we're
sorting 44,739,243 244-byte tuples, the estimated crossover point is
95%.  That's an extremely sharp discontinuity, and it seems very
unlikely that any real system behaves that way.

I'm prepared to concede that constant #9 - ignoring the input row
estimate if we've already seen twice that many rows - probably doesn't
need a whole lot of justification here, and what justification it does
need is provided by the fact that (we think) replacement selection
only wins when there are going to be less than 2 quicksorted runs.
But the other 8 constants here have to have reasons why they exist,
what they represent, and why they have the values they do, and that
explanation needs to be something that can be understood by people
besides you.  The overall cost model needs some explanation of the
theory of operation, too.

In my opinion, reasoning in terms of a crossover point is a strange
way of approaching the problem.  What would be more typical at least
in our code, and I suspect in general, is do a cost estimate of using
selection and a cost estimate of not using selection and compare them.
Replacement selection has a CPU cost and an I/O cost, each of which is
estimable based on the tuple count, chosen comparator, and expected
I/O volume.  Quicksort has those same costs, in different amounts.  If
those respective costs are accurately estimated, then you can pick the
strategy with the lower cost and expect to win.

>> I wonder if there's a way to accomplish what you're trying to do here
>> that avoids the need to have a cost model at all.  As I understand it,
>> and please correct me wherever I go off the rails, the situation is:
>>
>> 1. If we're sorting a large amount of data, such that we can't fit it
>> all in memory, we will need to produce a number of sorted runs and
>> then merge those runs.  If we generate each run using a heap with
>> replacement selection, rather than quicksort, we will produce runs
>> that are, on the 

Re: [HACKERS] [PROPOSAL] Backup and recovery of pg_statistic

2015-12-22 Thread Bruce Momjian
On Fri, Nov 27, 2015 at 06:52:59PM +0300, Dmitry Ivanov wrote:
> Proposed changes to pg_dump
> ===
> 
> Now that the approach has been developed, it may be applied to improve the 
> 'pg_dump' utility. Some minor code changes would make the 'pg_dump' emit 
> specially-formed recovery INSERTS for 'pg_statistic' in the 'binary-upgrade' 
> mode, thus allowing us to restore saved stats after an upgrade.

I think this is great.  My only question for the list is how much does
this dumping create new compatibility requirements between major
versions.  While pg_upgrade could disable it, pg_dump can't because it
doesn't know what PG version it is being loaded into.  Can we create a
format that is portable to all future PG versions, or at least detect
incompatibility?  I am thinking we would need some pg_statistic version
number in the dump data that can cause the data to be ignored.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2015-12-22 Thread Peter Geoghegan
On Tue, Dec 22, 2015 at 2:57 PM, Robert Haas  wrote:
> On Tue, Dec 22, 2015 at 4:37 PM, Peter Geoghegan  wrote:
>> That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and
>> DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do
>> something, though.
>>

> Sure, there are arbitrary numbers all over the code, driven by
> empirical observations about what factors are important to model.  But
> this is not that.  You don't have a thing called seq_page_cost and a
> thing called cpu_tuple_cost and then say, well, empirically the ratio
> is about 100:1, so let's make the former 1 and the latter 0.01.  You
> just have some numbers, and it's not clear what, if anything, they
> actually represent.

What I find difficult to accept about what you say here is that at
*this* level, something like cost_sort() has little to recommend it.
It costs a sort of a text attribute at the same level as the cost of
sorting the same tuples using an int4 attribute (based on the default
cpu_operator_cost for C functions -- without any attempt to
differentiate text and int4).

Prior to 9.5, sorting text took about 5 - 10 times longer that this
similar int4 sort. That's a pretty big difference, and yet I recall no
complaints. The cost of a comparison in a sort can hardly be
considered in isolation, anyway -- cache efficiency is at least as
important.

Of course, the point is that the goal of a cost model is not to
simulate reality as closely as possible -- it's to produce a good
outcome for performance purposes under realistic assumptions.
Realistic assumptions include that you can't hope to account for
certain differences in cost. Avoiding a terrible outcome is very
important, but the worst case for useselection() is no worse than
today's behavior (or a lost opportunity to do better than today's
behavior).

Recently, the paper that was posted to the list about the Postgres
optimizer stated formally what I know I had a good intuitive sense of
for a long time: that better selectivity estimates are much more
important than better cost models in practice. The "empirical
observations" driving something like DEFAULT_EQ_SEL are very weak --
but what are you gonna do?

> The crossover point is clamped to a minimum of 40% [constant #1] and a
> maximum of 85% [constant #2] when the size of the SortTuple array is
> no more than MaxAllocSize.  Between those bounds, the crossover point
> is 90% [constant #3] minus 7.5% [constant #4] per 32-byte increment
> [constant #5] of estimated average tuple size.  On the other hand,
> when the estimated average tuple size exceeds MaxAllocSize, the
> crossover point is either 90% [constant #6] or 95% [constant #7]
> depending on whether the average tuple size is greater than 32 bytes
> [constant #8].  But if the row count hit is less than 50% [constant
> #9] of the rows we've already seen, then we ignore it and do not use
> selection.
>
> You make no attempt to justify why any of these numbers are correct,
> or what underlying physical reality they represent.

Just like selfuncs.h for the most part, then.

> The comment which
> describes the manner in which crossover point is computed for
> SortTuple volumes under 1GB says "Starting from a threshold of 90%,
> refund 7.5% per 32 byte average-size-increment."  That is a precise
> restatement of what the code does, but it doesn't attempt to explain
> why it's a good idea.  Perhaps the reader should infer that the
> crossover point drops as the tuples get bigger, except that in the
> over-1GB case, a larger tuple size causes the crossover point to go
> *up* while in the under-1GB case, a larger tuple size causes the
> crossover point to go *down*.  Concretely, if we're sorting 44,739,242
> 224-byte tuples, the estimated crossover point is 40%.  If we're
> sorting 44,739,243 244-byte tuples, the estimated crossover point is
> 95%.  That's an extremely sharp discontinuity, and it seems very
> unlikely that any real system behaves that way.

Again, the goal of the cost model is not to model reality as such.
This cost model is conservative about using replacement selection. It
makes sense when you consider that there tends to be a lot fewer
external sorts on a realistic workload -- if we can cut that number in
half, which seems quite possible, that's pretty good, especially from
a DBA's practical perspective. I want to buffer DBAs against suddenly
incurring more I/O, but not at the risk of having a far longer sort
for the first run. Or with minimal exposure to that risk. The cost
model weighs the cost of the hint being wrong to some degree (which is
indeed novel). I think it makes sense in light of the cost and
benefits in this case, although I will add that I'm not entirely
comfortable with it. I just don't imagine that there is a solution
that I will be fully comfortable with. There may be one that
superficially looks correct, but I see little point in that.

> I'm prepared to concede that constant #9 - 

Re: [HACKERS] Possible marginally-incompatible change to array subscripting

2015-12-22 Thread Tom Lane
Pavel Stehule  writes:
> 2015-12-22 18:34 GMT+01:00 Robert Haas :
>> On Tue, Dec 22, 2015 at 11:51 AM, Tom Lane  wrote:
>>> ISTM that if we'd had Yury's code in there from the beginning, what we
>>> would define this as meaning is "a[3:4][:5]", ie the implied range runs
>>> from whatever the array lower bound is up to the specified subscript.

>> Gosh, our arrays are strange.  I would have expected a[3:4][5] to mean
>> a[3:4][5:5].

> exactly,

Since it's not clear that we've got consensus on doing anything
differently, I've adjusted the current patch to preserve the existing
behavior here (and added some regression tests showing that behavior).
If we do decide to change it, it'd be more appropriate to make that
change in a separate commit, anyway.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-22 Thread Bruce Momjian
On Thu, Dec 17, 2015 at 06:44:46AM +, Simon Riggs wrote:
> >> Thank you for having a look.
> >
> > I would not bother mentioning this detail in the pg_upgrade manual page:
> >
> > +   Since the format of visibility map has been changed in version 9.6,
> > +   pg_upgrade creates and rewrite new '_vm' literal>
> > +   file even if upgrading from 9.5 or before to 9.6 or later with link
> mode (-k).
> 
> Really?  I know we don't always document things like this, but it
> seems like a good idea to me that we do so.
> 
> 
> Agreed.
> 
> For me, rewriting the visibility map is a new data loss bug waiting to happen.
> I am worried that the group is not taking seriously the potential for
> catastrophe here. I think we can do it, but I think it needs these things
> 
> * Clear notice that it is happening unconditionally and unavoidably
> * Log files showing it has happened, action by action
> * Very clear mechanism for resolving an incomplete or interrupted upgrade
> process. Which VMs got upgraded? Which didn't?
> * Ability to undo an upgrade attempt, somehow, ideally automatically by 
> default
> * Ability to restart a failed upgrade attempt without doing a "double 
> upgrade",
> i.e. ensure transformation is immutable

Have you forgotten how pg_upgrade works?  This new vm file is only
created on the new cluster, which is not usable if pg_upgrade doesn't
complete successfully.  pg_upgrade never modifies the old cluster.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some questions about the array.

2015-12-22 Thread Tom Lane
Yury Zhuravlev  writes:
> New patch version in attachment.  

I've committed this with a number of revisions, mostly but not entirely
cosmetic.  Worthy of note:

* I did not like the way you were inserting the replacement subscript
values:

+   arrays = (AnyArrayType 
*)DatumGetArrayTypeP(array_source);
+   indexexpr = AARR_LBOUND(arrays)[i] + 
AARR_DIMS(arrays)[i] - 1;

If the source array is toasted, this causes an extra detoast operation
for *each* omitted subscript.  That could be pretty high overhead for
a large array.  The best way to avoid that is to postpone the actual
substitution of the replacement subscript values into array_get_slice
and array_set_slice; which complicates their APIs a bit more, but those
were pretty long argument lists already.

* Having done that, there was no very good reason for the blanket
prohibition on using omitted subscripts in the slice-set case.  We only
really need to fail if we're constructing the array from scratch, when
we don't have any existing subscript limits to substitute.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pgbench --latency-limit option

2015-12-22 Thread Tatsuo Ishii
While playing with 9.5's pgbench, I faced with a strange behavior.

$ pgbench -p 11002 --rate 2 --latency-limit 1 -c 10 -T 10 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
duration: 10 s
number of transactions actually processed: 16
number of transactions skipped: 0 (0.000 %)
number of transactions above the 1.0 ms latency limit: 16 (100.000 %)
latency average: 5.518 ms
latency stddev: 1.834 ms
rate limit schedule lag: avg 0.694 (max 1.823) ms
tps = 1.599917 (including connections establishing)
tps = 1.600319 (excluding connections establishing)

>From the pgbench manual:

  --latency-limit=limit
  
   
Transaction which last more than limit milliseconds
are counted and reported separately, as late.
   
   
When throttling is used (--rate=...), transactions that
lag behind schedule by more than limit ms, and thus
have no hope of meeting the latency limit, are not sent to the server
at all. They are counted and reported separately as
skipped.

In my understanding, this says: any transaction takes longer than the
duration specified by --latency-limit (in this case 1.0 ms) will not
be sent the sever. In the case above all (16) transactions were behind
the latency limit 1.0 ms:

number of transactions above the 1.0 ms latency limit: 16 (100.000 %)

So the number of skipped transactions should be 16 (100%), rather
than:

number of transactions skipped: 0 (0.000 %)

in this case I think.

Am I missing something?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-12-22 Thread Thomas Munro
On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro
 wrote:
> On Fri, Dec 18, 2015 at 7:38 AM, Masahiko Sawada  
> wrote:
> [000-_multi_sync_replication_v3.patch]
>
> Hi Masahiko,
>
> I haven't tested this version of the patch but I have some comments on the 
> code.
>
> +/* Is this wal sender considerable one? */
> +bool
> +SyncRepActiveListedWalSender(int num)
>
> Maybe "Is this wal sender managing a standby that is streaming and
> listed as a synchronous standby?"
>
> +/*
> + * Obtain three palloc'd arrays containing position of standbys currently
> + * considered as synchronous, and its length.
> + */
> +int
> +SyncRepGetSyncStandbys(int *sync_standbys)
>
> This comment seems to be out of date.  I would say "Populate a
> caller-supplied array which much have enough space for ...  Returns
> ...".
>
> +/*
> + * Obtain standby currently considered as synchronous using
> + * '1-priority' method.
> + */
> +int
> +SyncRepGetSyncStandbysOnePriority(int *sync_standbys)
> + ... code ...
>
> Why do we need a separate function and code path for this case?  If
> you used SyncRepGetSyncStandbysPriority with a size of 1, should it
> not produce the same result in the same time complexity?
>
> +/*
> + * Obtain standby currently considered as synchronous using
> + * 'priority' method.
> + */
> +int
> +SyncRepGetSyncStandbysPriority(int *sync_standbys)
>
> I would say something more descriptive, maybe like this: "Populates a
> caller-supplied buffer with the walsnds indexes of the highest
> priority active synchronous standbys, up to the a limit of
> 'synchronous_standby_num'.  The order of the results is undefined.
> Returns the number of results  actually written."
>
> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested
> above, then this function could be renamed to SyncRepGetSyncStandbys.
> I think it would be a tiny bit nicer if it also took a Size n argument
> along with the output buffer pointer.
>
> As for the body of that function (which I won't paste here), it
> contains an algorithm to find the top K elements in an array of N
> elements.  It does that with a linear search through the top K seen so
> far for each value in the input array, so its worst case is O(KN)
> comparisons.  Some of the sorting gurus on this list might have
> something to say about that but my take is that it seems fine for the
> tiny values of K and N that we're dealing with here, and it's nice
> that it doesn't need any space other than the output buffer, unlike
> some other top-K algorithms which would win for larger inputs.
>
> + /* Found sync standby */
>
> This comment would be clearer as "Found lowest priority standby, so replace 
> it".
>
> + if (walsndloc->sync_standby_priority == priority &&
> + walsnd->sync_standby_priority < priority)
> + sync_standbys[j] = i;
>
> In this case, couldn't you also update 'priority' directly, and break
> out of the loop immediately?

Oops, I didn't think that though: you can't break from the loop, you
still need to find the new lowest priority, so I retract that bit.

> Wouldn't "lowest_priority" be a better
> variable name than "priority"?  It might be good to say "lowest"
> rather than "highest" in the nearby comments, to be consistent with
> other parts of the code including the function name (lower priority
> number means higher priority!).
>
> +/*
> + * Obtain currently synced LSN: write and flush,
> + * using '1-prioirty' method.
>
> s/prioirty/priority/
>
> + */
> +bool
> +SyncRepGetSyncLsnsOnePriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>
> Similar to the earlier case, why have a special case for 1-priority?
> Wouldn't SyncRepGetSyncLsnsPriority produce the same result when is
> synchronous_standby_num == 1?
>
> +/*
> + * Obtain currently synced LSN: write and flush,
> + * using 'prioirty' method.
>
> s/prioirty/priority/
>
> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
> +{
> + int *sync_standbys = NULL;
> + int num_sync;
> + int i;
> + XLogRecPtr synced_write = InvalidXLogRecPtr;
> + XLogRecPtr synced_flush = InvalidXLogRecPtr;
> +
> + sync_standbys = (int *) palloc(sizeof(int) * synchronous_standby_num);
>
> Would a fixed size buffer on the stack (of compile time constant size)
> be better than palloc/free in here and elsewhere?
>
> + /*
> + for (i = 0; i < num_sync; i++)
> + {
> + volatile WalSnd *walsndloc = >walsnds[sync_standbys[i]];
> + if (walsndloc == MyWalSnd)
> + {
> + found = true;
> + break;
> + }
> + }
> + */
>
> Dead code.
>
> + if (synchronous_replication_method == SYNC_REP_METHOD_1_PRIORITY)
> + synchronous_standby_num = 1;
> + else
> + synchronous_standby_num = pg_atoi(lfirst(list_head(elemlist)),
> sizeof(int), 0);
>
> Should we detect if synchronous_standby_num > the number of listed
> servers, which would be a nonsensical configuration?  Should we also
> impose some other kind of constant limits, like must be >= 0 (I
> haven't tried but I wonder if -1 leads to very 

Re: [HACKERS] [PROPOSAL] Backup and recovery of pg_statistic

2015-12-22 Thread Tom Lane
Bruce Momjian  writes:
> On Fri, Nov 27, 2015 at 06:52:59PM +0300, Dmitry Ivanov wrote:
>> Now that the approach has been developed, it may be applied to improve the 
>> 'pg_dump' utility. Some minor code changes would make the 'pg_dump' emit 
>> specially-formed recovery INSERTS for 'pg_statistic' in the 'binary-upgrade' 
>> mode, thus allowing us to restore saved stats after an upgrade.

> I think this is great.  My only question for the list is how much does
> this dumping create new compatibility requirements between major
> versions.

That concern is exactly the reason why we never did this originally.
In particular, emitting raw INSERTs into pg_statistic is just plain
foolish; we have changed the rowtype of pg_statistic in the past and
are likely to do so again.  At a minimum we would need such a facility
to be proof against addition of more statistic "slots" (more columns)
to pg_statistic.

While at Salesforce, I did some work on this problem, and came up
with code that emitted calls to functions that would insert data for
a single "slot".  That is, the dump consisted of calls to functions
defined more or less like this:

pg_load_statistics_slot (table_oid regclass,
 attr_name name,
 slot_kind int,
 slot_op oid,
 slot_numbers float4[],
 slot_values anyarray);

and there was another one to load the non-slot columns of a pg_statistic
entry.  And of course there was code to emit such a dump, producing one
dump statement per occupied "slot" in pg_statistic plus one call to
the other function per pg_statistic row.

An API like this seems a good deal more future-proof than plain INSERTs.
Even if the underlying pg_statistic representation changes, such functions
could try to adapt the presented data to the new format, or at worst they
could be redefined as no-ops.

The major problem that I never had a very good solution for is how the
"load" function could validate that the presented data was actually valid
for the specified table attribute's data type and slot kind.  The absolute
minimum that you would want it to do is to cross-check that the
slot_values array has the correct element datatype, because if it doesn't
the backend is just about certain to crash when it tries to use the data.
I think there are probably other assumptions that would need to be
validated to be safe, depending on the code that looks at each slot kind.
I have not found an answer other than the load function knowing all there
is to know about each defined slot kind; which seems like a maintenance
problem, not to mention breaking extensions (like PostGIS) that define
their own slot kinds.

Failing any good solution to that, we could probably get by by restricting
use of these functions to superusers (which is certainly where direct use
of INSERTs would have to stop).  But that puts a pretty considerable crimp
in the usefulness of the stats dump/load process.  Ideally, ordinary users
could use this facility to transfer statistics for their own tables, just
as they can use pg_dump ... but without adequate validity checking, it's
way too much of a security hazard to allow that.

I don't have access to this code anymore, but could probably persuade
Salesforce to donate it.  Or we could just write something similar
from scratch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-12-22 Thread Thomas Munro
On Fri, Dec 18, 2015 at 7:38 AM, Masahiko Sawada  wrote:
[000-_multi_sync_replication_v3.patch]

Hi Masahiko,

I haven't tested this version of the patch but I have some comments on the code.

+/* Is this wal sender considerable one? */
+bool
+SyncRepActiveListedWalSender(int num)

Maybe "Is this wal sender managing a standby that is streaming and
listed as a synchronous standby?"

+/*
+ * Obtain three palloc'd arrays containing position of standbys currently
+ * considered as synchronous, and its length.
+ */
+int
+SyncRepGetSyncStandbys(int *sync_standbys)

This comment seems to be out of date.  I would say "Populate a
caller-supplied array which much have enough space for ...  Returns
...".

+/*
+ * Obtain standby currently considered as synchronous using
+ * '1-priority' method.
+ */
+int
+SyncRepGetSyncStandbysOnePriority(int *sync_standbys)
+ ... code ...

Why do we need a separate function and code path for this case?  If
you used SyncRepGetSyncStandbysPriority with a size of 1, should it
not produce the same result in the same time complexity?

+/*
+ * Obtain standby currently considered as synchronous using
+ * 'priority' method.
+ */
+int
+SyncRepGetSyncStandbysPriority(int *sync_standbys)

I would say something more descriptive, maybe like this: "Populates a
caller-supplied buffer with the walsnds indexes of the highest
priority active synchronous standbys, up to the a limit of
'synchronous_standby_num'.  The order of the results is undefined.
Returns the number of results  actually written."

If you got rid of SyncRepGetSyncStandbysOnePriority as suggested
above, then this function could be renamed to SyncRepGetSyncStandbys.
I think it would be a tiny bit nicer if it also took a Size n argument
along with the output buffer pointer.

As for the body of that function (which I won't paste here), it
contains an algorithm to find the top K elements in an array of N
elements.  It does that with a linear search through the top K seen so
far for each value in the input array, so its worst case is O(KN)
comparisons.  Some of the sorting gurus on this list might have
something to say about that but my take is that it seems fine for the
tiny values of K and N that we're dealing with here, and it's nice
that it doesn't need any space other than the output buffer, unlike
some other top-K algorithms which would win for larger inputs.

+ /* Found sync standby */

This comment would be clearer as "Found lowest priority standby, so replace it".

+ if (walsndloc->sync_standby_priority == priority &&
+ walsnd->sync_standby_priority < priority)
+ sync_standbys[j] = i;

In this case, couldn't you also update 'priority' directly, and break
out of the loop immediately?  Wouldn't "lowest_priority" be a better
variable name than "priority"?  It might be good to say "lowest"
rather than "highest" in the nearby comments, to be consistent with
other parts of the code including the function name (lower priority
number means higher priority!).

+/*
+ * Obtain currently synced LSN: write and flush,
+ * using '1-prioirty' method.

s/prioirty/priority/

+ */
+bool
+SyncRepGetSyncLsnsOnePriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)

Similar to the earlier case, why have a special case for 1-priority?
Wouldn't SyncRepGetSyncLsnsPriority produce the same result when is
synchronous_standby_num == 1?

+/*
+ * Obtain currently synced LSN: write and flush,
+ * using 'prioirty' method.

s/prioirty/priority/

+SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
+{
+ int *sync_standbys = NULL;
+ int num_sync;
+ int i;
+ XLogRecPtr synced_write = InvalidXLogRecPtr;
+ XLogRecPtr synced_flush = InvalidXLogRecPtr;
+
+ sync_standbys = (int *) palloc(sizeof(int) * synchronous_standby_num);

Would a fixed size buffer on the stack (of compile time constant size)
be better than palloc/free in here and elsewhere?

+ /*
+ for (i = 0; i < num_sync; i++)
+ {
+ volatile WalSnd *walsndloc = >walsnds[sync_standbys[i]];
+ if (walsndloc == MyWalSnd)
+ {
+ found = true;
+ break;
+ }
+ }
+ */

Dead code.

+ if (synchronous_replication_method == SYNC_REP_METHOD_1_PRIORITY)
+ synchronous_standby_num = 1;
+ else
+ synchronous_standby_num = pg_atoi(lfirst(list_head(elemlist)),
sizeof(int), 0);

Should we detect if synchronous_standby_num > the number of listed
servers, which would be a nonsensical configuration?  Should we also
impose some other kind of constant limits, like must be >= 0 (I
haven't tried but I wonder if -1 leads to very large palloc) and must
be <= MAX_XXX (smallish sanity check number like 256, rather than the
INT_MAX limit imposed by pg_atoi), so that we could use that constant
to size stack buffers in the places where you currently palloc?

Could 1-priority mode be inferred from the use of a non-number in the
leading position, and if so, does the mode concept even need to exist,
especially if SyncRepGetSyncLsnsOnePriority and
SyncRepGetSyncStandbysOnePriority aren't really needed either 

Re: [HACKERS] Possible marginally-incompatible change to array subscripting

2015-12-22 Thread Jim Nasby

On 12/22/15 12:01 PM, Tom Lane wrote:

Yury Zhuravlev  writes:

If you break backwards compatibility, it can be done arrays
similar to C/C++/Python/Ruby and other languages style?
I'm sorry to bring up this thread again...


I am not sure just exactly how incompatible that would be, but surely it
would break enormously more code than what we're discussing here.
So no, I don't think any such proposal has a chance.  There are degrees
of incompatibility, and considering a small/narrow one does not mean that
we'd also consider major breakage.


As I see it, the biggest problem with our arrays is that they can't 
decide if they're a simple array (which means >1 dimension is an array 
of arrays) or a matrix (all slices in a dimension must be the same 
size). They seem to be more like matricies than arrays, but then there's 
a bunch of places that completely ignore dimensionality. It would be 
nice to standardize them one way or another, but it seems like the 
breakage from that would be horrific.


One could theoretically construct a custom "type" that followed more 
traditional semantics, but then you'd lose all the syntax... which I 
suspect would make any such "type" all but unusable. The other problem 
would be having it deal with any other data type, but at least there's 
ways you can work around that for the most part.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A Typo in regress/sql/privileges.sql

2015-12-22 Thread Peter Geoghegan
On Tue, Dec 22, 2015 at 11:01 AM, Robert Haas  wrote:
>> TBH I'm kinda inclined to sort this out by removing all usage of the
>> word "inference" everywhere --- error messages and code comments and
>> documentation wording, and replace it with some other wording as
>> appropriate for each context.
>
> I would not object to that.

Why? There is nothing wrong with the term inference. It's perfectly
descriptive. It's also a term I've used as part of many talks about ON
CONFLICT. No one has said a word about it before now.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On columnar storage (2)

2015-12-22 Thread Michael Paquier
On Tue, Dec 22, 2015 at 11:43 PM, Alvaro Herrera
 wrote:
> Michael Paquier wrote:
>> On Wed, Dec 9, 2015 at 3:10 PM, Jeff Janes  wrote:
>> > Could we get this rebased past the merge of the parallel execution commits?
>>
>> +1. Alvaro, Tomas, Simon, what are the next plans with those patches?
>
> Yeah, I've been working intermittently on getting the whole tree rebased
> and squashed, because after the last submission we made a lot of
> progress.  I'll repost later.  I think it should be marked "returned
> with feedback" for now.

Ok. Noted. Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A Typo in regress/sql/privileges.sql

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 2:23 PM, Peter Geoghegan  wrote:
> On Tue, Dec 22, 2015 at 11:01 AM, Robert Haas  wrote:
>>> TBH I'm kinda inclined to sort this out by removing all usage of the
>>> word "inference" everywhere --- error messages and code comments and
>>> documentation wording, and replace it with some other wording as
>>> appropriate for each context.
>>
>> I would not object to that.
>
> Why? There is nothing wrong with the term inference. It's perfectly
> descriptive. It's also a term I've used as part of many talks about ON
> CONFLICT. No one has said a word about it before now.

If it's an axiom that there is nothing wrong with the term inference,
then obviously we should not change anything.  But that seems to me to
be putting the cart before the horse.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 12:31 PM, Robert Haas  wrote:
> On Mon, Dec 21, 2015 at 2:55 PM, Peter Geoghegan  wrote:
>> On Mon, Dec 21, 2015 at 10:53 AM, Robert Haas  wrote:
>>> PFA my proposal for comment changes for 9.5 and master.  This is based
>>> on your 0001, but I edited somewhat.  Please let me know your
>>> thoughts.  I am not willing to go further and rearrange actual code in
>>> 9.5 at this point; it just isn't necessary.
>>
>> Fine by me. But this revision hasn't made the important point at all
>> -- which is that 0002 is safe. That's a stronger guarantee than the
>> abbreviated key representation being pass-by-value.
>
> Right.  I don't think that we should back-patch that stuff into 9.5.

OK, so I've gone ahead and committed and back-patched that.  Can you
please rebase and repost the remainder as a 9.6 proposal?

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A Typo in regress/sql/privileges.sql

2015-12-22 Thread Peter Geoghegan
On Tue, Dec 22, 2015 at 3:38 PM, Robert Haas  wrote:
> In my opinion a term more closely coupled to the concrete syntax would
> be easier to understand.  I have no objection to referring to the
> *process* of trying to deduce a suitable index from the ON CONFLICT
> clause as "inference".  But calling the ON CONFLICT clause an
> "inference specification" is, in my opinion, an unnecessary oblique
> way of referring to it.   If you renamed InferenceElem to
> InsertOnConflictElem, I think that would be strictly more clear.

The documentation uses the term "unique index inference" to introduce
the concept. It then uses "inference" as a shorthand a couple of times
when the context is very well established. So I don't see that I've
done that at all.

As for the one user-visible error messages where the term "inference
specification" is used, that message also has a hint that draws
particular attention to what is meant:

if (onConflictClause->action == ONCONFLICT_UPDATE && !infer)
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
 errmsg("ON CONFLICT DO UPDATE requires inference
specification or constraint name"),
 errhint("For example, ON CONFLICT (column_name)."),
 parser_errposition(pstate,
  exprLocation((Node *) onConflictClause;

(There is one appearance of "inference specification" in a defensive
elog() call).

So I still don't understand why anyone takes issue with this. It's a
total mystery to me.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2015-12-22 Thread Peter Geoghegan
On Tue, Dec 22, 2015 at 9:10 AM, Robert Haas  wrote:
> So I was looking at the 0001 patch

Thanks. I'm going to produce a revision of 0002 shortly, so perhaps
hold off on that one. The big change there will be to call
grow_memtuples() to allow us to increase the number of slots without
palloc() overhead spuriously being weighed (since the memory for the
final on-the-fly merge phase doesn't have palloc() overhead). Also,
will incorporate what Jeff and you wanted around terminology.

> This looks like voodoo to me.  I assume you tested it and maybe it
> gives correct answers, but it's got to be some kind of world record
> for number of arbitrary constants per SLOC, and there's no real
> justification for any of it.  The comments say, essentially, well, we
> do this because it works.  But suppose I try it on some new piece of
> hardware and it doesn't work well.  What do I do?  Email the author
> and ask him to tweak the arbitrary constants?

That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and
DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do
something, though.

MaxAllocHugeSize is used fairly arbitrarily in pg_stat_statements.c.
And that part (the MaxAllocSize part of my patch) only defines a point
after which we require a really favorable case for replacement
selection/quicksort with spillover to proceed. It's a safety valve. We
try to err on the side of not using replacement selection.

> I wonder if there's a way to accomplish what you're trying to do here
> that avoids the need to have a cost model at all.  As I understand it,
> and please correct me wherever I go off the rails, the situation is:
>
> 1. If we're sorting a large amount of data, such that we can't fit it
> all in memory, we will need to produce a number of sorted runs and
> then merge those runs.  If we generate each run using a heap with
> replacement selection, rather than quicksort, we will produce runs
> that are, on the average, about twice as long, which means that we
> will have fewer runs to merge at the end.
>
> 2. Replacement selection is slower than quicksort on a per-tuple
> basis.  Furthermore, merging more runs isn't necessarily any slower
> than merging fewer runs.  Therefore, building runs via replacement
> selection tends to lose even though it tends to reduce the number of
> runs to merge.  Even when having a larger number of runs results in an
> increase in the number merge passes, we save so much time building the
> runs that we often (maybe not always) still come out ahead.

I'm with you so far. I'll only add: doing multiple passes ought to be
very rare anyway.

> 3. However, when replacement selection would result in a single run,
> and quicksort results in multiple runs, using quicksort loses.  This
> is especially true when we the amount of data we have is between one
> and two times work_mem.  If we fit everything into one run, we do not
> need to write any data to tape, but if we overflow by even a single
> tuple, we have to write a lot of data to tape.

No, this is where you lose me. I think that it's basically not true
that replacement selection can ever be faster than quicksort, even in
the cases where the conventional wisdom would have you believe so
(e.g. what you say here). Unless you have very little memory relative
to data size, or something along those lines. The conventional wisdom
obviously needs some revision, but it was perfectly correct in the
1970s and 1980s.

However, where replacement selection can still help is avoiding I/O
*entirely*. If we can avoid spilling 95% of tuples in the first place,
and quicksort the remaining (heapified) tuples that were not spilled,
and merge an in-memory run with an on-tape run, then we can win big.
Quicksort is not amenable to incremental spilling at all. I call this
"quicksort with spillover" (it is a secondary optimization that the
patch adds). This shows up in EXPLAIN ANALYZE, and avoids a stark
discontinuity in the cost function of sorts. That could really help
with admission control, and simplifying the optimizer, making merge
joins less scary. So with the patch, "quicksort with spillover" and
"replacement selection" are almost synonymous, except that we
acknowledge the historic importance of replacement selection to some
degree. The patch completely discards the conventional use of
replacement selection -- it just preserves its priority queue (heap)
implementation where incrementalism is thought to be particularly
useful (avoiding I/O entirely).

But this comparison has nothing to do with comparing the master branch
with my patch, since the master branch never attempts to avoid I/O
having committed to an external sort. It uses replacement selection in
a way that is consistent with the conventional wisdom, wisdom which
has now been shown to be obsolete.

BTW, I think that abandoning incrementalism (replacement selection)
will have future benefits for memory management. I bet we can get away
with one big palloc() 

Re: [HACKERS] Additional role attributes && superuser review

2015-12-22 Thread Stephen Frost
* Michael Paquier (michael.paqu...@gmail.com) wrote:
> On Tue, Dec 22, 2015 at 2:54 PM, Amit Langote
>  wrote:
> > On 2015/12/22 14:05, Michael Paquier wrote:
> >> On Tue, Dec 22, 2015 at 1:41 AM, Stephen Frost  wrote:
> >>> Updated and rebased patch attached which takes the 'pg_switch_xlog'
> >>> default role back out, leaving us with:
> >>>
> >>> pg_monitor - View privileged info
> >>> pg_backup - start/stop backups, switch xlog, create restore points
> >>> pg_replay - Pause/resume xlog replay on replicas
> >>> pg_replication - Create/destroy/etc replication slots
> >>> pg_rotate_logfile - Request logfile rotation
> >>> pg_signal_backend - Signal other backends (cancel query/terminate)
> >>> pg_file_settings - View configuration settings in all config files
> >>
> >> Thanks. This looks fine to me. I just have one single comment:
> >>
> >> +   Request logfile rotation.
> >> s/logfile/transaction log file/
> >
> > Looks correct as is. Or maybe "server's log file" as in:
> >
> > 9.26.2. Server Signaling Functions
> >
> > pg_rotate_logfile(): Rotate server's log file
> 
> You're right, this is not a WAL segment, just a normal log file. Your
> phrasing is better.

Works for me.

Updated patch attached.  I'll give it another good look and then commit
it, barring objections.

Thanks!

Stephen
From f493051be96514c0f0e178ef74d6d824f702a7c2 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 18 Nov 2015 11:50:57 -0500
Subject: [PATCH 1/3] Add note regarding permissions in pg_catalog

Add a note to the system catalog section pointing out that while
modifying the permissions on catalog tables is possible, it's
unlikely to have the desired effect.
---
 doc/src/sgml/catalogs.sgml | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 97ef618..3b7768c 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -21,6 +21,17 @@
particularly esoteric operations, such as adding index access methods.
   
 
+  
+   
+Changing the permissions on objects in the system catalogs, while
+possible, is unlikely to have the desired effect as the internal
+lookup functions use a cache and do not check the permissions nor
+policies of tables in the system catalog.  Further, permission
+changes to objects in the system catalogs are not preserved by
+pg_dump or across upgrades.
+   
+  
+
  
   Overview
 
-- 
2.5.0


From dece838e3ad74549c6f07dc878c8637dc2db674d Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 30 Sep 2015 07:04:55 -0400
Subject: [PATCH 2/3] Reserve the "pg_" namespace for roles

This will prevent users from creating roles which begin with "pg_" and
will check for those roles before allowing an upgrade using pg_upgrade.

This will allow for default roles to be provided at initdb time.
---
 doc/src/sgml/ref/psql-ref.sgml  |  8 +--
 src/backend/catalog/catalog.c   |  5 ++--
 src/backend/commands/user.c | 40 
 src/backend/utils/adt/acl.c | 41 +
 src/bin/pg_dump/pg_dumpall.c|  2 ++
 src/bin/pg_upgrade/check.c  | 40 ++--
 src/bin/psql/command.c  |  4 ++--
 src/bin/psql/describe.c |  5 +++-
 src/bin/psql/describe.h |  2 +-
 src/bin/psql/help.c |  4 ++--
 src/include/utils/acl.h |  1 +
 src/test/regress/expected/rolenames.out | 18 +++
 src/test/regress/sql/rolenames.sql  |  8 +++
 13 files changed, 166 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 6d0cb3d..76bb642 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1365,13 +1365,15 @@ testdb=
 
 
   
-\dg[+] [ pattern ]
+\dg[S+] [ pattern ]
 
 
 Lists database roles.
 (Since the concepts of users and groups have been
 unified into roles, this command is now equivalent to
 \du.)
+By default, only user-created roles are shown; supply the
+S modifier to include system roles.
 If pattern is specified,
 only those roles whose names match the pattern are listed.
 If the form \dg+ is used, additional information
@@ -1525,13 +1527,15 @@ testdb=
   
 
   
-\du[+] [ pattern ]
+\du[S+] [ pattern ]
 
 
 Lists database roles.
 (Since the concepts of users and groups have been
 unified into roles, this command is now equivalent to
 \dg.)
+By default, only user-created roles are shown; supply the
+S modifier to include system roles.
 If pattern is specified,
 only those roles whose names match the pattern 

Re: [HACKERS] A Typo in regress/sql/privileges.sql

2015-12-22 Thread Peter Geoghegan
On Tue, Dec 22, 2015 at 2:58 PM, Robert Haas  wrote:
> If it's an axiom that there is nothing wrong with the term inference,
> then obviously we should not change anything.  But that seems to me to
> be putting the cart before the horse.

OK, then. What's wrong with the term inference?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

2015-12-22 Thread Peter Geoghegan
On Tue, Dec 22, 2015 at 2:59 PM, Robert Haas  wrote:
>> Right.  I don't think that we should back-patch that stuff into 9.5.
>
> OK, so I've gone ahead and committed and back-patched that.  Can you
> please rebase and repost the remainder as a 9.6 proposal?

OK. I don't know why you didn't acknowledge in your revision to
sortsupport.h that bitwise inequality must be a reliable proxy for
authoritative value inequality, which is a stronger restriction than
mandating that abbreviated keys always be pass-by-value, but I'm not
going to argue.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A Typo in regress/sql/privileges.sql

2015-12-22 Thread Alvaro Herrera
Robert Haas wrote:
> On Tue, Dec 22, 2015 at 4:36 AM, Peter Geoghegan  wrote:
> > On Fri, Dec 18, 2015 at 12:52 PM, Robert Haas  wrote:
> >> Mind you, I don't think "inference specification" is very good
> >> terminology, but what's there right now is just wrong.
> >
> > It doesn't appear in the documentation. The term "inference
> > specification" only appears where it's necessary to precisely describe
> > the input to unique index inference.
> 
> Well, we can change this to say "inference specification", but I still
> think calling it the "ON CONFLICT" clause would be clearer in this
> context.

TBH I'm kinda inclined to sort this out by removing all usage of the
word "inference" everywhere --- error messages and code comments and
documentation wording, and replace it with some other wording as
appropriate for each context.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possible marginally-incompatible change to array subscripting

2015-12-22 Thread Yury Zhuravlev

This would make no difference of course for the common case where the
array lower bound is 1, but it seems a lot less arbitrary when it isn't.
So I think we should strongly consider changing it to mean that, even
though it would be non-backwards-compatible in such cases.

Comments?


If you break backwards compatibility, it can be done arrays 
similar to C/C++/Python/Ruby and other languages style?

I'm sorry to bring up this thread again...


ISTM that if we'd had Yury's code in there from the beginning, what we
would define this as meaning is "a[3:4][:5]", ie the implied range runs
from whatever the array lower bound is up to the specified subscript.


[3:4][:5] instead a[3:4][5] at least this is logical. But after what will
result from a[3:4][5]? One element? 

Thanks. 


--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possible marginally-incompatible change to array subscripting

2015-12-22 Thread Tom Lane
Robert Haas  writes:
> On Tue, Dec 22, 2015 at 11:51 AM, Tom Lane  wrote:
>> ISTM that if we'd had Yury's code in there from the beginning, what we
>> would define this as meaning is "a[3:4][:5]", ie the implied range runs
>> from whatever the array lower bound is up to the specified subscript.

> Gosh, our arrays are strange.  I would have expected a[3:4][5] to mean
> a[3:4][5:5].

Yeah, probably, now that you mention it ... but that seems like too much
of a compatibility break.  Or does anyone want to argue for just doing
that and never mind the compatibility issues?  This is a pretty weird
corner case already; there can't be very many people relying on it.

Another point worth realizing is that the implicit insertion of "1:"
happens in the parser, meaning that existing stored views/rules will dump
out with that added and hence aren't going to change meaning no matter
what we decide here.

(BTW, now that I've read the patch a bit further, it actually silently
changed the semantics as I'm suggesting already.  We could undo that
without too much extra code, but I feel that we shouldn't.  Robert's
idea seems like a plausible alternative, but it would take a nontrivial
amount of code to implement it unless we are willing to double-evaluate
such a subscript.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possible marginally-incompatible change to array subscripting

2015-12-22 Thread Tom Lane
Yury Zhuravlev  writes:
> If you break backwards compatibility, it can be done arrays 
> similar to C/C++/Python/Ruby and other languages style?
> I'm sorry to bring up this thread again...

I am not sure just exactly how incompatible that would be, but surely it
would break enormously more code than what we're discussing here.
So no, I don't think any such proposal has a chance.  There are degrees
of incompatibility, and considering a small/narrow one does not mean that
we'd also consider major breakage.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possible marginally-incompatible change to array subscripting

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 12:55 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Dec 22, 2015 at 11:51 AM, Tom Lane  wrote:
>>> ISTM that if we'd had Yury's code in there from the beginning, what we
>>> would define this as meaning is "a[3:4][:5]", ie the implied range runs
>>> from whatever the array lower bound is up to the specified subscript.
>
>> Gosh, our arrays are strange.  I would have expected a[3:4][5] to mean
>> a[3:4][5:5].
>
> Yeah, probably, now that you mention it ... but that seems like too much
> of a compatibility break.  Or does anyone want to argue for just doing
> that and never mind the compatibility issues?  This is a pretty weird
> corner case already; there can't be very many people relying on it.

To be honest, I'd be inclined not to change the semantics at all.  But
that's just me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A Typo in regress/sql/privileges.sql

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 4:36 AM, Peter Geoghegan  wrote:
> On Fri, Dec 18, 2015 at 12:52 PM, Robert Haas  wrote:
>> Mind you, I don't think "inference specification" is very good
>> terminology, but what's there right now is just wrong.
>
> It doesn't appear in the documentation. The term "inference
> specification" only appears where it's necessary to precisely describe
> the input to unique index inference.

Well, we can change this to say "inference specification", but I still
think calling it the "ON CONFLICT" clause would be clearer in this
context.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A Typo in regress/sql/privileges.sql

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 6:10 PM, Peter Geoghegan  wrote:
> On Tue, Dec 22, 2015 at 2:58 PM, Robert Haas  wrote:
>> If it's an axiom that there is nothing wrong with the term inference,
>> then obviously we should not change anything.  But that seems to me to
>> be putting the cart before the horse.
>
> OK, then. What's wrong with the term inference?

In my opinion a term more closely coupled to the concrete syntax would
be easier to understand.  I have no objection to referring to the
*process* of trying to deduce a suitable index from the ON CONFLICT
clause as "inference".  But calling the ON CONFLICT clause an
"inference specification" is, in my opinion, an unnecessary oblique
way of referring to it.   If you renamed InferenceElem to
InsertOnConflictElem, I think that would be strictly more clear.

I also kind of dislike the way that this feature has commandeered the
word "inference" for its own use.  The server infers a lot of things,
and it might infer more in the future, but if somebody else ever wants
to use the word for something, they're going to have a hard time not
getting false hits when they grep.  Like there are comments like this
in the regression tests:

-- inference succeeds:
-- inference fails:

That's pretty darn generic.  Whether the command succeeds or fails
will be clear enough from the output.

Mind you, I don't think the problems with inference are a
super-serious issue, or I would have raised a stink sooner.  But I
don't find it particularly appealing, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possible marginally-incompatible change to array subscripting

2015-12-22 Thread Tom Lane
Jim Nasby  writes:
> One could theoretically construct a custom "type" that followed more 
> traditional semantics, but then you'd lose all the syntax... which I 
> suspect would make any such "type" all but unusable. The other problem 
> would be having it deal with any other data type, but at least there's 
> ways you can work around that for the most part.

Yeah.  We've speculated a bit about allowing other datatypes to have
access to the subscript syntax, which could be modeled as allowing
'a[b]' to be an overloadable operator.  That seems possibly doable if
someone wanted to put time into it.  However, that still leaves a
heck of a lot of functionality on the table, such as automatic creation of
array types corresponding to new scalar types, not to mention the parser's
understanding of "anyarray" vs "anyelement" polymorphism.  I have no idea
how we might make those things extensible.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting sorted data from foreign server for merge join

2015-12-22 Thread Ashutosh Bapat
Thanks.

On Wed, Dec 23, 2015 at 12:24 AM, Robert Haas  wrote:

> On Mon, Dec 21, 2015 at 6:34 AM, Ashutosh Bapat
>  wrote:
> >> I went over this patch in some detail today and did a lot of cosmetic
> >> cleanup.  The results are attached.  I'm fairly happy with this
> >> version, but let me know what you think.  Of course, feedback from
> >> others is more than welcome also.
> >>
> >
> > Attached patch with some cosmetic changes (listed here for your quick
> > reference)
> > 1. , was replaced with ; in comment "inner join, expressions in the " at
> one
> > place, which is correct, but missed other place.
> > 2. The comment "First, consider whether any each active EC is
> potentially"
> > should use either "any" or "each". I have reworded it as "First, consider
> > whether any of the active ECs is potentially ...". Or we can use "First,
> > find all of the active ECs which are potentially ".
> > 3. "having the remote side due the sort generally won't be any worse
> ..." -
> > instead of "due" we should use "do"?
> > 4. Added static prototype of function get_useful_ecs_for_relation().
> > 5. The comment "Extract unique EC for query, if any, so we don't
> consider it
> > again." is too crisp. Phrase "Unique EC for query" is confusing; EC can
> not
> > be associated with a query per say and EC's are always unique because of
> > canonicalisation. May be we should reword it as "Extract single EC for
> > ordering of query, if any, so we don't consider it again." Is that
> cryptic
> > as well?
>
> Thanks.  I committed this version with one small tweak.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Add scale(numeric)

2015-12-22 Thread Pavel Stehule
Hi

2015-11-19 3:27 GMT+01:00 Marko Tiikkaja :

> Hi,
>
> As suggested in 5643125e.1030...@joh.to, here's a patch for extracting
> the scale out of a numeric.
>
> This is 2016-01 CF material, but if someone wants to bas^H^H^Hsay nice
> things in the meanwhile, feel free.
>

This patch is pretty trivial without any possible issues, but I miss a use
case of this feature.

Why it should be in Postgres?

Pavel


>
>
> .m
>


Re: [HACKERS] Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

2015-12-22 Thread Peter Geoghegan
On Tue, Dec 22, 2015 at 2:59 PM, Robert Haas  wrote:
> OK, so I've gone ahead and committed and back-patched that.  Can you
> please rebase and repost the remainder as a 9.6 proposal?

I attach a rebased patch for 9.6 only.


-- 
Peter Geoghegan
From 4bc33a602822ef0d56222f03174d71bbb4cd126b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Thu, 9 Jul 2015 15:59:07 -0700
Subject: [PATCH] Reuse abbreviated keys in ordered [set] aggregates

When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.  Only strict abbreviated key binary inequality is considered,
which is safe.
---
 src/backend/catalog/index.c|  2 +-
 src/backend/executor/nodeAgg.c | 20 ++---
 src/backend/executor/nodeSort.c|  2 +-
 src/backend/utils/adt/orderedsetaggs.c | 33 ++-
 src/backend/utils/sort/tuplesort.c | 76 --
 src/include/utils/tuplesort.h  |  4 +-
 6 files changed, 95 insertions(+), 42 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c10be3d..c6737c2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3073,7 +3073,7 @@ validate_index_heapscan(Relation heapRelation,
 			}
 
 			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-  _val, _isnull);
+  _val, _isnull, NULL);
 			Assert(tuplesort_empty || !ts_isnull);
 			if (!tuplesort_empty)
 			{
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2e36855..f007c38 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -539,7 +539,8 @@ fetch_input_tuple(AggState *aggstate)
 
 	if (aggstate->sort_in)
 	{
-		if (!tuplesort_gettupleslot(aggstate->sort_in, true, aggstate->sort_slot))
+		if (!tuplesort_gettupleslot(aggstate->sort_in, true, aggstate->sort_slot,
+	NULL))
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
@@ -894,8 +895,8 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
  * The one-input case is handled separately from the multi-input case
  * for performance reasons: for single by-value inputs, such as the
  * common case of count(distinct id), the tuplesort_getdatum code path
- * is around 300% faster.  (The speedup for by-reference types is less
- * but still noticeable.)
+ * is around 300% faster.  (The speedup for by-reference types without
+ * abbreviated key support is less but still noticeable.)
  *
  * This function handles only one grouping set (already set in
  * aggstate->current_set).
@@ -913,6 +914,8 @@ process_ordered_aggregate_single(AggState *aggstate,
 	MemoryContext workcontext = aggstate->tmpcontext->ecxt_per_tuple_memory;
 	MemoryContext oldContext;
 	bool		isDistinct = (pertrans->numDistinctCols > 0);
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	FunctionCallInfo fcinfo = >transfn_fcinfo;
 	Datum	   *newVal;
 	bool	   *isNull;
@@ -932,7 +935,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 	 */
 
 	while (tuplesort_getdatum(pertrans->sortstates[aggstate->current_set],
-			  true, newVal, isNull))
+			  true, newVal, isNull, ))
 	{
 		/*
 		 * Clear and select the working context for evaluation of the equality
@@ -950,6 +953,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 			haveOldVal &&
 			((oldIsNull && *isNull) ||
 			 (!oldIsNull && !*isNull &&
+			  oldAbbrevVal == newAbbrevVal &&
 			  DatumGetBool(FunctionCall2(>equalfns[0],
 		 oldVal, *newVal)
 		{
@@ -965,6 +969,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 pfree(DatumGetPointer(oldVal));
 			/* and remember the new one for subsequent equality checks */
 			oldVal = *newVal;
+			oldAbbrevVal = newAbbrevVal;
 			oldIsNull = *isNull;
 			haveOldVal = true;
 		}
@@ -1002,6 +1007,8 @@ process_ordered_aggregate_multi(AggState *aggstate,
 	TupleTableSlot *slot2 = pertrans->uniqslot;
 	int			numTransInputs = pertrans->numTransInputs;
 	int			numDistinctCols = pertrans->numDistinctCols;
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	bool		haveOldValue = false;
 	int			i;
 
@@ -1012,7 +1019,7 @@ process_ordered_aggregate_multi(AggState *aggstate,
 		ExecClearTuple(slot2);
 
 	while (tuplesort_gettupleslot(pertrans->sortstates[aggstate->current_set],
-  true, slot1))
+  true, slot1, ))
 	{
 		/*
 		 * Extract the first numTransInputs columns as datums to pass to the
@@ -1023,6 +1030,7 @@ process_ordered_aggregate_multi(AggState *aggstate,
 
 		if (numDistinctCols == 0 ||
 			!haveOldValue ||
+			newAbbrevVal != oldAbbrevVal ||
 			!execTuplesMatch(slot1, slot2,
 			 numDistinctCols,
 			 pertrans->sortColIdx,
@@ -1046,6 +1054,8 @@ process_ordered_aggregate_multi(AggState *aggstate,

Re: [HACKERS] Remove Windows crash dump support?

2015-12-22 Thread Alex Ignatov

On 22.12.2015 18:28, Magnus Hagander wrote:



On Tue, Dec 22, 2015 at 3:53 PM, Craig Ringer > wrote:


On 22 December 2015 at 22:50, Craig Ringer > wrote:

Hi all

Back in 2010 I submitted a small feature to allow the creation
of minidumps when backends crashed; see
commit dcb09b595f88a3bca6097a6acc17bf2ec935d55f .

At the time Windows lacked useful support for postmortem
debugging and crash-dump management in the operating system
its self, especially for applications running as services.
That has since improved considerably.

The feature was also included in 9.4


Ahem. 9.1. This is what I get for multi-tasking between writing
this and packaging an extension for 9.4.


In which version(s) of Windows was this improvement added? I think 
that's really the part that matters here, not necessarily which 
version of PostgreSQL.



--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Hi all!
I think that you can debug crash dump since windbg exists.
Also I think that Postgres on Windows number  of instalations is so 
tiny  because people even today think that it is not so solid as unix 
version thats why you think that nobody use your code ;).


Today if my memory serves me right this code can not deal with buffer 
overflow. Am i right?
May be we need to add this functionality instead of drop support of it 
entirely?


--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Some questions about the array.

2015-12-22 Thread Yury Zhuravlev
New patch version in attachment.  
Thanks. 


--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/doc/src/sgml/array.sgml b/doc/src/sgml/array.sgml
index 4385a09..305cabb 100644
--- a/doc/src/sgml/array.sgml
+++ b/doc/src/sgml/array.sgml
@@ -257,6 +257,26 @@ SELECT schedule[1:2][1:1] FROM sal_emp WHERE name = 'Bill';
 (1 row)
 
 
+  It is possible to omit lower-bound or
+  upper-bound
+  of slice operator, since the missing bound will default to the index of the corresponding item.
+
+
+SELECT schedule[:][:] FROM sal_emp WHERE name = 'Bill';
+
+schedule
+
+ {{meeting,lunch},{training,presentation}}
+(1 row)
+
+SELECT schedule[:2][2:] FROM sal_emp WHERE name = 'Bill';
+
+schedule
+
+ {{lunch},{presentation}}
+(1 row)
+
+
   If any dimension is written as a slice, i.e., contains a colon, then all
   dimensions are treated as slices.  Any dimension that has only a single
   number (no colon) is treated as being from 1
diff --git a/src/backend/executor/execQual.c b/src/backend/executor/execQual.c
index 29f058c..f586d57 100644
--- a/src/backend/executor/execQual.c
+++ b/src/backend/executor/execQual.c
@@ -268,10 +268,12 @@ ExecEvalArrayRef(ArrayRefExprState *astate,
 	bool		eisnull;
 	ListCell   *l;
 	int			i = 0,
-j = 0;
+j = 0,
+indexexpr;
 	IntArray	upper,
 lower;
 	int		   *lIndex;
+	AnyArrayType *arrays;
 
 	array_source = ExecEvalExpr(astate->refexpr,
 econtext,
@@ -293,6 +295,7 @@ ExecEvalArrayRef(ArrayRefExprState *astate,
 	foreach(l, astate->refupperindexpr)
 	{
 		ExprState  *eltstate = (ExprState *) lfirst(l);
+		eisnull = false;
 
 		if (i >= MAXDIM)
 			ereport(ERROR,
@@ -300,10 +303,29 @@ ExecEvalArrayRef(ArrayRefExprState *astate,
 	 errmsg("number of array dimensions (%d) exceeds the maximum allowed (%d)",
 			i + 1, MAXDIM)));
 
-		upper.indx[i++] = DatumGetInt32(ExecEvalExpr(eltstate,
-	 econtext,
-	 ,
-	 NULL));
+		if (eltstate == NULL)
+		{
+			if (isAssignment)
+ereport(ERROR,
+	(errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR),
+	 errmsg("cannot determine upper index for empty array")));
+
+			if (astate->refattrlength > 0)
+ereport(ERROR,
+	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+	 errmsg("slices of fixed-length arrays not implemented")));
+
+			arrays = (AnyArrayType *)DatumGetArrayTypeP(array_source);
+			indexexpr = AARR_LBOUND(arrays)[i] + AARR_DIMS(arrays)[i] - 1;
+		}
+		else
+			indexexpr = DatumGetInt32(ExecEvalExpr(eltstate,
+   econtext,
+   ,
+   NULL));
+
+		upper.indx[i++] = indexexpr;
+
 		/* If any index expr yields NULL, result is NULL or error */
 		if (eisnull)
 		{
@@ -321,6 +343,7 @@ ExecEvalArrayRef(ArrayRefExprState *astate,
 		foreach(l, astate->reflowerindexpr)
 		{
 			ExprState  *eltstate = (ExprState *) lfirst(l);
+			eisnull = false;
 
 			if (j >= MAXDIM)
 ereport(ERROR,
@@ -328,10 +351,19 @@ ExecEvalArrayRef(ArrayRefExprState *astate,
 		 errmsg("number of array dimensions (%d) exceeds the maximum allowed (%d)",
 j + 1, MAXDIM)));
 
-			lower.indx[j++] = DatumGetInt32(ExecEvalExpr(eltstate,
-		 econtext,
-		 ,
-		 NULL));
+			if (eltstate == NULL)
+			{
+arrays = (AnyArrayType *)DatumGetArrayTypeP(array_source);
+indexexpr = AARR_LBOUND(arrays)[j];
+			}
+			else
+indexexpr = DatumGetInt32(ExecEvalExpr(eltstate,
+	   econtext,
+	   ,
+	   NULL));
+
+			lower.indx[j++] = indexexpr;
+
 			/* If any index expr yields NULL, result is NULL or error */
 			if (eisnull)
 			{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 26264cb..fe848d5 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2417,6 +2417,7 @@ _copyAIndices(const A_Indices *from)
 
 	COPY_NODE_FIELD(lidx);
 	COPY_NODE_FIELD(uidx);
+	COPY_SCALAR_FIELD(is_slice);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index aa6e102..00f98ef 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2162,6 +2162,7 @@ _equalAIndices(const A_Indices *a, const A_Indices *b)
 {
 	COMPARE_NODE_FIELD(lidx);
 	COMPARE_NODE_FIELD(uidx);
+	COMPARE_SCALAR_FIELD(is_slice);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 012c14b..5c6326d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2773,6 +2773,7 @@ _outA_Indices(StringInfo str, const A_Indices *node)
 
 	WRITE_NODE_FIELD(lidx);
 	WRITE_NODE_FIELD(uidx);
+	WRITE_BOOL_FIELD(is_slice);
 }
 
 static void
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 7916df8..4afc572 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -13191,6 +13191,31 @@ indirection_el:
 	

Re: [HACKERS] custom function for converting human readable sizes to bytes

2015-12-22 Thread Pavel Stehule
Hi

2015-12-21 16:11 GMT+01:00 Robert Haas :

> On Sun, Dec 20, 2015 at 4:54 AM, Pavel Stehule 
> wrote:
> > new update:
> >
> > 1. unit searching is case insensitive
> >
> > 2. initial support for binary byte prefixes - KiB, MiB, ..  (IEC
> standard),
> > change behave for SI units
> >
> > Second point is much more complex then it is looking - if pg_size_bytes
> > should be consistent with pg_size_pretty.
> >
> > The current pg_size_pretty and transformations in guc.c are based on
> JEDEC
> > standard. Using this standard for GUC has sense - using it for object
> sizes
> > is probably unhappy.
> >
> > I tried to fix (and enhance) pg_size_pretty - now reports correct units,
> and
> > via second parameter it allows to specify base: 2 (binary, IEC  -
> default)
> > or 10 (SI).
>
> -1 from me.  I don't think we should muck with the way pg_size_pretty
> works.
>

new update - I reverted changes in pg_size_pretty

Regards

Pavel


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index 60b9a09..a73f37b
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*** postgres=# SELECT * FROM pg_xlogfile_nam
*** 17607,17612 
--- 17607,17615 
  pg_relation_size
 
 
+ pg_size_bytes
+
+
  pg_size_pretty
 
 
*** postgres=# SELECT * FROM pg_xlogfile_nam
*** 17677,17682 
--- 17680,17696 
 


+
+ pg_size_bytes(text)
+ 
+bigint
+
+  Converts a size in human-readable format with size units
+  into bytes.  The parameter is case insensitive string. Following
+  units are supported: B, kB, MB, GB, TB, PB, KiB, MiB, TiB, PiB.
+
+   
+   
 
  pg_size_pretty(bigint)
  
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
new file mode 100644
index 5ee59d0..99f10dc
*** a/src/backend/utils/adt/dbsize.c
--- b/src/backend/utils/adt/dbsize.c
***
*** 31,36 
--- 31,63 
  #include "utils/relmapper.h"
  #include "utils/syscache.h"
  
+ #define MAX_UNIT_LEN		3
+ #define MAX_DIGITS		20
+ 
+ typedef struct
+ {
+ 	char		unit[MAX_UNIT_LEN + 1];
+ 	long int	multiplier;
+ } unit_multiplier;
+ 
+ static const unit_multiplier unit_multiplier_table[] =
+ {
+ 	{"kiB", 1024L},
+ 	{"MiB", 1024L * 1024},
+ 	{"GiB", 1024L * 1024 * 1024},
+ 	{"TiB", 1024L * 1024 * 1024 * 1024},
+ 	{"PiB", 1024L * 1024 * 1024 * 1024 * 1024},
+ 	{"B", 1L},
+ 	{"KB", 1000L},
+ 	{"MB", 1000L * 1000},
+ 	{"GB", 1000L * 1000 * 1000},
+ 	{"TB", 1000L * 1000 * 1000 * 1000},
+ 	{"PB", 1000L * 1000 * 1000 * 1000 * 1000},
+ 
+ 	{""}/* end of table marker */
+ };
+ 
+ 
  /* Divide by two and round towards positive infinity. */
  #define half_rounded(x)   (((x) + ((x) < 0 ? 0 : 1)) / 2)
  
*** pg_size_pretty(PG_FUNCTION_ARGS)
*** 559,566 
  else
  {
  	size >>= 10;
! 	snprintf(buf, sizeof(buf), INT64_FORMAT " TB",
! 			 half_rounded(size));
  }
  			}
  		}
--- 586,600 
  else
  {
  	size >>= 10;
! 	if (Abs(size) < limit2)
! 		snprintf(buf, sizeof(buf), INT64_FORMAT " TB",
!  half_rounded(size));
! 	else
! 	{
! 		size >>= 10;
! 		snprintf(buf, sizeof(buf), INT64_FORMAT " PB",
!  half_rounded(size));
! 	}
  }
  			}
  		}
*** pg_size_pretty_numeric(PG_FUNCTION_ARGS)
*** 689,696 
  {
  	/* size >>= 10 */
  	size = numeric_shift_right(size, 10);
! 	size = numeric_half_rounded(size);
! 	result = psprintf("%s TB", numeric_to_cstring(size));
  }
  			}
  		}
--- 723,741 
  {
  	/* size >>= 10 */
  	size = numeric_shift_right(size, 10);
! 
! 	if (numeric_is_less(numeric_absolute(size), limit2))
! 	{
! 		size = numeric_half_rounded(size);
! 		result = psprintf("%s TB", numeric_to_cstring(size));
! 	}
! 	else
! 	{
! 		/* size >>= 10 */
! 		size = numeric_shift_right(size, 10);
! 		size = numeric_half_rounded(size);
! 		result = psprintf("%s PB", numeric_to_cstring(size));
! 	}
  }
  			}
  		}
*** pg_size_pretty_numeric(PG_FUNCTION_ARGS)
*** 700,705 
--- 745,853 
  }
  
  /*
+  * Convert human readable size to long int.
+  *
+  * cannot to use parse_int due too low limits there
+  */
+ Datum
+ pg_size_bytes(PG_FUNCTION_ARGS)
+ {
+ 	text	   *arg = PG_GETARG_TEXT_PP(0);
+ 	const char	   *cp = text_to_cstring(arg);
+ 	char		digits[MAX_DIGITS + 1];
+ 	int		ndigits = 0;
+ 	Numeric			num;
+ 	int64 result;
+ 
+ 	/* Skip leading spaces */
+ 	while (isspace((unsigned char) *cp))
+ 		cp++;
+ 
+ 	switch (*cp)
+ 	{
+ 		/* ignore plus symbol */
+ 		case '+':
+ 			cp++;
+ 			break;
+ 		case '-':
+ 			ereport(ERROR,
+ 	

Re: [HACKERS] pglogical_output - a general purpose logical decoding output plugin

2015-12-22 Thread Craig Ringer
On 22 December 2015 at 15:17, Michael Paquier 
wrote:

> On Mon, Dec 7, 2015 at 12:10 PM, Craig Ringer 
> wrote:
> > Removed, change pushed.
> >
> > Also pushed a change to expose the decoded row data to row filter hooks.
> >
> > I won't cut a v4 for this, as I'm working on merging the SGML-ified docs
> and
> > will do a v4 with that and the above readme change once that's done.
>
> Patch is moved to next CF, you seem to be still working on it.


Thanks.

Other than SGML-ified docs it's ready. The docs are a hard pre-req of
course. In any case most people appear to be waiting for the downstream
before looking at it at all, so bumping it makes sense.

I'm a touch frustrated by that, as a large part of the point of submitting
the output plugin separately and in advance of the downstream was to get
attention for it separately, as its own entity. A lot of effort has been
put into making this usable for more than just a data source for
pglogical's replication tools. In retrospect naming it pglogical_output was
probably unwise.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Experimental evaluation of PostgreSQL's query optimizer

2015-12-22 Thread Viktor Leis
On 12/22/2015 02:40 AM, Craig Ringer wrote:
> On 21 December 2015 at 23:57, Viktor Leis  > wrote:
>  
> 
> 
> Please have a look at Figure 6 (page 6) in
> http://www.vldb.org/pvldb/vol9/p204-leis.pdf Disabling nested loop
> joins without index scan (going from (a) to (b)) results in great
> improvements across the board. And even more importantly, it avoids
> most of the cases where queries took unreasonably long and timed
> out. Basically this amounts to the being able to run the query on
> PostgreSQL or not.
> 
> 
> For that data, yes. But you're ignoring other important cases. Small or even 
> 1-element lookup tables can be one where a nestloop over a seqscan turns out 
> to be by far the fastest way to do the job.
> This can really add up if it's deep in a subplan that's excuted repeatedly, 
> or if it's part of queries that get run very frequently on a busy OLTP system.
Ok here's what I presume to be the extreme case: Joining a large table
with a 1-entry table.

create table r1 (a int not null);
create table r2 (b int not null);
insert into r1 select 1 from generate_series(1,100);
insert into r2 values (1);
analyze r1;
analyze r2;

set enable_mergejoin to off;
set enable_nestloop to on;
set enable_hashjoin to off;
explain select count(*) from r1, r2 where r1.a = r2.b;
\timing
select count(*) from r1, r2 where r1.a = r2.b;
\timing

set enable_nestloop to off;
set enable_hashjoin to on;
explain select count(*) from r1, r2 where r1.a = r2.b;
\timing
select count(*) from r1, r2 where r1.a = r2.b;
\timing

I get 128.894ms vs. 183.724ms, i.e., a 43% slowdown for the hash
join. However, let me stress that this is really the extreme case:

- If the join has few matches (due to inserting a value different from
1 into r2), hash and nested loop join have pretty much the same
performance.

- If you add just one more row to r2, the hash join is faster by a
similar margin.

- Also if there is disk IO or network involved, I suspect that you
will see no performance differences.

There are many difficult tradeoffs in any query optimizer, but I do
not think picking nested loops where a hash join can be used is one of
those. To me this seems more like a self-inflicted wound.

--
Viktor Leis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench - allow backslash-continuations in custom scripts

2015-12-22 Thread Kyotaro HORIGUCHI
I'm terribly sorry to have overlooked Febien's comment.

I'll rework on this considering Febien's previous comment and
Michael's this comment in the next CF.


At Tue, 22 Dec 2015 16:17:07 +0900, Michael Paquier  
wrote in 

Re: [HACKERS] Experimental evaluation of PostgreSQL's query optimizer

2015-12-22 Thread Pavel Stehule
2015-12-22 9:28 GMT+01:00 Viktor Leis :

> On 12/22/2015 02:40 AM, Craig Ringer wrote:
> > On 21 December 2015 at 23:57, Viktor Leis > wrote:
> >
> >
> >
> > Please have a look at Figure 6 (page 6) in
> > http://www.vldb.org/pvldb/vol9/p204-leis.pdf Disabling nested loop
> > joins without index scan (going from (a) to (b)) results in great
> > improvements across the board. And even more importantly, it avoids
> > most of the cases where queries took unreasonably long and timed
> > out. Basically this amounts to the being able to run the query on
> > PostgreSQL or not.
> >
> >
> > For that data, yes. But you're ignoring other important cases. Small or
> even 1-element lookup tables can be one where a nestloop over a seqscan
> turns out to be by far the fastest way to do the job.
> > This can really add up if it's deep in a subplan that's excuted
> repeatedly, or if it's part of queries that get run very frequently on a
> busy OLTP system.
> Ok here's what I presume to be the extreme case: Joining a large table
> with a 1-entry table.
>
> create table r1 (a int not null);
> create table r2 (b int not null);
> insert into r1 select 1 from generate_series(1,100);
> insert into r2 values (1);
> analyze r1;
> analyze r2;
>
> set enable_mergejoin to off;
> set enable_nestloop to on;
> set enable_hashjoin to off;
> explain select count(*) from r1, r2 where r1.a = r2.b;
> \timing
> select count(*) from r1, r2 where r1.a = r2.b;
> \timing
>
> set enable_nestloop to off;
> set enable_hashjoin to on;
> explain select count(*) from r1, r2 where r1.a = r2.b;
> \timing
> select count(*) from r1, r2 where r1.a = r2.b;
> \timing
>
> I get 128.894ms vs. 183.724ms, i.e., a 43% slowdown for the hash
> join. However, let me stress that this is really the extreme case:
>

> - If the join has few matches (due to inserting a value different from
> 1 into r2), hash and nested loop join have pretty much the same
> performance.
>
> - If you add just one more row to r2, the hash join is faster by a
> similar margin.
>
> - Also if there is disk IO or network involved, I suspect that you
> will see no performance differences.
>
> There are many difficult tradeoffs in any query optimizer, but I do
> not think picking nested loops where a hash join can be used is one of
> those. To me this seems more like a self-inflicted wound.
>

this is oversimplification :( Probably correct in OLAP, but wrong in OLTP.
The seq scan enforced by hash join can be problematic.

Regards

Pavel


> --
> Viktor Leis
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] Some bugs in psql_complete of psql

2015-12-22 Thread Kyotaro HORIGUCHI
Hello, I returned to this since Thomas' psql-completion patch has
been committed.

This patch has been recreated on it.

"IN TABLESPACE xxx OWNED BY" has been alredy fixed so I removed it.
The attched files are the following,

1. 0001-Fix-tab-complete-of-CREATE-INDEX.patch

  Fixes completion for CREATE INDEX in ordinary way.

2. 0001-Fix-completion-for-CREATE-INDEX-type-2.patch

  An alternative for 1. Introduces multilevel matching. This
  effectively can avoid false matching, simplify each expression
  and reduce the number of matching operations.

3. 0002-Fix-tab-completion-for-DROP-INDEX.patch

  Fix of DROP INDEX completion in the type-2 way.

=
At Tue, 08 Dec 2015 17:56:19 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20151208.175619.245979824.horiguchi.kyot...@lab.ntt.co.jp>
> > I tested whether the following patterns work as expected or not.
> > 
> > CREATE UNIQUE INDEX CONCURRENTLY name ON
> > CREATE UNIQUE INDEX CONCURRENTLY ON
> > CREATE UNIQUE INDEX name ON
> > CREATE UNIQUE INDEX ON
> > CREATE INDEX CONCURRENTLY name ON
> > CREATE INDEX CONCURRENTLY ON
> > CREATE INDEX name ON
> > CREATE INDEX ON
> > 
> > Then I found the following problems.
> > 
> > "CREATE UNIQUE INDEX CONCURRENTLY " didn't suggest "ON".
> > "CREATE UNIQUE INDEX ON " suggested nothing.
> > "CREATE INDEX ON " suggested nothing.

All of the cases above are completed correctly. As a new feature,
this patch allows "IF NOT EXISTS". As the result, the following
part of the syntax of CREATE INDEX is completed correctly.

CREATE [UNIQUE] INDEX [CONCURRENTLY] [[IF NOT EXISTS] iname] ON tname

> > "CREATE INDEX CONCURRENTLY " didn't suggest "ON".

Now it suggests "ON", "IF NOT EXISTS" and existing index names.

> > BTW, I found that tab-completion for DROP INDEX has the following problems.
> > 
> > "DROP INDEX " didn't suggest "CONCURRENTLY".
> > "DROP INDEX CONCURRENTLY name " suggested nothing.

They are completed by the "things" completion. The first patch
gives a wrong suggestion for DROP INDEX CONCURRENTLY with "IF NOT
EXISTS". This can be avoided adding HeadMatches to every matching
for "CREATE INDEX..." syntexes but it is too bothersome.

Instaed, in the second alternative patch, I enclosed the whole
"CREATE INDEX" stuff by an if(HeadMatches2("CREATE",
"UNIQUE|INDEX")) block to avoid the false match and simplify each
matching expression. As a side effect, the total number of the
executions of matching expressions can be reduced. (This wouldn't
be just a bug fix, but rarther a kind of refactoring.)

The third patch fixes the behavior for DROP INDEX in this way.

Thoughs? Suggestions?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 5488edb6ebdd253b1a10179ac119cc1588459791 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 22 Dec 2015 11:03:16 +0900
Subject: [PATCH] Fix tab-complete of CREATE INDEX

Completion for CREATE INDEX added CONCURRENLTY into wrong place, and
it didn't suggest "IF NOT EXISTS". This patch corrects and adds them.
---
 src/bin/psql/tab-complete.c | 59 +
 1 file changed, 43 insertions(+), 16 deletions(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 1a7d184..9f557b3 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1922,36 +1922,63 @@ psql_completion(const char *text, int start, int end)
 	else if (TailMatches2("CREATE|UNIQUE", "INDEX"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_indexes,
    " UNION SELECT 'ON'"
+   " UNION SELECT 'IF NOT EXISTS'"
    " UNION SELECT 'CONCURRENTLY'");
-	/* Complete ... INDEX [] ON with a list of tables  */
-	else if (TailMatches3("INDEX", MatchAny, "ON") ||
-			 TailMatches2("INDEX|CONCURRENTLY", "ON"))
-		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tm, NULL);
-	/* If we have CREATE|UNIQUE INDEX  CONCURRENTLY, then add "ON" */
-	else if (TailMatches3("INDEX", MatchAny, "CONCURRENTLY") ||
-			 TailMatches2("INDEX", "CONCURRENTLY"))
+	/*
+	 * If we have CREATE [UNIQUE] INDEX [CONCURRENTLY], suggest existing
+	 * indexes or ON/IF NOT EXISTS.
+	 */
+	else if (TailMatches2("INDEX", "CONCURRENTLY"))
+		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_indexes,
+   " UNION SELECT 'ON'"
+   " UNION SELECT 'IF NOT EXISTS'");
+	/*
+	 * If we have CREATE [UNIQUE] INDEX [CONCURRENTLY] IF NOT EXISTS, suggest
+	 * existing indexes
+	 */
+	else if (TailMatches4("INDEX|CONCURRENTLY", "IF", "NOT", "EXISTS"))
+		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_indexes, NULL);
+	/*
+	 * If we have CREATE [UNIQUE] INDEX [CONCURRENTLY] [IF NOT EXISTS sth],
+	 * then add "ON"
+	 */
+	else if (TailMatches5("INDEX|CONCURRENTLY",
+		  "IF", "NOT", "EXISTS", MatchAny) ||
+			 TailMatches2("INDEX|CONCURRENTLY", "!ON"))
 		COMPLETE_WITH_CONST("ON");
-	/* If we have CREATE|UNIQUE INDEX , then add "ON" or "CONCURRENTLY" */
-	else if 

Re: [HACKERS] A Typo in regress/sql/privileges.sql

2015-12-22 Thread Peter Geoghegan
On Fri, Dec 18, 2015 at 12:52 PM, Robert Haas  wrote:
> Mind you, I don't think "inference specification" is very good
> terminology, but what's there right now is just wrong.

It doesn't appear in the documentation. The term "inference
specification" only appears where it's necessary to precisely describe
the input to unique index inference.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A typo in syncrep.c

2015-12-22 Thread Kyotaro HORIGUCHI
Thank you Robert and sorry for bothering you with a silly question!

I understand what I did clearly thanks to your attentive indication.

At Mon, 21 Dec 2015 07:50:40 -0500, Robert Haas  wrote 
in 
> >> >> + * lead the client to believe that the transaction is aborted, which
> >> No, that's correct the way it is.  What you're proposing wouldn't
> >> exactly be wrong, but it's a little less clear and direct.
> >
> > Hmm. I thought they are equal in meaning and make clearer, but I
> > understand they have such difference. Thank you for correcting
> > it.
> 
> The difference is that if you say "the transaction aborted" you mean
> that the transaction did something - specifically, it aborted.  If you
> say that "the transaction is aborted" you are talking about the state
> in which the transaction ended up, without really saying how it got
> that way.

What I made here was a mistake of the word class of the
"transaction" by somehow omitting the "that" in the original
sentense. It is not the objective case as in the case where the
"that" is omitted, but the subjective case there. Then the
"aborted" is not the objective complement but the past tense. The
"that" had been returned in my mind before I knew it but, after
all, adding "is" there utterly changes the maning as you pointed
out.

>  In this case we are talking about whether the client might
> think that the transaction did something (aborted), not what the
> client might think about the state we ended up in (aborted), so the
> current wording seems better to me.

I understand that you're completely right. Sorry for my silly
mistake.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types

2015-12-22 Thread Pavel Stehule
Hi

2015-12-21 16:21 GMT+01:00 Artur Zakirov :

> Hi.
> I have tried to do some review of this patch. Below are my comments.
>
> Introduction:
>
> This patch fixes and adds the following functionality:
> - %TYPE - now can be used for composite types.
> - %ARRAYTYPE - new functionality, provides the array type from a variable
> or table column.
> - %ELEMENTTYPE - new funcitonality, provides the element type of a given
> array.
>
> New regression tests are included in the patch. Changes to the
> documentation are not provided.
>
> Initial Run:
>
> The patch applies correctly to HEAD. Regression tests pass successfully,
> without errors. It seems that the patch work as you expected.
>
> Performance:
>
> It seems patch have not possible performance issues for the existing
> functionality.
>
> Coding:
>
> The style looks fine. I attached the patch that does some corrections in
> code and documentation. I have corrected indentation in pl_comp.c and
> "read_datatype" function in pl_gram.y. I think changes in "read_datatype"
> function would be better to avoid a code duplication. But I could be wrong
> of course.
>

I merged Artur's patch and appended examples to doc.



>
> Conclusion:
>
> The patch could be applied on master with documentation corrections. But
> I'm not sure that your task could be resloved only by adding %ARRAYTYPE and
> %ELEMENTTYPE. Maybe you will give some examples?
>

It fixes the most missed/known features related to this part of plpgsql,
what I found. But any ideas for this or follofup patches are welcome.

Regards

Pavel



>
> --
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
>
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
new file mode 100644
index 9786242..140c81f
*** a/doc/src/sgml/plpgsql.sgml
--- b/doc/src/sgml/plpgsql.sgml
*** SELECT merge_fields(t.*) FROM table1 t W
*** 710,715 
--- 710,792 
 

  
+   
+Array Types
+ 
+ 
+ variable%ARRAYTYPE
+ 
+ 
+
+ %ARRAYTYPE provides the array type from a variable or
+ table column. %ARRAYTYPE is particularly valuable in
+ polymorphic functions, since the data types needed for internal variables can
+ change from one call to the next.  Appropriate variables can be
+ created by applying %ARRAYTYPE to the function's
+ arguments or result placeholders:
+ 
+ CREATE OR REPLACE FUNCTION array_init(v anyelement, size integer)
+ RETURNS anyarray AS $$
+ DECLARE
+ result v%ARRAYTYPE DEFAULT '{}';
+ BEGIN
+ -- prefer builtin function array_fill
+ FOR i IN 1 .. size
+ LOOP
+ result := result || v;
+ END LOOP;
+ RETURN result;
+ END;
+ $$ LANGUAGE plpgsql;
+ 
+ SELECT array_init(0::numeric, 10);
+ SELECT array_init(''::varchar, 10);
+ 
+
+   
+ 
+   
+Array Element Types
+ 
+ 
+ variable%ELEMENTTYPE
+ 
+ 
+
+ %ELEMENTTYPE provides the element type of a given
+ array.  %ELEMENTTYPE is particularly valuable in polymorphic
+ functions, since the data types needed for internal variables can
+ change from one call to the next.  Appropriate variables can be
+ created by applying %ELEMENTTYPE to the function's
+ arguments or result placeholders:
+ 
+ CREATE OR REPLACE FUNCTION bubble_sort(a anyarray)
+ RETURNS anyarray AS $$
+ DECLARE
+ aux a%ELEMENTTYPE;
+ repeat_again boolean DEFAULT true;
+ BEGIN
+ -- Don't use this code for large arrays!
+ -- use builtin sort
+ WHILE repeat_again
+ LOOP
+ repeat_again := false;
+ FOR i IN array_lower(a, 1) .. array_upper(a, 1)
+ LOOP
+ IF a[i] > a[i+1] THEN
+ aux := a[i+1];
+ a[i+1] := a[i]; a[i] := aux;
+ repeat_again := true;
+ END IF;
+ END LOOP;
+ END LOOP;
+ RETURN a;
+ END;
+ $$ LANGUAGE plpgsql;
+ 
+
+   
+ 

 Collation of PL/pgSQL Variables
  
diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
new file mode 100644
index 1ae4bb7..c819517
*** a/src/pl/plpgsql/src/pl_comp.c
--- b/src/pl/plpgsql/src/pl_comp.c
*** plpgsql_parse_tripword(char *word1, char
*** 1617,1622 
--- 1617,1677 
  	return false;
  }
  
+ /*
+  * Derive type from ny base type controlled by reftype_mode
+  */
+ static PLpgSQL_type *
+ derive_type(PLpgSQL_type *base_type, int reftype_mode)
+ {
+ 	Oid typoid;
+ 
+ 	switch (reftype_mode)
+ 	{
+ 		case PLPGSQL_REFTYPE_TYPE:
+ 			return base_type;
+ 
+ 		case PLPGSQL_REFTYPE_ELEMENT:
+ 		{
+ 			typoid = get_element_type(base_type->typoid);
+ 			if (!OidIsValid(typoid))
+ ereport(ERROR,
+ 		(errcode(ERRCODE_DATATYPE_MISMATCH),
+ 		 errmsg("referenced variable should be an array, not type %s",
+ format_type_be(base_type->typoid;
+ 
+ 			return plpgsql_build_datatype(typoid, -1,
+ 			plpgsql_curr_compile->fn_input_collation);
+ 		}
+ 
+ 		case PLPGSQL_REFTYPE_ARRAY:
+ 		{
+ 		

Re: [HACKERS] parallel joins, and better parallel explain

2015-12-22 Thread Dilip Kumar
On Fri, Dec 18, 2015 at 8:47 PM Robert Wrote,

>> Yes, you are right, that create_gather_path() sets parallel_safe to false
>> unconditionally but whenever we are building a non partial path, that
time
>> we should carry forward the parallel_safe state to its parent, and it
seems
>> like that part is missing here..

>Ah, right.  Woops.  I can't exactly replicate your results, but I've
>attempted to fix this in a systematic way in the new version attached
>here (parallel-join-v3.patch).

I Have tested with the latest patch, problem is solved..

During my testing i observed one more behaviour in the hash join, where
Parallel hash join is taking more time compared to Normal hash join,

Here i have ensured that apart from hash column there is one more condition
on other column which force  Random page fetch

I think this behaviour seems similar what Amit has given in above thread
http://www.postgresql.org/message-id/caa4ek1+s3ud2g1wskeaw_fzgp8jeyw3ycnvtueplihe_e1d...@mail.gmail.com

create table t1 (c1 int, c2 int, c3 text);

create table t2 (c1 int, c2 int, c3 text);

 insert into t1 values(generate_series(1,1000),
generate_series(1,1000), repeat('x', 1000));

insert into t2 values(generate_series(1,300),
generate_series(1,300), repeat('x', 5));
analyze t1;
analyze t2;
set max_parallel_degree=6;
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t2.c2 + t1.c1 > 100;
 QUERY PLAN


 Aggregate  (cost=474378.39..474378.40 rows=1 width=0) (actual
time=34507.573..34507.573 rows=1 loops=1)
   ->  Gather  (cost=96436.00..471878.39 rows=100 width=0) (actual
time=2004.186..33918.216 rows=250 loops=1)
 Number of Workers: 6
 ->  Hash Join  (cost=95436.00..370878.39 rows=100 width=0)
(actual time=2077.085..18651.868 rows=428564 loops=7)
   Hash Cond: (t1.c1 = t2.c1)
   Join Filter: ((t2.c2 + t1.c1) > 100)
   Rows Removed by Join Filter: 7
   ->  Parallel Seq Scan on t1  (cost=0.00..235164.93
rows=1538462 width=4) (actual time=0.741..13199.231 rows=1428571 loops=7)
   ->  Hash  (cost=46217.00..46217.00 rows=300 width=8)
(actual time=2070.827..2070.827 rows=300 loops=7)
 Buckets: 131072  Batches: 64  Memory Usage: 2861kB
 ->  Seq Scan on t2  (cost=0.00..46217.00 rows=300
width=8) (actual time=0.027..904.607 rows=300 loops=7)
 Planning time: 0.292 ms
 Execution time: 34507.857 ms
(13 rows)

postgres=# set max_parallel_degree=0;
SET
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t2.c2 + t1.c1 > 100;
   QUERY PLAN


 Aggregate  (cost=1823853.06..1823853.07 rows=1 width=0) (actual
time=17833.067..17833.067 rows=1 loops=1)
   ->  Hash Join  (cost=95436.00..1821353.06 rows=100 width=0) (actual
time=1286.788..17558.987 rows=250 loops=1)
 Hash Cond: (t1.c1 = t2.c1)
 Join Filter: ((t2.c2 + t1.c1) > 100)
 Rows Removed by Join Filter: 50
 ->  Seq Scan on t1  (cost=0.00..1528572.04 rows=1004 width=4)
(actual time=2.728..9881.659 rows=1000 loops=1)
 ->  Hash  (cost=46217.00..46217.00 rows=300 width=8) (actual
time=1279.688..1279.688 rows=300 loops=1)
   Buckets: 131072  Batches: 64  Memory Usage: 2861kB
   ->  Seq Scan on t2  (cost=0.00..46217.00 rows=300
width=8) (actual time=0.029..588.887 rows=300 loops=1)
 Planning time: 0.314 ms
 Execution time: 17833.143 ms
(11 rows)


Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Fri, Dec 18, 2015 at 8:47 PM, Robert Haas  wrote:

> On Fri, Dec 18, 2015 at 3:54 AM, Dilip Kumar 
> wrote:
> > On Fri, Dec 18, 2015 at 7.59 AM Robert Haas 
> Wrote,
> >> Uh oh.  That's not supposed to happen.  A GatherPath is supposed to
> >> have parallel_safe = false, which should prevent the planner from
> >> using it to form new partial paths.  Is this with the latest version
> >> of the patch?  The plan output suggests that we're somehow reaching
> >> try_partial_hashjoin_path() with inner_path being a GatherPath, but I
> >> don't immediately see how that's possible, because
> >> create_gather_path() sets parallel_safe to false unconditionally, and
> >> hash_inner_and_outer() never sets cheapest_safe_inner to a path unless
> >> that path is parallel_safe.
> >
> > Yes, you are right, that create_gather_path() sets parallel_safe to false
> > unconditionally but whenever we are building a non partial path, that
> time
> > we should 

Re: [HACKERS] Mention column name in error messages

2015-12-22 Thread Michael Paquier
On Sun, Oct 4, 2015 at 12:23 AM, Andres Freund  wrote:
> On 2015-09-15 12:00:25 +0200, Franck Verrot wrote:
>> diff --git a/src/backend/parser/parse_target.c 
>> b/src/backend/parser/parse_target.c
>> index 1b3fcd6..78f82cd 100644
>> --- a/src/backend/parser/parse_target.c
>> +++ b/src/backend/parser/parse_target.c
>> @@ -389,6 +389,17 @@ markTargetListOrigin(ParseState *pstate, TargetEntry 
>> *tle,
>>   * omits the column name list.  So we should usually prefer to use
>>   * exprLocation(expr) for errors that can happen in a default INSERT.
>>   */
>> +
>> +void
>> +TransformExprCallback(void *arg)
>> +{
>> + TransformExprState *state = (TransformExprState *) arg;
>> +
>> + errcontext("coercion failed for column \"%s\" of type %s",
>> + state->column_name,
>> + format_type_be(state->expected_type));
>> +}
>
> Why is this not a static function?
>
>>  Expr *
>>  transformAssignedExpr(ParseState *pstate,
>> Expr *expr,
>> @@ -405,6 +416,14 @@ transformAssignedExpr(ParseState *pstate,
>>   Oid attrcollation;  /* collation of target column 
>> */
>>   ParseExprKind sv_expr_kind;
>>
>> + ErrorContextCallback errcallback;
>> + TransformExprState testate;
>
> Why the newline between the declarations? Why none afterwards?
>
>> diff --git a/src/include/parser/parse_target.h 
>> b/src/include/parser/parse_target.h
>> index a86b79d..5e12c12 100644
>> --- a/src/include/parser/parse_target.h
>> +++ b/src/include/parser/parse_target.h
>> @@ -42,4 +42,11 @@ extern TupleDesc expandRecordVariable(ParseState *pstate, 
>> Var *var,
>>  extern char *FigureColname(Node *node);
>>  extern char *FigureIndexColname(Node *node);
>>
>> +/* Support for TransformExprCallback */
>> +typedef struct TransformExprState
>> +{
>> + const char *column_name;
>> + Oid expected_type;
>> +} TransformExprState;
>
> I see no need for this to be a struct defined in the header. Given that
> TransformExprCallback isn't public, and the whole thing is specific to
> TransformExprState...
>
>
> My major complaint though, is that this doesn't contain any tests...
>
>
> Could you update the patch to address these issues?

Patch is marked as returned with feedback, it has been two months
since the last review, and comments have not been addressed.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor code improvements to create_foreignscan_plan/ExecInitForeignScan

2015-12-22 Thread Michael Paquier
On Wed, Sep 9, 2015 at 7:08 PM, Etsuro Fujita
 wrote:
> On 2015/07/10 18:40, Etsuro Fujita wrote:
>> To save cycles, I modified create_foreignscan_plan so that it detects
>> whether any system columns are requested if scanning a base relation.
>> Also, I revised other code there a little bit.
>
> Attached is an updated version of the patch.  The previous version
> contained changes to ExecInitForeignScan, but I've dropped that part, as
> discussed before.

Moved to next CF because of a lack of reviews.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-12-22 Thread Michael Paquier
On Thu, Dec 17, 2015 at 4:27 AM, Fabien COELHO  wrote:
>
> Hello Tomas,
>
>> I'm planning to do some thorough benchmarking of the patches proposed in
>> this thread, on various types of hardware (10k SAS drives and SSDs). But is
>> that actually needed? I see Andres did some testing, as he posted summary of
>> the results on 11/12, but I don't see any actual results or even info about
>> what benchmarks were done (pgbench?).
>>
>> If yes, do we only want to compare 0001-ckpt-14-andres.patch against
>> master, or do we need to test one of the previous Fabien's patches?
>
>
> My 0.02€,
>
> Although I disagree with some aspects of Andres patch, I'm not a committer
> and I'm tired of arguing. I'm just planing to do minor changes to Andres
> version to fix a potential issue if the file is closed which flushing is in
> progress, but that will not change the overall shape of it.
>
> So testing on Andres version seems relevant to me.
>
> For SSD the performance impact should be limited. For disk it should be
> significant if there is no big cache in front of it. There were some
> concerns raised for some loads in the thread (shared memory smaller than
> needed I think?), if you can include such cases that would be great. My
> guess is that it should be not very beneficial in this case because the
> writing is mostly done by bgwriter & worker in this case, and these are
> still random.

As there are still plans to move on regarding tests (and because this
patch makes a difference), this is moved to next CF.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-12-22 Thread Michael Paquier
On Sun, Dec 20, 2015 at 6:49 AM, Pavel Stehule  wrote:
> attached patch allows align to center.
>
> everywhere where left/right align was allowed, the center align is allowed

Moved to next CF, there is a fresh and new patch.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some questions about the array.

2015-12-22 Thread Uriy Zhuravlev

On понедельник, 21 декабря 2015 г. 20:28:43 MSK, Tom Lane wrote:

With is_slice false, the only valid
case is lidx==NULL, uidx!=NULL, as before for non-slice notation.

But now it becomes valid syntax:
select ('{1,2,3,4}'::int[])[NULL:NULL];
I do not think it's logical. Especially if in [:] function is used.
Unexpected behavior. 

Thanks. 
--

Uriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On columnar storage (2)

2015-12-22 Thread Michael Paquier
On Wed, Dec 9, 2015 at 3:10 PM, Jeff Janes  wrote:
> Could we get this rebased past the merge of the parallel execution commits?

+1. Alvaro, Tomas, Simon, what are the next plans with those patches?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Experimental evaluation of PostgreSQL's query optimizer

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 3:28 AM, Viktor Leis  wrote:
> I get 128.894ms vs. 183.724ms, i.e., a 43% slowdown for the hash
> join. However, let me stress that this is really the extreme case:
>
> - If the join has few matches (due to inserting a value different from
> 1 into r2), hash and nested loop join have pretty much the same
> performance.
>
> - If you add just one more row to r2, the hash join is faster by a
> similar margin.
>
> - Also if there is disk IO or network involved, I suspect that you
> will see no performance differences.
>
> There are many difficult tradeoffs in any query optimizer, but I do
> not think picking nested loops where a hash join can be used is one of
> those. To me this seems more like a self-inflicted wound.

Well, that's a fair perspective.  I don't think there's any debate
about the fact that we sometimes pick nested loops when we shouldn't,
or that that this can be very painful.  Figuring out exactly what to
do about that at the code level is harder.

>From my point of view, one interesting fact about database
optimization is that the numbers 0 and 1 are phenomenally important
special cases.  It is often the case that a join will return at most 1
row per outer row, or that an aggregate will generate exactly 1 group,
or whatever.  And the code is littered with special cases - including
Nested Loop - that cater to making such cases fast.  Those cases arise
frequently because people engineer their data so that they occur
frequently.

If we could bias the planner against picking nested loops in cases
where they will figure to win only a little but might conceivably lose
a lot, that would probably be a good idea.  But it's not obvious
exactly how to figure that out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Experimental evaluation of PostgreSQL's query optimizer

2015-12-22 Thread Tom Lane
Robert Haas  writes:
> From my point of view, one interesting fact about database
> optimization is that the numbers 0 and 1 are phenomenally important
> special cases.

Yeah.

> It is often the case that a join will return at most 1
> row per outer row, or that an aggregate will generate exactly 1 group,
> or whatever.  And the code is littered with special cases - including
> Nested Loop - that cater to making such cases fast.  Those cases arise
> frequently because people engineer their data so that they occur
> frequently.

> If we could bias the planner against picking nested loops in cases
> where they will figure to win only a little but might conceivably lose
> a lot, that would probably be a good idea.  But it's not obvious
> exactly how to figure that out.

There was discussion awhile ago of trying to teach the planner to generate
rowcount estimates of 0 or 1 row only in cases where that was provably the
case, eg because the query selects on a unique key.  In any other
situation, rowcount estimates would be clamped to a minimum of 2 rows.
This should by itself eliminate the worst abuses of nestloop plans, since
the planner would always assume that the outer scan contains at least 2
rows unless it's known not to.  Still, there might be a lot of other less
pleasant side effects; it's hard to tell in advance of actually doing the
work.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some questions about the array.

2015-12-22 Thread Tom Lane
Yury Zhuravlev  writes:
> New patch version in attachment.  

It's still awfully short on comments, but I'll see what I can do with it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Possible marginally-incompatible change to array subscripting

2015-12-22 Thread Tom Lane
I'm reviewing Yury Zhuravlev's patch to allow array slice boundaries to be
omitted, for example "a[4:]" means "the slice extending from element 4 to
the last element of a".  It strikes me that there's an improvement we
could easily make for the case where a mixture of slice and non-slice
syntax appears, that is something like "a[3:4][5]".  Now, this has always
meant a slice, and the way we've traditionally managed that is to treat
simple subscripts as being the range upper bound with a lower bound of 1;
that is, what this example means is exactly "a[3:4][1:5]".

ISTM that if we'd had Yury's code in there from the beginning, what we
would define this as meaning is "a[3:4][:5]", ie the implied range runs
from whatever the array lower bound is up to the specified subscript.

This would make no difference of course for the common case where the
array lower bound is 1, but it seems a lot less arbitrary when it isn't.
So I think we should strongly consider changing it to mean that, even
though it would be non-backwards-compatible in such cases.

Comments?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Maintenance on git.postgresql.org

2015-12-22 Thread Magnus Hagander
I'm going to run some upgrade tasks on git.postgresql.org over the next
hour or two. Thus it's likely to drop off the net a few times and come
back, and things might work on and off.

Anything pushed should be safe there normally, but it might be inaccessible
for some periods of time, giving strange error messages :)

Also, expect the web management interface to be offline a bit longer than
the actual git server.

This does *not* affect gitmaster, just git.postgresql.org.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] On columnar storage (2)

2015-12-22 Thread Alvaro Herrera
Michael Paquier wrote:
> On Wed, Dec 9, 2015 at 3:10 PM, Jeff Janes  wrote:
> > Could we get this rebased past the merge of the parallel execution commits?
> 
> +1. Alvaro, Tomas, Simon, what are the next plans with those patches?

Yeah, I've been working intermittently on getting the whole tree rebased
and squashed, because after the last submission we made a lot of
progress.  I'll repost later.  I think it should be marked "returned
with feedback" for now.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove Windows crash dump support?

2015-12-22 Thread Craig Ringer
On 22 December 2015 at 22:50, Craig Ringer  wrote:

> Hi all
>
> Back in 2010 I submitted a small feature to allow the creation of
> minidumps when backends crashed; see
> commit dcb09b595f88a3bca6097a6acc17bf2ec935d55f .
>
> At the time Windows lacked useful support for postmortem debugging and
> crash-dump management in the operating system its self, especially for
> applications running as services. That has since improved considerably.
>
> The feature was also included in 9.4
>

Ahem. 9.1. This is what I get for multi-tasking between writing this and
packaging an extension for 9.4.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Review: GiST support for UUIDs

2015-12-22 Thread Michael Paquier
On Tue, Sep 15, 2015 at 3:03 PM, Teodor Sigaev  wrote:
>
>
> Paul Jungwirth wrote:
>>>
>>> Or something like this in pseudocode:
>>>
>>> numeric = int8_numeric(*(uint64 *)(>data[0])) *
>>> int8_numeric(MAX_INT64) + int8_numeric(*(uint64 *)(>data[8]))
>>
>>
>> This is more like what I was hoping for, rather than converting to a
>> string and back. Would you mind confirming for me: int8_numeric turns an
>> int64 into an arbitrary-precision numeric Datum? So there is no overflow
>> risk here?
>
> Sure, no risk. Numeric precision is limited 1000 digits with magnitude 1000
>
>
>>
>> But it looks like int8_numeric takes a *signed* integer. Isn't that a
>> problem? I suppose I could get it working though by jumping through some
>> hoops.
>
> signed vs unsigned problem does not exist actually, because of precision of
> numeric is much better than we need and presence of numeric_abs.
>
>
>> Yes, but that seems like an unrealistic concern. Even "only" 2^64 is
>> 18446744073709551616 records. Or another way of thinking about it, if 64
>
> :) "only"
>
>> bits (or 32) is a good enough penalty input for all the other data
>> types, why not for UUIDs? Keep in mind type 3-5 UUIDs should be evenly
>> distributed. Perhaps we could use the bottom half (instead of the top)
>> to ensure even distribution for type 1 and 2 too.
>
> it must be. But UUID could be taken for unknown source and we can't predict
> distribution. I believe pg generates them correctly, but other generators
> could be not so good.
>
>> It seems to me that using only the top half should be okay, but if you
>> believe it's not I'll go with the int8_numeric approach (in three chunks
>> instead of two to work around signed-vs-unsigned).
>
> Yes, I believe. It is not good case when we can ruin index performance with
> special set of value.
>
> Some difficulty  which I see is how to transform numeric penalty to double
> as it requires by GiST. May be, penalty/(INT_MAX64*INT_MAX64 + INT_MAX64)?
> Keep in mind, that penalty is how range will be enlarged after range union,
> so minimal penalty is 0 and maximum is 0x

This patch has been marked as "returned with feedback" for CF 2015-11
because of the presence of a review and a certain lack of activity..
Please free to move it to next CF if you think that's more adapted.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system

2015-12-22 Thread Noah Misch
On Mon, Feb 18, 2013 at 06:19:12PM -0300, Alvaro Herrera wrote:
> I have pushed it now.  Further testing, of course, is always welcome.

While investigating stats.sql buildfarm failures, mostly on animals axolotl
and shearwater, I found that this patch (commit 187492b) inadvertently removed
the collector's ability to coalesce inquiries.  Every PGSTAT_MTYPE_INQUIRY
received now causes one stats file write.  Before, pgstat_recv_inquiry() did:

if (msg->inquiry_time > last_statrequest)
last_statrequest = msg->inquiry_time;

and pgstat_write_statsfile() did:

globalStats.stats_timestamp = GetCurrentTimestamp();
... (work of writing a stats file) ...
last_statwrite = globalStats.stats_timestamp;
last_statrequest = last_statwrite;

If the collector entered pgstat_write_statsfile() with more inquiries waiting
in its socket receive buffer, it would ignore them as being too old once it
finished the write and resumed message processing.  Commit 187492b converted
last_statrequest to a "last_statrequests" list that we wipe after each write.

I modeled a machine with slow stats writes using the attached diagnostic patch
(not for commit).  It has pgstat_write_statsfiles() sleep just before renaming
the temporary file, and it logs each stats message received.  A three second
delay causes stats.sql to fail the way it did on shearwater[1] and on
axolotl[2].  Inquiries accumulate until the socket receive buffer overflows,
at which point the socket drops stats messages whose effects we later test
for.  The 3s delay makes us drop some 85% of installcheck stats messages.
Logs from a single VACUUM show receipt of five inquiries ("stats 1:") with 3s
between them:

24239 2015-12-10 04:21:03.439 GMT LOG:  connection authorized: user=nmisch 
database=test
24236 2015-12-10 04:21:03.454 GMT LOG:  stats 2: 1888 + 936 = 2824
24236 2015-12-10 04:21:03.454 GMT LOG:  stats 2: 2824 + 376 = 3200
24236 2015-12-10 04:21:03.454 GMT LOG:  stats 2: 3200 + 824 = 4024
24239 2015-12-10 04:21:03.455 GMT LOG:  statement: vacuum pg_class
24236 2015-12-10 04:21:03.455 GMT LOG:  stats 1: 4024 + 32 = 4056
24236 2015-12-10 04:21:06.458 GMT LOG:  stats 12: 4056 + 88 = 4144
24236 2015-12-10 04:21:06.458 GMT LOG:  stats 1: 4144 + 32 = 4176
24239 2015-12-10 04:21:06.463 GMT LOG:  disconnection: session time: 
0:00:03.025 user=nmisch database=test host=[local]
24236 2015-12-10 04:21:09.486 GMT LOG:  stats 1: 4176 + 32 = 4208
24236 2015-12-10 04:21:12.503 GMT LOG:  stats 1: 4208 + 32 = 4240
24236 2015-12-10 04:21:15.519 GMT LOG:  stats 1: 4240 + 32 = 4272
24236 2015-12-10 04:21:18.535 GMT LOG:  stats 9: 4272 + 48 = 4320
24236 2015-12-10 04:21:18.535 GMT LOG:  stats 2: 4320 + 936 = 5256
24236 2015-12-10 04:21:18.535 GMT LOG:  stats 2: 5256 + 376 = 5632
24236 2015-12-10 04:21:18.535 GMT LOG:  stats 2: 5632 + 264 = 5896
24236 2015-12-10 04:21:18.535 GMT LOG:  stats 12: 5896 + 88 = 5984


As for how to fix this, the most direct translation of the old logic is to
retain last_statrequests entries that could help coalesce inquiries.  I lean
toward that for an initial, back-patched fix.  It would be good, though, to
process two closely-spaced, different-database inquiries in one
pgstat_write_statsfiles() call.  We do one-database writes and all-databases
writes, but we never write "1 < N < all" databases despite the code prepared
to do so.  I tried calling pgstat_write_statsfiles() only when the socket
receive buffer empties.  That's dead simple to implement and aggressively
coalesces inquiries (even a 45s sleep did not break stats.sql), but it starves
inquirers if the socket receive buffer stays full persistently.  Ideally, I'd
want to process inquiries when the buffer empties _or_ when the oldest inquiry
is X seconds old.  I don't have a more-specific design in mind, though.

Thanks,
nm

[1] 
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shearwater=2015-09-23%2002%3A08%3A31
[2] 
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl=2015-08-04%2019%3A31%3A22
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab018c4..e6f04b0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3314,6 +3314,7 @@ pgstat_send_bgwriter(void)
 NON_EXEC_STATIC void
 PgstatCollectorMain(int argc, char *argv[])
 {
+   unsigned total = 0;
int len;
PgStat_Msg  msg;
int wr;
@@ -3425,6 +3426,10 @@ PgstatCollectorMain(int argc, char *argv[])
 errmsg("could not read 
statistics message: %m")));
}
 
+   elog(LOG, "stats %d: %u + %u = %u",
+msg.msg_hdr.m_type, total, len, total + len);
+   total += len;
+
/*
 * We ignore messages that are smaller than our common 
header

Re: [HACKERS] [Proposal] Table partition + join pushdown

2015-12-22 Thread Michael Paquier
On Fri, Nov 20, 2015 at 9:05 PM, Taiki Kondo  wrote:
> I created v3 patch for this feature, and v1 patch for regression tests.
> Please find attached.
>
> [blah review and replies]
>
> Please find from attached patch.

This new patch did not actually get a review, moved to next CF.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Commit fest status for 2015-11

2015-12-22 Thread Michael Paquier
On Tue, Dec 22, 2015 at 4:04 PM, Michael Paquier
 wrote:
> OK, if those don't get addressed soon, they will be moved to the next
> CF with the same status.

After a first pass, Numbers are getting down,  I am just too tired to continue:
Needs review: 13.
Waiting on Author: 15.
Ready for Committer: 5

Two additional patches have been marked as ready for committer:
1) Bug fix for pg_dump --jobs when password is passed a a parameter in
a connection string:
https://commitfest.postgresql.org/7/405/
2) Default roles, in Stephen's plate:
https://commitfest.postgresql.org/7/49/
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-22 Thread Robert Haas
On Mon, Dec 21, 2015 at 7:51 AM, Fabien COELHO  wrote:
> I think that the 1.5 value somewhere in the patch is much too high for the
> purpose because it shifts the checkpoint load quite a lot (50% more load at
> the end of the checkpoint) just for the purpose of avoiding a spike which
> lasts a few seconds (I think) at the beginning. A much smaller value should
> be used (1.0 <= factor < 1.1), as it would be much less disruptive and would
> probably avoid the issue just the same. I recommend not to commit with a 1.5
> factor in any case.

Wait, what?  On what workload does the FPW spike last only a few
seconds?  That's certainly not the case in testing I've done.  It
would have to be the case that almost all the writes were concentrated
on a very few pages.

> Another issue I raised is that the load change occurs both with xlog and
> time triggered checkpoints, and I'm sure it should be applied in both case.

Is this sentence missing a "not"?

> Another issue is that the patch makes sense when the WAL & relations are on
> the same disk, but might degrade performance otherwise.

Yes, that would be a good case to test.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2015-12-22 Thread Robert Haas
On Sun, Dec 6, 2015 at 7:25 PM, Peter Geoghegan  wrote:
> On Tue, Nov 24, 2015 at 4:33 PM, Peter Geoghegan  wrote:
>> So, the bottom line is: This patch seems very good, is unlikely to
>> have any notable downside (no case has been shown to be regressed),
>> but has yet to receive code review. I am working on a new version with
>> the first two commits consolidated, and better comments, but that will
>> have the same code, unless I find bugs or am dissatisfied. It mostly
>> needs thorough code review, and to a lesser extent some more
>> performance testing.
>
> I'm currently spending a lot of time working on parallel CREATE INDEX.
> I should not delay posting a new version of my patch series any
> further, though. I hope to polish up parallel CREATE INDEX to be able
> to show people something in a couple of weeks.
>
> This version features consolidated commits, the removal of the
> multipass_warning parameter, and improved comments and commit
> messages. It has almost entirely unchanged functionality.
>
> The only functional changes are:
>
> * The function useselection() is taught to distrust an obviously bogus
> caller reltuples hint (when it's already less than half of what we
> know to be the minimum number of tuples that the sort must sort,
> immediately after LACKMEM() first becomes true -- this is probably a
> generic estimate).
>
> * Prefetching only occurs when writing tuples. Explicit prefetching
> appears to hurt in some cases, as David Rowley has shown over on the
> dedicated thread. But it might still be that writing tuples is a case
> that is simple enough to benefit consistently, due to the relatively
> uniform processing that memory latency can hide behind for that case
> (before, the same prefetching instructions were used for CREATE INDEX
> and for aggregates, for example).
>
> Maybe we should consider trying to get patch 0002 (the memory
> pool/merge patch) committed first, something Greg Stark suggested
> privately. That might actually be an easier way of integrating this
> work, since it changes nothing about the algorithm we use for merging
> (it only improves memory locality), and so is really an independent
> piece of work (albeit one that makes a huge overall difference due to
> the other patches increasing the time spent merging in absolute terms,
> and especially as a proportion of the total).

So I was looking at the 0001 patch and came across this code:

+/*
+ * Crossover point is somewhere between where memtuples is between 40%
+ * and all-but-one of total tuples to sort.  This weighs approximate
+ * savings in I/O, against generic heap sorting cost.
+ */
+avgTupleSize = (double) memNowUsed / (double) state->memtupsize;
+
+/*
+ * Starting from a threshold of 90%, refund 7.5% per 32 byte
+ * average-size-increment.
+ */
+increments = MAXALIGN_DOWN((int) avgTupleSize) / 32;
+crossover = 0.90 - (increments * 0.075);
+
+/*
+ * Clamp, making either outcome possible regardless of average size.
+ *
+ * 40% is about the minimum point at which "quicksort with spillover"
+ * can still occur without a logical/physical correlation.
+ */
+crossover = Max(0.40, Min(crossover, 0.85));
+
+/*
+ * The point where the overhead of maintaining the heap invariant is
+ * likely to dominate over any saving in I/O is somewhat arbitrarily
+ * assumed to be the point where memtuples' size exceeds MaxAllocSize
+ * (note that overall memory consumption may be far greater).  Past
+ * this point, only the most compelling cases use replacement selection
+ * for their first run.
+ *
+ * This is not about cache characteristics so much as the O(n log n)
+ * cost of sorting larger runs dominating over the O(n) cost of
+ * writing/reading tuples.
+ */
+if (sizeof(SortTuple) * state->memtupcount > MaxAllocSize)
+crossover = avgTupleSize > 32 ? 0.90 : 0.95;

This looks like voodoo to me.  I assume you tested it and maybe it
gives correct answers, but it's got to be some kind of world record
for number of arbitrary constants per SLOC, and there's no real
justification for any of it.  The comments say, essentially, well, we
do this because it works.  But suppose I try it on some new piece of
hardware and it doesn't work well.  What do I do?  Email the author
and ask him to tweak the arbitrary constants?

The dependency on MaxAllocSize seems utterly bizarre to me.  If we
decide to modify our TOAST infrastructure so that we support datums up
to 2GB in size, or alternatively datums of up to only 512MB in size,
do you expect that to change the behavior of tuplesort.c?  I bet not,
but that's a major reason why MaxAllocSize is defined the way it is.

I wonder if there's a way to accomplish what you're trying to do here
that avoids the need to have a cost model at all.  As I understand it,
and please correct me wherever I go off the rails, the situation is:

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-12-22 Thread Robert Haas
On Mon, Dec 21, 2015 at 1:27 AM, Amit Kapila  wrote:
> On Fri, Dec 18, 2015 at 9:58 PM, Robert Haas  wrote:
>>
>> On Fri, Dec 18, 2015 at 1:16 AM, Amit Kapila 
>> wrote:
>>
>> >> Some random comments:
>> >>
>> >> - TransactionGroupUpdateXidStatus could do just as well without
>> >> add_proc_to_group.  You could just say if (group_no >= NUM_GROUPS)
>> >> break; instead.  Also, I think you could combine the two if statements
>> >> inside the loop.  if (nextidx != INVALID_PGPROCNO &&
>> >> ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
>> >> something like that.
>> >>
>
> Changed as per suggestion.
>
>> >> - memberXid and memberXidstatus are terrible names.  Member of what?
>> >
>> > How about changing them to clogGroupMemberXid and
>> > clogGroupMemberXidStatus?
>>
>> What we've currently got for group XID clearing for the ProcArray is
>> clearXid, nextClearXidElem, and backendLatestXid.  We should try to
>> make these things consistent.  Maybe rename those to
>> procArrayGroupMember, procArrayGroupNext, procArrayGroupXid
>>
>
> Here procArrayGroupXid sounds like Xid at group level, how about
> procArrayGroupMemberXid?
> Find the patch with renamed variables for PGProc
> (rename_pgproc_variables_v1.patch) attached with mail.

I sort of hate to make these member names any longer, but I wonder if
we should make it procArrayGroupClearXid etc.  Otherwise it might be
confused with some other time of grouping of PGPROCs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

2015-12-22 Thread Robert Haas
On Mon, Dec 21, 2015 at 2:55 PM, Peter Geoghegan  wrote:
> On Mon, Dec 21, 2015 at 10:53 AM, Robert Haas  wrote:
>> PFA my proposal for comment changes for 9.5 and master.  This is based
>> on your 0001, but I edited somewhat.  Please let me know your
>> thoughts.  I am not willing to go further and rearrange actual code in
>> 9.5 at this point; it just isn't necessary.
>
> Fine by me. But this revision hasn't made the important point at all
> -- which is that 0002 is safe. That's a stronger guarantee than the
> abbreviated key representation being pass-by-value.

Right.  I don't think that we should back-patch that stuff into 9.5.

>> Looking at this whole system again, I wonder if we're missing a trick
>> here.  How about if we decree from on high that the abbreviated-key
>> comparator is always just the application of an integer comparison
>> operator?  The only abbreviation function that we have right now that
>> wouldn't be happy with that is the one for numeric, and that looks
>> like it could be fixed.
>
> I gather you mean that we'd decree that they were always compared
> using a text or uuid style 3-way unsigned integer comparison, allowing
> us to hardcode that.

Right.

>> Then we could hard-code knowledge of this
>> representation in tuplesort.c in such a way that it wouldn't need to
>> call a comparator function at all - instead of doing
>> ApplySortComparator() and then maybe ApplySortAbbrevFullComparator(),
>> it would do something like:
>>
>> if (using_abbreviation && (compare = ApplyAbbrevComparator(...)) != 0)
>> return compare;
>>
>> I'm not sure if that would save enough cycles vs. the status quo to be
>> worth bothering with, but it seems like it might.  You may have a
>> better feeling for that than I do.
>
> I like this idea. I thought of it myself already, actually.
>
> It has some nice properties, because unsigned integers are so simple
> and flexible. For example, we can do it with int4 sometimes, and then
> compare two attributes at once on 64-bit platforms. Maybe the second
> attribute (the second set of 4 bytes) isn't an int4, though -- it's
> any other type's abbreviated key (which can be assumed to have a
> compatible representation). That's one more advanced possibility.

Yikes.

> It seems worthwhile because numeric is the odd one out. Once I or
> someone else gets around to adding network types, which could happen
> in 9.6, then we are done (because there is already a pending patch
> that adds support for remaining character-like types and alternative
> opclasses). I really don't foresee much additional use of abbreviated
> keys beyond these cases. There'd be a small but nice boost by not
> doing pointer chasing for the abbreviated key comparison, I imagine,
> but that benefit would be felt everywhere. Numeric is the odd one out
> currently, but as you say it can be fixed, and I foresee no other
> trouble from any other opclass/type that cares about abbreviation.
>
> Another issue is that abbreviated keys in indexes are probably not
> going to take 8 bytes, because they'll go in the ItemId array. It's
> likely to be very useful to be able to take the first two bytes, and
> know that a uint16 comparison is all that is needed, and have a basis
> to perform an interpolation search rather than a binary search
> (although that needs to be adaptive to different distributions, I
> think it could be an effective technique -- there are many cache
> misses for binary searches on internal B-Tree pages).

Hmm, that seems iffy to me.  There are plenty of data sets where 8
bytes is enough to get some meaningful information about what part of
the keyspace you're in, but 2 bytes isn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possible marginally-incompatible change to array subscripting

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 11:51 AM, Tom Lane  wrote:
> I'm reviewing Yury Zhuravlev's patch to allow array slice boundaries to be
> omitted, for example "a[4:]" means "the slice extending from element 4 to
> the last element of a".  It strikes me that there's an improvement we
> could easily make for the case where a mixture of slice and non-slice
> syntax appears, that is something like "a[3:4][5]".  Now, this has always
> meant a slice, and the way we've traditionally managed that is to treat
> simple subscripts as being the range upper bound with a lower bound of 1;
> that is, what this example means is exactly "a[3:4][1:5]".
>
> ISTM that if we'd had Yury's code in there from the beginning, what we
> would define this as meaning is "a[3:4][:5]", ie the implied range runs
> from whatever the array lower bound is up to the specified subscript.
>
> This would make no difference of course for the common case where the
> array lower bound is 1, but it seems a lot less arbitrary when it isn't.
> So I think we should strongly consider changing it to mean that, even
> though it would be non-backwards-compatible in such cases.
>
> Comments?

Gosh, our arrays are strange.  I would have expected a[3:4][5] to mean
a[3:4][5:5].

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 2:00 AM, Michael Paquier
 wrote:
> On Tue, Dec 22, 2015 at 3:52 PM, Amit Langote
>  wrote:
>> On 2015/12/22 15:24, Michael Paquier wrote:
>>> As this debate continues, I think that moving this patch to the next
>>> CF would make the most sense then.. So done this way.
>>
>> Perhaps, this ended (?) with the following commit:
>>
>> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=385f337c9f39b21dca96ca4770552a10a6d5af24
>
> Ah, thanks! What has been committed is actually more or less
> epq-recheck-v6-efujita.patch posted upthread, I'll mark the patch as
> committed then.

+1.  And thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pglogical_output - a general purpose logical decoding output plugin

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 4:55 AM, Craig Ringer  wrote:
> I'm a touch frustrated by that, as a large part of the point of submitting
> the output plugin separately and in advance of the downstream was to get
> attention for it separately, as its own entity. A lot of effort has been put
> into making this usable for more than just a data source for pglogical's
> replication tools. In retrospect naming it pglogical_output was probably
> unwise.

It's not too late to rename it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor code improvements to create_foreignscan_plan/ExecInitForeignScan

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 7:32 AM, Michael Paquier
 wrote:
> On Wed, Sep 9, 2015 at 7:08 PM, Etsuro Fujita
>  wrote:
>> On 2015/07/10 18:40, Etsuro Fujita wrote:
>>> To save cycles, I modified create_foreignscan_plan so that it detects
>>> whether any system columns are requested if scanning a base relation.
>>> Also, I revised other code there a little bit.
>>
>> Attached is an updated version of the patch.  The previous version
>> contained changes to ExecInitForeignScan, but I've dropped that part, as
>> discussed before.
>
> Moved to next CF because of a lack of reviews.

I just took a look at this.  I think the basic idea of this patch is
good, but the comments need some work, because they don't really
explain why this should be skipped in the join case.  Maybe something
like this:

If rel is a base relation, detect whether any system columns were
requested.  (If rel is a join relation, rel->relid will be 0, but
there can be no Var in the target list with relid 0, so we skip this
in that case.) This is a bit of a kluge and might go away someday, so
we intentionally leave it out of the API presented to FDWs.

And the rest as it is currently written.

It might be good, also, to say something about why we never need
fsSystemCol to be true in the joinrel case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting sorted data from foreign server for merge join

2015-12-22 Thread Robert Haas
On Mon, Dec 21, 2015 at 6:34 AM, Ashutosh Bapat
 wrote:
>> I went over this patch in some detail today and did a lot of cosmetic
>> cleanup.  The results are attached.  I'm fairly happy with this
>> version, but let me know what you think.  Of course, feedback from
>> others is more than welcome also.
>>
>
> Attached patch with some cosmetic changes (listed here for your quick
> reference)
> 1. , was replaced with ; in comment "inner join, expressions in the " at one
> place, which is correct, but missed other place.
> 2. The comment "First, consider whether any each active EC is potentially"
> should use either "any" or "each". I have reworded it as "First, consider
> whether any of the active ECs is potentially ...". Or we can use "First,
> find all of the active ECs which are potentially ".
> 3. "having the remote side due the sort generally won't be any worse ..." -
> instead of "due" we should use "do"?
> 4. Added static prototype of function get_useful_ecs_for_relation().
> 5. The comment "Extract unique EC for query, if any, so we don't consider it
> again." is too crisp. Phrase "Unique EC for query" is confusing; EC can not
> be associated with a query per say and EC's are always unique because of
> canonicalisation. May be we should reword it as "Extract single EC for
> ordering of query, if any, so we don't consider it again." Is that cryptic
> as well?

Thanks.  I committed this version with one small tweak.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel joins, and better parallel explain

2015-12-22 Thread Dilip Kumar
On Tue, Dec 22, 2015 at 8:30 PM, Robert Haas  wrote:

> On Tue, Dec 22, 2015 at 4:14 AM, Dilip Kumar 
> wrote:
> > On Fri, Dec 18, 2015 at 8:47 PM Robert Wrote,
> >>> Yes, you are right, that create_gather_path() sets parallel_safe to
> false
> >>> unconditionally but whenever we are building a non partial path, that
> >>> time
> >>> we should carry forward the parallel_safe state to its parent, and it
> >>> seems
> >>> like that part is missing here..
> >
> >>Ah, right.  Woops.  I can't exactly replicate your results, but I've
> >>attempted to fix this in a systematic way in the new version attached
> >>here (parallel-join-v3.patch).
> >
> > I Have tested with the latest patch, problem is solved..
> >
> > During my testing i observed one more behaviour in the hash join, where
> > Parallel hash join is taking more time compared to Normal hash join,
>
> I think the gather-reader-order patch will fix this.  Here's a test
> with all three patches.
>
>
Yeah right, After applying all three patches this problem is fixed, now
parallel hash join is faster than normal hash join.

I have tested one more case which Amit mentioned, I can see in that case
parallel plan (parallel degree>= 3) is still slow, In Normal case it
selects "Hash Join" but in case of parallel worker > 3 it selects Parallel
"Nest Loop Join" which is making it costlier.

CREATE TABLE t1(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 1000) g;

CREATE TABLE t2(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 300) g;
Analyze t1;
Analyze t2;

postgres=# set max_parallel_degree=0;
SET
postgres=#  Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 =
t2.c1 AND t1.c1 BETWEEN 100 AND 200;
QUERY
PLAN
--
 Aggregate  (cost=223208.93..223208.94 rows=1 width=0) (actual
time=2148.840..2148.841 rows=1 loops=1)
   ->  Hash Join  (cost=204052.91..223208.92 rows=1 width=0) (actual
time=1925.309..2148.812 rows=101 loops=1)
 Hash Cond: (t2.c1 = t1.c1)
 ->  Seq Scan on t2  (cost=0.00..15406.00 rows=100 width=4)
(actual time=0.025..104.028 rows=100 loops=1)
 ->  Hash  (cost=204052.90..204052.90 rows=1 width=4) (actual
time=1925.219..1925.219 rows=101 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 12kB
   ->  Seq Scan on t1  (cost=0.00..204052.90 rows=1 width=4)
(actual time=0.029..1925.196 rows=101 loops=1)
 Filter: ((c1 >= 100) AND (c1 <= 200))
 Rows Removed by Filter: 899
 Planning time: 0.470 ms
 Execution time: 2148.928 ms
(11 rows)

postgres=# set max_parallel_degree=3;
SET
postgres=#  Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 =
t2.c1 AND t1.c1 BETWEEN 100 AND 200;
QUERY
PLAN
--
 Aggregate  (cost=78278.36..78278.37 rows=1 width=0) (actual
time=19944.113..19944.113 rows=1 loops=1)
   ->  Gather  (cost=1000.00..78278.36 rows=1 width=0) (actual
time=0.682..19943.928 rows=101 loops=1)
 Number of Workers: 3
 ->  Nested Loop  (cost=0.00..77278.26 rows=1 width=0) (actual
time=690.633..6556.201 rows=25 loops=4)
   Join Filter: (t1.c1 = t2.c1)
   Rows Removed by Join Filter: 25249975
   ->  Parallel Seq Scan on t1  (cost=0.00..58300.83 rows=0
width=4) (actual time=619.198..619.262 rows=25 loops=4)
 Filter: ((c1 >= 100) AND (c1 <= 200))
 Rows Removed by Filter: 2499975
   ->  Seq Scan on t2  (cost=0.00..15406.00 rows=100
width=4) (actual time=0.008..105.757 rows=100 loops=101)
 Planning time: 0.206 ms
 Execution time: 19944.748 ms


postgres=# set max_parallel_degree=1;
SET
postgres=#  Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 =
t2.c1 AND t1.c1 BETWEEN 100 AND 200;
   QUERY
PLAN

 Aggregate  (cost=156191.39..156191.40 rows=1 width=0) (actual
time=1336.401..1336.401 rows=1 loops=1)
   ->  Hash Join  (cost=137035.38..156191.39 rows=1 width=0) (actual
time=1110.562..1336.386 rows=101 loops=1)
 Hash Cond: (t2.c1 = t1.c1)
 ->  Seq Scan on t2  (cost=0.00..15406.00 rows=100 width=4)
(actual time=0.025..101.659 rows=100 loops=1)
 ->  Hash  (cost=137035.37..137035.37 rows=1 width=4) (actual
time=1110.486..1110.486 rows=101 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 12kB
   ->  Gather  (cost=1000.00..137035.37 rows=1 width=4) (actual