Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

2017-08-04 Thread Stephen Frost
Robert,

On Fri, Aug 4, 2017 at 23:17 Robert Haas  wrote:

> On Thu, Aug 3, 2017 at 9:49 PM, Stephen Frost  wrote:
> > Thanks for the patches.  I'm planning to push them tomorrow morning
> > after a bit more review and testing.  I'll provide an update tomorrow.
>
> Obviously, the part about pushing them Friday morning didn't happen,
> and you're running out of time for providing an update on Friday, too.
> The release is due to wrap in less than 72 hours.


Unfortunately the day got away from me due to some personal... adventures
(having to do with lack of air conditioning first and then lack of gas,
amongst a lot of other things going on right now...). I just got things
back online but, well, my day tomorrow is pretty well packed solid.  I
wouldn't complain if someone has a few cycles to push these, they look
pretty good to me but I won't be able to work on PG stuff again until
tomorrow night at the earliest.

Thanks!

Stephen


Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

2017-08-04 Thread Robert Haas
On Thu, Aug 3, 2017 at 9:49 PM, Stephen Frost  wrote:
> Thanks for the patches.  I'm planning to push them tomorrow morning
> after a bit more review and testing.  I'll provide an update tomorrow.

Obviously, the part about pushing them Friday morning didn't happen,
and you're running out of time for providing an update on Friday, too.
The release is due to wrap in less than 72 hours.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-04 Thread Robert Haas
On Fri, Aug 4, 2017 at 2:45 PM, Amit Kapila  wrote:
> I have not done anything for this comment as it doesn't sound wrong to
> me.  I think it is not making much sense in the current code and we
> can remove it or change it as part of the separate patch if you or
> others think so.

I don't get it.  The comment claims we have to _hash_getnewbuf before
releasing the metapage write lock.  But the function neither calls
_hash_getnewbuf nor releases the metapage write lock.  It then goes on
to say that for this reason we pass the new buffer rather than just
the block number.  However, even if it were true that we have to call
_hash_getnewbuf before releasing the metapage write lock, what does
that have to do with the choice of passing a buffer vs. a block
number?

Explain to me what I'm missing, please, because I'm completely befuddled!

(On another note, I committed these patches.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A bug in mapping attributes in ATExecAttachPartition()

2017-08-04 Thread Robert Haas
On Thu, Aug 3, 2017 at 8:45 PM, Amit Langote
 wrote:
>> 0003 needs a rebase.
>
> Rebased patch attached.

Committed.  I think 0004 is a new feature, so I'm leaving that for v11.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Subscription code improvements

2017-08-04 Thread Peter Eisentraut
On 8/4/17 12:02, Masahiko Sawada wrote:
> To make ALTER SUBSCRIPTION REFRESH being transactional, I prefer
> Petr's proposal. Because it can make ALTER SUBSCRIPTION and DROP
> SUBSCRIPTION stop the table sync workers that are in progress of
> copying data. I'm not sure killing table sync workers in DDL commands
> would be acceptable but since it can free unnecessary slots of logical
> replication workers and replication slots I'd prefer this idea.

OK, I have committed the 0004 patch.

> However, even with this patch there is still an issue that NOTICE
> messages "removed subscription for table public.t1" can be appeared
> even if we rollback ALTER SUBSCRIPTION REFRESH command as I mentioned
> on earlier thread. Since I think this behaviour will confuse users who
> check server logs I'd like to deal with it, I don't have a good idea
> though.

Maybe we can just remove those messages?

We don't get messages when we create a subscription about which tables
are part of it.  So why do we need such messages when we refresh a
subscription?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-08-04 Thread Peter Geoghegan
On Fri, Jul 21, 2017 at 4:51 AM, Alik Khilazhev
 wrote:
> (Latest version of pgbench Zipfian patch)

While I'm +1 on this idea, I think that it would also be nice if there
was an option to make functions like random_zipfian() actually return
a value that has undergone perfect hashing. When this option is used,
any given value that the function returns would actually be taken from
a random mapping to some other value in the same range. So, you can
potentially get a Zipfian distribution without the locality.

While I think that Zipfian distributions are common, it's probably
less common for the popular items to be clustered together within the
primary key, unless the PK has multiple attributes, and the first is
the "popular attribute". For example, there would definitely be a lot
of interest among contiguous items within an index on "(artist,
album)" where "artist" is a popular artist, which is certainly quite
possible. But, if "album" is the primary key, and it's a SERIAL PK,
then there will only be weak temporal locality within the PK, and only
because today's fads are more popular than the fads from previous
years.

Just another idea, that doesn't have to hold this patch up.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Draft release notes up for review

2017-08-04 Thread Tom Lane
I've committed the first-draft release notes for 9.6.4 at
https://git.postgresql.org/pg/commitdiff/03378c4da598840b0520a53580dd7713c95f21c8

(If you prefer to read nicely-marked-up copy, they should be up at
https://www.postgresql.org/docs/devel/static/release-9-6-4.html
in a couple hours from now.)

Please review.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-08-04 Thread Peter Geoghegan
On Mon, Jul 31, 2017 at 10:54 AM, Peter Geoghegan  wrote:
> Let's wait to see what difference it makes if Alik's zipfian
> distribution pgbench test case uses unlogged tables. That may gives us a
> good sense of the problem for cases with contention/concurrency.

Yura Sokolov of Postgres Pro performed this benchmark at my request.
He took the 9.5 commit immediately proceeding 2ed5b87f9 as a baseline.
In all cases, logged tables were used. Note that this is not the most
effective benchmark for showing the regression, because he didn't
replace the PK with a non-unique index, though that is planned as
follow-up; I wanted to stick with Alik's Zipfian test case for the
time being.

(Using a unique index is not the most effective thing for showing the
regression because unique indexes have most LP_DEAD setting done in
_bt_check_unique(), so only SELECTs will do less LP_DEAD cleanup
there; SELECTs are 50% of all queries.)

His results with 10 minute pgbench runs:

Logged
clients | 8217fb14 | 2ed5b87f | master | hashsnap | hashsnap_lwlock
+--+--++--+
 10 |   201569 |   204154 | 201095 |   201793 |  206111
 20 |   328180 |   58 | 334858 |   336721 |  370769
 40 |   196352 |   194692 | 232413 |   231968 |  393947
 70 |   121083 |   116881 | 148066 |   148443 |  224162
110 |77989 |73414 |  99305 |   111404 |  161900
160 |48958 |45830 |  65830 |82340 |  115678
230 |27699 |25510 |  38186 |57617 |   80575
310 |16369 |15137 |  21571 |39745 |   56819
400 |10327 | 9486 |  13157 |27318 |   40859
500 | 6920 | 6496 |   8638 |18677 |   29429
650 | 4315 | 3971 |   5196 |11762 |   17883
800 | 2850 | 2772 |   3523 | 7733 |   10977

Note that you also see numbers from various patches from Yura, and the
master branch mixed in here, but 8217fb14 (before) and 2ed5b87f
(after) are the interesting numbers as far as this regression goes.

There is an appreciable reduction in TPS here, though this workload is
not as bad by that measure as first thought. There is a roughly 5%
regression here past 40 clients or so. The regression in the
*consistency* of transaction *throughput* is far more interesting,
though. I've been doing analysis of this by drilling down to
individual test cases with vimdiff, as follows:

$ vimdiff test_8217fb14_logged_1_pgbench_40.out
test_2ed5b87f_logged_1_pgbench_40.out

(I attach these two files as a single example. I can provide the full
results to those that ask for them privately; it's too much data to
attach to an e-mail to the list.)

You can see in this example that for most 5 second slices of the 10
minute benchmark, commit 2ed5b87f actually increases TPS somewhat,
which is good. But there are also occasional *big* drops in TPS,
sometimes by as much as 50% over a single 5 second period (when
ANALYZE runs, adding random I/O during holding an exclusive buffer
lock [1]?). When this slowdown happens, latency can be over 3 times
higher, too.

We see much more consistent performance without the B-Tree buffer pin
VACUUM optimization, where there is no discernible pattern of
performance dropping. The headline regression of 4% or 5% is not the
main problem here, it seems.

In summary, commit 2ed5b87f makes the workload have increased
throughput most of the time, but occasionally sharply reduces
throughput, which averages out to TPS being 4% or 5% lower overall. I
think we'll find that there are bigger problems TPS-wise with
non-unique indexes when that other test is performed by Yura; let's
wait for those results to come back.

Finally, I find it interesting that when Yura did the same benchmark,
but with 5% SELECTs + 95% UPDATEs, rather than 50% SELECTs + 50%
UPDATEs as above, the overall impact was surprisingly similar. His
results:

clients | 8217fb14 | 2ed5b87f | master | hashsnap | hashsnap_lwlock
+--+--++--+
 20 |   187697 |   187335 | 217558 |   215059 |  266894
 50 |81272 |78784 |  97948 |97659 |  157710
 85 |49446 |47683 |  64597 |70814 |  107748
130 |32358 |30393 |  42216 |50531 |   75001
190 |19403 |17569 |  25704 |35506 |   51292
270 |10803 | 9878 |  14166 |23851 |   35257
370 | 6108 | 5645 |   7684 |15390 |   23659
500 | 3649 | 3297 |   4225 | 9172 |   14643
650 | 2239 | 2049 |   2635 | 5887 |8588
800 | 1475 | 1424 |   1804 | 4035 |5611

If nothing else, this shows how generally reliant these kinds of
workload can be on LP_DEAD setting. And, there is one difference: The
regression is seen here at *all* client 

Re: [HACKERS] PostgreSQL not setting OpenSSL session id context?

2017-08-04 Thread Tom Lane
Shay Rojansky  writes:
> Great. Do you think it's possible to backport to the other maintained
> branches as well, seeing as how this is quite trivial and low-impact?

Already done, will be in next week's minor releases.  (You timed this
bug report well.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-04 Thread Robert Haas
On Fri, Aug 4, 2017 at 2:49 PM, Amit Kapila  wrote:
> On Fri, Aug 4, 2017 at 10:59 PM, Robert Haas  wrote:
>> On Fri, Aug 4, 2017 at 6:22 AM, Amit Kapila  wrote:
>>> I have increased the number of hash bitmap pages as a separate patch.
>>> I am not completely sure if it is a good idea to directly increase it
>>> to 1024 as that will increase the size of hashmetapagedata from 960
>>> bytes to 4544 bytes.  Shall we increase it to 512?
>>
>> I don't quite see what the problem is with increasing it to 4544
>> bytes.  What's your concern?
>
> Nothing as such.  It is just that the previous code might have some
> reason to keep it at 128, probably if there are that many overflow
> bucket pages, then it is better to reindex the index otherwise the
> performance might suck while traversing the long overflow chains.  In
> general, I see your point that if we can provide user to have that big
> overflow space, then let's do it.

Yeah.  The only concern I see is that one doesn't want to run out of
space in the metapage when future patches come along and try to do new
things.  But 4544 bytes still leaves quite a bit of headroom, and I
think it's better to solve a problem we know we have right now than
try to save too much space for future needs that may or may not occur.
Maybe at some point this whole thing needs a broader rethink, but the
fact that somebody hit the current limit before we got out of beta
makes me think that we should be fairly aggressive in trying to
ameliorate the problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostgreSQL not setting OpenSSL session id context?

2017-08-04 Thread Shay Rojansky
>
> > Doing SSL_CTX_set_session_cache_mode(context, SSL_SESS_CACHE_OFF)
> doesn't
> > have any effect whatsoever - I still have the same issue (session id
> > context uninitialized). I suspect session caching is an entirely
> different
> > feature from session tickets/RFC5077 (although it might still be a good
> > idea to disable).
>
> Right, we expected that that would have no visible effect, because there
> is no way to cache sessions in Postgres anyway.  The main point, if I
> understand Heikki's concern correctly, is that this might save some
> amount of low-level overhead from clients trying to cache connections.
>

OK, sounds right (i.e. this is a defensive measure that isn't directly
connected to my problem but makes sense).

> Doing SSL_CTX_set_options(context, SSL_OP_NO_TICKET) indeed resolves the
> > issue, as expected.
>
> Excellent.  I'll push this patch tomorrow sometime (too late/tired
> right now).
>

Great. Do you think it's possible to backport to the other maintained
branches as well, seeing as how this is quite trivial and low-impact?


> > As I wrote above, I'd remove the #ifdef and execute it always.
>
> The reason I put the #ifdef in is that according to my research the
> SSL_OP_NO_TICKET symbol was introduced in openssl 0.9.8f, while we
> claim to support back to 0.9.8.  I'd be the first to say that you're
> nuts if you're running openssl versions that old; but this patch is not
> something to move the compatibility goalposts for when it only takes
> an #ifdef to avoid breaking older versions.
>
> (I need to check how far back SSL_SESS_CACHE_OFF goes ... we might
> need an #ifdef for that too.)
>

Ah OK, thanks for the explanation - makes perfect sense.


Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-04 Thread Amit Kapila
On Fri, Aug 4, 2017 at 10:59 PM, Robert Haas  wrote:
> On Fri, Aug 4, 2017 at 6:22 AM, Amit Kapila  wrote:
>> I have increased the number of hash bitmap pages as a separate patch.
>> I am not completely sure if it is a good idea to directly increase it
>> to 1024 as that will increase the size of hashmetapagedata from 960
>> bytes to 4544 bytes.  Shall we increase it to 512?
>
> I don't quite see what the problem is with increasing it to 4544
> bytes.  What's your concern?
>

Nothing as such.  It is just that the previous code might have some
reason to keep it at 128, probably if there are that many overflow
bucket pages, then it is better to reindex the index otherwise the
performance might suck while traversing the long overflow chains.  In
general, I see your point that if we can provide user to have that big
overflow space, then let's do it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostgreSQL not setting OpenSSL session id context?

2017-08-04 Thread Shay Rojansky
>
> On 2017-08-04 07:22:42 +0300, Shay Rojansky wrote:
> > I'm still not convinced of the risk/problem of simply setting the session
> > id context as I explained above (rather than disabling the optimization),
> > but of course either solution resolves my problem.
>
> How would that do anything? Each backend has it's own local
> memory. I.e. any cache state that openssl would maintain wouldn't be
> useful. If you want to take advantage of features around this you really
> need to cache tickets in shared memory...
>

Guys, there's no data being cached at the backend - RFC5077 is about
packaging information into a client-side opaque session ticket that allows
skipping a roundtrip on the next connection. As I said, simply setting the
session id context (*not* the session id or anything else) makes this
feature work, even though a completely new backend process is launched.


Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-04 Thread Amit Kapila
On Fri, Aug 4, 2017 at 11:45 PM, Amit Kapila  wrote:
> On Fri, Aug 4, 2017 at 11:15 PM, Robert Haas  wrote:
>> On Fri, Aug 4, 2017 at 6:22 AM, Amit Kapila  wrote:
>>> I have implemented the patch with this approach as other approach
>>> require quite extensive changes which I am not sure is the right thing
>>> to do at this stage.
>>
>> I think this approach is actually better anyway.  There's no guarantee
>> that VACUUM can be responsive enough to get the job done in time, work
>> items or no work items, and the split-cleanup is cheap enough that I
>> don't think we ought to worry about negative effects on foreground
>> workloads.  Sure, it could have some impact, but *less bloat* could
>> also have some impact in the other direction.
>>
>> +hashbucketcleanup(rel, obucket, bucket_obuf,
>> +  BufferGetBlockNumber(bucket_obuf), NULL,
>> +  maxbucket, highmask, lowmask, NULL, NULL, true,
>> +  NULL, NULL);
>> +LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
>>
>> Don't we want to do those things in the other order?
>>
>
> Earlier one of the callers was releasing the old bucket lock first, so
> I just maintained that order, but I think it is better if we release
> new bucket lock first. I think that will maintain the ordering and it
> is better to do it before actual cleanup of the old bucket as cleanup
> can take some time and there is no use of retaining a lock on the new
> bucket for that time.  I will do some testing after making the change
> and will post a patch in some time.
>

Changed as per suggestion.

>> - * which is passed in buffer nbuf, pinned and write-locked.  That lock and
>> - * pin are released here.  (The API is set up this way because we must do
>> - * _hash_getnewbuf() before releasing the metapage write lock.  So instead 
>> of
>> - * passing the new bucket's start block number, we pass an actual buffer.)
>> + * which is passed in buffer nbuf, pinned and write-locked.  The lock will 
>> be
>> + * released here and pin must be released by the caller.  (The API is set up
>> + * this way because we must do _hash_getnewbuf() before releasing the 
>> metapage
>> + * write lock.  So instead of passing the new bucket's start block number, 
>> we
>> + * pass an actual buffer.)
>>
>> Isn't the parenthesized part at the end of the comment wrong?  I
>> realize this patch isn't editing that part anyway except for
>> reflowing, but further up in that same block comment it says this:
>>
>>  * The caller must hold a pin, but no lock, on the metapage buffer.
>>  * The buffer is returned in the same state.  (The metapage is only
>>  * touched if it becomes necessary to add or remove overflow pages.)
>>
>
> The comment doesn't seem to be wrong but is not making much sense
> here.  Both the actions _hash_getnewbuf and release of metapage lock
> is done in the caller.  In 9.6, we were passing old bucket's block
> number and new bucket's buffer, so the comment tries to explain why it
> is passing buffer for the new bucket.  It makes less sense now because
> we are passing buffer for both old and new buckets, so we can remove
> it.  Do you want me to remove it as part of this patch or as a
> separate patch?
>

I have not done anything for this comment as it doesn't sound wrong to
me.  I think it is not making much sense in the current code and we
can remove it or change it as part of the separate patch if you or
others think so.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_cleanup_bucket_after_split_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small code improvement for btree

2017-08-04 Thread Peter Geoghegan
On Fri, Aug 4, 2017 at 11:12 AM, Alvaro Herrera
 wrote:
> Interesting.  We learned elsewhere that it's better to integrate the
> "!= 0" test as part of the macro definition; so a
> better formulation of this patch would be to change the
> P_INCOMPLETE_SPLIT macro and omit the comparison in the Assert.  (See
> commit 594e61a1de03 for an example).
>
>
>> -   LockBuffer(hbuffer, BUFFER_LOCK_SHARE);
>> +   LockBuffer(hbuffer, BT_READ);

+1.

One Linus Torvalds rant that I actually agreed with was a rant against
the use of bool as a type in C code. It's fine, as long as you never
forget that it's actually just another integer.

> I think BT_READ and BT_WRITE are useless, and I'd rather get rid of
> them ...

Fair enough, but we should either use them consistently or not at all.
I'm not especially concerned about which, as long as it's one of those
two.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-04 Thread Amit Kapila
On Fri, Aug 4, 2017 at 11:15 PM, Robert Haas  wrote:
> On Fri, Aug 4, 2017 at 6:22 AM, Amit Kapila  wrote:
>> I have implemented the patch with this approach as other approach
>> require quite extensive changes which I am not sure is the right thing
>> to do at this stage.
>
> I think this approach is actually better anyway.  There's no guarantee
> that VACUUM can be responsive enough to get the job done in time, work
> items or no work items, and the split-cleanup is cheap enough that I
> don't think we ought to worry about negative effects on foreground
> workloads.  Sure, it could have some impact, but *less bloat* could
> also have some impact in the other direction.
>
> +hashbucketcleanup(rel, obucket, bucket_obuf,
> +  BufferGetBlockNumber(bucket_obuf), NULL,
> +  maxbucket, highmask, lowmask, NULL, NULL, true,
> +  NULL, NULL);
> +LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
>
> Don't we want to do those things in the other order?
>

Earlier one of the callers was releasing the old bucket lock first, so
I just maintained that order, but I think it is better if we release
new bucket lock first. I think that will maintain the ordering and it
is better to do it before actual cleanup of the old bucket as cleanup
can take some time and there is no use of retaining a lock on the new
bucket for that time.  I will do some testing after making the change
and will post a patch in some time.

> - * which is passed in buffer nbuf, pinned and write-locked.  That lock and
> - * pin are released here.  (The API is set up this way because we must do
> - * _hash_getnewbuf() before releasing the metapage write lock.  So instead of
> - * passing the new bucket's start block number, we pass an actual buffer.)
> + * which is passed in buffer nbuf, pinned and write-locked.  The lock will be
> + * released here and pin must be released by the caller.  (The API is set up
> + * this way because we must do _hash_getnewbuf() before releasing the 
> metapage
> + * write lock.  So instead of passing the new bucket's start block number, we
> + * pass an actual buffer.)
>
> Isn't the parenthesized part at the end of the comment wrong?  I
> realize this patch isn't editing that part anyway except for
> reflowing, but further up in that same block comment it says this:
>
>  * The caller must hold a pin, but no lock, on the metapage buffer.
>  * The buffer is returned in the same state.  (The metapage is only
>  * touched if it becomes necessary to add or remove overflow pages.)
>

The comment doesn't seem to be wrong but is not making much sense
here.  Both the actions _hash_getnewbuf and release of metapage lock
is done in the caller.  In 9.6, we were passing old bucket's block
number and new bucket's buffer, so the comment tries to explain why it
is passing buffer for the new bucket.  It makes less sense now because
we are passing buffer for both old and new buckets, so we can remove
it.  Do you want me to remove it as part of this patch or as a
separate patch?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small code improvement for btree

2017-08-04 Thread Alvaro Herrera
Masahiko Sawada wrote:

> While hacking the btree code I found two points we can improve in nbtxlog.c.
> 
> @@ -135,7 +135,7 @@ _bt_clear_incomplete_split(XLogReaderState
> *record, uint8 block_id)
> Pagepage = (Page) BufferGetPage(buf);
> BTPageOpaque pageop = (BTPageOpaque)
> PageGetSpecialPointer(page);
> 
> -   Assert((pageop->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0);
> +Assert(P_INCOMPLETE_SPLIT(pageop) != 0);
> pageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;

Interesting.  We learned elsewhere that it's better to integrate the
"!= 0" test as part of the macro definition; so a
better formulation of this patch would be to change the
P_INCOMPLETE_SPLIT macro and omit the comparison in the Assert.  (See
commit 594e61a1de03 for an example).


> -   LockBuffer(hbuffer, BUFFER_LOCK_SHARE);
> +   LockBuffer(hbuffer, BT_READ);

I think BT_READ and BT_WRITE are useless, and I'd rather get rid of
them ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Scan Mode in Hash Index

2017-08-04 Thread Robert Haas
On Fri, Aug 4, 2017 at 9:44 AM, Amit Kapila  wrote:
> There is no need to use Parentheses around opaque.  I mean there is no
> problem with that, but it is redundant and makes code less readable.

Amit, I'm sure you know this, but just for the benefit of anyone who doesn't:

We often include these kinds of extra parentheses in macros, for good
reason.  Suppose you have:

#define mul(x,y) x * y

If the user says mul(2+3,5), it will expand to 2 + 3 * 5 = 17, which
is wrong.  If you instead do this:

#define mul(x,y) (x) * (y)

...then mul(2+3,5) expands to (2 + 3) * (5) = 25, which is what the
user of the macro is expecting to get.

Outside of macro definitions, as you say, there's no point and we
should avoid it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-04 Thread Alvaro Herrera
Robert Haas wrote:

> I think this approach is actually better anyway.  There's no guarantee
> that VACUUM can be responsive enough to get the job done in time, work
> items or no work items,

Yeah, autovacuum work items don't have a guaranteed response time.
They're okay for things that "ought to be done eventually", but if the
condition causes a high-speed load to fail with errors, then halting the
load until the cleanup is done seems like the way to go.  Having to
randomly inject pauses in your workload so that autovacuum has time to
cope isn't great from the user's POV (a condition that is made worse if
you don't have any mechanism to detect that you need a pause, or how
long to pause for.)

Once that part is working well, you could as a convenience (to avoid or
reduce the stalls some of the time) add autovacuum work-item support.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-04 Thread Robert Haas
On Fri, Aug 4, 2017 at 6:22 AM, Amit Kapila  wrote:
> I have implemented the patch with this approach as other approach
> require quite extensive changes which I am not sure is the right thing
> to do at this stage.

I think this approach is actually better anyway.  There's no guarantee
that VACUUM can be responsive enough to get the job done in time, work
items or no work items, and the split-cleanup is cheap enough that I
don't think we ought to worry about negative effects on foreground
workloads.  Sure, it could have some impact, but *less bloat* could
also have some impact in the other direction.

+hashbucketcleanup(rel, obucket, bucket_obuf,
+  BufferGetBlockNumber(bucket_obuf), NULL,
+  maxbucket, highmask, lowmask, NULL, NULL, true,
+  NULL, NULL);
+LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);

Don't we want to do those things in the other order?

- * which is passed in buffer nbuf, pinned and write-locked.  That lock and
- * pin are released here.  (The API is set up this way because we must do
- * _hash_getnewbuf() before releasing the metapage write lock.  So instead of
- * passing the new bucket's start block number, we pass an actual buffer.)
+ * which is passed in buffer nbuf, pinned and write-locked.  The lock will be
+ * released here and pin must be released by the caller.  (The API is set up
+ * this way because we must do _hash_getnewbuf() before releasing the metapage
+ * write lock.  So instead of passing the new bucket's start block number, we
+ * pass an actual buffer.)

Isn't the parenthesized part at the end of the comment wrong?  I
realize this patch isn't editing that part anyway except for
reflowing, but further up in that same block comment it says this:

 * The caller must hold a pin, but no lock, on the metapage buffer.
 * The buffer is returned in the same state.  (The metapage is only
 * touched if it becomes necessary to add or remove overflow pages.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-04 Thread Robert Haas
On Fri, Aug 4, 2017 at 6:22 AM, Amit Kapila  wrote:
> I have increased the number of hash bitmap pages as a separate patch.
> I am not completely sure if it is a good idea to directly increase it
> to 1024 as that will increase the size of hashmetapagedata from 960
> bytes to 4544 bytes.  Shall we increase it to 512?

I don't quite see what the problem is with increasing it to 4544
bytes.  What's your concern?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-08-04 Thread Robert Haas
On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote
 wrote:
> The current way to expand inherited tables, including partitioned tables,
> is to use either find_all_inheritors() or find_inheritance_children()
> depending on the context.  They return child table OIDs in the (ascending)
> order of those OIDs, which means the callers that need to lock the child
> tables can do so without worrying about the possibility of deadlock in
> some concurrent execution of that piece of code.  That's good.
>
> For partitioned tables, there is a possibility of returning child table
> (partition) OIDs in the partition bound order, which in addition to
> preventing the possibility of deadlocks during concurrent locking, seems
> potentially useful for other caller-specific optimizations.  For example,
> tuple-routing code can utilize that fact to implement binary-search based
> partition-searching algorithm.  For one more example, refer to the "UPDATE
> partition key" thread where it's becoming clear that it would be nice if
> the planner had put the partitions in bound order in the ModifyTable that
> it creates for UPDATE of partitioned tables [1].

I guess I don't really understand why we want to change the locking
order.  That is bound to make expanding the inheritance hierarchy more
expensive.  If we use this approach in all cases, it seems to me we're
bound to reintroduce the problem we fixed in commit
c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in
the same vein.  But I don't see that there's any necessary relation
between the order of locking and the order of expansion inside the
relcache entry/plan/whatever else -- so I propose that we keep the
existing locking order and only change the other stuff.

While reading related code this morning, I noticed that
RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have
*already* changed the locking order for certain operations, because
the PartitionDesc's OID array is bound-ordered not OID-ordered.  That
means that when RelationGetPartitionDispatchInfo uses the
PartitionDesc's OID arra to figure out what to lock, it's potentially
going to encounter partitions in a different order than would have
been the case if it had used find_all_inheritors directly.  I'm
tempted to think that RelationGetPartitionDispatchInfo shouldn't
really be doing locking at all.  The best way to have the locking
always happen in the same order is to have only one piece of code that
determines that order - and I vote for find_all_inheritors.  Aside
from the fact that it's the way we've always done it (and still do it
in most other places), that code includes infinite-loop defenses which
the new code you've introduced lacks.

Concretely, my proposal is:

1. Before calling RelationGetPartitionDispatchInfo, the calling code
should use find_all_inheritors to lock all the relevant relations (or
the planner could use find_all_inheritors to get a list of relation
OIDs, store it in the plan in order, and then at execution time we
visit them in that order and lock them).

2. RelationGetPartitionDispatchInfo assumes the relations are already locked.

3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open().  Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it.  If not, there's no need.

4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
elog(ERROR, ...).  Might be interesting to stuff that check into the
relation_open(..., NoLock) path, too.

One objection to this line of attack is that there might be a good
case for locking only the partitioned inheritors first and then going
back and locking the leaf nodes in a second pass, or even only when
required for a particular row.  However, that doesn't require putting
everything in bound order - it only requires moving the partitioned
children to the beginning of the list.  And I think rather than having
new logic for that, we should teach find_inheritance_children() to do
that directly.  I have a feeling Ashutosh is going to cringe at this
suggestion, but my idea is to do this by denormalizing: add a column
to pg_inherits indicating whether the child is of
RELKIND_PARTITIONED_TABLE.  Then, when find_inheritance_children scans
pg_inherits, it can pull that flag out for free along with the
relation OID, and qsort() first by the flag and then by the OID.  It
can also return the number of initial elements of its return value
which have that flag set.

Then, in find_all_inheritors, we split rels_list into
partitioned_rels_list and other_rels_list, and process
partitioned_rels_list in its entirety before touching other_rels_list;
they get concatenated at the end.

Now, find_all_inheritors and find_inheritance_children can also grow a
flag bool only_partitioned_children; if set, then we skip the
unpartitioned children entirely.

With all 

Re: [HACKERS] UPDATE of partition key

2017-08-04 Thread Amit Khandekar
>
> Below are the TODOS at this point :
>
> Fix for bug reported by Rajkumar about update with join.

I had explained the root issue of this bug here : [1]

Attached patch includes the fix, which is explained below.
Currently in the patch, there is a check if the tuple is concurrently
deleted by other session, i.e. when heap_update() returns
HeapTupleUpdated. In such case we set concurrently_deleted output
param to true. We should also do the same for HeapTupleSelfUpdated
return value.

In fact, there are other places in ExecDelete() where it can return
without doing anything. For e.g. if a BR DELETE trigger prevents the
delete from happening, ExecBRDeleteTriggers() returns false, in which
case ExecDelete() returns.

So what the fix does is : rename concurrently_deleted parameter to
delete_skipped so as to indicate a more general status : whether
delete has actually happened or was it skipped. And set this param to
true only after the delete happens. This allows us to avoid adding a
new rows for the trigger case also.

Added test scenario for UPDATE with JOIN case, and also TRIGGER case.

> Do something about two separate mapping tables for Transition tables
> and update tuple-routing.
On 1 July 2017 at 03:15, Thomas Munro  wrote:
> Would make sense to have a set of functions with names like
> GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays
> m_convertors_{from,to}_by_{subplan,leaf} the first time they need
> them?

This was discussed here : [2]. I think even if we have them built when
needed, still in presence of both tuple routing and transition tables,
we do need separate arrays. So I think rather than dynamic arrays, we
can have static arrays but their elements will point to  a shared
TupleConversionMap structure whenever possible.
As already in the patch, in case of insert/update tuple routing, there
is a per-leaf partition mt_transition_tupconv_maps array for
transition tables, and a separate per-subplan arry mt_resultrel_maps
for update tuple routing. *But*, what I am proposing is: for the
mt_transition_tupconv_maps[] element for which the leaf partition also
exists as a per-subplan result, that array element and the
mt_resultrel_maps[] element will point to the same TupleConversionMap
structure.

This is quite similar to how we are re-using the per-subplan
resultrels for the per-leaf result rels. We will re-use the
per-subplan TupleConversionMap for the per-leaf
mt_transition_tupconv_maps[] elements.

Not yet implemented this.

> GetUpdatedColumns() to be moved to header file.

Done. I have moved it in execnodes.h

> More test scenarios in regression tests.
> Need to check/test whether we are correctly applying insert policies
> (ant not update) while inserting a routed tuple.

Yet to do above two.

> Use getASTriggerResultRelInfo() for attrno mapping, rather than first
> resultrel, for generating child WCO/RETURNING expression.
>

Regarding generating child WithCheckOption and Returning expressions
using those of the root result relation, ModifyTablePath and
ModifyTable should have new fields rootReturningList (and
rootWithCheckOptions) which would be derived from
root->parse->returningList in inheritance_planner(). But then, similar
to per-subplan returningList, rootReturningList would have to pass
through set_plan_refs()=>set_returning_clause_references() which
requires the subplan targetlist to be passed. Because of this, for
rootReturningList, we require a subplan for root partition, which is
not there currently; we have subpans only for child rels. That means
we would have to create such plan only for the sake of generating
rootReturningList.

The other option is to do the way the patch is currently doing in the
executor by using the returningList of the first per-subplan result
rel to generate the other child returningList (and WithCheckOption).
This is working by applying map_partition_varattnos() to the first
returningList. But now that we realized that we have to specially
handle whole-row vars, map_partition_varattnos() would need some
changes to convert whole row vars differently for
child-rel-to-child-rel mapping. For childrel-to-childrel conversion,
the whole-row var is already wrapped by ConvertRowtypeExpr, but we
need to change its Var->vartype to the new child vartype.

I think the second option looks easier, but I am open to suggestions,
and I am myself still checking the first one.

> Address Robert's review comments on make_resultrel_ordered.patch.
>
> +typedef struct ParentChild
>
> This is a pretty generic name.  Pick something more specific and informative.

I have used ChildPartitionInfo. But suggestions welcome.

>
> +static List *append_rel_partition_oids(List *rel_list, Relation rel);
>
> One could be forgiven for thinking that this function was just going
> to append OIDs, but it actually appends ParentChild structures, so I
> think the name needs work.

Renamed it to append_child_partitions().

>
> +List 

Re: [HACKERS] pg_upgrade failed with error - ERROR: column "a" in child table must be marked NOT NULL

2017-08-04 Thread Michael Paquier
On Fri, Aug 4, 2017 at 5:50 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> So I think that the attached patch is able to do the legwork.
>
> I've pushed this into HEAD.  It seems like enough of a behavioral
> change that we wouldn't want to back-patch, but IMO it's not too late
> to be making this type of change in v10.  If we wait for the next CF
> then it will take another year for the fix to reach the field.

Thanks for applying the fix. My intention when adding that in a CF is
not to see things lost.

>> While
>> looking at the code, I have bumped into index_check_primary_key() that
>> discarded the case of sending SET NOT NULL to child tables, as
>> introduced by 88452d5b. But that's clearly an oversight IMO, and the
>> comment is wrong to begin with because SET NOT NULL is spread to child
>> tables. Using is_alter_table instead of a plain true in
>> index_check_primary_key() for the call of AlterTableInternal() is
>> defensive, but I don't think that we want to impact any modules
>> relying on this API, so recursing only when ALTER TABLE is used is the
>> safest course of action to me.
>
> I didn't find that persuasive: I think passing "recurse" as just plain
> "true" is safer and more readable.  We shouldn't be able to get there
> in a CREATE case, as per the comments; and if we did, there shouldn't
> be any child tables anyway; but if somehow there were, surely the same
> consistency argument would imply that we *must* recurse and fix those
> child tables.  So I just made it "true".

Yeah, that matches what I read as well. No complains to switch to a
plain "true" even if it is not reached yet.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Subscription code improvements

2017-08-04 Thread Masahiko Sawada
On Fri, Aug 4, 2017 at 2:17 AM, Peter Eisentraut
 wrote:
> On 7/13/17 23:53, Masahiko Sawada wrote:
>> To summary, I think we now have only one issue; ALTER SUBSCRIPTION is
>> not transactional, 0004 patch is addressing this issue .
>
> We have two competing patches for this issue.  This patch moves the
> killing to the end of the DDL transaction.  Your earlier patch makes the
> tablesync work itself responsible for exiting.  Do you wish to comment
> which direction to pursue?  (Doing both might also be an option?)
>

To make ALTER SUBSCRIPTION REFRESH being transactional, I prefer
Petr's proposal. Because it can make ALTER SUBSCRIPTION and DROP
SUBSCRIPTION stop the table sync workers that are in progress of
copying data. I'm not sure killing table sync workers in DDL commands
would be acceptable but since it can free unnecessary slots of logical
replication workers and replication slots I'd prefer this idea.

However, even with this patch there is still an issue that NOTICE
messages "removed subscription for table public.t1" can be appeared
even if we rollback ALTER SUBSCRIPTION REFRESH command as I mentioned
on earlier thread. Since I think this behaviour will confuse users who
check server logs I'd like to deal with it, I don't have a good idea
though.

Also, I think we can incorporate the idea of my earlier proposal with
some changes (i.g. I'd choose the third option). In current
implementation, in case where a subscription relation state is
accidentally removed while the corresponding table sync worker is
progress of copying data, it cannot exit from a loop in
wait_for_worker_state_change unless the apply worker dies. So to be
more robust, table sync workers can finish with an error if its
subscription relation state has disappeared.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade failed with error - ERROR: column "a" in child table must be marked NOT NULL

2017-08-04 Thread Tom Lane
Michael Paquier  writes:
> So I think that the attached patch is able to do the legwork.

I've pushed this into HEAD.  It seems like enough of a behavioral
change that we wouldn't want to back-patch, but IMO it's not too late
to be making this type of change in v10.  If we wait for the next CF
then it will take another year for the fix to reach the field.

> While
> looking at the code, I have bumped into index_check_primary_key() that
> discarded the case of sending SET NOT NULL to child tables, as
> introduced by 88452d5b. But that's clearly an oversight IMO, and the
> comment is wrong to begin with because SET NOT NULL is spread to child
> tables. Using is_alter_table instead of a plain true in
> index_check_primary_key() for the call of AlterTableInternal() is
> defensive, but I don't think that we want to impact any modules
> relying on this API, so recursing only when ALTER TABLE is used is the
> safest course of action to me.

I didn't find that persuasive: I think passing "recurse" as just plain
"true" is safer and more readable.  We shouldn't be able to get there
in a CREATE case, as per the comments; and if we did, there shouldn't
be any child tables anyway; but if somehow there were, surely the same
consistency argument would imply that we *must* recurse and fix those
child tables.  So I just made it "true".

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default Partition for Range

2017-08-04 Thread Robert Haas
On Mon, Jul 31, 2017 at 8:28 AM, Beena Emerson  wrote:
> Thanks for informing.
> PFA the updated patch.
> I have changed the numbering of enum PartitionRangeDatumKind since I
> have to include DEFAULT as well. If you have better ideas, let me
> know.

Why do we need to introduce PARTITION_RANGE_DATUM_DEFAULT at all?  It
seems to me that the handling of default range partitions ought to be
similar to the way a null-accepting list partition is handled -
namely, it wouldn't show up in the "datums" or "kind" array at all,
instead just showing up in PartitionBoundInfoData's default_index
field.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Scan Mode in Hash Index

2017-08-04 Thread Amit Kapila
On Sun, Jul 30, 2017 at 2:07 PM, Ashutosh Sharma  wrote:
> Hi,
>
> On Wed, May 10, 2017 at 2:28 PM, Ashutosh Sharma  
> wrote:
>> While doing the code coverage testing of v7 patch shared with - [1], I
>> found that there are few lines of code in _hash_next() that are
>> redundant and needs to be removed. I basically came to know this while
>> testing the scenario where a hash index scan starts when a split is in
>> progress. I have removed those lines and attached is the newer version
>> of patch.
>>
>
> Please find the new version of patches for page at a time scan in hash
> index rebased on top of latest commit in master branch. Also, i have
> ran pgindent script with pg_bsd_indent version 2.0 on all the modified
> files. Thanks.
>

Few comments:
1.
+_hash_kill_items(IndexScanDesc scan, bool havePin)

I think you can do without the second parameter.  Can't we detect
inside _hash_kill_items whether the page is pinned or not as we do for
btree?

2.
+ /*
+ * We remember prev and next block number along with current block
+ * number so that if fetching the tup- les using cursor we know
+ * the page from where to startwith. This is for the case where we
+ * have re- ached the end of bucket chain without finding any
+ * matching tuples.

The spelling of tuples and reached contain some unwanted symbol.  Have
space between "startwith" or just use "begin"

3.
+ if (!BlockNumberIsValid((opaque)->hasho_nextblkno))
+ {
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ so->currPos.nextPage = InvalidBlockNumber;
+ }

There is no need to use Parentheses around opaque.  I mean there is no
problem with that, but it is redundant and makes code less readable.
Also, there is similar usage at other places in the code, please
change all another place as well.  I think you can save the value of
prevblkno in a local variable and use it in else part.

4.
+ if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+ _hash_checkqual(scan, itup))
+ {
+ /* tuple is qualified, so remember it */
+ _hash_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+
+ /*
+ * No more matching tuples exist in this page. so, exit while
+ * loop.
+ */
+ break;

It is better to have braces for the else part. It makes code look
better.  The type of code exists few line down as well, change that as
well.

5.
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later.
+ */

"whether it safe"/"whether it is safe"

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench: Skipping the creating primary keys after initialization

2017-08-04 Thread Masahiko Sawada
On Fri, Aug 4, 2017 at 3:24 AM, Fabien COELHO  wrote:
>
>>> For the CREATE stuff, the script language is SQL, the command to use it
>>> is
>>> "psql"...
>>
>>
>>> The real and hard part is to fill tables with meaningful pseudo-random
>>> test data which do not violate constraints for any non trivial schema
>>> involving foreign keys and various unique constraints.
>>
>>
>>> The solution for this is SQL for trivial cases, think of:
>>>"INSERT INTO Foo() SELECT ... FROM generate_series(...);"
>>
>>
>> Yeah.  I was also thinking that complicated data-generation requirements
>> could be handled with plpgsql DO blocks, avoiding the need for hard-wired
>> code inside pgbench.
>
>
> I do not thing that it is really be needed for what pgbench does, though.
> See attached attempt, including a no_foreign_keys option.
>
> The only tricky thing is to have the elapsed/remaining advancement report on
> stdout, maybe with some PL/pgSQL.
>
> Timings are very similar compared to "pgbench -i".
>

The generating data with plpgsql DO blocks means that we do the
data-generation on sever side rather than on client side. I think it's
preferable in a sense because could speed up initialization time by
reducing the network traffic.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-08-04 Thread Ashutosh Bapat
On Fri, Aug 4, 2017 at 1:08 PM, Amit Langote
 wrote:
> The current way to expand inherited tables, including partitioned tables,
> is to use either find_all_inheritors() or find_inheritance_children()
> depending on the context.  They return child table OIDs in the (ascending)
> order of those OIDs, which means the callers that need to lock the child
> tables can do so without worrying about the possibility of deadlock in
> some concurrent execution of that piece of code.  That's good.
>
> For partitioned tables, there is a possibility of returning child table
> (partition) OIDs in the partition bound order, which in addition to
> preventing the possibility of deadlocks during concurrent locking, seems
> potentially useful for other caller-specific optimizations.  For example,
> tuple-routing code can utilize that fact to implement binary-search based
> partition-searching algorithm.  For one more example, refer to the "UPDATE
> partition key" thread where it's becoming clear that it would be nice if
> the planner had put the partitions in bound order in the ModifyTable that
> it creates for UPDATE of partitioned tables [1].

Thanks a lot for working on this. Partition-wise join can benefit from
this as well. See comment about build_simple_rel's matching algorithm
in [1]. It will become O(n) instead of O(n^2).

>
> So attached are two WIP patches:
>
> 0001 implements two interface functions:
>
>   List *get_all_partition_oids(Oid, LOCKMODE)
>   List *get_partition_oids(Oid, LOCKMODE)
>
> that resemble find_all_inheritors() and find_inheritance_children(),
> respectively, but expect that users call them only for partitioned tables.
>  Needless to mention, OIDs are returned with canonical order determined by
> that of the partition bounds and they way partition tree structure is
> traversed (top-down, breadth-first-left-to-right).  Patch replaces all the
> calls of the old interface functions with the respective new ones for
> partitioned table parents.  That means expand_inherited_rtentry (among
> others) now calls get_all_partition_oids() if the RTE is for a partitioned
> table and find_all_inheritors() otherwise.
>
> In its implementation, get_all_partition_oids() calls
> RelationGetPartitionDispatchInfo(), which is useful to generate the result
> list in the desired partition bound order.  But the current interface and
> implementation of RelationGetPartitionDispatchInfo() needs some rework,
> because it's too closely coupled with the executor's tuple routing code.

May be we want to implement get_all_partition_oids() calling
get_partition_oids() on each new entry that gets added, similar to
find_all_inheritors(). That might avoid changes to DispatchInfo() and
also dependency on that structure.

Also almost every place which called find_all_inheritors() or
find_inheritance_children() is changed to if () else case calling
those functions or the new function as required. May be we should
create macros/functions to do that routing to keep the code readable.
May be find_all_inheritors() and find_inheritance_children()
themselves become the routing function and their original code moves
into new functions get_all_inheritors() and
get_inheritance_children(). We may choose other names for functions.
The idea is to create routing functions/macros instead of sprinkling
code with if () else conditions.

[1] 
https://www.postgresql.org/message-id/ca+tgmoberutu4osxa_ua4aorho83wxavfg8n1nqcofuujbe...@mail.gmail.com
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] More flexible LDAP auth search filters?

2017-08-04 Thread Mark Cave-Ayland
On 01/08/17 23:17, Thomas Munro wrote:

> On Wed, Aug 2, 2017 at 5:36 AM, Peter Eisentraut
>  wrote:
>> On 7/16/17 19:09, Thomas Munro wrote:
>>> On Mon, Jul 17, 2017 at 10:26 AM, Thomas Munro
>>>  wrote:
 ldap-search-filters-v2.patch
>>>
>>> Gah, it would help if I could spell "occurrences" correctly.  Fixed in
>>> the attached.
>>
>> Please also add the corresponding support for specifying search filters
>> in LDAP URLs.  See RFC 4516 for the format and
>> https://linux.die.net/man/3/ldap_url_parse for the API.  You might just
>> need to grab LDAPURLDesc.lud_filter and use it.
> 
> Good idea.  Yes, it seems to be that simple.  Here's a version like
> that.  Here's an example of how it looks in pg_hba.conf:
> 
> host   all all  127.0.0.1/32ldap
> ldapurl="ldap://localhost/ou=people1,dc=my-domain,dc=com??sub?(cn=%25u)"
> 
> Maybe we could choose a better token than %u for user name, since it
> has to be escaped when included in a URL like that, but on the other
> hand there seems to be wide precedent for %u in other software.

Yeah, mostly I only ever see ldapurls used programatically, i.e. the
configuration allows you to set the various fields separately and then
the software generates the URL with the correct encoding itself. But if
it's documented that's not a reason to reject the patch as I definitely
see it as an improvement.

As I mentioned previously in the thread, the main barrier preventing
people from using LDAP is that the role cannot be generated from other
attributes in the directory. In a lot of real-life cases I see, that
would be enough to discount PostgreSQL's LDAP authentication completely.


ATB,

Mark.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-04 Thread Amit Kapila
On Fri, Aug 4, 2017 at 9:19 AM, AP  wrote:
> On Fri, Aug 04, 2017 at 08:21:01AM +0530, Amit Kapila wrote:
>> Note - AP has off list shared the data dump and we (Ashutosh Sharma
>> and me) are able to reproduce the problem and we could see that if we
>> force vacuum via the debugger, then it is able to free overflow pages.
>> The exact numbers are not available at this stage as the test is not
>> complete.
>
> I've another if you would like it. I COPYed with FILLFACTOR of 10 and
> it eventually failed but I could not recreate the index (via CREATE INDEX
> CONCURRENTLY) with the data that made it using a fillfactor of 100. If
> I created the index again (again with the same data) with fillfactor 10
> then it completed.
>
> I may be completely misunderstanding fillfactor but I always thought it was
> a performance optimisation rather than something that may allow you to store
> more (or less) index entries.
>

It impacts the split behavior.  You can read about it at:
https://www.postgresql.org/docs/9.6/static/sql-createindex.html


>
> The DB in question is now gone but I took a copy of the column as per
> before so if you'd like it I can make it available via the same means.
>

Can you try with the patches posted above?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-04 Thread AP
On Fri, Aug 04, 2017 at 08:21:01AM +0530, Amit Kapila wrote:
> Note - AP has off list shared the data dump and we (Ashutosh Sharma
> and me) are able to reproduce the problem and we could see that if we
> force vacuum via the debugger, then it is able to free overflow pages.
> The exact numbers are not available at this stage as the test is not
> complete.

I've another if you would like it. I COPYed with FILLFACTOR of 10 and
it eventually failed but I could not recreate the index (via CREATE INDEX
CONCURRENTLY) with the data that made it using a fillfactor of 100. If
I created the index again (again with the same data) with fillfactor 10
then it completed.

I may be completely misunderstanding fillfactor but I always thought it was
a performance optimisation rather than something that may allow you to store
more (or less) index entries.

The stats for the various indexes are:

After COPYs started failing:
fmmdstash=# select overflow_pages/bitmap_pages/8,* from  
pgstathashindex('link_datum_id_idx');
 ?column? | version | bucket_pages | overflow_pages | bitmap_pages | 
unused_pages | live_items | dead_items |   free_percent
--+-+--++--+--+++--
 4095 |   3 |103782169 |4194176 |  128 | 
13658343 | 5 085 570 007 |  0 | 21014.6558371539
(1 row)

Time: 6146310.494 ms (01:42:26.310)

After the CREATE INDEX CONCURRENTLY with FILLFACTOR 100 failed:
fmmdstash=# select overflow_pages/bitmap_pages/8,* from  
pgstathashindex('link_datum_id_idx1');
 ?column? | version | bucket_pages | overflow_pages | bitmap_pages | 
unused_pages | live_items | dead_items |  free_percent
--+-+--++--+--+++-
 4095 |   3 |  6205234 |4194176 |  128 |
86222 | 3080760746 |  0 | 615.91682922039
(1 row)

Time: 19128.527 ms (00:19.129)

After the CREATE INDEX CONCURRENTLY with FILLFACTOR 10 succeeded:
fmmdstash=# select overflow_pages/bitmap_pages/8,* from  
pgstathashindex('link_datum_id_idx2');
 ?column? | version | bucket_pages | overflow_pages | bitmap_pages | 
unused_pages | live_items | dead_items |   free_percent
--+-+--++--+--+++--
 3062 |   3 | 79677471 |2572565 |  105 |  
5074888 | 3187098806 |  0 | 19027.2399324415
(1 row)

Time: 1557509.940 ms (25:57.510)

The DB in question is now gone but I took a copy of the column as per
before so if you'd like it I can make it available via the same means.

AP


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-04 Thread Amit Kapila
On Fri, Aug 4, 2017 at 8:21 AM, Amit Kapila  wrote:
> On Wed, Aug 2, 2017 at 9:04 PM, Robert Haas  wrote:
>> On Wed, Jul 12, 2017 at 1:10 AM, Amit Kapila  wrote:
> Yes, I also think the same idea can be used, in fact, I have mentioned
> it [1] as soon as you have committed that patch.  Do we want to do
> anything at this stage for PG-10?  I don't think we should attempt
> something this late unless people feel this is a show-stopper issue
> for usage of hash indexes.  If required, I think a separate function
> can be provided to allow users to perform squeeze operation.

 Sorry, I have no idea how critical this squeeze thing is for the
 newfangled hash indexes, so I cannot comment on that.  Does this make
 the indexes unusable in some way under some circumstances?
>>>
>>> It seems so.  Basically, in the case of a large number of duplicates,
>>> we hit the maximum number of overflow pages.  There is a theoretical
>>> possibility of hitting it but it could also happen that we are not
>>> free the existing unused overflow pages due to which it keeps on
>>> growing and hit the limit.  I have requested up thread to verify if
>>> that is happening in this case and I am still waiting for same.  The
>>> squeeze operation does free such unused overflow pages after cleaning
>>> them.  As this is a costly operation and needs a cleanup lock, so we
>>> currently perform it only during Vacuum and next split from the bucket
>>> which can have redundant overflow pages.
>>
>> Oops.  It was rather short-sighted of us not to increase
>> HASH_MAX_BITMAPS when we bumped HASH_VERSION.  Actually removing that
>> limit is hard, but we could have easily bumped it for 128 to say 1024
>> without (I think) causing any problem, which would have given us quite
>> a bit of headroom here.
>
> Yes, that sounds sensible, but I think it will just delay the problem
> to happen.  I think here the actual problem is that we are not able to
> perform squeeze operation often enough that it frees the overflow
> pages.  Currently, we try to perform the squeeze only at the start of
> next split of the bucket or during vacuum.  The reason for doing it
> that way was that squeeze operation needs cleanup lock and we already
> have that during the start of split and vacuum. Now, to solve it I
> have already speculated few ways above [1] and among those, it is
> feasible to either do this at end of split which can have performance
> implications in some work loads, but will work fine for the case
> reported in this thread
>

I have implemented the patch with this approach as other approach
require quite extensive changes which I am not sure is the right thing
to do at this stage.

>
> I think we can fix it in one of above ways and increase the value of
> HASH_MAX_BITMAPS as well.
>

I have increased the number of hash bitmap pages as a separate patch.
I am not completely sure if it is a good idea to directly increase it
to 1024 as that will increase the size of hashmetapagedata from 960
bytes to 4544 bytes.  Shall we increase it to 512?



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_cleanup_bucket_after_split_v1.patch
Description: Binary data


increase_hash_max_bitmaps_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reload-through-the-top-parent switch the partition table

2017-08-04 Thread Rushabh Lathia
Here is an update patch,  now renamed the switch to
--load-via-partition-root
and also added the documentation for the new switch into pg_dump as well
as pg_dumpall.


On Fri, Aug 4, 2017 at 7:13 AM, Amit Langote 
wrote:

> On 2017/08/04 1:08, David G. Johnston wrote:
> > On Thu, Aug 3, 2017 at 8:53 AM, Tom Lane  wrote:
> >
> >> Robert Haas  writes:
> >>> So maybe --load-via-partition-root if nobody likes my previous
> >>> suggestion of --partition-data-via-root ?
> >>
> >> WFM.
> >>
> >
> > ​+1
>
> +1.
>
> Thanks,
> Amit
>
>


Thanks,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index bafa031..65f7e98 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -889,6 +889,16 @@ PostgreSQL documentation
  
 
  
+  --load-via-partition-root
+  
+   
+Dump data via the top-most partitioned table (rather than partition
+table) when dumping data for the partition table.
+   
+  
+ 
+
+ 
--section=sectionname

  
diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml
index aa944a2..54b8e7e 100644
--- a/doc/src/sgml/ref/pg_dumpall.sgml
+++ b/doc/src/sgml/ref/pg_dumpall.sgml
@@ -431,6 +431,16 @@ PostgreSQL documentation
  
 
  
+  --load-via-partition-root
+  
+   
+Dump data via the top-most partitioned table (rather than partition
+table) when dumping data for the partition table.
+   
+  
+ 
+
+ 
   --use-set-session-authorization
   

diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 144068a..ce3100f 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -157,6 +157,7 @@ typedef struct _dumpOptions
 	int			outputNoTablespaces;
 	int			use_setsessauth;
 	int			enable_row_security;
+	int			load_via_partition_root;
 
 	/* default, if no "inclusion" switches appear, is to dump everything */
 	bool		include_everything;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 393b9e2..27e23ca 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -269,6 +269,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions,
 		const char *prefix, Archive *fout);
 static char *get_synchronized_snapshot(Archive *fout);
 static void setupDumpWorker(Archive *AHX);
+static TableInfo *getRootTableInfo(TableInfo *tbinfo);
 
 
 int
@@ -345,6 +346,7 @@ main(int argc, char **argv)
 		{"lock-wait-timeout", required_argument, NULL, 2},
 		{"no-tablespaces", no_argument, , 1},
 		{"quote-all-identifiers", no_argument, _all_identifiers, 1},
+		{"load-via-partition-root", no_argument, _via_partition_root, 1},
 		{"role", required_argument, NULL, 3},
 		{"section", required_argument, NULL, 5},
 		{"serializable-deferrable", no_argument, _deferrable, 1},
@@ -959,6 +961,7 @@ help(const char *progname)
 	printf(_("  --no-tablespaces do not dump tablespace assignments\n"));
 	printf(_("  --no-unlogged-table-data do not dump unlogged table data\n"));
 	printf(_("  --quote-all-identifiers  quote all identifiers, even if not key words\n"));
+	printf(_("  --load-via-partition-rootload partition table via the root relation\n"));
 	printf(_("  --section=SECTIONdump named section (pre-data, data, or post-data)\n"));
 	printf(_("  --serializable-deferrablewait until the dump can run without anomalies\n"));
 	printf(_("  --snapshot=SNAPSHOT  use given snapshot for the dump\n"));
@@ -1902,8 +1905,32 @@ dumpTableData_insert(Archive *fout, void *dcontext)
 			if (insertStmt == NULL)
 			{
 insertStmt = createPQExpBuffer();
+
+/*
+ * When load-via-partition-root is set, get the root relation name
+ * for the partition table, so that we can reload data through
+ * the root table.
+ */
+if (dopt->load_via_partition_root && tbinfo->ispartition)
+{
+	TableInfo  *parentTbinfo;
+
+	parentTbinfo = getRootTableInfo(tbinfo);
+
+	/*
+	 * When we loading data through the root, we will qualify
+	 * the table name. This is needed because earlier
+	 * search_path will be set for the partition table.
+	 */
+	classname = (char *) fmtQualifiedId(fout->remoteVersion,
+		parentTbinfo->dobj.namespace->dobj.name,
+		parentTbinfo->dobj.name);
+}
+else
+	classname = fmtId(tbinfo->dobj.name);
+
 appendPQExpBuffer(insertStmt, "INSERT INTO %s ",
-  fmtId(classname));
+  classname);
 
 /* corner case for zero-column table */
 if (nfields == 0)
@@ -2025,6 +2052,20 @@ dumpTableData_insert(Archive *fout, void *dcontext)
 	return 1;
 }
 
+/*
+ * getRootTableInfo:
+ * get the root TableInfo for the given partition table.
+ */
+static TableInfo *

[HACKERS] scan on inheritance parent with no children in current session

2017-08-04 Thread Ashutosh Bapat
Hi All,
Consider a parent table which has no child in the current session, but
has temporary children in other sessions.

Session 1
postgres=# create table parent (a int);
CREATE TABLE

Session 2:
postgres=# create temp table temp_child () inherits(parent);
CREATE TABLE

Before commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50. We would have
Seq Scan plan for scanning parent in session 1
postgres=# explain verbose select * from parent;
 QUERY PLAN
-
 Seq Scan on public.parent  (cost=0.00..0.00 rows=1 width=4)
   Output: parent.a
(2 rows)

In session 2, it would be an Append plan
postgres=# explain verbose select * from parent;
  QUERY PLAN
--
 Append  (cost=0.00..35.50 rows=2551 width=4)
   ->  Seq Scan on public.parent  (cost=0.00..0.00 rows=1 width=4)
 Output: parent.a
   ->  Seq Scan on pg_temp_4.temp_child  (cost=0.00..35.50 rows=2550 width=4)
 Output: temp_child.a
(5 rows)

After that commit in session 1, we get an Append plan
postgres=# explain verbose select * from parent;
QUERY PLAN
---
 Append  (cost=0.00..0.00 rows=1 width=4)
   ->  Seq Scan on public.parent  (cost=0.00..0.00 rows=1 width=4)
 Output: parent.a
(3 rows)

I don't think this is an intentional change. Here's patch to fix it.
The comment in the patch uses term "real child" in the context of
comments about temporary children from other session and the comment
at the end of the function where rte->inh is reset. May be we should
move the second comment before setting has_child in the patch and use
"real child" in the comment at the end to avoid repetition. But I want
to first check whether we want this fix or we can live with the Append
plan.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index cf46b74..b640639 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1374,7 +1374,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	List	   *inhOIDs;
 	List	   *appinfos;
 	ListCell   *l;
-	bool		need_append;
+	bool		has_child;
 	PartitionedChildRelInfo *pcinfo;
 	List	   *partitioned_child_rels = NIL;
 
@@ -1448,7 +1448,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 
 	/* Scan the inheritance set and expand it */
 	appinfos = NIL;
-	need_append = false;
+	has_child = false;
 	foreach(l, inhOIDs)
 	{
 		Oid			childOID = lfirst_oid(l);
@@ -1502,7 +1502,10 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		 */
 		if (childrte->relkind != RELKIND_PARTITIONED_TABLE)
 		{
-			need_append = true;
+			/* Remember if we saw a real child. */
+			if (childOID != parentOID)
+has_child = true;
+
 			appinfo = makeNode(AppendRelInfo);
 			appinfo->parent_relid = rti;
 			appinfo->child_relid = childRTindex;
@@ -1582,7 +1585,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	 * the parent table is harmless, so we don't bother to get rid of it;
 	 * ditto for the useless PlanRowMark node.
 	 */
-	if (!need_append)
+	if (!has_child)
 	{
 		/* Clear flag before returning */
 		rte->inh = false;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Small code improvement for btree

2017-08-04 Thread Masahiko Sawada
Hi,

While hacking the btree code I found two points we can improve in nbtxlog.c.

@@ -135,7 +135,7 @@ _bt_clear_incomplete_split(XLogReaderState
*record, uint8 block_id)
Pagepage = (Page) BufferGetPage(buf);
BTPageOpaque pageop = (BTPageOpaque)
PageGetSpecialPointer(page);

-   Assert((pageop->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0);
+Assert(P_INCOMPLETE_SPLIT(pageop) != 0);
pageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;

PageSetLSN(page, lsn);

We can use P_INCOMPLETE_SPLIT() instead here.

---
@@ -598,7 +598,7 @@
btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record)
UnlockReleaseBuffer(ibuffer);
return InvalidTransactionId;
}
-   LockBuffer(hbuffer, BUFFER_LOCK_SHARE);
+   LockBuffer(hbuffer, BT_READ);
hpage = (Page) BufferGetPage(hbuffer);

/*

We should use BT_READ here rather than BUFFER_LOCK_SHARE.

I think that since such codes are sometimes hard to be found easily by
grep, so could be a cause of bugs when changing the code.
Attached small patch fixes these issues.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


code_improvement_for_btree.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-08-04 Thread Amit Langote
On 2017/08/02 19:49, Amit Khandekar wrote:
> On 2 August 2017 at 14:38, Amit Langote  wrote:
>>> One approach I had considered was to have find_inheritance_children()
>>> itself lock the children in bound order, so that everyone will have
>>> bound-ordered oids, but that would be too expensive since it requires
>>> opening all partitioned tables to initialize partition descriptors. In
>>> find_inheritance_children(), we get all oids without opening any
>>> tables. But now that I think more of it, it's only the partitioned
>>> tables that we have to open, not the leaf partitions; and furthermore,
>>> I didn't see calls to find_inheritance_children() and
>>> find_all_inheritors() in performance-critical code, except in
>>> expand_inherited_rtentry(). All of them are in DDL commands; but yes,
>>> that can change in the future.
>>
>> This approach more or less amounts to calling the new
>> RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
>> which I posted upthread.)  Maybe we can add a wrapper on top, say,
>> get_all_partition_oids() which throws away other things that
>> RelationGetPartitionDispatchInfo() returned.  In addition it locks all the
>> partitions that are returned, unlike only the partitioned ones, which is
>> what RelationGetPartitionDispatchInfo() has been taught to do.
> 
> So there are three different task items here :
> 1. Arrange the oids in consistent order everywhere.
> 2. Prepare the Partition Dispatch Info data structure in the planner
> as against during execution.
> 3. For update tuple routing, assume that the result rels are ordered
> consistently to make the searching efficient.

That's a good breakdown.

> #3 depends on #1. So for that, I have come up with a minimum set of
> changes to have expand_inherited_rtentry() generate the rels in bound
> order. When we do #2 , it may be possible that we may need to re-do my
> changes in expand_inherited_rtentry(), but those are minimum. We may
> even end up having the walker function being used at multiple places,
> but right now it is not certain.

So AFAICS:

For performance reasons, we want the order in which leaf partition
sub-plans appear in the ModifyTable node (and subsequently leaf partition
ResultRelInfos ModifyTableState) to be some known canonical order.  That's
because we want to map partitions in the insert tuple-routing data
structure (which appear in a known canonical order as determined by
RelationGetPartititionDispatchInfo) to those appearing in the
ModifyTableState.  That's so that we can reuse the planner-generated WCO
and RETURNING lists in the insert code path when update tuple-routing
invokes that path.

To implement that, planner should retrieve the list of leaf partition OIDs
in the same order as ExecSetupPartitionTupleRouting() retrieves them.
Because the latter calls RelationGetPartitionDispatchInfo on the root
partitioned table, maybe the planner should do that too, instead of its
current method getting OIDs using find_all_inheritors().  But it's
currently not possible due to the way RelationGetPartitionDispatchInfo()
and involved data structures are designed.

One way forward I see is to invent new interface functions:

  List *get_all_partition_oids(Oid, LOCKMODE)
  List *get_partition_oids(Oid, LOCKMODE)

that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expects that users make sure that they are called only
for partitioned tables.  Needless to mention, OIDs are returned with
canonical order determined by that of the partition bounds and partition
tree structure.  We replace all the calls of the old interface functions
with the respective new ones.  That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.

> So, I think we can continue the discussion about #1 and #2 in a separate 
> thread.

I have started a new thread named "expanding inheritance in partition
bound order" and posted a couple of patches [1].

After applying those patches, you can write code for #3 without having to
worry about the concerns of partition order, which I guess you've already
done.

>>> Regarding dynamically locking specific partitions as and when needed,
>>> I think this method inherently has the issue of deadlock because the
>>> order would be random. So it feels like there is no way around other
>>> than to lock all partitions beforehand.
>>
>> I'm not sure why the order has to be random.  If and when we decide to
>> open and lock a subset of partitions for a given query, it will be done in
>> some canonical order as far as I can imagine.  Do you have some specific
>> example in mind?
> 
> Partitioned table t1 has partitions t1p1 and t1p2
> Partitioned table t2 at the same level has partitions t2p1 and t2p2
> Tuple routing causes the first row to insert into t2p2, so t2p2 is locked.
> Next insert locks t1p1 because it inserts into t1p1.
> 

[HACKERS] expanding inheritance in partition bound order

2017-08-04 Thread Amit Langote
The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context.  They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code.  That's good.

For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations.  For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm.  For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1].

So attached are two WIP patches:

0001 implements two interface functions:

  List *get_all_partition_oids(Oid, LOCKMODE)
  List *get_partition_oids(Oid, LOCKMODE)

that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expect that users call them only for partitioned tables.
 Needless to mention, OIDs are returned with canonical order determined by
that of the partition bounds and they way partition tree structure is
traversed (top-down, breadth-first-left-to-right).  Patch replaces all the
calls of the old interface functions with the respective new ones for
partitioned table parents.  That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.

In its implementation, get_all_partition_oids() calls
RelationGetPartitionDispatchInfo(), which is useful to generate the result
list in the desired partition bound order.  But the current interface and
implementation of RelationGetPartitionDispatchInfo() needs some rework,
because it's too closely coupled with the executor's tuple routing code.

Applying just 0001 will satisfy the requirements stated in [1], but it
won't look pretty as is for too long.

So, 0002 is a patch to refactor RelationGetPartitionDispatchInfo() and
relevant data structures.  For example, PartitionDispatchData has now been
simplified to contain only the partition key, partition descriptor and
indexes array, whereas previously it also stored the relation descriptor,
partition key execution state, tuple table slot, tuple conversion map
which are required for tuple-routing.  RelationGetPartitionDispatchInfo()
no longer generates those things, but returns just enough information so
that a caller can generate and manage those things by itself.  This
simplification makes it less cumbersome to call
RelationGetPartitionDispatchInfo() in other places.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoajC0J50%3D2FqnZLvB10roY%2B68HgFWhso%3DV_StkC6PWujQ%40mail.gmail.com
From 9674053fd1e57a480d8a42585cb10421e2c76a70 Mon Sep 17 00:00:00 2001
From: amit 
Date: Wed, 2 Aug 2017 17:14:59 +0900
Subject: [PATCH 1/3] Add get_all_partition_oids and get_partition_oids

Their respective counterparts find_all_inheritors() and
find_inheritance_children() read the pg_inherits catalog directly and
frame the result list in some order determined by the order of OIDs.

get_all_partition_oids() and get_partition_oids() form their result
by reading the partition OIDs from the PartitionDesc contained in the
relcache.  Hence, the order of OIDs in the resulting list is based
on that of the partition bounds.  In the case of get_all_partition_oids
which traverses the whole-tree, the order is also determined by the
fact that the tree is traversed in a breadth-first manner.
---
 contrib/sepgsql/dml.c  |   4 +-
 src/backend/catalog/partition.c|  84 ++
 src/backend/commands/analyze.c |   8 ++-
 src/backend/commands/lockcmds.c|   6 +-
 src/backend/commands/publicationcmds.c |   9 ++-
 src/backend/commands/tablecmds.c   | 124 +
 src/backend/commands/vacuum.c  |   7 +-
 src/backend/optimizer/prep/prepunion.c |   6 +-
 src/include/catalog/partition.h|   3 +
 9 files changed, 213 insertions(+), 38 deletions(-)

diff --git a/contrib/sepgsql/dml.c b/contrib/sepgsql/dml.c
index b643720e36..62d6610c43 100644
--- a/contrib/sepgsql/dml.c
+++ b/contrib/sepgsql/dml.c
@@ -332,8 +332,10 @@ sepgsql_dml_privileges(List *rangeTabls, bool 
abort_on_violation)
 */
if (!rte->inh)
tableIds = list_make1_oid(rte->relid);
-   else
+   else if (rte->relkind != RELKIND_PARTITIONED_TABLE)
tableIds = find_all_inheritors(rte->relid, 

Re: [HACKERS] Support for Secure Transport SSL library on macOS as OpenSSL alternative

2017-08-04 Thread Michael Paquier
On Thu, Aug 3, 2017 at 11:26 PM, Daniel Gustafsson  wrote:
>> On 03 Aug 2017, at 19:27, Michael Paquier  wrote:
>> There were no APIs to get the TLS finish message last time I looked at OSX
>> stuff, which mattered for tls-unique.  It would be nice if we could get one.
>
> Yeah, AFAICT there is no API for that.

Perhaps my last words were confusing. I meant that it would be nice to
get at least one type of channel binding working.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2017-08-04 Thread Masahiko Sawada
On Fri, Jul 28, 2017 at 2:24 PM, Noah Misch  wrote:
> On Thu, Apr 06, 2017 at 08:55:37AM +0200, Petr Jelinek wrote:
>> On 06/04/17 03:51, Noah Misch wrote:
>> > On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:
>> >> On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch  wrote:
>> >>> On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:
>>  Regarding this feature, there are some loose ends. We should work on
>>  and complete them until the release.
>> 
>>  (1)
>>  Which synchronous replication method, priority or quorum, should be
>>  chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
>>  a priority-based sync replication is chosen for keeping backward
>>  compatibility. However some hackers argued to change this decision
>>  so that a quorum commit is chosen because they think that most users
>>  prefer to a quorum.
>
>> >> The items (1) and (3) are not bugs. So I don't think that they need to be
>> >> resolved before the beta release. After the feature freeze, many users
>> >> will try and play with many new features including quorum-based syncrep.
>> >> Then if many of them complain about (1) and (3), we can change the code
>> >> at that timing. So we need more time that users can try the feature.
>> >
>> > I've moved (1) to a new section for things to revisit during beta.  If 
>> > someone
>> > feels strongly that the current behavior is Wrong and must change, speak 
>> > up as
>> > soon as you reach that conclusion.  Absent such arguments, the behavior 
>> > won't
>> > change.
>> >
>>
>> I was one of the people who said in original thread that I think the
>> default behavior should change to quorum and I am still of that opinion.
>
> This item appears under "decisions to recheck mid-beta".  If anyone is going
> to push for a change here, now is the time.

It has been 1 week since the previous mail. I though that there were
others argued to change the behavior of old-style setting so that a
quorum commit is chosen. If nobody is going to push for a change we
can live with the current behavior?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers