Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server
Robert, On Fri, Aug 4, 2017 at 23:17 Robert Haaswrote: > On Thu, Aug 3, 2017 at 9:49 PM, Stephen Frost wrote: > > Thanks for the patches. I'm planning to push them tomorrow morning > > after a bit more review and testing. I'll provide an update tomorrow. > > Obviously, the part about pushing them Friday morning didn't happen, > and you're running out of time for providing an update on Friday, too. > The release is due to wrap in less than 72 hours. Unfortunately the day got away from me due to some personal... adventures (having to do with lack of air conditioning first and then lack of gas, amongst a lot of other things going on right now...). I just got things back online but, well, my day tomorrow is pretty well packed solid. I wouldn't complain if someone has a few cycles to push these, they look pretty good to me but I won't be able to work on PG stuff again until tomorrow night at the earliest. Thanks! Stephen
Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server
On Thu, Aug 3, 2017 at 9:49 PM, Stephen Frostwrote: > Thanks for the patches. I'm planning to push them tomorrow morning > after a bit more review and testing. I'll provide an update tomorrow. Obviously, the part about pushing them Friday morning didn't happen, and you're running out of time for providing an update on Friday, too. The release is due to wrap in less than 72 hours. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Fri, Aug 4, 2017 at 2:45 PM, Amit Kapilawrote: > I have not done anything for this comment as it doesn't sound wrong to > me. I think it is not making much sense in the current code and we > can remove it or change it as part of the separate patch if you or > others think so. I don't get it. The comment claims we have to _hash_getnewbuf before releasing the metapage write lock. But the function neither calls _hash_getnewbuf nor releases the metapage write lock. It then goes on to say that for this reason we pass the new buffer rather than just the block number. However, even if it were true that we have to call _hash_getnewbuf before releasing the metapage write lock, what does that have to do with the choice of passing a buffer vs. a block number? Explain to me what I'm missing, please, because I'm completely befuddled! (On another note, I committed these patches.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A bug in mapping attributes in ATExecAttachPartition()
On Thu, Aug 3, 2017 at 8:45 PM, Amit Langotewrote: >> 0003 needs a rebase. > > Rebased patch attached. Committed. I think 0004 is a new feature, so I'm leaving that for v11. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Subscription code improvements
On 8/4/17 12:02, Masahiko Sawada wrote: > To make ALTER SUBSCRIPTION REFRESH being transactional, I prefer > Petr's proposal. Because it can make ALTER SUBSCRIPTION and DROP > SUBSCRIPTION stop the table sync workers that are in progress of > copying data. I'm not sure killing table sync workers in DDL commands > would be acceptable but since it can free unnecessary slots of logical > replication workers and replication slots I'd prefer this idea. OK, I have committed the 0004 patch. > However, even with this patch there is still an issue that NOTICE > messages "removed subscription for table public.t1" can be appeared > even if we rollback ALTER SUBSCRIPTION REFRESH command as I mentioned > on earlier thread. Since I think this behaviour will confuse users who > check server logs I'd like to deal with it, I don't have a good idea > though. Maybe we can just remove those messages? We don't get messages when we create a subscription about which tables are part of it. So why do we need such messages when we refresh a subscription? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] Zipfian distribution in pgbench
On Fri, Jul 21, 2017 at 4:51 AM, Alik Khilazhevwrote: > (Latest version of pgbench Zipfian patch) While I'm +1 on this idea, I think that it would also be nice if there was an option to make functions like random_zipfian() actually return a value that has undergone perfect hashing. When this option is used, any given value that the function returns would actually be taken from a random mapping to some other value in the same range. So, you can potentially get a Zipfian distribution without the locality. While I think that Zipfian distributions are common, it's probably less common for the popular items to be clustered together within the primary key, unless the PK has multiple attributes, and the first is the "popular attribute". For example, there would definitely be a lot of interest among contiguous items within an index on "(artist, album)" where "artist" is a popular artist, which is certainly quite possible. But, if "album" is the primary key, and it's a SERIAL PK, then there will only be weak temporal locality within the PK, and only because today's fads are more popular than the fads from previous years. Just another idea, that doesn't have to hold this patch up. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Draft release notes up for review
I've committed the first-draft release notes for 9.6.4 at https://git.postgresql.org/pg/commitdiff/03378c4da598840b0520a53580dd7713c95f21c8 (If you prefer to read nicely-marked-up copy, they should be up at https://www.postgresql.org/docs/devel/static/release-9-6-4.html in a couple hours from now.) Please review. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)
On Mon, Jul 31, 2017 at 10:54 AM, Peter Geogheganwrote: > Let's wait to see what difference it makes if Alik's zipfian > distribution pgbench test case uses unlogged tables. That may gives us a > good sense of the problem for cases with contention/concurrency. Yura Sokolov of Postgres Pro performed this benchmark at my request. He took the 9.5 commit immediately proceeding 2ed5b87f9 as a baseline. In all cases, logged tables were used. Note that this is not the most effective benchmark for showing the regression, because he didn't replace the PK with a non-unique index, though that is planned as follow-up; I wanted to stick with Alik's Zipfian test case for the time being. (Using a unique index is not the most effective thing for showing the regression because unique indexes have most LP_DEAD setting done in _bt_check_unique(), so only SELECTs will do less LP_DEAD cleanup there; SELECTs are 50% of all queries.) His results with 10 minute pgbench runs: Logged clients | 8217fb14 | 2ed5b87f | master | hashsnap | hashsnap_lwlock +--+--++--+ 10 | 201569 | 204154 | 201095 | 201793 | 206111 20 | 328180 | 58 | 334858 | 336721 | 370769 40 | 196352 | 194692 | 232413 | 231968 | 393947 70 | 121083 | 116881 | 148066 | 148443 | 224162 110 |77989 |73414 | 99305 | 111404 | 161900 160 |48958 |45830 | 65830 |82340 | 115678 230 |27699 |25510 | 38186 |57617 | 80575 310 |16369 |15137 | 21571 |39745 | 56819 400 |10327 | 9486 | 13157 |27318 | 40859 500 | 6920 | 6496 | 8638 |18677 | 29429 650 | 4315 | 3971 | 5196 |11762 | 17883 800 | 2850 | 2772 | 3523 | 7733 | 10977 Note that you also see numbers from various patches from Yura, and the master branch mixed in here, but 8217fb14 (before) and 2ed5b87f (after) are the interesting numbers as far as this regression goes. There is an appreciable reduction in TPS here, though this workload is not as bad by that measure as first thought. There is a roughly 5% regression here past 40 clients or so. The regression in the *consistency* of transaction *throughput* is far more interesting, though. I've been doing analysis of this by drilling down to individual test cases with vimdiff, as follows: $ vimdiff test_8217fb14_logged_1_pgbench_40.out test_2ed5b87f_logged_1_pgbench_40.out (I attach these two files as a single example. I can provide the full results to those that ask for them privately; it's too much data to attach to an e-mail to the list.) You can see in this example that for most 5 second slices of the 10 minute benchmark, commit 2ed5b87f actually increases TPS somewhat, which is good. But there are also occasional *big* drops in TPS, sometimes by as much as 50% over a single 5 second period (when ANALYZE runs, adding random I/O during holding an exclusive buffer lock [1]?). When this slowdown happens, latency can be over 3 times higher, too. We see much more consistent performance without the B-Tree buffer pin VACUUM optimization, where there is no discernible pattern of performance dropping. The headline regression of 4% or 5% is not the main problem here, it seems. In summary, commit 2ed5b87f makes the workload have increased throughput most of the time, but occasionally sharply reduces throughput, which averages out to TPS being 4% or 5% lower overall. I think we'll find that there are bigger problems TPS-wise with non-unique indexes when that other test is performed by Yura; let's wait for those results to come back. Finally, I find it interesting that when Yura did the same benchmark, but with 5% SELECTs + 95% UPDATEs, rather than 50% SELECTs + 50% UPDATEs as above, the overall impact was surprisingly similar. His results: clients | 8217fb14 | 2ed5b87f | master | hashsnap | hashsnap_lwlock +--+--++--+ 20 | 187697 | 187335 | 217558 | 215059 | 266894 50 |81272 |78784 | 97948 |97659 | 157710 85 |49446 |47683 | 64597 |70814 | 107748 130 |32358 |30393 | 42216 |50531 | 75001 190 |19403 |17569 | 25704 |35506 | 51292 270 |10803 | 9878 | 14166 |23851 | 35257 370 | 6108 | 5645 | 7684 |15390 | 23659 500 | 3649 | 3297 | 4225 | 9172 | 14643 650 | 2239 | 2049 | 2635 | 5887 |8588 800 | 1475 | 1424 | 1804 | 4035 |5611 If nothing else, this shows how generally reliant these kinds of workload can be on LP_DEAD setting. And, there is one difference: The regression is seen here at *all* client
Re: [HACKERS] PostgreSQL not setting OpenSSL session id context?
Shay Rojanskywrites: > Great. Do you think it's possible to backport to the other maintained > branches as well, seeing as how this is quite trivial and low-impact? Already done, will be in next week's minor releases. (You timed this bug report well.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Fri, Aug 4, 2017 at 2:49 PM, Amit Kapilawrote: > On Fri, Aug 4, 2017 at 10:59 PM, Robert Haas wrote: >> On Fri, Aug 4, 2017 at 6:22 AM, Amit Kapila wrote: >>> I have increased the number of hash bitmap pages as a separate patch. >>> I am not completely sure if it is a good idea to directly increase it >>> to 1024 as that will increase the size of hashmetapagedata from 960 >>> bytes to 4544 bytes. Shall we increase it to 512? >> >> I don't quite see what the problem is with increasing it to 4544 >> bytes. What's your concern? > > Nothing as such. It is just that the previous code might have some > reason to keep it at 128, probably if there are that many overflow > bucket pages, then it is better to reindex the index otherwise the > performance might suck while traversing the long overflow chains. In > general, I see your point that if we can provide user to have that big > overflow space, then let's do it. Yeah. The only concern I see is that one doesn't want to run out of space in the metapage when future patches come along and try to do new things. But 4544 bytes still leaves quite a bit of headroom, and I think it's better to solve a problem we know we have right now than try to save too much space for future needs that may or may not occur. Maybe at some point this whole thing needs a broader rethink, but the fact that somebody hit the current limit before we got out of beta makes me think that we should be fairly aggressive in trying to ameliorate the problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL not setting OpenSSL session id context?
> > > Doing SSL_CTX_set_session_cache_mode(context, SSL_SESS_CACHE_OFF) > doesn't > > have any effect whatsoever - I still have the same issue (session id > > context uninitialized). I suspect session caching is an entirely > different > > feature from session tickets/RFC5077 (although it might still be a good > > idea to disable). > > Right, we expected that that would have no visible effect, because there > is no way to cache sessions in Postgres anyway. The main point, if I > understand Heikki's concern correctly, is that this might save some > amount of low-level overhead from clients trying to cache connections. > OK, sounds right (i.e. this is a defensive measure that isn't directly connected to my problem but makes sense). > Doing SSL_CTX_set_options(context, SSL_OP_NO_TICKET) indeed resolves the > > issue, as expected. > > Excellent. I'll push this patch tomorrow sometime (too late/tired > right now). > Great. Do you think it's possible to backport to the other maintained branches as well, seeing as how this is quite trivial and low-impact? > > As I wrote above, I'd remove the #ifdef and execute it always. > > The reason I put the #ifdef in is that according to my research the > SSL_OP_NO_TICKET symbol was introduced in openssl 0.9.8f, while we > claim to support back to 0.9.8. I'd be the first to say that you're > nuts if you're running openssl versions that old; but this patch is not > something to move the compatibility goalposts for when it only takes > an #ifdef to avoid breaking older versions. > > (I need to check how far back SSL_SESS_CACHE_OFF goes ... we might > need an #ifdef for that too.) > Ah OK, thanks for the explanation - makes perfect sense.
Re: [HACKERS] pgsql 10: hash indexes testing
On Fri, Aug 4, 2017 at 10:59 PM, Robert Haaswrote: > On Fri, Aug 4, 2017 at 6:22 AM, Amit Kapila wrote: >> I have increased the number of hash bitmap pages as a separate patch. >> I am not completely sure if it is a good idea to directly increase it >> to 1024 as that will increase the size of hashmetapagedata from 960 >> bytes to 4544 bytes. Shall we increase it to 512? > > I don't quite see what the problem is with increasing it to 4544 > bytes. What's your concern? > Nothing as such. It is just that the previous code might have some reason to keep it at 128, probably if there are that many overflow bucket pages, then it is better to reindex the index otherwise the performance might suck while traversing the long overflow chains. In general, I see your point that if we can provide user to have that big overflow space, then let's do it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL not setting OpenSSL session id context?
> > On 2017-08-04 07:22:42 +0300, Shay Rojansky wrote: > > I'm still not convinced of the risk/problem of simply setting the session > > id context as I explained above (rather than disabling the optimization), > > but of course either solution resolves my problem. > > How would that do anything? Each backend has it's own local > memory. I.e. any cache state that openssl would maintain wouldn't be > useful. If you want to take advantage of features around this you really > need to cache tickets in shared memory... > Guys, there's no data being cached at the backend - RFC5077 is about packaging information into a client-side opaque session ticket that allows skipping a roundtrip on the next connection. As I said, simply setting the session id context (*not* the session id or anything else) makes this feature work, even though a completely new backend process is launched.
Re: [HACKERS] pgsql 10: hash indexes testing
On Fri, Aug 4, 2017 at 11:45 PM, Amit Kapilawrote: > On Fri, Aug 4, 2017 at 11:15 PM, Robert Haas wrote: >> On Fri, Aug 4, 2017 at 6:22 AM, Amit Kapila wrote: >>> I have implemented the patch with this approach as other approach >>> require quite extensive changes which I am not sure is the right thing >>> to do at this stage. >> >> I think this approach is actually better anyway. There's no guarantee >> that VACUUM can be responsive enough to get the job done in time, work >> items or no work items, and the split-cleanup is cheap enough that I >> don't think we ought to worry about negative effects on foreground >> workloads. Sure, it could have some impact, but *less bloat* could >> also have some impact in the other direction. >> >> +hashbucketcleanup(rel, obucket, bucket_obuf, >> + BufferGetBlockNumber(bucket_obuf), NULL, >> + maxbucket, highmask, lowmask, NULL, NULL, true, >> + NULL, NULL); >> +LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK); >> >> Don't we want to do those things in the other order? >> > > Earlier one of the callers was releasing the old bucket lock first, so > I just maintained that order, but I think it is better if we release > new bucket lock first. I think that will maintain the ordering and it > is better to do it before actual cleanup of the old bucket as cleanup > can take some time and there is no use of retaining a lock on the new > bucket for that time. I will do some testing after making the change > and will post a patch in some time. > Changed as per suggestion. >> - * which is passed in buffer nbuf, pinned and write-locked. That lock and >> - * pin are released here. (The API is set up this way because we must do >> - * _hash_getnewbuf() before releasing the metapage write lock. So instead >> of >> - * passing the new bucket's start block number, we pass an actual buffer.) >> + * which is passed in buffer nbuf, pinned and write-locked. The lock will >> be >> + * released here and pin must be released by the caller. (The API is set up >> + * this way because we must do _hash_getnewbuf() before releasing the >> metapage >> + * write lock. So instead of passing the new bucket's start block number, >> we >> + * pass an actual buffer.) >> >> Isn't the parenthesized part at the end of the comment wrong? I >> realize this patch isn't editing that part anyway except for >> reflowing, but further up in that same block comment it says this: >> >> * The caller must hold a pin, but no lock, on the metapage buffer. >> * The buffer is returned in the same state. (The metapage is only >> * touched if it becomes necessary to add or remove overflow pages.) >> > > The comment doesn't seem to be wrong but is not making much sense > here. Both the actions _hash_getnewbuf and release of metapage lock > is done in the caller. In 9.6, we were passing old bucket's block > number and new bucket's buffer, so the comment tries to explain why it > is passing buffer for the new bucket. It makes less sense now because > we are passing buffer for both old and new buckets, so we can remove > it. Do you want me to remove it as part of this patch or as a > separate patch? > I have not done anything for this comment as it doesn't sound wrong to me. I think it is not making much sense in the current code and we can remove it or change it as part of the separate patch if you or others think so. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_cleanup_bucket_after_split_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small code improvement for btree
On Fri, Aug 4, 2017 at 11:12 AM, Alvaro Herrerawrote: > Interesting. We learned elsewhere that it's better to integrate the > "!= 0" test as part of the macro definition; so a > better formulation of this patch would be to change the > P_INCOMPLETE_SPLIT macro and omit the comparison in the Assert. (See > commit 594e61a1de03 for an example). > > >> - LockBuffer(hbuffer, BUFFER_LOCK_SHARE); >> + LockBuffer(hbuffer, BT_READ); +1. One Linus Torvalds rant that I actually agreed with was a rant against the use of bool as a type in C code. It's fine, as long as you never forget that it's actually just another integer. > I think BT_READ and BT_WRITE are useless, and I'd rather get rid of > them ... Fair enough, but we should either use them consistently or not at all. I'm not especially concerned about which, as long as it's one of those two. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Fri, Aug 4, 2017 at 11:15 PM, Robert Haaswrote: > On Fri, Aug 4, 2017 at 6:22 AM, Amit Kapila wrote: >> I have implemented the patch with this approach as other approach >> require quite extensive changes which I am not sure is the right thing >> to do at this stage. > > I think this approach is actually better anyway. There's no guarantee > that VACUUM can be responsive enough to get the job done in time, work > items or no work items, and the split-cleanup is cheap enough that I > don't think we ought to worry about negative effects on foreground > workloads. Sure, it could have some impact, but *less bloat* could > also have some impact in the other direction. > > +hashbucketcleanup(rel, obucket, bucket_obuf, > + BufferGetBlockNumber(bucket_obuf), NULL, > + maxbucket, highmask, lowmask, NULL, NULL, true, > + NULL, NULL); > +LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK); > > Don't we want to do those things in the other order? > Earlier one of the callers was releasing the old bucket lock first, so I just maintained that order, but I think it is better if we release new bucket lock first. I think that will maintain the ordering and it is better to do it before actual cleanup of the old bucket as cleanup can take some time and there is no use of retaining a lock on the new bucket for that time. I will do some testing after making the change and will post a patch in some time. > - * which is passed in buffer nbuf, pinned and write-locked. That lock and > - * pin are released here. (The API is set up this way because we must do > - * _hash_getnewbuf() before releasing the metapage write lock. So instead of > - * passing the new bucket's start block number, we pass an actual buffer.) > + * which is passed in buffer nbuf, pinned and write-locked. The lock will be > + * released here and pin must be released by the caller. (The API is set up > + * this way because we must do _hash_getnewbuf() before releasing the > metapage > + * write lock. So instead of passing the new bucket's start block number, we > + * pass an actual buffer.) > > Isn't the parenthesized part at the end of the comment wrong? I > realize this patch isn't editing that part anyway except for > reflowing, but further up in that same block comment it says this: > > * The caller must hold a pin, but no lock, on the metapage buffer. > * The buffer is returned in the same state. (The metapage is only > * touched if it becomes necessary to add or remove overflow pages.) > The comment doesn't seem to be wrong but is not making much sense here. Both the actions _hash_getnewbuf and release of metapage lock is done in the caller. In 9.6, we were passing old bucket's block number and new bucket's buffer, so the comment tries to explain why it is passing buffer for the new bucket. It makes less sense now because we are passing buffer for both old and new buckets, so we can remove it. Do you want me to remove it as part of this patch or as a separate patch? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small code improvement for btree
Masahiko Sawada wrote: > While hacking the btree code I found two points we can improve in nbtxlog.c. > > @@ -135,7 +135,7 @@ _bt_clear_incomplete_split(XLogReaderState > *record, uint8 block_id) > Pagepage = (Page) BufferGetPage(buf); > BTPageOpaque pageop = (BTPageOpaque) > PageGetSpecialPointer(page); > > - Assert((pageop->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0); > +Assert(P_INCOMPLETE_SPLIT(pageop) != 0); > pageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT; Interesting. We learned elsewhere that it's better to integrate the "!= 0" test as part of the macro definition; so a better formulation of this patch would be to change the P_INCOMPLETE_SPLIT macro and omit the comparison in the Assert. (See commit 594e61a1de03 for an example). > - LockBuffer(hbuffer, BUFFER_LOCK_SHARE); > + LockBuffer(hbuffer, BT_READ); I think BT_READ and BT_WRITE are useless, and I'd rather get rid of them ... -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Scan Mode in Hash Index
On Fri, Aug 4, 2017 at 9:44 AM, Amit Kapilawrote: > There is no need to use Parentheses around opaque. I mean there is no > problem with that, but it is redundant and makes code less readable. Amit, I'm sure you know this, but just for the benefit of anyone who doesn't: We often include these kinds of extra parentheses in macros, for good reason. Suppose you have: #define mul(x,y) x * y If the user says mul(2+3,5), it will expand to 2 + 3 * 5 = 17, which is wrong. If you instead do this: #define mul(x,y) (x) * (y) ...then mul(2+3,5) expands to (2 + 3) * (5) = 25, which is what the user of the macro is expecting to get. Outside of macro definitions, as you say, there's no point and we should avoid it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
Robert Haas wrote: > I think this approach is actually better anyway. There's no guarantee > that VACUUM can be responsive enough to get the job done in time, work > items or no work items, Yeah, autovacuum work items don't have a guaranteed response time. They're okay for things that "ought to be done eventually", but if the condition causes a high-speed load to fail with errors, then halting the load until the cleanup is done seems like the way to go. Having to randomly inject pauses in your workload so that autovacuum has time to cope isn't great from the user's POV (a condition that is made worse if you don't have any mechanism to detect that you need a pause, or how long to pause for.) Once that part is working well, you could as a convenience (to avoid or reduce the stalls some of the time) add autovacuum work-item support. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Fri, Aug 4, 2017 at 6:22 AM, Amit Kapilawrote: > I have implemented the patch with this approach as other approach > require quite extensive changes which I am not sure is the right thing > to do at this stage. I think this approach is actually better anyway. There's no guarantee that VACUUM can be responsive enough to get the job done in time, work items or no work items, and the split-cleanup is cheap enough that I don't think we ought to worry about negative effects on foreground workloads. Sure, it could have some impact, but *less bloat* could also have some impact in the other direction. +hashbucketcleanup(rel, obucket, bucket_obuf, + BufferGetBlockNumber(bucket_obuf), NULL, + maxbucket, highmask, lowmask, NULL, NULL, true, + NULL, NULL); +LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK); Don't we want to do those things in the other order? - * which is passed in buffer nbuf, pinned and write-locked. That lock and - * pin are released here. (The API is set up this way because we must do - * _hash_getnewbuf() before releasing the metapage write lock. So instead of - * passing the new bucket's start block number, we pass an actual buffer.) + * which is passed in buffer nbuf, pinned and write-locked. The lock will be + * released here and pin must be released by the caller. (The API is set up + * this way because we must do _hash_getnewbuf() before releasing the metapage + * write lock. So instead of passing the new bucket's start block number, we + * pass an actual buffer.) Isn't the parenthesized part at the end of the comment wrong? I realize this patch isn't editing that part anyway except for reflowing, but further up in that same block comment it says this: * The caller must hold a pin, but no lock, on the metapage buffer. * The buffer is returned in the same state. (The metapage is only * touched if it becomes necessary to add or remove overflow pages.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Fri, Aug 4, 2017 at 6:22 AM, Amit Kapilawrote: > I have increased the number of hash bitmap pages as a separate patch. > I am not completely sure if it is a good idea to directly increase it > to 1024 as that will increase the size of hashmetapagedata from 960 > bytes to 4544 bytes. Shall we increase it to 512? I don't quite see what the problem is with increasing it to 4544 bytes. What's your concern? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On Fri, Aug 4, 2017 at 3:38 AM, Amit Langotewrote: > The current way to expand inherited tables, including partitioned tables, > is to use either find_all_inheritors() or find_inheritance_children() > depending on the context. They return child table OIDs in the (ascending) > order of those OIDs, which means the callers that need to lock the child > tables can do so without worrying about the possibility of deadlock in > some concurrent execution of that piece of code. That's good. > > For partitioned tables, there is a possibility of returning child table > (partition) OIDs in the partition bound order, which in addition to > preventing the possibility of deadlocks during concurrent locking, seems > potentially useful for other caller-specific optimizations. For example, > tuple-routing code can utilize that fact to implement binary-search based > partition-searching algorithm. For one more example, refer to the "UPDATE > partition key" thread where it's becoming clear that it would be nice if > the planner had put the partitions in bound order in the ModifyTable that > it creates for UPDATE of partitioned tables [1]. I guess I don't really understand why we want to change the locking order. That is bound to make expanding the inheritance hierarchy more expensive. If we use this approach in all cases, it seems to me we're bound to reintroduce the problem we fixed in commit c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in the same vein. But I don't see that there's any necessary relation between the order of locking and the order of expansion inside the relcache entry/plan/whatever else -- so I propose that we keep the existing locking order and only change the other stuff. While reading related code this morning, I noticed that RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have *already* changed the locking order for certain operations, because the PartitionDesc's OID array is bound-ordered not OID-ordered. That means that when RelationGetPartitionDispatchInfo uses the PartitionDesc's OID arra to figure out what to lock, it's potentially going to encounter partitions in a different order than would have been the case if it had used find_all_inheritors directly. I'm tempted to think that RelationGetPartitionDispatchInfo shouldn't really be doing locking at all. The best way to have the locking always happen in the same order is to have only one piece of code that determines that order - and I vote for find_all_inheritors. Aside from the fact that it's the way we've always done it (and still do it in most other places), that code includes infinite-loop defenses which the new code you've introduced lacks. Concretely, my proposal is: 1. Before calling RelationGetPartitionDispatchInfo, the calling code should use find_all_inheritors to lock all the relevant relations (or the planner could use find_all_inheritors to get a list of relation OIDs, store it in the plan in order, and then at execution time we visit them in that order and lock them). 2. RelationGetPartitionDispatchInfo assumes the relations are already locked. 3. While we're optimizing, in the first loop inside of RelationGetPartitionDispatchInfo, don't call heap_open(). Instead, use get_rel_relkind() to see whether we've got a partitioned table; if so, open it. If not, there's no need. 4. For safety, add a function bool RelationLockHeldByMe(Oid) and add to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1)) elog(ERROR, ...). Might be interesting to stuff that check into the relation_open(..., NoLock) path, too. One objection to this line of attack is that there might be a good case for locking only the partitioned inheritors first and then going back and locking the leaf nodes in a second pass, or even only when required for a particular row. However, that doesn't require putting everything in bound order - it only requires moving the partitioned children to the beginning of the list. And I think rather than having new logic for that, we should teach find_inheritance_children() to do that directly. I have a feeling Ashutosh is going to cringe at this suggestion, but my idea is to do this by denormalizing: add a column to pg_inherits indicating whether the child is of RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans pg_inherits, it can pull that flag out for free along with the relation OID, and qsort() first by the flag and then by the OID. It can also return the number of initial elements of its return value which have that flag set. Then, in find_all_inheritors, we split rels_list into partitioned_rels_list and other_rels_list, and process partitioned_rels_list in its entirety before touching other_rels_list; they get concatenated at the end. Now, find_all_inheritors and find_inheritance_children can also grow a flag bool only_partitioned_children; if set, then we skip the unpartitioned children entirely. With all
Re: [HACKERS] UPDATE of partition key
> > Below are the TODOS at this point : > > Fix for bug reported by Rajkumar about update with join. I had explained the root issue of this bug here : [1] Attached patch includes the fix, which is explained below. Currently in the patch, there is a check if the tuple is concurrently deleted by other session, i.e. when heap_update() returns HeapTupleUpdated. In such case we set concurrently_deleted output param to true. We should also do the same for HeapTupleSelfUpdated return value. In fact, there are other places in ExecDelete() where it can return without doing anything. For e.g. if a BR DELETE trigger prevents the delete from happening, ExecBRDeleteTriggers() returns false, in which case ExecDelete() returns. So what the fix does is : rename concurrently_deleted parameter to delete_skipped so as to indicate a more general status : whether delete has actually happened or was it skipped. And set this param to true only after the delete happens. This allows us to avoid adding a new rows for the trigger case also. Added test scenario for UPDATE with JOIN case, and also TRIGGER case. > Do something about two separate mapping tables for Transition tables > and update tuple-routing. On 1 July 2017 at 03:15, Thomas Munrowrote: > Would make sense to have a set of functions with names like > GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays > m_convertors_{from,to}_by_{subplan,leaf} the first time they need > them? This was discussed here : [2]. I think even if we have them built when needed, still in presence of both tuple routing and transition tables, we do need separate arrays. So I think rather than dynamic arrays, we can have static arrays but their elements will point to a shared TupleConversionMap structure whenever possible. As already in the patch, in case of insert/update tuple routing, there is a per-leaf partition mt_transition_tupconv_maps array for transition tables, and a separate per-subplan arry mt_resultrel_maps for update tuple routing. *But*, what I am proposing is: for the mt_transition_tupconv_maps[] element for which the leaf partition also exists as a per-subplan result, that array element and the mt_resultrel_maps[] element will point to the same TupleConversionMap structure. This is quite similar to how we are re-using the per-subplan resultrels for the per-leaf result rels. We will re-use the per-subplan TupleConversionMap for the per-leaf mt_transition_tupconv_maps[] elements. Not yet implemented this. > GetUpdatedColumns() to be moved to header file. Done. I have moved it in execnodes.h > More test scenarios in regression tests. > Need to check/test whether we are correctly applying insert policies > (ant not update) while inserting a routed tuple. Yet to do above two. > Use getASTriggerResultRelInfo() for attrno mapping, rather than first > resultrel, for generating child WCO/RETURNING expression. > Regarding generating child WithCheckOption and Returning expressions using those of the root result relation, ModifyTablePath and ModifyTable should have new fields rootReturningList (and rootWithCheckOptions) which would be derived from root->parse->returningList in inheritance_planner(). But then, similar to per-subplan returningList, rootReturningList would have to pass through set_plan_refs()=>set_returning_clause_references() which requires the subplan targetlist to be passed. Because of this, for rootReturningList, we require a subplan for root partition, which is not there currently; we have subpans only for child rels. That means we would have to create such plan only for the sake of generating rootReturningList. The other option is to do the way the patch is currently doing in the executor by using the returningList of the first per-subplan result rel to generate the other child returningList (and WithCheckOption). This is working by applying map_partition_varattnos() to the first returningList. But now that we realized that we have to specially handle whole-row vars, map_partition_varattnos() would need some changes to convert whole row vars differently for child-rel-to-child-rel mapping. For childrel-to-childrel conversion, the whole-row var is already wrapped by ConvertRowtypeExpr, but we need to change its Var->vartype to the new child vartype. I think the second option looks easier, but I am open to suggestions, and I am myself still checking the first one. > Address Robert's review comments on make_resultrel_ordered.patch. > > +typedef struct ParentChild > > This is a pretty generic name. Pick something more specific and informative. I have used ChildPartitionInfo. But suggestions welcome. > > +static List *append_rel_partition_oids(List *rel_list, Relation rel); > > One could be forgiven for thinking that this function was just going > to append OIDs, but it actually appends ParentChild structures, so I > think the name needs work. Renamed it to append_child_partitions(). > > +List
Re: [HACKERS] pg_upgrade failed with error - ERROR: column "a" in child table must be marked NOT NULL
On Fri, Aug 4, 2017 at 5:50 PM, Tom Lanewrote: > Michael Paquier writes: >> So I think that the attached patch is able to do the legwork. > > I've pushed this into HEAD. It seems like enough of a behavioral > change that we wouldn't want to back-patch, but IMO it's not too late > to be making this type of change in v10. If we wait for the next CF > then it will take another year for the fix to reach the field. Thanks for applying the fix. My intention when adding that in a CF is not to see things lost. >> While >> looking at the code, I have bumped into index_check_primary_key() that >> discarded the case of sending SET NOT NULL to child tables, as >> introduced by 88452d5b. But that's clearly an oversight IMO, and the >> comment is wrong to begin with because SET NOT NULL is spread to child >> tables. Using is_alter_table instead of a plain true in >> index_check_primary_key() for the call of AlterTableInternal() is >> defensive, but I don't think that we want to impact any modules >> relying on this API, so recursing only when ALTER TABLE is used is the >> safest course of action to me. > > I didn't find that persuasive: I think passing "recurse" as just plain > "true" is safer and more readable. We shouldn't be able to get there > in a CREATE case, as per the comments; and if we did, there shouldn't > be any child tables anyway; but if somehow there were, surely the same > consistency argument would imply that we *must* recurse and fix those > child tables. So I just made it "true". Yeah, that matches what I read as well. No complains to switch to a plain "true" even if it is not reached yet. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Subscription code improvements
On Fri, Aug 4, 2017 at 2:17 AM, Peter Eisentrautwrote: > On 7/13/17 23:53, Masahiko Sawada wrote: >> To summary, I think we now have only one issue; ALTER SUBSCRIPTION is >> not transactional, 0004 patch is addressing this issue . > > We have two competing patches for this issue. This patch moves the > killing to the end of the DDL transaction. Your earlier patch makes the > tablesync work itself responsible for exiting. Do you wish to comment > which direction to pursue? (Doing both might also be an option?) > To make ALTER SUBSCRIPTION REFRESH being transactional, I prefer Petr's proposal. Because it can make ALTER SUBSCRIPTION and DROP SUBSCRIPTION stop the table sync workers that are in progress of copying data. I'm not sure killing table sync workers in DDL commands would be acceptable but since it can free unnecessary slots of logical replication workers and replication slots I'd prefer this idea. However, even with this patch there is still an issue that NOTICE messages "removed subscription for table public.t1" can be appeared even if we rollback ALTER SUBSCRIPTION REFRESH command as I mentioned on earlier thread. Since I think this behaviour will confuse users who check server logs I'd like to deal with it, I don't have a good idea though. Also, I think we can incorporate the idea of my earlier proposal with some changes (i.g. I'd choose the third option). In current implementation, in case where a subscription relation state is accidentally removed while the corresponding table sync worker is progress of copying data, it cannot exit from a loop in wait_for_worker_state_change unless the apply worker dies. So to be more robust, table sync workers can finish with an error if its subscription relation state has disappeared. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_upgrade failed with error - ERROR: column "a" in child table must be marked NOT NULL
Michael Paquierwrites: > So I think that the attached patch is able to do the legwork. I've pushed this into HEAD. It seems like enough of a behavioral change that we wouldn't want to back-patch, but IMO it's not too late to be making this type of change in v10. If we wait for the next CF then it will take another year for the fix to reach the field. > While > looking at the code, I have bumped into index_check_primary_key() that > discarded the case of sending SET NOT NULL to child tables, as > introduced by 88452d5b. But that's clearly an oversight IMO, and the > comment is wrong to begin with because SET NOT NULL is spread to child > tables. Using is_alter_table instead of a plain true in > index_check_primary_key() for the call of AlterTableInternal() is > defensive, but I don't think that we want to impact any modules > relying on this API, so recursing only when ALTER TABLE is used is the > safest course of action to me. I didn't find that persuasive: I think passing "recurse" as just plain "true" is safer and more readable. We shouldn't be able to get there in a CREATE case, as per the comments; and if we did, there shouldn't be any child tables anyway; but if somehow there were, surely the same consistency argument would imply that we *must* recurse and fix those child tables. So I just made it "true". regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Default Partition for Range
On Mon, Jul 31, 2017 at 8:28 AM, Beena Emersonwrote: > Thanks for informing. > PFA the updated patch. > I have changed the numbering of enum PartitionRangeDatumKind since I > have to include DEFAULT as well. If you have better ideas, let me > know. Why do we need to introduce PARTITION_RANGE_DATUM_DEFAULT at all? It seems to me that the handling of default range partitions ought to be similar to the way a null-accepting list partition is handled - namely, it wouldn't show up in the "datums" or "kind" array at all, instead just showing up in PartitionBoundInfoData's default_index field. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Scan Mode in Hash Index
On Sun, Jul 30, 2017 at 2:07 PM, Ashutosh Sharmawrote: > Hi, > > On Wed, May 10, 2017 at 2:28 PM, Ashutosh Sharma > wrote: >> While doing the code coverage testing of v7 patch shared with - [1], I >> found that there are few lines of code in _hash_next() that are >> redundant and needs to be removed. I basically came to know this while >> testing the scenario where a hash index scan starts when a split is in >> progress. I have removed those lines and attached is the newer version >> of patch. >> > > Please find the new version of patches for page at a time scan in hash > index rebased on top of latest commit in master branch. Also, i have > ran pgindent script with pg_bsd_indent version 2.0 on all the modified > files. Thanks. > Few comments: 1. +_hash_kill_items(IndexScanDesc scan, bool havePin) I think you can do without the second parameter. Can't we detect inside _hash_kill_items whether the page is pinned or not as we do for btree? 2. + /* + * We remember prev and next block number along with current block + * number so that if fetching the tup- les using cursor we know + * the page from where to startwith. This is for the case where we + * have re- ached the end of bucket chain without finding any + * matching tuples. The spelling of tuples and reached contain some unwanted symbol. Have space between "startwith" or just use "begin" 3. + if (!BlockNumberIsValid((opaque)->hasho_nextblkno)) + { + so->currPos.prevPage = (opaque)->hasho_prevblkno; + so->currPos.nextPage = InvalidBlockNumber; + } There is no need to use Parentheses around opaque. I mean there is no problem with that, but it is redundant and makes code less readable. Also, there is similar usage at other places in the code, please change all another place as well. I think you can save the value of prevblkno in a local variable and use it in else part. 4. + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) && + _hash_checkqual(scan, itup)) + { + /* tuple is qualified, so remember it */ + _hash_saveitem(so, itemIndex, offnum, itup); + itemIndex++; + } + else + + /* + * No more matching tuples exist in this page. so, exit while + * loop. + */ + break; It is better to have braces for the else part. It makes code look better. The type of code exists few line down as well, change that as well. 5. + /* + * We save the LSN of the page as we read it, so that we know whether it + * safe to apply LP_DEAD hints to the page later. + */ "whether it safe"/"whether it is safe" -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench: Skipping the creating primary keys after initialization
On Fri, Aug 4, 2017 at 3:24 AM, Fabien COELHOwrote: > >>> For the CREATE stuff, the script language is SQL, the command to use it >>> is >>> "psql"... >> >> >>> The real and hard part is to fill tables with meaningful pseudo-random >>> test data which do not violate constraints for any non trivial schema >>> involving foreign keys and various unique constraints. >> >> >>> The solution for this is SQL for trivial cases, think of: >>>"INSERT INTO Foo() SELECT ... FROM generate_series(...);" >> >> >> Yeah. I was also thinking that complicated data-generation requirements >> could be handled with plpgsql DO blocks, avoiding the need for hard-wired >> code inside pgbench. > > > I do not thing that it is really be needed for what pgbench does, though. > See attached attempt, including a no_foreign_keys option. > > The only tricky thing is to have the elapsed/remaining advancement report on > stdout, maybe with some PL/pgSQL. > > Timings are very similar compared to "pgbench -i". > The generating data with plpgsql DO blocks means that we do the data-generation on sever side rather than on client side. I think it's preferable in a sense because could speed up initialization time by reducing the network traffic. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On Fri, Aug 4, 2017 at 1:08 PM, Amit Langotewrote: > The current way to expand inherited tables, including partitioned tables, > is to use either find_all_inheritors() or find_inheritance_children() > depending on the context. They return child table OIDs in the (ascending) > order of those OIDs, which means the callers that need to lock the child > tables can do so without worrying about the possibility of deadlock in > some concurrent execution of that piece of code. That's good. > > For partitioned tables, there is a possibility of returning child table > (partition) OIDs in the partition bound order, which in addition to > preventing the possibility of deadlocks during concurrent locking, seems > potentially useful for other caller-specific optimizations. For example, > tuple-routing code can utilize that fact to implement binary-search based > partition-searching algorithm. For one more example, refer to the "UPDATE > partition key" thread where it's becoming clear that it would be nice if > the planner had put the partitions in bound order in the ModifyTable that > it creates for UPDATE of partitioned tables [1]. Thanks a lot for working on this. Partition-wise join can benefit from this as well. See comment about build_simple_rel's matching algorithm in [1]. It will become O(n) instead of O(n^2). > > So attached are two WIP patches: > > 0001 implements two interface functions: > > List *get_all_partition_oids(Oid, LOCKMODE) > List *get_partition_oids(Oid, LOCKMODE) > > that resemble find_all_inheritors() and find_inheritance_children(), > respectively, but expect that users call them only for partitioned tables. > Needless to mention, OIDs are returned with canonical order determined by > that of the partition bounds and they way partition tree structure is > traversed (top-down, breadth-first-left-to-right). Patch replaces all the > calls of the old interface functions with the respective new ones for > partitioned table parents. That means expand_inherited_rtentry (among > others) now calls get_all_partition_oids() if the RTE is for a partitioned > table and find_all_inheritors() otherwise. > > In its implementation, get_all_partition_oids() calls > RelationGetPartitionDispatchInfo(), which is useful to generate the result > list in the desired partition bound order. But the current interface and > implementation of RelationGetPartitionDispatchInfo() needs some rework, > because it's too closely coupled with the executor's tuple routing code. May be we want to implement get_all_partition_oids() calling get_partition_oids() on each new entry that gets added, similar to find_all_inheritors(). That might avoid changes to DispatchInfo() and also dependency on that structure. Also almost every place which called find_all_inheritors() or find_inheritance_children() is changed to if () else case calling those functions or the new function as required. May be we should create macros/functions to do that routing to keep the code readable. May be find_all_inheritors() and find_inheritance_children() themselves become the routing function and their original code moves into new functions get_all_inheritors() and get_inheritance_children(). We may choose other names for functions. The idea is to create routing functions/macros instead of sprinkling code with if () else conditions. [1] https://www.postgresql.org/message-id/ca+tgmoberutu4osxa_ua4aorho83wxavfg8n1nqcofuujbe...@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More flexible LDAP auth search filters?
On 01/08/17 23:17, Thomas Munro wrote: > On Wed, Aug 2, 2017 at 5:36 AM, Peter Eisentraut >wrote: >> On 7/16/17 19:09, Thomas Munro wrote: >>> On Mon, Jul 17, 2017 at 10:26 AM, Thomas Munro >>> wrote: ldap-search-filters-v2.patch >>> >>> Gah, it would help if I could spell "occurrences" correctly. Fixed in >>> the attached. >> >> Please also add the corresponding support for specifying search filters >> in LDAP URLs. See RFC 4516 for the format and >> https://linux.die.net/man/3/ldap_url_parse for the API. You might just >> need to grab LDAPURLDesc.lud_filter and use it. > > Good idea. Yes, it seems to be that simple. Here's a version like > that. Here's an example of how it looks in pg_hba.conf: > > host all all 127.0.0.1/32ldap > ldapurl="ldap://localhost/ou=people1,dc=my-domain,dc=com??sub?(cn=%25u)" > > Maybe we could choose a better token than %u for user name, since it > has to be escaped when included in a URL like that, but on the other > hand there seems to be wide precedent for %u in other software. Yeah, mostly I only ever see ldapurls used programatically, i.e. the configuration allows you to set the various fields separately and then the software generates the URL with the correct encoding itself. But if it's documented that's not a reason to reject the patch as I definitely see it as an improvement. As I mentioned previously in the thread, the main barrier preventing people from using LDAP is that the role cannot be generated from other attributes in the directory. In a lot of real-life cases I see, that would be enough to discount PostgreSQL's LDAP authentication completely. ATB, Mark. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Fri, Aug 4, 2017 at 9:19 AM, APwrote: > On Fri, Aug 04, 2017 at 08:21:01AM +0530, Amit Kapila wrote: >> Note - AP has off list shared the data dump and we (Ashutosh Sharma >> and me) are able to reproduce the problem and we could see that if we >> force vacuum via the debugger, then it is able to free overflow pages. >> The exact numbers are not available at this stage as the test is not >> complete. > > I've another if you would like it. I COPYed with FILLFACTOR of 10 and > it eventually failed but I could not recreate the index (via CREATE INDEX > CONCURRENTLY) with the data that made it using a fillfactor of 100. If > I created the index again (again with the same data) with fillfactor 10 > then it completed. > > I may be completely misunderstanding fillfactor but I always thought it was > a performance optimisation rather than something that may allow you to store > more (or less) index entries. > It impacts the split behavior. You can read about it at: https://www.postgresql.org/docs/9.6/static/sql-createindex.html > > The DB in question is now gone but I took a copy of the column as per > before so if you'd like it I can make it available via the same means. > Can you try with the patches posted above? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Fri, Aug 04, 2017 at 08:21:01AM +0530, Amit Kapila wrote: > Note - AP has off list shared the data dump and we (Ashutosh Sharma > and me) are able to reproduce the problem and we could see that if we > force vacuum via the debugger, then it is able to free overflow pages. > The exact numbers are not available at this stage as the test is not > complete. I've another if you would like it. I COPYed with FILLFACTOR of 10 and it eventually failed but I could not recreate the index (via CREATE INDEX CONCURRENTLY) with the data that made it using a fillfactor of 100. If I created the index again (again with the same data) with fillfactor 10 then it completed. I may be completely misunderstanding fillfactor but I always thought it was a performance optimisation rather than something that may allow you to store more (or less) index entries. The stats for the various indexes are: After COPYs started failing: fmmdstash=# select overflow_pages/bitmap_pages/8,* from pgstathashindex('link_datum_id_idx'); ?column? | version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | live_items | dead_items | free_percent --+-+--++--+--+++-- 4095 | 3 |103782169 |4194176 | 128 | 13658343 | 5 085 570 007 | 0 | 21014.6558371539 (1 row) Time: 6146310.494 ms (01:42:26.310) After the CREATE INDEX CONCURRENTLY with FILLFACTOR 100 failed: fmmdstash=# select overflow_pages/bitmap_pages/8,* from pgstathashindex('link_datum_id_idx1'); ?column? | version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | live_items | dead_items | free_percent --+-+--++--+--+++- 4095 | 3 | 6205234 |4194176 | 128 | 86222 | 3080760746 | 0 | 615.91682922039 (1 row) Time: 19128.527 ms (00:19.129) After the CREATE INDEX CONCURRENTLY with FILLFACTOR 10 succeeded: fmmdstash=# select overflow_pages/bitmap_pages/8,* from pgstathashindex('link_datum_id_idx2'); ?column? | version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | live_items | dead_items | free_percent --+-+--++--+--+++-- 3062 | 3 | 79677471 |2572565 | 105 | 5074888 | 3187098806 | 0 | 19027.2399324415 (1 row) Time: 1557509.940 ms (25:57.510) The DB in question is now gone but I took a copy of the column as per before so if you'd like it I can make it available via the same means. AP -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Fri, Aug 4, 2017 at 8:21 AM, Amit Kapilawrote: > On Wed, Aug 2, 2017 at 9:04 PM, Robert Haas wrote: >> On Wed, Jul 12, 2017 at 1:10 AM, Amit Kapila wrote: > Yes, I also think the same idea can be used, in fact, I have mentioned > it [1] as soon as you have committed that patch. Do we want to do > anything at this stage for PG-10? I don't think we should attempt > something this late unless people feel this is a show-stopper issue > for usage of hash indexes. If required, I think a separate function > can be provided to allow users to perform squeeze operation. Sorry, I have no idea how critical this squeeze thing is for the newfangled hash indexes, so I cannot comment on that. Does this make the indexes unusable in some way under some circumstances? >>> >>> It seems so. Basically, in the case of a large number of duplicates, >>> we hit the maximum number of overflow pages. There is a theoretical >>> possibility of hitting it but it could also happen that we are not >>> free the existing unused overflow pages due to which it keeps on >>> growing and hit the limit. I have requested up thread to verify if >>> that is happening in this case and I am still waiting for same. The >>> squeeze operation does free such unused overflow pages after cleaning >>> them. As this is a costly operation and needs a cleanup lock, so we >>> currently perform it only during Vacuum and next split from the bucket >>> which can have redundant overflow pages. >> >> Oops. It was rather short-sighted of us not to increase >> HASH_MAX_BITMAPS when we bumped HASH_VERSION. Actually removing that >> limit is hard, but we could have easily bumped it for 128 to say 1024 >> without (I think) causing any problem, which would have given us quite >> a bit of headroom here. > > Yes, that sounds sensible, but I think it will just delay the problem > to happen. I think here the actual problem is that we are not able to > perform squeeze operation often enough that it frees the overflow > pages. Currently, we try to perform the squeeze only at the start of > next split of the bucket or during vacuum. The reason for doing it > that way was that squeeze operation needs cleanup lock and we already > have that during the start of split and vacuum. Now, to solve it I > have already speculated few ways above [1] and among those, it is > feasible to either do this at end of split which can have performance > implications in some work loads, but will work fine for the case > reported in this thread > I have implemented the patch with this approach as other approach require quite extensive changes which I am not sure is the right thing to do at this stage. > > I think we can fix it in one of above ways and increase the value of > HASH_MAX_BITMAPS as well. > I have increased the number of hash bitmap pages as a separate patch. I am not completely sure if it is a good idea to directly increase it to 1024 as that will increase the size of hashmetapagedata from 960 bytes to 4544 bytes. Shall we increase it to 512? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_cleanup_bucket_after_split_v1.patch Description: Binary data increase_hash_max_bitmaps_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reload-through-the-top-parent switch the partition table
Here is an update patch, now renamed the switch to --load-via-partition-root and also added the documentation for the new switch into pg_dump as well as pg_dumpall. On Fri, Aug 4, 2017 at 7:13 AM, Amit Langotewrote: > On 2017/08/04 1:08, David G. Johnston wrote: > > On Thu, Aug 3, 2017 at 8:53 AM, Tom Lane wrote: > > > >> Robert Haas writes: > >>> So maybe --load-via-partition-root if nobody likes my previous > >>> suggestion of --partition-data-via-root ? > >> > >> WFM. > >> > > > > +1 > > +1. > > Thanks, > Amit > > Thanks, Rushabh Lathia www.EnterpriseDB.com diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index bafa031..65f7e98 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -889,6 +889,16 @@ PostgreSQL documentation + --load-via-partition-root + + +Dump data via the top-most partitioned table (rather than partition +table) when dumping data for the partition table. + + + + + --section=sectionname diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml index aa944a2..54b8e7e 100644 --- a/doc/src/sgml/ref/pg_dumpall.sgml +++ b/doc/src/sgml/ref/pg_dumpall.sgml @@ -431,6 +431,16 @@ PostgreSQL documentation + --load-via-partition-root + + +Dump data via the top-most partitioned table (rather than partition +table) when dumping data for the partition table. + + + + + --use-set-session-authorization diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h index 144068a..ce3100f 100644 --- a/src/bin/pg_dump/pg_backup.h +++ b/src/bin/pg_dump/pg_backup.h @@ -157,6 +157,7 @@ typedef struct _dumpOptions int outputNoTablespaces; int use_setsessauth; int enable_row_security; + int load_via_partition_root; /* default, if no "inclusion" switches appear, is to dump everything */ bool include_everything; diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 393b9e2..27e23ca 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -269,6 +269,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions, const char *prefix, Archive *fout); static char *get_synchronized_snapshot(Archive *fout); static void setupDumpWorker(Archive *AHX); +static TableInfo *getRootTableInfo(TableInfo *tbinfo); int @@ -345,6 +346,7 @@ main(int argc, char **argv) {"lock-wait-timeout", required_argument, NULL, 2}, {"no-tablespaces", no_argument, , 1}, {"quote-all-identifiers", no_argument, _all_identifiers, 1}, + {"load-via-partition-root", no_argument, _via_partition_root, 1}, {"role", required_argument, NULL, 3}, {"section", required_argument, NULL, 5}, {"serializable-deferrable", no_argument, _deferrable, 1}, @@ -959,6 +961,7 @@ help(const char *progname) printf(_(" --no-tablespaces do not dump tablespace assignments\n")); printf(_(" --no-unlogged-table-data do not dump unlogged table data\n")); printf(_(" --quote-all-identifiers quote all identifiers, even if not key words\n")); + printf(_(" --load-via-partition-rootload partition table via the root relation\n")); printf(_(" --section=SECTIONdump named section (pre-data, data, or post-data)\n")); printf(_(" --serializable-deferrablewait until the dump can run without anomalies\n")); printf(_(" --snapshot=SNAPSHOT use given snapshot for the dump\n")); @@ -1902,8 +1905,32 @@ dumpTableData_insert(Archive *fout, void *dcontext) if (insertStmt == NULL) { insertStmt = createPQExpBuffer(); + +/* + * When load-via-partition-root is set, get the root relation name + * for the partition table, so that we can reload data through + * the root table. + */ +if (dopt->load_via_partition_root && tbinfo->ispartition) +{ + TableInfo *parentTbinfo; + + parentTbinfo = getRootTableInfo(tbinfo); + + /* + * When we loading data through the root, we will qualify + * the table name. This is needed because earlier + * search_path will be set for the partition table. + */ + classname = (char *) fmtQualifiedId(fout->remoteVersion, + parentTbinfo->dobj.namespace->dobj.name, + parentTbinfo->dobj.name); +} +else + classname = fmtId(tbinfo->dobj.name); + appendPQExpBuffer(insertStmt, "INSERT INTO %s ", - fmtId(classname)); + classname); /* corner case for zero-column table */ if (nfields == 0) @@ -2025,6 +2052,20 @@ dumpTableData_insert(Archive *fout, void *dcontext) return 1; } +/* + * getRootTableInfo: + * get the root TableInfo for the given partition table. + */ +static TableInfo *
[HACKERS] scan on inheritance parent with no children in current session
Hi All, Consider a parent table which has no child in the current session, but has temporary children in other sessions. Session 1 postgres=# create table parent (a int); CREATE TABLE Session 2: postgres=# create temp table temp_child () inherits(parent); CREATE TABLE Before commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50. We would have Seq Scan plan for scanning parent in session 1 postgres=# explain verbose select * from parent; QUERY PLAN - Seq Scan on public.parent (cost=0.00..0.00 rows=1 width=4) Output: parent.a (2 rows) In session 2, it would be an Append plan postgres=# explain verbose select * from parent; QUERY PLAN -- Append (cost=0.00..35.50 rows=2551 width=4) -> Seq Scan on public.parent (cost=0.00..0.00 rows=1 width=4) Output: parent.a -> Seq Scan on pg_temp_4.temp_child (cost=0.00..35.50 rows=2550 width=4) Output: temp_child.a (5 rows) After that commit in session 1, we get an Append plan postgres=# explain verbose select * from parent; QUERY PLAN --- Append (cost=0.00..0.00 rows=1 width=4) -> Seq Scan on public.parent (cost=0.00..0.00 rows=1 width=4) Output: parent.a (3 rows) I don't think this is an intentional change. Here's patch to fix it. The comment in the patch uses term "real child" in the context of comments about temporary children from other session and the comment at the end of the function where rte->inh is reset. May be we should move the second comment before setting has_child in the patch and use "real child" in the comment at the end to avoid repetition. But I want to first check whether we want this fix or we can live with the Append plan. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c index cf46b74..b640639 100644 --- a/src/backend/optimizer/prep/prepunion.c +++ b/src/backend/optimizer/prep/prepunion.c @@ -1374,7 +1374,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) List *inhOIDs; List *appinfos; ListCell *l; - bool need_append; + bool has_child; PartitionedChildRelInfo *pcinfo; List *partitioned_child_rels = NIL; @@ -1448,7 +1448,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) /* Scan the inheritance set and expand it */ appinfos = NIL; - need_append = false; + has_child = false; foreach(l, inhOIDs) { Oid childOID = lfirst_oid(l); @@ -1502,7 +1502,10 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) */ if (childrte->relkind != RELKIND_PARTITIONED_TABLE) { - need_append = true; + /* Remember if we saw a real child. */ + if (childOID != parentOID) +has_child = true; + appinfo = makeNode(AppendRelInfo); appinfo->parent_relid = rti; appinfo->child_relid = childRTindex; @@ -1582,7 +1585,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) * the parent table is harmless, so we don't bother to get rid of it; * ditto for the useless PlanRowMark node. */ - if (!need_append) + if (!has_child) { /* Clear flag before returning */ rte->inh = false; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Small code improvement for btree
Hi, While hacking the btree code I found two points we can improve in nbtxlog.c. @@ -135,7 +135,7 @@ _bt_clear_incomplete_split(XLogReaderState *record, uint8 block_id) Pagepage = (Page) BufferGetPage(buf); BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page); - Assert((pageop->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0); +Assert(P_INCOMPLETE_SPLIT(pageop) != 0); pageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT; PageSetLSN(page, lsn); We can use P_INCOMPLETE_SPLIT() instead here. --- @@ -598,7 +598,7 @@ btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record) UnlockReleaseBuffer(ibuffer); return InvalidTransactionId; } - LockBuffer(hbuffer, BUFFER_LOCK_SHARE); + LockBuffer(hbuffer, BT_READ); hpage = (Page) BufferGetPage(hbuffer); /* We should use BT_READ here rather than BUFFER_LOCK_SHARE. I think that since such codes are sometimes hard to be found easily by grep, so could be a cause of bugs when changing the code. Attached small patch fixes these issues. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center code_improvement_for_btree.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 2017/08/02 19:49, Amit Khandekar wrote: > On 2 August 2017 at 14:38, Amit Langotewrote: >>> One approach I had considered was to have find_inheritance_children() >>> itself lock the children in bound order, so that everyone will have >>> bound-ordered oids, but that would be too expensive since it requires >>> opening all partitioned tables to initialize partition descriptors. In >>> find_inheritance_children(), we get all oids without opening any >>> tables. But now that I think more of it, it's only the partitioned >>> tables that we have to open, not the leaf partitions; and furthermore, >>> I didn't see calls to find_inheritance_children() and >>> find_all_inheritors() in performance-critical code, except in >>> expand_inherited_rtentry(). All of them are in DDL commands; but yes, >>> that can change in the future. >> >> This approach more or less amounts to calling the new >> RelationGetPartitionDispatchInfo() (per my proposed patch, a version of >> which I posted upthread.) Maybe we can add a wrapper on top, say, >> get_all_partition_oids() which throws away other things that >> RelationGetPartitionDispatchInfo() returned. In addition it locks all the >> partitions that are returned, unlike only the partitioned ones, which is >> what RelationGetPartitionDispatchInfo() has been taught to do. > > So there are three different task items here : > 1. Arrange the oids in consistent order everywhere. > 2. Prepare the Partition Dispatch Info data structure in the planner > as against during execution. > 3. For update tuple routing, assume that the result rels are ordered > consistently to make the searching efficient. That's a good breakdown. > #3 depends on #1. So for that, I have come up with a minimum set of > changes to have expand_inherited_rtentry() generate the rels in bound > order. When we do #2 , it may be possible that we may need to re-do my > changes in expand_inherited_rtentry(), but those are minimum. We may > even end up having the walker function being used at multiple places, > but right now it is not certain. So AFAICS: For performance reasons, we want the order in which leaf partition sub-plans appear in the ModifyTable node (and subsequently leaf partition ResultRelInfos ModifyTableState) to be some known canonical order. That's because we want to map partitions in the insert tuple-routing data structure (which appear in a known canonical order as determined by RelationGetPartititionDispatchInfo) to those appearing in the ModifyTableState. That's so that we can reuse the planner-generated WCO and RETURNING lists in the insert code path when update tuple-routing invokes that path. To implement that, planner should retrieve the list of leaf partition OIDs in the same order as ExecSetupPartitionTupleRouting() retrieves them. Because the latter calls RelationGetPartitionDispatchInfo on the root partitioned table, maybe the planner should do that too, instead of its current method getting OIDs using find_all_inheritors(). But it's currently not possible due to the way RelationGetPartitionDispatchInfo() and involved data structures are designed. One way forward I see is to invent new interface functions: List *get_all_partition_oids(Oid, LOCKMODE) List *get_partition_oids(Oid, LOCKMODE) that resemble find_all_inheritors() and find_inheritance_children(), respectively, but expects that users make sure that they are called only for partitioned tables. Needless to mention, OIDs are returned with canonical order determined by that of the partition bounds and partition tree structure. We replace all the calls of the old interface functions with the respective new ones. That means expand_inherited_rtentry (among others) now calls get_all_partition_oids() if the RTE is for a partitioned table and find_all_inheritors() otherwise. > So, I think we can continue the discussion about #1 and #2 in a separate > thread. I have started a new thread named "expanding inheritance in partition bound order" and posted a couple of patches [1]. After applying those patches, you can write code for #3 without having to worry about the concerns of partition order, which I guess you've already done. >>> Regarding dynamically locking specific partitions as and when needed, >>> I think this method inherently has the issue of deadlock because the >>> order would be random. So it feels like there is no way around other >>> than to lock all partitions beforehand. >> >> I'm not sure why the order has to be random. If and when we decide to >> open and lock a subset of partitions for a given query, it will be done in >> some canonical order as far as I can imagine. Do you have some specific >> example in mind? > > Partitioned table t1 has partitions t1p1 and t1p2 > Partitioned table t2 at the same level has partitions t2p1 and t2p2 > Tuple routing causes the first row to insert into t2p2, so t2p2 is locked. > Next insert locks t1p1 because it inserts into t1p1. >
[HACKERS] expanding inheritance in partition bound order
The current way to expand inherited tables, including partitioned tables, is to use either find_all_inheritors() or find_inheritance_children() depending on the context. They return child table OIDs in the (ascending) order of those OIDs, which means the callers that need to lock the child tables can do so without worrying about the possibility of deadlock in some concurrent execution of that piece of code. That's good. For partitioned tables, there is a possibility of returning child table (partition) OIDs in the partition bound order, which in addition to preventing the possibility of deadlocks during concurrent locking, seems potentially useful for other caller-specific optimizations. For example, tuple-routing code can utilize that fact to implement binary-search based partition-searching algorithm. For one more example, refer to the "UPDATE partition key" thread where it's becoming clear that it would be nice if the planner had put the partitions in bound order in the ModifyTable that it creates for UPDATE of partitioned tables [1]. So attached are two WIP patches: 0001 implements two interface functions: List *get_all_partition_oids(Oid, LOCKMODE) List *get_partition_oids(Oid, LOCKMODE) that resemble find_all_inheritors() and find_inheritance_children(), respectively, but expect that users call them only for partitioned tables. Needless to mention, OIDs are returned with canonical order determined by that of the partition bounds and they way partition tree structure is traversed (top-down, breadth-first-left-to-right). Patch replaces all the calls of the old interface functions with the respective new ones for partitioned table parents. That means expand_inherited_rtentry (among others) now calls get_all_partition_oids() if the RTE is for a partitioned table and find_all_inheritors() otherwise. In its implementation, get_all_partition_oids() calls RelationGetPartitionDispatchInfo(), which is useful to generate the result list in the desired partition bound order. But the current interface and implementation of RelationGetPartitionDispatchInfo() needs some rework, because it's too closely coupled with the executor's tuple routing code. Applying just 0001 will satisfy the requirements stated in [1], but it won't look pretty as is for too long. So, 0002 is a patch to refactor RelationGetPartitionDispatchInfo() and relevant data structures. For example, PartitionDispatchData has now been simplified to contain only the partition key, partition descriptor and indexes array, whereas previously it also stored the relation descriptor, partition key execution state, tuple table slot, tuple conversion map which are required for tuple-routing. RelationGetPartitionDispatchInfo() no longer generates those things, but returns just enough information so that a caller can generate and manage those things by itself. This simplification makes it less cumbersome to call RelationGetPartitionDispatchInfo() in other places. Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoajC0J50%3D2FqnZLvB10roY%2B68HgFWhso%3DV_StkC6PWujQ%40mail.gmail.com From 9674053fd1e57a480d8a42585cb10421e2c76a70 Mon Sep 17 00:00:00 2001 From: amitDate: Wed, 2 Aug 2017 17:14:59 +0900 Subject: [PATCH 1/3] Add get_all_partition_oids and get_partition_oids Their respective counterparts find_all_inheritors() and find_inheritance_children() read the pg_inherits catalog directly and frame the result list in some order determined by the order of OIDs. get_all_partition_oids() and get_partition_oids() form their result by reading the partition OIDs from the PartitionDesc contained in the relcache. Hence, the order of OIDs in the resulting list is based on that of the partition bounds. In the case of get_all_partition_oids which traverses the whole-tree, the order is also determined by the fact that the tree is traversed in a breadth-first manner. --- contrib/sepgsql/dml.c | 4 +- src/backend/catalog/partition.c| 84 ++ src/backend/commands/analyze.c | 8 ++- src/backend/commands/lockcmds.c| 6 +- src/backend/commands/publicationcmds.c | 9 ++- src/backend/commands/tablecmds.c | 124 + src/backend/commands/vacuum.c | 7 +- src/backend/optimizer/prep/prepunion.c | 6 +- src/include/catalog/partition.h| 3 + 9 files changed, 213 insertions(+), 38 deletions(-) diff --git a/contrib/sepgsql/dml.c b/contrib/sepgsql/dml.c index b643720e36..62d6610c43 100644 --- a/contrib/sepgsql/dml.c +++ b/contrib/sepgsql/dml.c @@ -332,8 +332,10 @@ sepgsql_dml_privileges(List *rangeTabls, bool abort_on_violation) */ if (!rte->inh) tableIds = list_make1_oid(rte->relid); - else + else if (rte->relkind != RELKIND_PARTITIONED_TABLE) tableIds = find_all_inheritors(rte->relid,
Re: [HACKERS] Support for Secure Transport SSL library on macOS as OpenSSL alternative
On Thu, Aug 3, 2017 at 11:26 PM, Daniel Gustafssonwrote: >> On 03 Aug 2017, at 19:27, Michael Paquier wrote: >> There were no APIs to get the TLS finish message last time I looked at OSX >> stuff, which mattered for tls-unique. It would be nice if we could get one. > > Yeah, AFAICT there is no API for that. Perhaps my last words were confusing. I meant that it would be nice to get at least one type of channel binding working. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Quorum commit for multiple synchronous replication.
On Fri, Jul 28, 2017 at 2:24 PM, Noah Mischwrote: > On Thu, Apr 06, 2017 at 08:55:37AM +0200, Petr Jelinek wrote: >> On 06/04/17 03:51, Noah Misch wrote: >> > On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote: >> >> On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch wrote: >> >>> On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote: >> Regarding this feature, there are some loose ends. We should work on >> and complete them until the release. >> >> (1) >> Which synchronous replication method, priority or quorum, should be >> chosen when neither FIRST nor ANY is specified in s_s_names? Right now, >> a priority-based sync replication is chosen for keeping backward >> compatibility. However some hackers argued to change this decision >> so that a quorum commit is chosen because they think that most users >> prefer to a quorum. > >> >> The items (1) and (3) are not bugs. So I don't think that they need to be >> >> resolved before the beta release. After the feature freeze, many users >> >> will try and play with many new features including quorum-based syncrep. >> >> Then if many of them complain about (1) and (3), we can change the code >> >> at that timing. So we need more time that users can try the feature. >> > >> > I've moved (1) to a new section for things to revisit during beta. If >> > someone >> > feels strongly that the current behavior is Wrong and must change, speak >> > up as >> > soon as you reach that conclusion. Absent such arguments, the behavior >> > won't >> > change. >> > >> >> I was one of the people who said in original thread that I think the >> default behavior should change to quorum and I am still of that opinion. > > This item appears under "decisions to recheck mid-beta". If anyone is going > to push for a change here, now is the time. It has been 1 week since the previous mail. I though that there were others argued to change the behavior of old-style setting so that a quorum commit is chosen. If nobody is going to push for a change we can live with the current behavior? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers