Re: [HACKERS] Reduce pinning in btree indexes
Kevin Grittner kgri...@ymail.com wrote: Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: I attached the your latest patch to this mail as bt-nopin-v4.patch for now. Please check that there's no problem in it. I checked out master, applied the patch and checked it against my latest code (merged with master), and it matched. So, looks good. Pushed with the addition of one more paragraph of comments, regarding something to watch out for in any follow-on patch to eliminate the pin for a scan using a non-MVCC snapshot. Thanks again to Kyotaro-san and Heikki for the reviews! -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On Sat, Mar 14, 2015 at 4:07 PM, Simon Riggs si...@2ndquadrant.com wrote: -1 for a time based setting. After years of consideration, bloat is now controllable by altering the size of the undo tablespace. Hm. Well, fwiw the situation is rather more complicated than that. You're correct that you can create a fixed size undo tablespace in which case Oracle will tell you approximately how much retention period you can rely on given that size. But alternately you can set the undo tablespace to be autoextend and Oracle will tune it to ensure there's enough retention to cover the longest-running query in the system. You can also set a time-based parameter (undo_retention) to ensure there's always at least enough data retained to cover a fixed amount of time for flashback (Oracle's equivalent to our time travel feature in ancient versions of Postgres). -- greg
Re: [HACKERS] Reduce pinning in btree indexes
On 18 March 2015 at 03:01, Jim Nasby jim.na...@bluetreble.com wrote: On 3/16/15 11:47 AM, Robert Haas wrote: I am sure there are more sophisticated things to be done here, but I guess my feeling is that time is a good way to go here for a first cut - lots of people have suggested it, and there's clearly a use case for it. If the setting turns out to be popular, we can fine-tune it more once we've got some experience with it. But I'm nervous about trying to to design something more complicated than that right now, especially so late in the release cycle. We know that this setting, with time-based units, will meet the needs of the customer for whom it was developed, and that is a good-enough reason to think that time is a reasonable metric for this, even if we eventually have others. +1. As you said, time is easy for people to understand, and I think it will handle a large chunk of the use cases. If it wasn't clear, I had already said that my idea was for next release, before Robert said that. I used to have this quote (or something close to it) on my whiteboard... I think it's appropriate here ;) The perfect is the enemy of the good. -Simon Riggs Interesting to hear I was anybody's pinup ;-) Not sure that is a saying of mine though, sounds more like Robert to me. I do believe in something now, something better later. I'd call that incremental benefit, as opposed to incremental development where the benefit may not be realisable until the last point. Which is why I am backing this patch for 9.5, even though the UI is less useful than I think is eventually necessary. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, RemoteDBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On 3/16/15 11:47 AM, Robert Haas wrote: I am sure there are more sophisticated things to be done here, but I guess my feeling is that time is a good way to go here for a first cut - lots of people have suggested it, and there's clearly a use case for it. If the setting turns out to be popular, we can fine-tune it more once we've got some experience with it. But I'm nervous about trying to to design something more complicated than that right now, especially so late in the release cycle. We know that this setting, with time-based units, will meet the needs of the customer for whom it was developed, and that is a good-enough reason to think that time is a reasonable metric for this, even if we eventually have others. +1. As you said, time is easy for people to understand, and I think it will handle a large chunk of the use cases. I used to have this quote (or something close to it) on my whiteboard... I think it's appropriate here ;) The perfect is the enemy of the good. -Simon Riggs -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Thank you for rewriting. Thank *you* for pointing out where more work was needed in the comments and README! I attached the your latest patch to this mail as bt-nopin-v4.patch for now. Please check that there's no problem in it. I checked out master, applied the patch and checked it against my latest code (merged with master), and it matched. So, looks good. - By this patch, index scan becomes to release buffer pins while fetching index tuples in a page, so it should reduce the chance of index scans with long duration to block vacuum, because vacuums now can easily overtake the current position of an index scan. I didn't actually measured how effective it is, though. That part got pretty thorough testing on end-user software. The whole reason for writing the patch was that after applying the snapshot-too-old PoC patch they still saw massive bloat because all autovacuum workers blocked behind cursors which were left idle. The initial version of this patch fixed that, preventing (in combination with the other patch) uncontrolled bloat in a 48 hour test. - It makes no performance deterioration, on the contrary it accelerates index scans. It seems to be because of removal of lock and unlock surrounding _bt_steppage in bt_next. I fear that the performance improvement may only show up on forward scans -- because the whole LY algorithm depends on splits only moving right, walking right is almost trivial to perform correctly with minimal locking and pinning. Our left pointers help with scanning backward, but there are a lot of conditions that can complicate that, leaving us with the choice between keeping some of the locking or potentially scanning more pages than we now do on a backward scan. Since it wasn't clear how many cases would benefit from a change and how many would lose, I pretty much left the backward scan locking alone in this patch. Changing the code to work the other way would not be outrageously difficult; but benchmarking to determine whether the alternative was a net win would be pretty tricky and very time-consuming. If that is to be done, I strongly feel it should be a separate patch. Because this patch makes a forward scan faster without having much affect on a backward scan, the performance difference between them (which has always existed) will get a little wider. I wonder whether this difference should perhaps be reflected in plan costing. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On 16 March 2015 at 12:48, Kevin Grittner kgri...@ymail.com wrote: Simon Riggs si...@2ndquadrant.com wrote: On 13 March 2015 at 15:41, Kevin Grittner kgri...@ymail.com wrote: The feedback was generally fairly positive except for the fact that snapshot age (for purposes of being too old) was measured in transaction IDs assigned. There seemed to be a pretty universal feeling that this needed to be changed to a time-based setting. -1 for a time based setting. After years of consideration, bloat is now controllable by altering the size of the undo tablespace. I think PostgreSQL needs something size-based also. It would need some estimation to get it to work like that, true, but it is actually the size of the bloat we care about, not the time. So we should be thinking in terms of limits that we actually care about. Are you thinking, then, that WAL volume generated (as determined by LSN) would be the appropriate unit of measure for this? (We would still need to map that back to transaction IDs for vacuuming, of course.) If we did that we could allow the size units of measure, like '5GB' and similar. Or are you thinking of something else? It's probably the closest and easiest measure, and the most meaningful. We can easily accumulate that in a data structure in clog, like async commit LSN. For next release though, since it will take a little bit of thought to interpret that. With commit timestamp enabled in 9.5, we can easily judge time limit, but it is less useful because its not a measure of bloat. As I've said, I'd be happy with just an xid limit for 9.5, if that was the only thing we had. But I think timestamp is just as easy. Given that there seems to be disagreement on what is the more useful metric, do we want to consider allowing more than one? If so, would it be when *all* conditions are met or when *any* conditions are met? Yours was the first reply to my idea, so I think its too early to describe that as disagreement. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, RemoteDBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On Mon, Mar 16, 2015 at 8:48 AM, Kevin Grittner kgri...@ymail.com wrote: Given that there seems to be disagreement on what is the more useful metric, do we want to consider allowing more than one? If so, would it be when *all* conditions are met or when *any* conditions are met? Oh, gosh. Maybe we could invent a whole policy language for this. Or, if we want to do something less painful, we could fire nail-guns into our eyeballs. I'm not sure what the value of basing this on WAL volume is, and I think we should talk about that a bit more first. The value of doing this by rate of XID consumption is easy to understand: it's simple to implement. The value of doing this by time is also simple to understand: if you set the setting to 20 minutes, then you can be sure that queries will not get cancelled unless they run for more than 20 minutes. This clearly makes the system easier to reason about. I don't see what the benefit of doing this by WAL volume would be. What Simon actually proposed was by *bloat*; that is, if the overall amount of bloat in the system exceeds some metric, start whacking old queries in order to control it. The appeal of that is obvious, but I think it would be very hard to make work, because canceling queries won't get rid of the bloat you've already introduced, and so you'd tend to cancel nothing until you hit some threshold, and then start canceling everything. If we want this feature, let's try to keep the spec simple enough that we actually get it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On Mon, Mar 16, 2015 at 12:07 PM, Simon Riggs si...@2ndquadrant.com wrote: What Simon actually proposed was by *bloat*; that is, if the overall amount of bloat in the system exceeds some metric, start whacking old queries in order to control it. The appeal of that is obvious, Agreed but I think it would be very hard to make work, because canceling queries won't get rid of the bloat you've already introduced, and so you'd tend to cancel nothing until you hit some threshold, and then start canceling everything. That would just be an unhelpful way of using that info. There are many possible designs that wouldn't need to work that way. We have many cases where we begin a cleanup action before we run out of space for other server resources. VACUUM is itself a good example, but there are many others. The main point is that if we blindly decide things based upon xid age or time then it will be wrong in many cases, since apps with uniform write rates are rare. We need to work out how to limit bloat itself. If we can't do that easily at a macro level, then we'll have to do that more precisely at the table level, for example having VACUUM decide where to put the limit if we find too much unremovable/recently dead data. I am sure there are more sophisticated things to be done here, but I guess my feeling is that time is a good way to go here for a first cut - lots of people have suggested it, and there's clearly a use case for it. If the setting turns out to be popular, we can fine-tune it more once we've got some experience with it. But I'm nervous about trying to to design something more complicated than that right now, especially so late in the release cycle. We know that this setting, with time-based units, will meet the needs of the customer for whom it was developed, and that is a good-enough reason to think that time is a reasonable metric for this, even if we eventually have others. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On 16 March 2015 at 13:02, Robert Haas robertmh...@gmail.com wrote: I'm not sure what the value of basing this on WAL volume is, and I think we should talk about that a bit more first. The value of doing this by rate of XID consumption is easy to understand: it's simple to implement. The value of doing this by time is also simple to understand: if you set the setting to 20 minutes, then you can be sure that queries will not get cancelled unless they run for more than 20 minutes. This clearly makes the system easier to reason about. I don't see what the benefit of doing this by WAL volume would be. I think we can assess that more clearly as the amount of now-dead space left by an action. So we can include Updates and Deletes, or in the case of aborted transactions, the total WAL written. I assumed that was what Kevin meant. What Simon actually proposed was by *bloat*; that is, if the overall amount of bloat in the system exceeds some metric, start whacking old queries in order to control it. The appeal of that is obvious, Agreed but I think it would be very hard to make work, because canceling queries won't get rid of the bloat you've already introduced, and so you'd tend to cancel nothing until you hit some threshold, and then start canceling everything. That would just be an unhelpful way of using that info. There are many possible designs that wouldn't need to work that way. We have many cases where we begin a cleanup action before we run out of space for other server resources. VACUUM is itself a good example, but there are many others. The main point is that if we blindly decide things based upon xid age or time then it will be wrong in many cases, since apps with uniform write rates are rare. We need to work out how to limit bloat itself. If we can't do that easily at a macro level, then we'll have to do that more precisely at the table level, for example having VACUUM decide where to put the limit if we find too much unremovable/recently dead data. If we want this feature, let's try to keep the spec simple enough that we actually get it. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, RemoteDBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Simon Riggs si...@2ndquadrant.com wrote: On 13 March 2015 at 15:41, Kevin Grittner kgri...@ymail.com wrote: The feedback was generally fairly positive except for the fact that snapshot age (for purposes of being too old) was measured in transaction IDs assigned. There seemed to be a pretty universal feeling that this needed to be changed to a time-based setting. -1 for a time based setting. After years of consideration, bloat is now controllable by altering the size of the undo tablespace. I think PostgreSQL needs something size-based also. It would need some estimation to get it to work like that, true, but it is actually the size of the bloat we care about, not the time. So we should be thinking in terms of limits that we actually care about. Are you thinking, then, that WAL volume generated (as determined by LSN) would be the appropriate unit of measure for this? (We would still need to map that back to transaction IDs for vacuuming, of course.) If we did that we could allow the size units of measure, like '5GB' and similar. Or are you thinking of something else? Given that there seems to be disagreement on what is the more useful metric, do we want to consider allowing more than one? If so, would it be when *all* conditions are met or when *any* conditions are met? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On 13 March 2015 at 13:17, Robert Haas robertmh...@gmail.com wrote: On Fri, Mar 13, 2015 at 8:52 AM, Simon Riggs si...@2ndquadrant.com wrote: On 15 February 2015 at 00:19, Kevin Grittner kgri...@ymail.com wrote: What they wanted was what happened in the other database product -- if a snapshot got sufficiently old that cleaning up the MVCC data was a problem *and* the snapshot was used again *and* it read a page which had been modified far enough back that it was not possible to return correct data, then they wanted to receive a snapshot too old error. I wrote a patch to do that... So, please lets see the patch. It seems useful for core Postgres. It was submitted here: http://www.postgresql.org/message-id/136937748.3364317.1423964815320.javamail.ya...@mail.yahoo.com Thanks. I have +1'd that patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, RemoteDBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On 13 March 2015 at 15:41, Kevin Grittner kgri...@ymail.com wrote: The feedback was generally fairly positive except for the fact that snapshot age (for purposes of being too old) was measured in transaction IDs assigned. There seemed to be a pretty universal feeling that this needed to be changed to a time-based setting. -1 for a time based setting. After years of consideration, bloat is now controllable by altering the size of the undo tablespace. I think PostgreSQL needs something size-based also. It would need some estimation to get it to work like that, true, but it is actually the size of the bloat we care about, not the time. So we should be thinking in terms of limits that we actually care about. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, RemoteDBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Hello, At Thu, 12 Mar 2015 15:27:37 -0700, Peter Geoghegan p...@heroku.com wrote in CAM3SWZQ5nwTB-y4ZOj=5ckMLce5GAEUnjKJ2=m1vmhfx_ay...@mail.gmail.com On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner kgri...@ymail.com wrote: At some point we could consider building on this patch to recheck index conditions for heap access when a non-MVCC snapshot is used, check the visibility map for referenced heap pages when the TIDs are read for an index-only scan, and/or skip LP_DEAD hinting for non-WAL-logged indexes. But all those are speculative future work; this is a conservative implementation that just didn't modify pinning where there were any confounding factors. I don't have the bandwidth for a full review, but I took a quick look at this. I think you should call out those confounding factors in the README. It's not hard to piece them together from _bt_drop_lock_and_maybe_pin(), but I think you should anyway. Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing nbtree LockBuffer() callers do. You're inconsistent about that in V3. I agree with you. It looks the only point where it is used. Addition to that, the commnet just above the point methioned above quizes me. /* XXX: Can walking left be lighter on the locking and pins? */ if (BTScanPosIsPinned(so-currPos)) LockBuffer(so-currPos.buf, BUFFER_LOCK_SHARE); else so-currPos.buf = _bt_getbuf(rel, so-currPos.currPage, BT_READ); I'm happy if I could read the meaming of the comment more clearly. I understand that it says that you want to remove the locking (and pinning), but can't now because the simultaneous splitting of the left page would break something. I'd like to see it clearer even for me either I am correct or not.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On 4 March 2015 at 03:16, Kevin Grittner kgri...@ymail.com wrote: How do people feel about the idea of me pushing this for 9.5 (after I clean up all the affected comments and README files)? I know this first appeared in the last CF, but the footprint is fairly small and the only user-visible behavior change is that a btree index scan of a WAL-logged table using an MVCC snapshot no longer blocks a vacuum indefinitely. (If there are objections I will move this to the first CF for the next release.) It helps Hot Standby also, BTW. I proposed this previously, but it was shot down, so I'm glad to see it happening. I'm not sure why you have proposed only half the solution though? Hopefully we aren't just submitting the difficult half that you needed feedback on? Let's see the whole snapshot too old patch as well, since that affects core PostgresSQL also. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, RemoteDBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On 15 February 2015 at 00:19, Kevin Grittner kgri...@ymail.com wrote: What they wanted was what happened in the other database product -- if a snapshot got sufficiently old that cleaning up the MVCC data was a problem *and* the snapshot was used again *and* it read a page which had been modified far enough back that it was not possible to return correct data, then they wanted to receive a snapshot too old error. I wrote a patch to do that... So, please lets see the patch. It seems useful for core Postgres. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, RemoteDBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On Fri, Mar 13, 2015 at 8:52 AM, Simon Riggs si...@2ndquadrant.com wrote: On 15 February 2015 at 00:19, Kevin Grittner kgri...@ymail.com wrote: What they wanted was what happened in the other database product -- if a snapshot got sufficiently old that cleaning up the MVCC data was a problem *and* the snapshot was used again *and* it read a page which had been modified far enough back that it was not possible to return correct data, then they wanted to receive a snapshot too old error. I wrote a patch to do that... So, please lets see the patch. It seems useful for core Postgres. It was submitted here: http://www.postgresql.org/message-id/136937748.3364317.1423964815320.javamail.ya...@mail.yahoo.com -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: At Thu, 12 Mar 2015 15:27:37 -0700, Peter Geoghegan p...@heroku.com wrote: On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner kgri...@ymail.com wrote: At some point we could consider building on this patch to recheck index conditions for heap access when a non-MVCC snapshot is used, check the visibility map for referenced heap pages when the TIDs are read for an index-only scan, and/or skip LP_DEAD hinting for non-WAL-logged indexes. But all those are speculative future work; this is a conservative implementation that just didn't modify pinning where there were any confounding factors. I think you should call out those confounding factors in the README. It's not hard to piece them together from _bt_drop_lock_and_maybe_pin(), but I think you should anyway. OK: https://github.com/kgrittn/postgres/commit/f5f59ded30b114ac83b90a00ba1fa5ef490b994e Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing nbtree LockBuffer() callers do. You're inconsistent about that in V3. I agree with you. It looks the only point where it is used. OK: https://github.com/kgrittn/postgres/commit/76118b58be819ed5e68569c926d0222bc41640ea Addition to that, the commnet just above the point methioned above quizes me. /* XXX: Can walking left be lighter on the locking and pins? */ if (BTScanPosIsPinned(so-currPos)) LockBuffer(so-currPos.buf, BUFFER_LOCK_SHARE); else so-currPos.buf = _bt_getbuf(rel, so-currPos.currPage, BT_READ); I'm happy if I could read the meaming of the comment more clearly. I understand that it says that you want to remove the locking (and pinning), but can't now because the simultaneous splitting of the left page would break something. I'd like to see it clearer even for me either I am correct or not.. Does this clear it up?: https://github.com/kgrittn/postgres/commit/22066fc161a092e800e4c1e853136c4513f8771b Since there are no changes that would affect the compiled code here, I'm not posting a new patch yet. I'll do that once things seem to have settled down a bit more. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Robert Haas robertmh...@gmail.com wrote: On Fri, Mar 13, 2015 at 8:52 AM, Simon Riggs si...@2ndquadrant.com wrote: On 15 February 2015 at 00:19, Kevin Grittner kgri...@ymail.com wrote: What they wanted was what happened in the other database product -- if a snapshot got sufficiently old that cleaning up the MVCC data was a problem *and* the snapshot was used again *and* it read a page which had been modified far enough back that it was not possible to return correct data, then they wanted to receive a snapshot too old error. I wrote a patch to do that... So, please lets see the patch. It seems useful for core Postgres. It was submitted here: http://www.postgresql.org/message-id/136937748.3364317.1423964815320.javamail.ya...@mail.yahoo.com The feedback was generally fairly positive except for the fact that snapshot age (for purposes of being too old) was measured in transaction IDs assigned. There seemed to be a pretty universal feeling that this needed to be changed to a time-based setting. I've been working on that, although my initial efforts proved to be a dead-end. I've had some off-list discussions with Andrew Dunstan and we have agreed on something that looks like it should work well, but it's still being coded. The other thing that is needed is to sprinkle the other index AMs with TestForOldSnapshot() calls. In the btree code the right spots for those were all following BufferGetPage() calls, so the work was in seeing where each of those could be called from to see whether it was a spot that the check was needed. Of course, docs and comment changes are also needed, but that's probably only a day or two's effort. Further discussion of the snapshot too old patch should probably go on its thread. The spillover is largely my fault for referencing one from the other. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Kevin Grittner kgri...@ymail.com wrote: Because these changes are just to a comment and a README file, I'm posting a patch-on-patch v3a (to be applied on top of v3). Here is some additional comment work that I hope will make things easier to understand for whoever next visits this code. There is also a minor reorganization of some code that should not have any impact except to skip the restore memcpy() calls where they are not needed in a corner case that I'm not sure we can currently even reach. That reorganization is intended more for code clarity than for the marginal performance it could provide. I'm posting this as v3b to be applied on top of v3 and v3a, so that these changes can be reviewed and commented on separately. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company bt-nopin-v3b.patch Description: invalid/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner kgri...@ymail.com wrote: At some point we could consider building on this patch to recheck index conditions for heap access when a non-MVCC snapshot is used, check the visibility map for referenced heap pages when the TIDs are read for an index-only scan, and/or skip LP_DEAD hinting for non-WAL-logged indexes. But all those are speculative future work; this is a conservative implementation that just didn't modify pinning where there were any confounding factors. I don't have the bandwidth for a full review, but I took a quick look at this. I think you should call out those confounding factors in the README. It's not hard to piece them together from _bt_drop_lock_and_maybe_pin(), but I think you should anyway. Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing nbtree LockBuffer() callers do. You're inconsistent about that in V3. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Hello, I found no other problem including the performance issue in the patch attached to the last mail as far as I can understand. So I'll mark this as ready for commit after a few days with no objection after this comments is addressed. - If (BTScanPosIsPinned(so-currPos)). As I mention below for nbtsearch.c, the page aquired in the if-then block may be vacuumed so the LSN check done in the if-else block is need also in the if-then block. It will be accomplished only by changing the position of the end of the if-else block. I'm not sure I agree on this. Sorry, it is largely because of my poor composition. If the page is pinned it should have been pinned continuously since we initially read it, so the line pointers we have could not have been removed by any vacuum process. The tuples may have been pruned away in the heap, but that doesn't matter. Index entries may have been added and the index page may have been split, but that was true before this patch and _bt_killitems will deal with those things the same as it always has. Yes. I think so. I don't see any benefit to doing the LSN compare in this case; if we've paid the costs of holding the pin through to this point, we might as well flag any dead entries we can. I thought of the case that the buffer has been pinned by another backend after this backend unpinned it. I looked again the definition of BTScanPosIsPinned and BTScanPosUnpinIfPinned, and understood that the pin should be mine if BTScanPosIsPinned. Would you mind rewriting the comment there like this? - /* The buffer is still pinned, but not locked. Lock it now. */ + /* I still hold the pin on the buffer, but not locked. Lock it now. */ | LockBuffer(so-currPos.buf, BT_READ); Or would you mind renaming the macro as BTScanPosIsPinnedByMe or something like, or anyway to notify the reader that the pin should be mine there? - _bt_killitems is called without pin when rescanning from before, so I think the previous code has already considered the unpinned case (if (ItemPointerEquals(ituple... does that). Vacuum rearranges line pointers after deleting LP_DEAD items so the previous check seems enough for the purpose. The previous code is more effeicient becuase the mathing items will be processed even after vacuum. I'm not following you on this one; could you rephrase it? Sorry, I read btrescan incorrectly that it calls _bt_killitems() *after* releaseing the buffer. Please forget it. Finally, I'd like to timidly comment on this.. + To handle the cases where it is safe to release the pin after + reading the index page, the LSN of the index page is read along + with the index entries before the shared lock is released, and we + do not attempt to flag index tuples as dead if the page is not + pinned and the LSN does not match. I suppose that the sentence like following is more directly describing about the mechanism and easier to find the correnponsing code, if it is correct. To handle the cases where a index page has unpinned when trying to mark the unused index tuples on the page as dead, the LSN of the index page is remembered when reading the index page for index tuples, and we do not attempt to flag index tuples as dead if the page is not pinned and the LSN does not match. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: I found no other problem including the performance issue in the patch attached to the last mail as far as I can understand. So I'll mark this as ready for commit after a few days with no objection after this comments is addressed. Thanks for the reviews! I don't see any benefit to doing the LSN compare in this case; if we've paid the costs of holding the pin through to this point, we might as well flag any dead entries we can. I thought of the case that the buffer has been pinned by another backend after this backend unpinned it. I looked again the definition of BTScanPosIsPinned and BTScanPosUnpinIfPinned, and understood that the pin should be mine if BTScanPosIsPinned. Would you mind rewriting the comment there like this? - /* The buffer is still pinned, but not locked. Lock it now. */ + /* I still hold the pin on the buffer, but not locked. Lock it now. */ | LockBuffer(so-currPos.buf, BT_READ); Or would you mind renaming the macro as BTScanPosIsPinnedByMe or something like, or anyway to notify the reader that the pin should be mine there? I see your point, although those first person singular pronouns used like that make me a little uncomfortable; I'll change the comment and/or macro name, but I'll work on the name some more. Finally, I'd like to timidly comment on this.. + To handle the cases where it is safe to release the pin after + reading the index page, the LSN of the index page is read along + with the index entries before the shared lock is released, and we + do not attempt to flag index tuples as dead if the page is not + pinned and the LSN does not match. I suppose that the sentence like following is more directly describing about the mechanism and easier to find the correnponsing code, if it is correct. To handle the cases where a index page has unpinned when trying to mark the unused index tuples on the page as dead, the LSN of the index page is remembered when reading the index page for index tuples, and we do not attempt to flag index tuples as dead if the page is not pinned and the LSN does not match. Will reword that part to try to make it clearer. Thanks! -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Kevin Grittner kgri...@ymail.com wrote: Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Would you mind rewriting the comment there like this? - /* The buffer is still pinned, but not locked. Lock it now. */ + /* I still hold the pin on the buffer, but not locked. Lock it now. */ Or would you mind renaming the macro as BTScanPosIsPinnedByMe or something like, or anyway to notify the reader that the pin should be mine there? I see your point, although those first person singular pronouns used like that make me a little uncomfortable; I'll change the comment and/or macro name, but I'll work on the name some more. After thinking it over, I think that the BTScanPos part of the macro name is enough of a hint that it is concerned with the actions of this scan; it is the comment that needs the change. I went with: /* * We have held the pin on this page since we read the index tuples, * so all we need to do is lock it. The pin will have prevented * re-use of any TID on the page, so there is no need to check the * LSN. */ + To handle the cases where it is safe to release the pin after + reading the index page, the LSN of the index page is read along + with the index entries before the shared lock is released, and we + do not attempt to flag index tuples as dead if the page is not + pinned and the LSN does not match. I suppose that the sentence like following is more directly describing about the mechanism and easier to find the correnponsing code, if it is correct. To handle the cases where a index page has unpinned when trying to mark the unused index tuples on the page as dead, the LSN of the index page is remembered when reading the index page for index tuples, and we do not attempt to flag index tuples as dead if the page is not pinned and the LSN does not match. Will reword that part to try to make it clearer. That sentence was full of passive voice, which didn't help any. I changed it to: | So that we can handle the cases where we attempt LP_DEAD flagging | for a page after we have released its pin, we remember the LSN of | the index page when we read the index tuples from it; we do not | attempt to flag index tuples as dead if the we didn't hold the | pin the entire time and the LSN has changed. Do these changes seem clear? Because these changes are just to a comment and a README file, I'm posting a patch-on-patch v3a (to be applied on top of v3). -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company bt-nopin-v3a.patch Description: invalid/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Kevin Grittner kgri...@ymail.com wrote: It doesn't seem worth posting to the list for the small changes since the last version; I'll wait until I update the comments and README files. If you want review or test the latest, you can peek at: https://github.com/kgrittn/postgres/tree/btnopin Here is v3, with the promised README and code comment changes. In the process of documenting the mark/restore area I noticed a subtlety that I had missed (in the case that there was a mark, advance to the next page, restore, advance within the page, and restore). I fixed that, and in the process gave the patched code an additional direct performance edge over unpatched code. For the 1000k marks, average timings are now: master: 970.999 ms, stdev: 4.043 patched: 940.460 ms, stdev: 4.874 So, in the micro-benchmark showing the biggest benefit the direct improvement is now just over 3%. It remains the case that the primary motivation for the patch is to reduce blocking of vacuum processes; but that's a nice side-benefit. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company bt-nopin-v3.patch Description: invalid/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: The performance which your test shows looks great. The gain might comes from removing of buffer locking on _bt_next. Yeah, I had been hoping that removing some buffer pins and locks from the common path of scanning forward from one page to the next might have some direct performance benefit; it's possible that this will show up more dramatically at high concurrency levels. The main goal, of course, was to improve performance more indirectly, by not having autovacuum worker processes blocked for extended periods in some situations where that can now happen. In nbtutils.c, _bt_killitems: - This is not in charge of this patch, but the comment for _bt_killitems looks to have a trivial typo. As you say, that problem was already there, but no reason not to fix it in the patch as long as we're here. Reworded. - The following comment looks somewhat wrong. /* Since the else condition needs page, get it here, too. */ the else condition needs page means the page of the buffer is needed later? But, I suppose the comment itself is not necessary. I guess that comment isn't really worth the space it takes up; what it says is pretty obvious from the code. Removed. - If (BTScanPosIsPinned(so-currPos)). As I mention below for nbtsearch.c, the page aquired in the if-then block may be vacuumed so the LSN check done in the if-else block is need also in the if-then block. It will be accomplished only by changing the position of the end of the if-else block. I'm not sure I agree on this. If the page is pinned it should have been pinned continuously since we initially read it, so the line pointers we have could not have been removed by any vacuum process. The tuples may have been pruned away in the heap, but that doesn't matter. Index entries may have been added and the index page may have been split, but that was true before this patch and _bt_killitems will deal with those things the same as it always has. I don't see any benefit to doing the LSN compare in this case; if we've paid the costs of holding the pin through to this point, we might as well flag any dead entries we can. - _bt_killitems is called without pin when rescanning from before, so I think the previous code has already considered the unpinned case (if (ItemPointerEquals(ituple... does that). Vacuum rearranges line pointers after deleting LP_DEAD items so the previous check seems enough for the purpose. The previous code is more effeicient becuase the mathing items will be processed even after vacuum. I'm not following you on this one; could you rephrase it? In nbtsearch.c - _bt_first(): It now releases the share lock on the page before the items to be killed is processed. This allows other backends to insert items into the page simultaneously. It seems to be dangerous alone, but _bt_killitems already considers the case. Anyway I think It'll be better to add a comment mentioning unlocking is not dangerous. Well, _bt_killitems does acquire a shared (read) lock to flag index tuples as LP_DEAD, and it has always been OK for the index page to accept inserts and allow page splits before calling _bt_killitems. I don't really see anything new with the patch in this regard. I do agree, though, that a lot of comments and at least two README files refer to the pins blocking vacuum, and these must be found and changed. It doesn't seem worth posting to the list for the small changes since the last version; I'll wait until I update the comments and README files. If you want review or test the latest, you can peek at: https://github.com/kgrittn/postgres/tree/btnopin -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Hello, It looks like the remaining performance regression was indeed a result of code alignment. I found two paranoia assignments I had accidentally failed to put back with the rest of the mark/restore optimization; after that trivial change the patched version is ever-so-slightly faster than master on all tests. The performance which your test shows looks great. The gain might comes from removing of buffer locking on _bt_next. I also would like to see this to come along with 9.5 but it is the matter for committers. Apart from it, I looked into the patch itself more closely. Please reconcile as you like among contradicting comments below:) In nbtutils.c, _bt_killitems: - This is not in charge of this patch, but the comment for _bt_killitems looks to have a trivial typo. * _bt_killitems - set LP_DEAD state for items an indexscan caller has * told us were killed * * scan-so contains information about the current page and killed tuples * thereon (generally, this should only be called if so-numKilled 0). I suppose scan-so should be scan-opaque and so-numKilled be opaque-numKilled. - The following comment looks somewhat wrong. /* Since the else condition needs page, get it here, too. */ the else condition needs page means the page of the buffer is needed later? But, I suppose the comment itself is not necessary. - If (BTScanPosIsPinned(so-currPos)). As I mention below for nbtsearch.c, the page aquired in the if-then block may be vacuumed so the LSN check done in the if-else block is need also in the if-then block. It will be accomplished only by changing the position of the end of the if-else block. - _bt_killitems is called without pin when rescanning from before, so I think the previous code has already considered the unpinned case (if (ItemPointerEquals(ituple... does that). Vacuum rearranges line pointers after deleting LP_DEAD items so the previous check seems enough for the purpose. The previous code is more effeicient becuase the mathing items will be processed even after vacuum. In nbtsearch.c - _bt_first(): It now releases the share lock on the page before the items to be killed is processed. This allows other backends to insert items into the page simultaneously. It seems to be dangerous alone, but _bt_killitems already considers the case. Anyway I think It'll be better to add a comment mentioning unlocking is not dangerous. ... Sorry, time's up for now. regards, Average of 3 `make check-world` runs: master: 336.288 seconds patch: 332.237 seconds Average of 6 `make check` runs: master: 25.409 seconds patch: 25.270 seconds The following were all run 12 times, the worst 2 dropped, the rest averaged: Kyotaro-san's 1m mark worst case benchmark: master: 962.581 ms, 6.765 stdev patch: 947.518 ms, 3.584 stdev Kyotaro-san's 500k mark, 500k restore worst case benchmark: master: 1366.639 ms, 5.314 stdev patch: 1363.844 ms, 5.896 stdev pgbench -i -s 16 pgbench / pgbench -c 16 -j 4 -T 500 pgbench master: 265.011 tps, 4.952 stdev patch: 267.164 tps, 9.094 stdev How do people feel about the idea of me pushing this for 9.5 (after I clean up all the affected comments and README files)? I know this first appeared in the last CF, but the footprint is fairly small and the only user-visible behavior change is that a btree index scan of a WAL-logged table using an MVCC snapshot no longer blocks a vacuum indefinitely. (If there are objections I will move this to the first CF for the next release.) src/backend/access/nbtree/nbtree.c| 50 +--- src/backend/access/nbtree/nbtsearch.c | 141 +- src/backend/access/nbtree/nbtutils.c | 58 ++ src/include/access/nbtree.h | 36 - 4 files changed, 201 insertions(+), 84 deletions(-) -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Kevin Grittner kgri...@ymail.com wrote: [need to check performance more] It looks like the remaining performance regression was indeed a result of code alignment. I found two paranoia assignments I had accidentally failed to put back with the rest of the mark/restore optimization; after that trivial change the patched version is ever-so-slightly faster than master on all tests. Average of 3 `make check-world` runs: master: 336.288 seconds patch: 332.237 seconds Average of 6 `make check` runs: master: 25.409 seconds patch: 25.270 seconds The following were all run 12 times, the worst 2 dropped, the rest averaged: Kyotaro-san's 1m mark worst case benchmark: master: 962.581 ms, 6.765 stdev patch: 947.518 ms, 3.584 stdev Kyotaro-san's 500k mark, 500k restore worst case benchmark: master: 1366.639 ms, 5.314 stdev patch: 1363.844 ms, 5.896 stdev pgbench -i -s 16 pgbench / pgbench -c 16 -j 4 -T 500 pgbench master: 265.011 tps, 4.952 stdev patch: 267.164 tps, 9.094 stdev How do people feel about the idea of me pushing this for 9.5 (after I clean up all the affected comments and README files)? I know this first appeared in the last CF, but the footprint is fairly small and the only user-visible behavior change is that a btree index scan of a WAL-logged table using an MVCC snapshot no longer blocks a vacuum indefinitely. (If there are objections I will move this to the first CF for the next release.) src/backend/access/nbtree/nbtree.c| 50 +--- src/backend/access/nbtree/nbtsearch.c | 141 +- src/backend/access/nbtree/nbtutils.c | 58 ++ src/include/access/nbtree.h | 36 - 4 files changed, 201 insertions(+), 84 deletions(-) -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Hello, I measured the performance of this patch considering markpos/restorepos. The loss seems to be up to about 10% unfortunately. Thanks for the test case! I took another look at this optimization, and found that it didn't really depend on the pin (as I had first thought), so I put it back (keeping the rest of the patch unchanged). I saw a 1.4% increase in run time with the patch (compared to master) for the mark-only test, and a 1.6% increase in run time for the mark/restore test. I'll look into what might be causing that. I have seen bigger differences just due to changing where executable code crossed cache boundaries, but this is big enough to worry about; it needs to be checked. New patch attached. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company bt-nopin-v2.patch Description: invalid/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Kyotaro == Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes: Kyotaro Just a reminder, but I am not the author of this patch:) Yes, I understand that. Kyotaro Mmm? The patch bt-nopin-v1.patch seems not contain the change Kyotaro for ExecSupportMarkRestore and the very simple function remain Kyotaro looking to return true for T_Index(Only)Scan after the patch Kyotaro applied. Right. I'm suggesting you change that, in order to determine what performance cost, if any, would result from abandoning the idea of doing mark/restore entirely. Kyotaro I understand that you'd like to see the net drag of Kyotaro performance by the memcpy(), right? No. What I am suggesting is this: if mark/restore is a performance issue, then it would be useful to know how much gain we're getting (if any) from supporting it _at all_. Let me try and explain it another way. If you change ExecSupportMarkRestore to return false for index scans, then btmarkpos/btrestorepos will no longer be called. What is the performance of this case compared to the original and patched versions? -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Just a reminder, but I am not the author of this patch:) At Fri, 27 Feb 2015 07:26:38 +, Andrew Gierth and...@tao11.riddles.org.uk wrote in 87zj7z6ckc@news-spur.riddles.org.uk Kyotaro == Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes: You might want to try running the same test, but after patching ExecSupportsMarkRestore to return false for index scans. This will cause the planner to insert a Materialize node to handle the mark/restore. Kyotaro Mmm? The patch bt-nopin-v1.patch seems not contain the change Kyotaro for ExecSupportMarkRestore and the very simple function remain Kyotaro looking to return true for T_Index(Only)Scan after the patch Kyotaro applied. Right. I'm suggesting you change that, in order to determine what performance cost, if any, would result from abandoning the idea of doing mark/restore entirely. I understand that you'd like to see the net drag of performance by the memcpy(), right? That don't seem to be possible without breaking (a part of) the patch's function, but, concerning btmarkpos() only case, the mark struct can have garbage so only changing refcount would be viable to see the pure loss by the memcpy(). The attached patch is the destruction I did, and the result was like below. Case 2. 1M markpos, no restorepos for 1M result rows. IndesOnly scan: The patches loses about 10%. master: 3655 ms (std dev = 2.5 ms) Patched: 4038 ms (std dev = 2.6 ms) Patched: 4036 ms (std dev = 3.5 ms) Broken: 3718 ms (std dev = 1.6 ms) The Patched just above is the retook numbers. It's a little under 9% loss from broken version. That is the pure drag of memcpy(). This seems quite big as Heikki said. It's an extreme case, though. The rest 1.7% should be the share of all the other stuff. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 65daf27..4973b63 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -547,12 +547,17 @@ btmarkpos(PG_FUNCTION_ARGS) BTScanOpaque so = (BTScanOpaque) scan-opaque; hoge_markpos_count++; - memcpy(so-markPos, so-currPos, - offsetof(BTScanPosData, items[1]) + - so-currPos.lastItem * sizeof(BTScanPosItem)); - if (so-markTuples) - memcpy(so-markTuples, so-currTuples, - so-currPos.nextTupleOffset); +// memcpy(so-markPos, so-currPos, +// offsetof(BTScanPosData, items[1]) + +// so-currPos.lastItem * sizeof(BTScanPosItem)); +// if (so-markTuples) +// memcpy(so-markTuples, so-currTuples, +// so-currPos.nextTupleOffset); + if (BTScanPosIsValid(so-markPos)) + { + ReleaseBuffer(so-markPos.buf); + so-markPos.buf = InvalidBuffer; + } /* We don't take out an extra pin for the mark position. */ so-markPos.buf = InvalidBuffer; @@ -599,12 +604,13 @@ btrestrpos(PG_FUNCTION_ARGS) if (BTScanPosIsValid(so-markPos)) { - memcpy(so-currPos, so-markPos, - offsetof(BTScanPosData, items[1]) + - so-markPos.lastItem * sizeof(BTScanPosItem)); - if (so-currTuples) -memcpy(so-currTuples, so-markTuples, - so-markPos.nextTupleOffset); +// memcpy(so-currPos, so-markPos, +// offsetof(BTScanPosData, items[1]) + +// so-markPos.lastItem * sizeof(BTScanPosItem)); +// if (so-currTuples) +//memcpy(so-currTuples, so-markTuples, +// so-markPos.nextTupleOffset); + IncrBufferRefCount(so-markPos.buf); } else BTScanPosInvalidate(so-currPos); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Kyotaro == Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes: Kyotaro Anyway I'm sorry but I have left my dev env and cannot have Kyotaro time till next week. The point is moot; rather than try and explain it further I did the test myself, and the performance penalty of disabling mark/restore is substantial, on the order of 30%, so that's a non-starter. (I was a bit surprised that it was so bad...) -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Hello. Mmm... We might be focusing on a bit different things.. 2015/02/27 17:27 Andrew Gierth Kyotaro I understand that you'd like to see the net drag of Kyotaro performance by the memcpy(), right? No. What I am suggesting is this: if mark/restore is a performance issue, then it would be useful to know how much gain we're getting (if any) from supporting it _at all_. The mark/restore works as before with this patch, except the stuff for early dropping of buffer pins. As far as my understanding so far all the stuff in the patch is for the purpose but doesn't intend to boost btree itself. That is, it reduces the chance to block vacuum for possible burden on general performance. And I think the current issue in focus is that the burden might a bit too heavy on specific situation. Therefore I tried to measure how heavy/light the burden is. Am I correct so far? Let me try and explain it another way. If you change ExecSupportMarkRestore to return false for index scans, then btmarkpos/btrestorepos will no longer be called. What is the performance of this case compared to the original and patched versions? As you mentioned before, such change will make the planner turn to, perhaps materialize plan or rescan, or any other scans which are no use comparing to indexscan concerning the issue in focus, the performance drag when btree scan does extremely frequent mark/restore in comparison to unpatched, less copy-amount version. Your suggestion looks intending somewhat different from this. Anyway I'm sorry but I have left my dev env and cannot have time till next week. regards, -- Kyotaro Horiguti
Re: [HACKERS] Reduce pinning in btree indexes
Hello, I measured the performance of this patch considering markpos/restorepos. The loss seems to be up to about 10% unfortunately. At Thu, 26 Feb 2015 17:49:23 + (UTC), Kevin Grittner kgri...@ymail.com wrote in 440831854.629116.1424972963735.javamail.ya...@mail.yahoo.com Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/15/2015 02:19 AM, Kevin Grittner wrote: Hmm. Did your test case actually exercise mark/restore? The memcpy()s are not that small. Especially if it's an index-only scan, you're copying a large portion of the page. Some scans call markpos on every tuple, so that could add up. I have results from the `make world` regression tests and a 48-hour customer test. Unfortunately I don't know how heavily either of those exercise this code. Do you have a suggestion for a way to test whether there is a serious regression for something approaching a worst case? ammarkpos/amrestrpos are called in merge joins. By the steps shown below, I had 1M times of markpos and no restorepos for 1M result rows, and had 500k times of markpos and the same number of times of restorepos for 2M rows result by a bit different configuration. I suppose we can say that they are the worst case considering maskpos/restrpos. The call counts can be taken using the attached patch. Both binaries ware compiled with -O2. shared_buffers=1GB and all shared pages used in the query were hit when measuring. The numbers were taken 12 times for each cofiguration and took averages and stddevs of 10 numbers excluding best and worst. Case 1. 500k markpos/restorepos for 2M result rows. Index scan: The patched loses about 2%. (1.98%) master: 6166 ms (std dev = 3.2 ms) Patched: 6288 ms (std dev = 3.7 ms) IndesOnly scan: The patches loses about 2%. (2.14%) master: 5946 ms (std dev = 5.0 ms) Patched: 6073 ms (std dev = 3.3 ms) The patched version is slower by about 2%. Of course all of it is not the effect of memcpy but I don't know the distribution. Case 2. 1M markpos, no restorepos for 1M result rows. IndesOnly scan: The patches loses about 10%. master: 3655 ms (std dev = 2.5 ms) Patched: 4038 ms (std dev = 2.6 ms) The loss is about 10%. The case looks a bit impractical but unfortunately the number might be unignorable. The distribution of the loss is unknown, too. regards, === CREATE TABLE t1 (a int); CREATE TABLE t2 (a int); delete from t1; delete from t2; -- This causes 1M times of markpos and no restorepos INSERT INTO t1 (SELECT a FROM generate_series(0, 99) a); INSERT INTO t2 (SELECT a FROM generate_series(0, 99) a); -- This causes 500k times of markpos and the same number of restorepos -- INSERT INTO t1 (SELECT a/2 FROM generate_series(0, 99) a); -- INSERT INTO t2 (SELECT a/2 FROM generate_series(0, 99) a); CREATE INDEX it1 ON t1 (a); CREATE INDEX it2 ON t2 (a); ANALYZE t1; ANALYZE t1; SET enable_seqscan to false; SET enable_material to false; SET enable_hashjoin to false; SET enable_nestloop to false; SET enable_indexonlyscan to false; -- omit this to do indexonly scan EXPLAIN (ANALYZE) SELECT t1.a, t2.a FROM t1 JOIN t2 on (t1.a = t2.a); QUERY PLAN -- Merge Join (cost=2.83..322381.82 rows=3031231 width=8) (actual time=0.013..5193.566 rows=200 loops=1) Merge Cond: (t1.a = t2.a) - Index Scan using it1 on t1 (cost=0.43..65681.87 rows=167 width=4) (actual time=0.004..727.557 rows=100 oops=1) - Index Scan using it2 on t2 (cost=0.43..214642.89 rows=3031231 width=4) (actual time=0.004..1478.361 rows=1 loops=1) Planning time: 25.585 ms Execution time: 6299.521 ms (6 rows) -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 3af462b..04d6cec 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -543,15 +543,19 @@ btendscan(PG_FUNCTION_ARGS) PG_RETURN_VOID(); } + /* * btmarkpos() -- save current scan position */ +int hoge_markpos_count = 0; +int hoge_restrpos_count = 0; Datum btmarkpos(PG_FUNCTION_ARGS) { IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0); BTScanOpaque so = (BTScanOpaque) scan-opaque; + hoge_markpos_count++; /* we aren't holding any read locks, but gotta drop the pin */ if (BTScanPosIsValid(so-markPos)) { @@ -585,7 +589,7 @@ btrestrpos(PG_FUNCTION_ARGS) { IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0); BTScanOpaque so = (BTScanOpaque) scan-opaque; - + hoge_restrpos_count++; /* Restore the marked positions of any array keys */ if (so-numArrayKeys) _bt_restore_array_keys(scan); diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 33b172b..e7c1b99 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -258,14 +258,19 @@
Re: [HACKERS] Reduce pinning in btree indexes
Kyotaro == Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes: Kyotaro ammarkpos/amrestrpos are called in merge joins. By the steps Kyotaro shown below, I had 1M times of markpos and no restorepos for Kyotaro 1M result rows, and had 500k times of markpos and the same Kyotaro number of times of restorepos for 2M rows result by a bit Kyotaro different configuration. I suppose we can say that they are Kyotaro the worst case considering maskpos/restrpos. The call counts Kyotaro can be taken using the attached patch. You might want to try running the same test, but after patching ExecSupportsMarkRestore to return false for index scans. This will cause the planner to insert a Materialize node to handle the mark/restore. This way, you can get an idea of how much gain (if any) the direct support of mark/restore in the scan is actually providing. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Hi, At Fri, 27 Feb 2015 05:56:18 +, Andrew Gierth and...@tao11.riddles.org.uk wrote in 874mq77vuu@news-spur.riddles.org.uk Kyotaro == Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes: Kyotaro ammarkpos/amrestrpos are called in merge joins. By the steps Kyotaro shown below, I had 1M times of markpos and no restorepos for Kyotaro 1M result rows, and had 500k times of markpos and the same Kyotaro number of times of restorepos for 2M rows result by a bit Kyotaro different configuration. I suppose we can say that they are Kyotaro the worst case considering maskpos/restrpos. The call counts Kyotaro can be taken using the attached patch. You might want to try running the same test, but after patching ExecSupportsMarkRestore to return false for index scans. This will cause the planner to insert a Materialize node to handle the mark/restore. Mmm? The patch bt-nopin-v1.patch seems not contain the change for ExecSupportMarkRestore and the very simple function remain looking to return true for T_Index(Only)Scan after the patch applied. Explain shows that the plan does not involve materializing step and addition to it, the counters I put into both btmarkpos() and btrestrpos() showed that they are actually called during the execution, like this, for both unpatched and patched. | LOG: markpos=100, restrpos=0 | STATEMENT: EXPLAIN (ANALYZE,BUFFERS) SELECT t1.a, t2.a FROM t1 JOIN t2 on (t1.a = t2.a); To make sure, I looked into btmarkpos and btrestrpos to confirm that they really does the memcpy() we're talking about, then recompiled whole the source tree. This way, you can get an idea of how much gain (if any) the direct support of mark/restore in the scan is actually providing. Does the scan mean T_Material node? But no material in plan and counters in bt*pos were incremented in fact as mentioned above.. Could you point out any other possible mistake in my thought? regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Kyotaro == Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes: You might want to try running the same test, but after patching ExecSupportsMarkRestore to return false for index scans. This will cause the planner to insert a Materialize node to handle the mark/restore. Kyotaro Mmm? The patch bt-nopin-v1.patch seems not contain the change Kyotaro for ExecSupportMarkRestore and the very simple function remain Kyotaro looking to return true for T_Index(Only)Scan after the patch Kyotaro applied. Right. I'm suggesting you change that, in order to determine what performance cost, if any, would result from abandoning the idea of doing mark/restore entirely. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/15/2015 02:19 AM, Kevin Grittner wrote: Interestingly, the btree README points out that using the old TID with a new tuple poses no hazard for a scan using an MVCC snapshot, because the new tuple would not be visible to a snapshot created that long ago. The first question is: Do we really need that interlock for the non-MVCC snapshots either? We don't need the interlock for non-MVCC snapshots if the conditions we limit or filter on at the index level are rechecked when we get to the heap tuple; otherwise we do. If we do: For non-MVCC snapshots, we need to ensure that all index scans that started before the VACUUM did complete before the VACUUM does. Isn't it enough to be sure that if we're not going to recheck any index tests against the heap tuple that any process-local copies of index entries which exist at the start of the index scan phase of the vacuum are not used after the vacuum enters the phase of freeing line pointers -- that is, any scan which has read a page when the vacuum starts to scan the index moves on to another page before vacuum finishes scanning that index? I wonder if we could find a different mechanism to enforce that. Using the pin-interlock for that has always seemed a bit silly to me. It certainly is abusing the semantics of the pin to treat it as a type of lock on the contents. Perhaps grab a new heavy-weight lock on the table whenever a non-MVCC index scan on the table begins, and have VACUUM wait on it. -1 The problem I'm most interested in fixing based on user feedback is a vacuum blocking when a cursor which is using an index scan is sitting idle. Not only does the vacuum of that table stall, but if it is an autovacuum worker, that worker is prevented from cleaning up any other tables. I have seen all autovacuum workers blocked in exactly this way, leading to massive bloat. What you suggest is just using a less awkward way to keep the problem. I found that the LP_DEAD hinting would be a problem with an old TID, but I figured we could work around that by storing the page LSN into the scan position structure when the page contents were read, and only doing hinting if that matched when we were ready to do the hinting. That wouldn't work for an index which was not WAL-logged, so the patch still holds pins for those. Or you could use GetFakeLSNForUnloggedRel(). I'm unfamiliar with that, but will take a look. (I will be back at my usual development machine late tomorrow.) Robert pointed out that the visibility information for an index-only scan wasn't checked while the index page READ lock was held, so those scans also still hold the pins. Why does an index-only scan need to hold the pin? Robert already answered this one, but there is a possible solution -- provide some way to check the heap's visibility map for the TIDs read from an index page before releasing the read lock on that page. Finally, there was an optimization for marking buffer position for possible restore that was incompatible with releasing the pin. I use quotes because the optimization adds overhead to every move to the next page in order set some variables in a structure when a mark is requested instead of running two fairly small memcpy() statements. The two-day benchmark of the customer showed no performance hit, and looking at the code I would be amazed if the optimization yielded a measurable benefit. In general, optimization by adding overhead to moving through a scan to save time in a mark operation seems dubious. Hmm. Did your test case actually exercise mark/restore? The memcpy()s are not that small. Especially if it's an index-only scan, you're copying a large portion of the page. Some scans call markpos on every tuple, so that could add up. I have results from the `make world` regression tests and a 48-hour customer test. Unfortunately I don't know how heavily either of those exercise this code. Do you have a suggestion for a way to test whether there is a serious regression for something approaching a worst case? At some point we could consider building on this patch to recheck index conditions for heap access when a non-MVCC snapshot is used, check the visibility map for referenced heap pages when the TIDs are read for an index-only scan, and/or skip LP_DEAD hinting for non-WAL-logged indexes. But all those are speculative future work; this is a conservative implementation that just didn't modify pinning where there were any confounding factors. Understood. Still, I'd like to see if we can easily get rid of the pinning altogether. That would be great, but I felt that taking care of the easy cases in on patch and following with patches to handle the trickier cases as separate follow-on patches made more sense than trying to do everything at once. Assuming we sort out the mark/restore issues for the initial patch, it provides infrastructure to keep the
Re: [HACKERS] Reduce pinning in btree indexes
On 02/15/2015 02:19 AM, Kevin Grittner wrote: Interestingly, the btree README points out that using the old TID with a new tuple poses no hazard for a scan using an MVCC snapshot, because the new tuple would not be visible to a snapshot created that long ago. The first question is: Do we really need that interlock for the non-MVCC snapshots either? If we do: For non-MVCC snapshots, we need to ensure that all index scans that started before the VACUUM did complete before the VACUUM does. I wonder if we could find a different mechanism to enforce that. Using the pin-interlock for that has always seemed a bit silly to me. Perhaps grab a new heavy-weight lock on the table whenever a non-MVCC index scan on the table begins, and have VACUUM wait on it. I found that the LP_DEAD hinting would be a problem with an old TID, but I figured we could work around that by storing the page LSN into the scan position structure when the page contents were read, and only doing hinting if that matched when we were ready to do the hinting. That wouldn't work for an index which was not WAL-logged, so the patch still holds pins for those. Or you could use GetFakeLSNForUnloggedRel(). Robert pointed out that the visibility information for an index-only scan wasn't checked while the index page READ lock was held, so those scans also still hold the pins. Why does an index-only scan need to hold the pin? Finally, there was an optimization for marking buffer position for possible restore that was incompatible with releasing the pin. I use quotes because the optimization adds overhead to every move to the next page in order set some variables in a structure when a mark is requested instead of running two fairly small memcpy() statements. The two-day benchmark of the customer showed no performance hit, and looking at the code I would be amazed if the optimization yielded a measurable benefit. In general, optimization by adding overhead to moving through a scan to save time in a mark operation seems dubious. Hmm. Did your test case actually exercise mark/restore? The memcpy()s are not that small. Especially if it's an index-only scan, you're copying a large portion of the page. Some scans call markpos on every tuple, so that could add up. At some point we could consider building on this patch to recheck index conditions for heap access when a non-MVCC snapshot is used, check the visibility map for referenced heap pages when the TIDs are read for an index-only scan, and/or skip LP_DEAD hinting for non-WAL-logged indexes. But all those are speculative future work; this is a conservative implementation that just didn't modify pinning where there were any confounding factors. Understood. Still, I'd like to see if we can easily get rid of the pinning altogether. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On Mon, Feb 23, 2015 at 2:48 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Robert pointed out that the visibility information for an index-only scan wasn't checked while the index page READ lock was held, so those scans also still hold the pins. Why does an index-only scan need to hold the pin? Suppose there's a dead tuple in the heap someplace, and an index pointer pointing to that dead tuple. An index scan reaches the relevant page and copies all of the index pointers. VACUUM removes the index pointer and marks the heap page all-visible. The scan now uses the index pointers copied to backend-local memory and notes that the heap-page is all-visible, so the scan sees the tuple even though that tuple is totally gone from the heap by that point. Holding the pin prevents this, because we can't reach the second heap pass until the scan is done with the page, and therefore the dead line pointer in the heap page can't be marked unused, and therefore the page can't be marked all-visible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Reduce pinning in btree indexes
We have a customer who was unable to migrate from a well-known commercial database product to pg because they have a very large software base that holds cursors open for very long periods of time, preventing GlobalXmin values from advancing, leading to bloat. They have a standard test that exercises hundreds of client connections for days at a time, and saw disk space and CPU usage climbing to unacceptable levels[1]. The other database product had no such problems. We suggested the obvious solutions that we always suggest on the community lists, and they had perfectly good reasons why those were not viable for them -- we either needed to prevent bloat under their current software or they could not migrate. What they wanted was what happened in the other database product -- if a snapshot got sufficiently old that cleaning up the MVCC data was a problem *and* the snapshot was used again *and* it read a page which had been modified far enough back that it was not possible to return correct data, then they wanted to receive a snapshot too old error. I wrote a patch to do that, but they didn't seem much improvement, because all (auto)vacuum processes blocked indefinitely on the btree index pages where a cursor was parked. So they still got massive bloat and consequent problems. It seemed odd to me that a btree implementation based on the Lehman Yao techniques would block anything, because the concept is that it reads everything it needs off the index page and pauses between pages. We sorta do that, but the interlock between vacuum processes and other index usages basically involves holding a pin on the page we just read until we are done using the values we read off that page, and treating the pin as a lighter lock than a shared (or READ) lock -- which only conflicts with a super exclusive lock, which consists of getting an exclusive lock only once there are no other processes holding a pin on the page. This ensures that at the end of a vacuum pass over a btree index there are no TIDs in process-local memory from before the start of the pass, and consequently no scan can read a new tuple assigned to the same TID value after the rest of the vacuum phases run. So a pin on a btree page blocks a vacuum process indefinitely. Interestingly, the btree README points out that using the old TID with a new tuple poses no hazard for a scan using an MVCC snapshot, because the new tuple would not be visible to a snapshot created that long ago. The customer's cursors which were causing the problems all use MVCC snapshots, so I went looking to see whether we couldn't take advantage of this fact. For the most part it seemed we were OK if we dropped pins with the READ lock for a btree scan which used an MVCC snapshot. I found that the LP_DEAD hinting would be a problem with an old TID, but I figured we could work around that by storing the page LSN into the scan position structure when the page contents were read, and only doing hinting if that matched when we were ready to do the hinting. That wouldn't work for an index which was not WAL-logged, so the patch still holds pins for those. Robert pointed out that the visibility information for an index-only scan wasn't checked while the index page READ lock was held, so those scans also still hold the pins. There was also a problem with the fact that the code conflated the notion of having a valid scan position with holding a pin on a block in the index. Those two concepts needed to be teased apart, which I did using some new macros and figuring out which one to use where. Without a pin on the block, we also needed to remember what block we had been processing to be able to do the LP_DEAD hinting; for that the block number was added to the structure which tracks scan position. Finally, there was an optimization for marking buffer position for possible restore that was incompatible with releasing the pin. I use quotes because the optimization adds overhead to every move to the next page in order set some variables in a structure when a mark is requested instead of running two fairly small memcpy() statements. The two-day benchmark of the customer showed no performance hit, and looking at the code I would be amazed if the optimization yielded a measurable benefit. In general, optimization by adding overhead to moving through a scan to save time in a mark operation seems dubious. At some point we could consider building on this patch to recheck index conditions for heap access when a non-MVCC snapshot is used, check the visibility map for referenced heap pages when the TIDs are read for an index-only scan, and/or skip LP_DEAD hinting for non-WAL-logged indexes. But all those are speculative future work; this is a conservative implementation that just didn't modify pinning where there were any confounding factors. In combination with the snapshot-too-old patch the 2-day customer benchmark ran without problem levels of bloat, controlling disk space growth and