subject:"\[HACKERS\] Reduce pinning in btree indexes"

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-25 Thread Kevin Grittner

Kevin Grittner kgri...@ymail.com wrote:
 Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote:

 I attached the your latest patch to this mail as
 bt-nopin-v4.patch for now. Please check that there's no problem
 in it.

 I checked out master, applied the patch and checked it against my
 latest code (merged with master), and it matched.  So, looks
 good.

Pushed with the addition of one more paragraph of comments,
regarding something to watch out for in any follow-on patch to
eliminate the pin for a scan using a non-MVCC snapshot.

Thanks again to Kyotaro-san and Heikki for the reviews!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-18 Thread Greg Stark

On Sat, Mar 14, 2015 at 4:07 PM, Simon Riggs si...@2ndquadrant.com wrote:

 -1 for a time based setting.

 After years of consideration, bloat is now controllable by altering
 the size of the undo tablespace.


Hm. Well, fwiw the situation is rather more complicated than that. You're
correct that you can create a fixed size undo tablespace in which case
Oracle will tell you approximately how much retention period you can rely
on given that size. But alternately you can set the undo tablespace to be
autoextend and Oracle will tune it to ensure there's enough retention to
cover the longest-running query in the system. You can also set a
time-based parameter (undo_retention) to ensure there's always at least
enough data retained to cover a fixed amount of time for flashback
(Oracle's equivalent to our time travel feature in ancient versions of
Postgres).


-- 
greg

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-18 Thread Simon Riggs

On 18 March 2015 at 03:01, Jim Nasby jim.na...@bluetreble.com wrote:
 On 3/16/15 11:47 AM, Robert Haas wrote:

 I am sure there are more sophisticated things to be done here, but I
 guess my feeling is that time is a good way to go here for a first cut
 - lots of people have suggested it, and there's clearly a use case for
 it.  If the setting turns out to be popular, we can fine-tune it more
 once we've got some experience with it.  But I'm nervous about trying
 to to design something more complicated than that right now,
 especially so late in the release cycle.  We know that this setting,
 with time-based units, will meet the needs of the customer for whom it
 was developed, and that is a good-enough reason to think that time is
 a reasonable metric for this, even if we eventually have others.


 +1. As you said, time is easy for people to understand, and I think it will
 handle a large chunk of the use cases.

If it wasn't clear, I had already said that my idea was for next
release, before Robert said that.


 I used to have this quote (or something close to it) on my whiteboard... I
 think it's appropriate here ;)

 The perfect is the enemy of the good. -Simon Riggs

Interesting to hear I was anybody's pinup ;-)

Not sure that is a saying of mine though, sounds more like Robert to me.

I do believe in something now, something better later. I'd call that
incremental benefit, as opposed to incremental development where the
benefit may not be realisable until the last point. Which is why I am
backing this patch for 9.5, even though the UI is less useful than I
think is eventually necessary.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-17 Thread Jim Nasby


On 3/16/15 11:47 AM, Robert Haas wrote:

I am sure there are more sophisticated things to be done here, but I
guess my feeling is that time is a good way to go here for a first cut
- lots of people have suggested it, and there's clearly a use case for
it.  If the setting turns out to be popular, we can fine-tune it more
once we've got some experience with it.  But I'm nervous about trying
to to design something more complicated than that right now,
especially so late in the release cycle.  We know that this setting,
with time-based units, will meet the needs of the customer for whom it
was developed, and that is a good-enough reason to think that time is
a reasonable metric for this, even if we eventually have others.


+1. As you said, time is easy for people to understand, and I think it 
will handle a large chunk of the use cases.


I used to have this quote (or something close to it) on my whiteboard... 
I think it's appropriate here ;)


The perfect is the enemy of the good. -Simon Riggs
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-16 Thread Kevin Grittner

Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote:

 Thank you for rewriting.

Thank *you* for pointing out where more work was needed in the
comments and README!

 I attached the your latest patch to this mail as
 bt-nopin-v4.patch for now. Please check that there's no problem
 in it.

I checked out master, applied the patch and checked it against my
latest code (merged with master), and it matched.  So, looks good.

 - By this patch, index scan becomes to release buffer pins while
   fetching index tuples in a page, so it should reduce the chance
   of index scans with long duration to block vacuum, because
   vacuums now can easily overtake the current position of an
   index scan. I didn't actually measured how effective it is,
   though.

That part got pretty thorough testing on end-user software.  The
whole reason for writing the patch was that after applying the
snapshot-too-old PoC patch they still saw massive bloat because all
autovacuum workers blocked behind cursors which were left idle.
The initial version of this patch fixed that, preventing (in
combination with the other patch) uncontrolled bloat in a 48 hour
test.

 - It makes no performance deterioration, on the contrary it
   accelerates index scans. It seems to be because of removal of
   lock and unlock surrounding _bt_steppage in bt_next.

I fear that the performance improvement may only show up on forward
scans -- because the whole LY algorithm depends on splits only
moving right, walking right is almost trivial to perform correctly
with minimal locking and pinning.  Our left pointers help with
scanning backward, but there are a lot of conditions that can
complicate that, leaving us with the choice between keeping some of
the locking or potentially scanning more pages than we now do on a
backward scan.  Since it wasn't clear how many cases would benefit
from a change and how many would lose, I pretty much left the
backward scan locking alone in this patch.  Changing the code to
work the other way would not be outrageously difficult; but
benchmarking to determine whether the alternative was a net win
would be pretty tricky and very time-consuming.  If that is to be
done, I strongly feel it should be a separate patch.

Because this patch makes a forward scan faster without having much
affect on a backward scan, the performance difference between them
(which has always existed) will get a little wider.  I wonder
whether this difference should perhaps be reflected in plan
costing.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-16 Thread Simon Riggs

On 16 March 2015 at 12:48, Kevin Grittner kgri...@ymail.com wrote:
 Simon Riggs si...@2ndquadrant.com wrote:
 On 13 March 2015 at 15:41, Kevin Grittner kgri...@ymail.com wrote:

 The feedback was generally fairly positive except for the fact that
 snapshot age (for purposes of being too old) was measured in
 transaction IDs assigned.  There seemed to be a pretty universal
 feeling that this needed to be changed to a time-based setting.

 -1 for a time based setting.

 After years of consideration, bloat is now controllable by altering
 the size of the undo tablespace.

 I think PostgreSQL needs something size-based also. It would need some
 estimation to get it to work like that, true, but it is actually the
 size of the bloat we care about, not the time. So we should be
 thinking in terms of limits that we actually care about.

 Are you thinking, then, that WAL volume generated (as determined by
 LSN) would be the appropriate unit of measure for this?  (We would
 still need to map that back to transaction IDs for vacuuming, of
 course.)  If we did that we could allow the size units of
 measure, like '5GB' and similar.  Or are you thinking of something
 else?

It's probably the closest and easiest measure, and the most
meaningful. We can easily accumulate that in a data structure in clog,
like async commit LSN. For next release though, since it will take a
little bit of thought to interpret that.

With commit timestamp enabled in 9.5, we can easily judge time limit,
but it is less useful because its not a measure of bloat.

As I've said, I'd be happy with just an xid limit for 9.5, if that was
the only thing we had. But I think timestamp is just as easy.

 Given that there seems to be disagreement on what is the more
 useful metric, do we want to consider allowing more than one?  If
 so, would it be when *all* conditions are met or when *any*
 conditions are met?

Yours was the first reply to my idea, so I think its too early to
describe that as disagreement.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-16 Thread Robert Haas

On Mon, Mar 16, 2015 at 8:48 AM, Kevin Grittner kgri...@ymail.com wrote:
 Given that there seems to be disagreement on what is the more
 useful metric, do we want to consider allowing more than one?  If
 so, would it be when *all* conditions are met or when *any*
 conditions are met?

Oh, gosh.  Maybe we could invent a whole policy language for this.
Or, if we want to do something less painful, we could fire nail-guns
into our eyeballs.

I'm not sure what the value of basing this on WAL volume is, and I
think we should talk about that a bit more first.  The value of doing
this by rate of XID consumption is easy to understand: it's simple to
implement.  The value of doing this by time is also simple to
understand: if you set the setting to 20 minutes, then you can be sure
that queries will not get cancelled unless they run for more than 20
minutes.  This clearly makes the system easier to reason about.  I
don't see what the benefit of doing this by WAL volume would be.

What Simon actually proposed was by *bloat*; that is, if the overall
amount of bloat in the system exceeds some metric, start whacking old
queries in order to control it.  The appeal of that is obvious, but I
think it would be very hard to make work, because canceling queries
won't get rid of the bloat you've already introduced, and so you'd
tend to cancel nothing until you hit some threshold, and then start
canceling everything.

If we want this feature, let's try to keep the spec simple enough that
we actually get it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-16 Thread Robert Haas

On Mon, Mar 16, 2015 at 12:07 PM, Simon Riggs si...@2ndquadrant.com wrote:
 What Simon actually proposed was by *bloat*; that is, if the overall
 amount of bloat in the system exceeds some metric, start whacking old
 queries in order to control it.  The appeal of that is obvious,

 Agreed

 but I
 think it would be very hard to make work, because canceling queries
 won't get rid of the bloat you've already introduced, and so you'd
 tend to cancel nothing until you hit some threshold, and then start
 canceling everything.

 That would just be an unhelpful way of using that info. There are many
 possible designs that wouldn't need to work that way.

 We have many cases where we begin a cleanup action before we run out
 of space for other server resources. VACUUM is itself a good example,
 but there are many others.

 The main point is that if we blindly decide things based upon xid age
 or time then it will be wrong in many cases, since apps with uniform
 write rates are rare. We need to work out how to limit bloat itself.
 If we can't do that easily at a macro level, then we'll have to do
 that more precisely at the table level, for example having VACUUM
 decide where to put the limit if we find too much unremovable/recently
 dead data.

I am sure there are more sophisticated things to be done here, but I
guess my feeling is that time is a good way to go here for a first cut
- lots of people have suggested it, and there's clearly a use case for
it.  If the setting turns out to be popular, we can fine-tune it more
once we've got some experience with it.  But I'm nervous about trying
to to design something more complicated than that right now,
especially so late in the release cycle.  We know that this setting,
with time-based units, will meet the needs of the customer for whom it
was developed, and that is a good-enough reason to think that time is
a reasonable metric for this, even if we eventually have others.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-16 Thread Simon Riggs

On 16 March 2015 at 13:02, Robert Haas robertmh...@gmail.com wrote:

 I'm not sure what the value of basing this on WAL volume is, and I
 think we should talk about that a bit more first.  The value of doing
 this by rate of XID consumption is easy to understand: it's simple to
 implement.  The value of doing this by time is also simple to
 understand: if you set the setting to 20 minutes, then you can be sure
 that queries will not get cancelled unless they run for more than 20
 minutes.  This clearly makes the system easier to reason about.  I
 don't see what the benefit of doing this by WAL volume would be.

I think we can assess that more clearly as the amount of now-dead
space left by an action. So we can include Updates and Deletes, or in
the case of aborted transactions, the total WAL written. I assumed
that was what Kevin meant.

 What Simon actually proposed was by *bloat*; that is, if the overall
 amount of bloat in the system exceeds some metric, start whacking old
 queries in order to control it.  The appeal of that is obvious,

Agreed

 but I
 think it would be very hard to make work, because canceling queries
 won't get rid of the bloat you've already introduced, and so you'd
 tend to cancel nothing until you hit some threshold, and then start
 canceling everything.

That would just be an unhelpful way of using that info. There are many
possible designs that wouldn't need to work that way.

We have many cases where we begin a cleanup action before we run out
of space for other server resources. VACUUM is itself a good example,
but there are many others.

The main point is that if we blindly decide things based upon xid age
or time then it will be wrong in many cases, since apps with uniform
write rates are rare. We need to work out how to limit bloat itself.
If we can't do that easily at a macro level, then we'll have to do
that more precisely at the table level, for example having VACUUM
decide where to put the limit if we find too much unremovable/recently
dead data.

 If we want this feature, let's try to keep the spec simple enough that
 we actually get it.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-16 Thread Kevin Grittner

Simon Riggs si...@2ndquadrant.com wrote:
 On 13 March 2015 at 15:41, Kevin Grittner kgri...@ymail.com wrote:

 The feedback was generally fairly positive except for the fact that
 snapshot age (for purposes of being too old) was measured in
 transaction IDs assigned.  There seemed to be a pretty universal
 feeling that this needed to be changed to a time-based setting.

 -1 for a time based setting.

 After years of consideration, bloat is now controllable by altering
 the size of the undo tablespace.

 I think PostgreSQL needs something size-based also. It would need some
 estimation to get it to work like that, true, but it is actually the
 size of the bloat we care about, not the time. So we should be
 thinking in terms of limits that we actually care about.

Are you thinking, then, that WAL volume generated (as determined by
LSN) would be the appropriate unit of measure for this?  (We would
still need to map that back to transaction IDs for vacuuming, of
course.)  If we did that we could allow the size units of
measure, like '5GB' and similar.  Or are you thinking of something
else?

Given that there seems to be disagreement on what is the more
useful metric, do we want to consider allowing more than one?  If
so, would it be when *all* conditions are met or when *any*
conditions are met?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-14 Thread Simon Riggs

On 13 March 2015 at 13:17, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Mar 13, 2015 at 8:52 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 15 February 2015 at 00:19, Kevin Grittner kgri...@ymail.com wrote:
 What they wanted was what happened in the other database product --
 if a snapshot got sufficiently old that cleaning up the MVCC data
 was a problem *and* the snapshot was used again *and* it read a
 page which had been modified far enough back that it was not
 possible to return correct data, then they wanted to receive a
 snapshot too old error.  I wrote a patch to do that...

 So, please lets see the patch. It seems useful for core Postgres.

 It was submitted here:

 http://www.postgresql.org/message-id/136937748.3364317.1423964815320.javamail.ya...@mail.yahoo.com

Thanks. I have +1'd that patch.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-14 Thread Simon Riggs

On 13 March 2015 at 15:41, Kevin Grittner kgri...@ymail.com wrote:

 The feedback was generally fairly positive except for the fact that
 snapshot age (for purposes of being too old) was measured in
 transaction IDs assigned.  There seemed to be a pretty universal
 feeling that this needed to be changed to a time-based setting.

-1 for a time based setting.

After years of consideration, bloat is now controllable by altering
the size of the undo tablespace.

I think PostgreSQL needs something size-based also. It would need some
estimation to get it to work like that, true, but it is actually the
size of the bloat we care about, not the time. So we should be
thinking in terms of limits that we actually care about.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-13 Thread Kyotaro HORIGUCHI

Hello,

At Thu, 12 Mar 2015 15:27:37 -0700, Peter Geoghegan p...@heroku.com wrote in 
CAM3SWZQ5nwTB-y4ZOj=5ckMLce5GAEUnjKJ2=m1vmhfx_ay...@mail.gmail.com
 On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner kgri...@ymail.com wrote:
  At some point we could consider building on this patch to recheck
  index conditions for heap access when a non-MVCC snapshot is used,
  check the visibility map for referenced heap pages when the TIDs
  are read for an index-only scan, and/or skip LP_DEAD hinting for
  non-WAL-logged indexes.  But all those are speculative future work;
  this is a conservative implementation that just didn't modify
  pinning where there were any confounding factors.
 
 I don't have the bandwidth for a full review, but I took a quick look at this.
 
 I think you should call out those confounding factors in the README.
 It's not hard to piece them together from
 _bt_drop_lock_and_maybe_pin(), but I think you should anyway.
 
 Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing nbtree
 LockBuffer() callers do. You're inconsistent about that in V3.

I agree with you. It looks the only point where it is used. 

Addition to that, the commnet just above the point methioned
above quizes me.

 /* XXX: Can walking left be lighter on the locking and pins? */
   if (BTScanPosIsPinned(so-currPos))
   LockBuffer(so-currPos.buf, BUFFER_LOCK_SHARE);
   else
   so-currPos.buf = _bt_getbuf(rel, so-currPos.currPage, 
 BT_READ);

I'm happy if I could read the meaming of the comment more
clearly. I understand that it says that you want to remove the
locking (and pinning), but can't now because the simultaneous
splitting of the left page would break something. I'd like to see
it clearer even for me either I am correct or not..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-13 Thread Simon Riggs

On 4 March 2015 at 03:16, Kevin Grittner kgri...@ymail.com wrote:

 How do people feel about the idea of me pushing this for 9.5 (after
 I clean up all the affected comments and README files)?  I know
 this first appeared in the last CF, but the footprint is fairly
 small and the only user-visible behavior change is that a btree
 index scan of a WAL-logged table using an MVCC snapshot no longer
 blocks a vacuum indefinitely.  (If there are objections I will move
 this to the first CF for the next release.)

It helps Hot Standby also, BTW. I proposed this previously, but it was
shot down, so I'm glad to see it happening.

I'm not sure why you have proposed only half the solution though?
Hopefully we aren't just submitting the difficult half that you needed
feedback on? Let's see the whole snapshot too old patch as well,
since that affects core PostgresSQL also.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-13 Thread Simon Riggs

On 15 February 2015 at 00:19, Kevin Grittner kgri...@ymail.com wrote:

 What they wanted was what happened in the other database product --
 if a snapshot got sufficiently old that cleaning up the MVCC data
 was a problem *and* the snapshot was used again *and* it read a
 page which had been modified far enough back that it was not
 possible to return correct data, then they wanted to receive a
 snapshot too old error.  I wrote a patch to do that...

So, please lets see the patch. It seems useful for core Postgres.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-13 Thread Robert Haas

On Fri, Mar 13, 2015 at 8:52 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 15 February 2015 at 00:19, Kevin Grittner kgri...@ymail.com wrote:
 What they wanted was what happened in the other database product --
 if a snapshot got sufficiently old that cleaning up the MVCC data
 was a problem *and* the snapshot was used again *and* it read a
 page which had been modified far enough back that it was not
 possible to return correct data, then they wanted to receive a
 snapshot too old error.  I wrote a patch to do that...

 So, please lets see the patch. It seems useful for core Postgres.

It was submitted here:

http://www.postgresql.org/message-id/136937748.3364317.1423964815320.javamail.ya...@mail.yahoo.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-13 Thread Kevin Grittner

Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote:
 At Thu, 12 Mar 2015 15:27:37 -0700, Peter Geoghegan p...@heroku.com wrote:
 On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner kgri...@ymail.com wrote:

 At some point we could consider building on this patch to
 recheck index conditions for heap access when a non-MVCC
 snapshot is used, check the visibility map for referenced heap
 pages when the TIDs are read for an index-only scan, and/or
 skip LP_DEAD hinting for non-WAL-logged indexes.  But all those
 are speculative future work; this is a conservative
 implementation that just didn't modify pinning where there were
 any confounding factors.

 I think you should call out those confounding factors in the
 README. It's not hard to piece them together from
 _bt_drop_lock_and_maybe_pin(), but I think you should anyway.

OK:

https://github.com/kgrittn/postgres/commit/f5f59ded30b114ac83b90a00ba1fa5ef490b994e

 Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing
 nbtree LockBuffer() callers do. You're inconsistent about that
 in V3.

 I agree with you. It looks the only point where it is used.

OK:

https://github.com/kgrittn/postgres/commit/76118b58be819ed5e68569c926d0222bc41640ea

 Addition to that, the commnet just above the point methioned
 above quizes me.

/* XXX: Can walking left be lighter on the locking and pins? */
if (BTScanPosIsPinned(so-currPos))
LockBuffer(so-currPos.buf, BUFFER_LOCK_SHARE);
else
so-currPos.buf = _bt_getbuf(rel, so-currPos.currPage, 
 BT_READ);

 I'm happy if I could read the meaming of the comment more
 clearly. I understand that it says that you want to remove the
 locking (and pinning), but can't now because the simultaneous
 splitting of the left page would break something. I'd like to see
 it clearer even for me either I am correct or not..

Does this clear it up?:

https://github.com/kgrittn/postgres/commit/22066fc161a092e800e4c1e853136c4513f8771b

Since there are no changes that would affect the compiled code
here, I'm not posting a new patch yet.  I'll do that once things
seem to have settled down a bit more.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-13 Thread Kevin Grittner

Robert Haas robertmh...@gmail.com wrote:
On Fri, Mar 13, 2015 at 8:52 AM, Simon Riggs si...@2ndquadrant.com wrote:
On 15 February 2015 at 00:19, Kevin Grittner kgri...@ymail.com wrote:
What they wanted was what happened in the other database
product -- if a snapshot got sufficiently old that cleaning up
the MVCC data was a problem *and* the snapshot was used again
*and* it read a page which had been modified far enough back
that it was not possible to return correct data, then they
wanted to receive a snapshot too old error. I wrote a patch
to do that...

So, please lets see the patch. It seems useful for core
Postgres.

It was submitted here:

http://www.postgresql.org/message-id/136937748.3364317.1423964815320.javamail.ya...@mail.yahoo.com

The feedback was generally fairly positive except for the fact that
snapshot age (for purposes of being too old) was measured in
transaction IDs assigned. There seemed to be a pretty universal
feeling that this needed to be changed to a time-based setting.
I've been working on that, although my initial efforts proved to be
a dead-end. I've had some off-list discussions with Andrew Dunstan
and we have agreed on something that looks like it should work
well, but it's still being coded.

The other thing that is needed is to sprinkle the other index AMs
with TestForOldSnapshot() calls. In the btree code the right spots
for those were all following BufferGetPage() calls, so the work was
in seeing where each of those could be called from to see whether
it was a spot that the check was needed.

Of course, docs and comment changes are also needed, but that's
probably only a day or two's effort.

Further discussion of the snapshot too old patch should probably
go on its thread. The spillover is largely my fault for
referencing one from the other.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-12 Thread Kevin Grittner

Kevin Grittner kgri...@ymail.com wrote:

 Because these changes are just to a comment and a README file,
 I'm posting a patch-on-patch v3a (to be applied on top of v3).

Here is some additional comment work that I hope will make things
easier to understand for whoever next visits this code.  There is
also a minor reorganization of some code that should not have any
impact except to skip the restore memcpy() calls where they are not
needed in a corner case that I'm not sure we can currently even
reach.  That reorganization is intended more for code clarity than
for the marginal performance it could provide.

I'm posting this as v3b to be applied on top of v3 and v3a, so that
these changes can be reviewed and commented on separately.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

bt-nopin-v3b.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-12 Thread Peter Geoghegan

On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner kgri...@ymail.com wrote:
 At some point we could consider building on this patch to recheck
 index conditions for heap access when a non-MVCC snapshot is used,
 check the visibility map for referenced heap pages when the TIDs
 are read for an index-only scan, and/or skip LP_DEAD hinting for
 non-WAL-logged indexes.  But all those are speculative future work;
 this is a conservative implementation that just didn't modify
 pinning where there were any confounding factors.

I don't have the bandwidth for a full review, but I took a quick look at this.

I think you should call out those confounding factors in the README.
It's not hard to piece them together from
_bt_drop_lock_and_maybe_pin(), but I think you should anyway.

Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing nbtree
LockBuffer() callers do. You're inconsistent about that in V3.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-10 Thread Kyotaro HORIGUCHI

Hello,

I found no other problem including the performance issue in the
patch attached to the last mail as far as I can understand. So
I'll mark this as ready for commit after a few days with no
objection after this comments is addressed.


  - If (BTScanPosIsPinned(so-currPos)).
 
  As I mention below for nbtsearch.c, the page aquired in the
  if-then block may be vacuumed so the LSN check done in the
  if-else block is need also in the if-then block. It will be
  accomplished only by changing the position of the end of the
  if-else block.
 
 I'm not sure I agree on this.  

Sorry, it is largely because of my poor composition.

 If the page is pinned it should have
 been pinned continuously since we initially read it, so the line
 pointers we have could not have been removed by any vacuum process.
 The tuples may have been pruned away in the heap, but that doesn't
 matter. 
 Index entries may have been added and the index page may
 have been split, but that was true before this patch and
 _bt_killitems will deal with those things the same as it always
 has.

Yes. I think so.

 I don't see any benefit to doing the LSN compare in this
 case; if we've paid the costs of holding the pin through to this
 point, we might as well flag any dead entries we can.

I thought of the case that the buffer has been pinned by another
backend after this backend unpinned it. I looked again the
definition of BTScanPosIsPinned and BTScanPosUnpinIfPinned, and
understood that the pin should be mine if BTScanPosIsPinned.

Would you mind rewriting the comment there like this?

-  /* The buffer is still pinned, but not locked.  Lock it now. */
+  /* I still hold the pin on the buffer, but not locked.  Lock it now. */
|   LockBuffer(so-currPos.buf, BT_READ);


Or would you mind renaming the macro as BTScanPosIsPinnedByMe
or something like, or anyway to notify the reader that the pin
should be mine there?


  - _bt_killitems is called without pin when rescanning from
  before, so I think the previous code has already considered the
  unpinned case (if (ItemPointerEquals(ituple... does
  that). Vacuum rearranges line pointers after deleting LP_DEAD
  items so the previous check seems enough for the purpose. The
  previous code is more effeicient becuase the mathing items will
  be processed even after vacuum.
 
 I'm not following you on this one; could you rephrase it?

Sorry, I read btrescan incorrectly that it calls _bt_killitems()
*after* releaseing the buffer. Please forget it.


Finally, I'd like to timidly comment on this..

+ To handle the cases where it is safe to release the pin after
+ reading the index page, the LSN of the index page is read along
+ with the index entries before the shared lock is released, and we
+ do not attempt to flag index tuples as dead if the page is not
+ pinned and the LSN does not match.


I suppose that the sentence like following is more directly
describing about the mechanism and easier to find the
correnponsing code, if it is correct.

  To handle the cases where a index page has unpinned when
  trying to mark the unused index tuples on the page as dead,
  the LSN of the index page is remembered when reading the index
  page for index tuples, and we do not attempt to flag index
  tuples as dead if the page is not pinned and the LSN does not
  match.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-10 Thread Kevin Grittner

Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote:

 I found no other problem including the performance issue in the
 patch attached to the last mail as far as I can understand. So
 I'll mark this as ready for commit after a few days with no
 objection after this comments is addressed.

Thanks for the reviews!

 I don't see any benefit to doing the LSN compare in this
 case; if we've paid the costs of holding the pin through to this
 point, we might as well flag any dead entries we can.

 I thought of the case that the buffer has been pinned by another
 backend after this backend unpinned it. I looked again the
 definition of BTScanPosIsPinned and BTScanPosUnpinIfPinned, and
 understood that the pin should be mine if BTScanPosIsPinned.

 Would you mind rewriting the comment there like this?

 -  /* The buffer is still pinned, but not locked.  Lock it now. */
 +  /* I still hold the pin on the buffer, but not locked.  Lock it now. */
 |  LockBuffer(so-currPos.buf, BT_READ);

 Or would you mind renaming the macro as BTScanPosIsPinnedByMe
 or something like, or anyway to notify the reader that the pin
 should be mine there?

I see your point, although those first person singular pronouns
used like that make me a little uncomfortable; I'll change the
comment and/or macro name, but I'll work on the name some more.

 Finally, I'd like to timidly comment on this..

 + To handle the cases where it is safe to release the pin after
 + reading the index page, the LSN of the index page is read along
 + with the index entries before the shared lock is released, and we
 + do not attempt to flag index tuples as dead if the page is not
 + pinned and the LSN does not match.

 I suppose that the sentence like following is more directly
 describing about the mechanism and easier to find the
 correnponsing code, if it is correct.

  To handle the cases where a index page has unpinned when
  trying to mark the unused index tuples on the page as dead,
  the LSN of the index page is remembered when reading the index
  page for index tuples, and we do not attempt to flag index
  tuples as dead if the page is not pinned and the LSN does not
  match.

Will reword that part to try to make it clearer.

Thanks!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-10 Thread Kevin Grittner

Kevin Grittner kgri...@ymail.com wrote:
 Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote:

 Would you mind rewriting the comment there like this?

 - /* The buffer is still pinned, but not locked. Lock it now. */
 + /* I still hold the pin on the buffer, but not locked. Lock it now. */


 Or would you mind renaming the macro as BTScanPosIsPinnedByMe
 or something like, or anyway to notify the reader that the pin
 should be mine there?

 I see your point, although those first person singular pronouns
 used like that make me a little uncomfortable; I'll change the
 comment and/or macro name, but I'll work on the name some more.

After thinking it over, I think that the BTScanPos part of the
macro name is enough of a hint that it is concerned with the
actions of this scan; it is the comment that needs the change.
I went with:

  /*
   * We have held the pin on this page since we read the index tuples,
   * so all we need to do is lock it.  The pin will have prevented
   * re-use of any TID on the page, so there is no need to check the
   * LSN.
   */

 + To handle the cases where it is safe to release the pin after
 + reading the index page, the LSN of the index page is read along
 + with the index entries before the shared lock is released, and we
 + do not attempt to flag index tuples as dead if the page is not
 + pinned and the LSN does not match.

 I suppose that the sentence like following is more directly
 describing about the mechanism and easier to find the
 correnponsing code, if it is correct.

 To handle the cases where a index page has unpinned when
 trying to mark the unused index tuples on the page as dead,
 the LSN of the index page is remembered when reading the index
 page for index tuples, and we do not attempt to flag index
 tuples as dead if the page is not pinned and the LSN does not
 match.

 Will reword that part to try to make it clearer.

That sentence was full of passive voice, which didn't help any.
I changed it to:

| So that we can handle the cases where we attempt LP_DEAD flagging
| for a page after we have released its pin, we remember the LSN of
| the index page when we read the index tuples from it; we do not
| attempt to flag index tuples as dead if the we didn't hold the
| pin the entire time and the LSN has changed.

Do these changes seem clear?

Because these changes are just to a comment and a README file, I'm
posting a patch-on-patch v3a (to be applied on top of v3).

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


bt-nopin-v3a.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-09 Thread Kevin Grittner

Kevin Grittner kgri...@ymail.com wrote:

 It doesn't seem worth posting to the list for the small changes
 since the last version; I'll wait until I update the comments and
 README files.  If you want review or test the latest, you can
 peek at:

 https://github.com/kgrittn/postgres/tree/btnopin

Here is v3, with the promised README and code comment changes.

In the process of documenting the mark/restore area I noticed a
subtlety that I had missed (in the case that there was a mark,
advance to the next page, restore, advance within the page, and
restore).  I fixed that, and in the process gave the patched code
an additional direct performance edge over unpatched code.  For the
1000k marks, average timings are now:

master:  970.999 ms, stdev: 4.043
patched: 940.460 ms, stdev: 4.874

So, in the micro-benchmark showing the biggest benefit the direct
improvement is now just over 3%.  It remains the case that the
primary motivation for the patch is to reduce blocking of vacuum
processes; but that's a nice side-benefit.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


bt-nopin-v3.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-04 Thread Kevin Grittner

Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote:

 The performance which your test shows looks great. The gain might
 comes from removing of buffer locking on _bt_next.

Yeah, I had been hoping that removing some buffer pins and locks
from the common path of scanning forward from one page to the next
might have some direct performance benefit; it's possible that this
will show up more dramatically at high concurrency levels.  The
main goal, of course, was to improve performance more indirectly,
by not having autovacuum worker processes blocked for extended
periods in some situations where that can now happen.

 In nbtutils.c, _bt_killitems:

 - This is not in charge of this patch, but the comment for
 _bt_killitems looks to have a trivial typo.

As you say, that problem was already there, but no reason not to
fix it in the patch as long as we're here.  Reworded.

 - The following comment looks somewhat wrong.

  /* Since the else condition needs page, get it here, too. */

 the else condition needs page means the page of the buffer
 is needed later? But, I suppose the comment itself is not
 necessary.

I guess that comment isn't really worth the space it takes up; what
it says is pretty obvious from the code.  Removed.

 - If (BTScanPosIsPinned(so-currPos)).

 As I mention below for nbtsearch.c, the page aquired in the
 if-then block may be vacuumed so the LSN check done in the
 if-else block is need also in the if-then block. It will be
 accomplished only by changing the position of the end of the
 if-else block.

I'm not sure I agree on this.  If the page is pinned it should have
been pinned continuously since we initially read it, so the line
pointers we have could not have been removed by any vacuum process.
The tuples may have been pruned away in the heap, but that doesn't
matter.  Index entries may have been added and the index page may
have been split, but that was true before this patch and
_bt_killitems will deal with those things the same as it always
has.  I don't see any benefit to doing the LSN compare in this
case; if we've paid the costs of holding the pin through to this
point, we might as well flag any dead entries we can.

 - _bt_killitems is called without pin when rescanning from
 before, so I think the previous code has already considered the
 unpinned case (if (ItemPointerEquals(ituple... does
 that). Vacuum rearranges line pointers after deleting LP_DEAD
 items so the previous check seems enough for the purpose. The
 previous code is more effeicient becuase the mathing items will
 be processed even after vacuum.

I'm not following you on this one; could you rephrase it?

 In nbtsearch.c

 - _bt_first(): It now releases the share lock on the page before
 the items to be killed is processed. This allows other backends
 to insert items into the page simultaneously. It seems to be
 dangerous alone, but _bt_killitems already considers the
 case. Anyway I think It'll be better to add a comment
 mentioning unlocking is not dangerous.

Well, _bt_killitems does acquire a shared (read) lock to flag index
tuples as LP_DEAD, and it has always been OK for the index page to
accept inserts and allow page splits before calling _bt_killitems.
I don't really see anything new with the patch in this regard.  I
do agree, though, that a lot of comments and at least two README
files refer to the pins blocking vacuum, and these must be found
and changed.

It doesn't seem worth posting to the list for the small changes
since the last version; I'll wait until I update the comments and
README files.  If you want review or test the latest, you can peek
at: https://github.com/kgrittn/postgres/tree/btnopin

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-04 Thread Kyotaro HORIGUCHI

Hello,

 It looks like the remaining performance regression was indeed a
 result of code alignment.  I found two paranoia assignments I had
 accidentally failed to put back with the rest of the mark/restore
 optimization; after that trivial change the patched version is
 ever-so-slightly faster than master on all tests.

The performance which your test shows looks great. The gain might
comes from removing of buffer locking on _bt_next. I also would
like to see this to come along with 9.5 but it is the matter for
committers.

Apart from it, I looked into the patch itself more
closely. Please reconcile as you like among contradicting
comments below:)


In nbtutils.c, _bt_killitems:

- This is not in charge of this patch, but the comment for
  _bt_killitems looks to have a trivial typo.

* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
*
* scan-so contains information about the current page and killed tuples
* thereon (generally, this should only be called if so-numKilled  0).

  I suppose scan-so should be scan-opaque and
  so-numKilled be opaque-numKilled.


- The following comment looks somewhat wrong.

   /* Since the else condition needs page, get it here, too. */

  the else condition needs page means the page of the buffer
  is needed later? But, I suppose the comment itself is not
  necessary.

- If (BTScanPosIsPinned(so-currPos)).

  As I mention below for nbtsearch.c, the page aquired in the
  if-then block may be vacuumed so the LSN check done in the
  if-else block is need also in the if-then block. It will be
  accomplished only by changing the position of the end of the
  if-else block.

- _bt_killitems is called without pin when rescanning from
  before, so I think the previous code has already considered the
  unpinned case (if (ItemPointerEquals(ituple... does
  that). Vacuum rearranges line pointers after deleting LP_DEAD
  items so the previous check seems enough for the purpose. The
  previous code is more effeicient becuase the mathing items will
  be processed even after vacuum.


In nbtsearch.c

- _bt_first(): It now releases the share lock on the page before
  the items to be killed is processed. This allows other backends
  to insert items into the page simultaneously. It seems to be
  dangerous alone, but _bt_killitems already considers the
  case. Anyway I think It'll be better to add a comment
  mentioning unlocking is not dangerous.


... Sorry, time's up for now.

regards,


 Average of 3 `make check-world` runs:
 master: 336.288 seconds
 patch:  332.237 seconds
 
 Average of 6 `make check` runs:
 master: 25.409 seconds
 patch:  25.270 seconds
 
 The following were all run 12 times, the worst 2 dropped, the rest
 averaged:
 
 Kyotaro-san's 1m mark worst case benchmark:
 master: 962.581 ms, 6.765 stdev
 patch:  947.518 ms, 3.584 stdev
 
 Kyotaro-san's 500k mark, 500k restore worst case benchmark:
 master: 1366.639 ms, 5.314 stdev
 patch:  1363.844 ms, 5.896 stdev
 
 pgbench -i -s 16 pgbench / pgbench -c 16 -j 4 -T 500 pgbench
 master: 265.011 tps, 4.952 stdev
 patch:  267.164 tps, 9.094 stdev
 
 How do people feel about the idea of me pushing this for 9.5 (after
 I clean up all the affected comments and README files)?  I know
 this first appeared in the last CF, but the footprint is fairly
 small and the only user-visible behavior change is that a btree
 index scan of a WAL-logged table using an MVCC snapshot no longer
 blocks a vacuum indefinitely.  (If there are objections I will move
 this to the first CF for the next release.)
 
  src/backend/access/nbtree/nbtree.c|  50 +---
  src/backend/access/nbtree/nbtsearch.c | 141 
 +-
  src/backend/access/nbtree/nbtutils.c  |  58 ++
  src/include/access/nbtree.h   |  36 -
  4 files changed, 201 insertions(+), 84 deletions(-)

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-03 Thread Kevin Grittner

Kevin Grittner kgri...@ymail.com wrote:

 [need to check performance more]


It looks like the remaining performance regression was indeed a
result of code alignment.  I found two paranoia assignments I had
accidentally failed to put back with the rest of the mark/restore
optimization; after that trivial change the patched version is
ever-so-slightly faster than master on all tests.

Average of 3 `make check-world` runs:
master: 336.288 seconds
patch:  332.237 seconds

Average of 6 `make check` runs:
master: 25.409 seconds
patch:  25.270 seconds

The following were all run 12 times, the worst 2 dropped, the rest
averaged:

Kyotaro-san's 1m mark worst case benchmark:
master: 962.581 ms, 6.765 stdev
patch:  947.518 ms, 3.584 stdev

Kyotaro-san's 500k mark, 500k restore worst case benchmark:
master: 1366.639 ms, 5.314 stdev
patch:  1363.844 ms, 5.896 stdev

pgbench -i -s 16 pgbench / pgbench -c 16 -j 4 -T 500 pgbench
master: 265.011 tps, 4.952 stdev
patch:  267.164 tps, 9.094 stdev

How do people feel about the idea of me pushing this for 9.5 (after
I clean up all the affected comments and README files)?  I know
this first appeared in the last CF, but the footprint is fairly
small and the only user-visible behavior change is that a btree
index scan of a WAL-logged table using an MVCC snapshot no longer
blocks a vacuum indefinitely.  (If there are objections I will move
this to the first CF for the next release.)

 src/backend/access/nbtree/nbtree.c|  50 +---
 src/backend/access/nbtree/nbtsearch.c | 141 +-
 src/backend/access/nbtree/nbtutils.c  |  58 ++
 src/include/access/nbtree.h   |  36 -
 4 files changed, 201 insertions(+), 84 deletions(-)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-03-01 Thread Kevin Grittner

Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote:

 Hello, I measured the performance of this patch considering
 markpos/restorepos. The loss seems to be up to about 10%
 unfortunately.

Thanks for the test case!

I took another look at this optimization, and found that it didn't
really depend on the pin (as I had first thought), so I put it back
(keeping the rest of the patch unchanged).  I saw a 1.4% increase
in run time with the patch (compared to master) for the mark-only
test, and a 1.6% increase in run time for the mark/restore test.
I'll look into what might be causing that.  I have seen bigger
differences just due to changing where executable code crossed
cache boundaries, but this is big enough to worry about; it needs
to be checked.

New patch attached.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


bt-nopin-v2.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-27 Thread Andrew Gierth

 Kyotaro == Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes:

 Kyotaro Just a reminder, but I am not the author of this patch:)

Yes, I understand that.

 Kyotaro Mmm? The patch bt-nopin-v1.patch seems not contain the change
 Kyotaro for ExecSupportMarkRestore and the very simple function remain
 Kyotaro looking to return true for T_Index(Only)Scan after the patch
 Kyotaro applied.

  Right. I'm suggesting you change that, in order to determine what
  performance cost, if any, would result from abandoning the idea of
  doing mark/restore entirely.

 Kyotaro I understand that you'd like to see the net drag of
 Kyotaro performance by the memcpy(), right?

No.

What I am suggesting is this: if mark/restore is a performance issue,
then it would be useful to know how much gain we're getting (if any)
from supporting it _at all_.

Let me try and explain it another way. If you change
ExecSupportMarkRestore to return false for index scans, then
btmarkpos/btrestorepos will no longer be called. What is the performance
of this case compared to the original and patched versions?

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-27 Thread Kyotaro HORIGUCHI

Just a reminder, but I am not the author of this patch:)

At Fri, 27 Feb 2015 07:26:38 +, Andrew Gierth and...@tao11.riddles.org.uk 
wrote in 87zj7z6ckc@news-spur.riddles.org.uk
  Kyotaro == Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes:
 
   You might want to try running the same test, but after patching
   ExecSupportsMarkRestore to return false for index scans. This will
   cause the planner to insert a Materialize node to handle the
   mark/restore.
 
  Kyotaro Mmm? The patch bt-nopin-v1.patch seems not contain the change
  Kyotaro for ExecSupportMarkRestore and the very simple function remain
  Kyotaro looking to return true for T_Index(Only)Scan after the patch
  Kyotaro applied.
 
 Right. I'm suggesting you change that, in order to determine what
 performance cost, if any, would result from abandoning the idea of doing
 mark/restore entirely.

I understand that you'd like to see the net drag of performance
by the memcpy(), right?

That don't seem to be possible without breaking (a part of) the
patch's function, but, concerning btmarkpos() only case, the mark
struct can have garbage so only changing refcount would be viable
to see the pure loss by the memcpy().

The attached patch is the destruction I did, and the result was
like below.

 Case 2. 1M markpos, no restorepos for 1M result rows.
 
 IndesOnly scan: The patches loses about 10%.
   master:  3655 ms (std dev = 2.5 ms)
   Patched: 4038 ms (std dev = 2.6 ms)

   Patched: 4036 ms (std dev = 3.5 ms)
   Broken:  3718 ms (std dev = 1.6 ms)

The Patched just above is the retook numbers. It's a little
under 9% loss from broken version. That is the pure drag of
memcpy(). This seems quite big as Heikki said. It's an extreme
case, though.

The rest 1.7% should be the share of all the other stuff.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 65daf27..4973b63 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -547,12 +547,17 @@ btmarkpos(PG_FUNCTION_ARGS)
 	BTScanOpaque so = (BTScanOpaque) scan-opaque;
 
 	hoge_markpos_count++;
-	memcpy(so-markPos, so-currPos,
-		   offsetof(BTScanPosData, items[1]) +
-		   so-currPos.lastItem * sizeof(BTScanPosItem));
-	if (so-markTuples)
-		memcpy(so-markTuples, so-currTuples,
-			   so-currPos.nextTupleOffset);
+//	memcpy(so-markPos, so-currPos,
+//		   offsetof(BTScanPosData, items[1]) +
+//		   so-currPos.lastItem * sizeof(BTScanPosItem));
+//	if (so-markTuples)
+//		memcpy(so-markTuples, so-currTuples,
+//			   so-currPos.nextTupleOffset);
+	if (BTScanPosIsValid(so-markPos))
+	{
+		ReleaseBuffer(so-markPos.buf);
+		so-markPos.buf = InvalidBuffer;
+	}
 
 	/* We don't take out an extra pin for the mark position. */
 	so-markPos.buf = InvalidBuffer;
@@ -599,12 +604,13 @@ btrestrpos(PG_FUNCTION_ARGS)
 
 		if (BTScanPosIsValid(so-markPos))
 		{
-			memcpy(so-currPos, so-markPos,
-   offsetof(BTScanPosData, items[1]) +
-   so-markPos.lastItem * sizeof(BTScanPosItem));
-			if (so-currTuples)
-memcpy(so-currTuples, so-markTuples,
-	   so-markPos.nextTupleOffset);
+//			memcpy(so-currPos, so-markPos,
+//   offsetof(BTScanPosData, items[1]) +
+//   so-markPos.lastItem * sizeof(BTScanPosItem));
+//			if (so-currTuples)
+//memcpy(so-currTuples, so-markTuples,
+//	   so-markPos.nextTupleOffset);
+			IncrBufferRefCount(so-markPos.buf);			
 		}
 		else
 			BTScanPosInvalidate(so-currPos);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-27 Thread Andrew Gierth

 Kyotaro == Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes:

 Kyotaro Anyway I'm sorry but I have left my dev env and cannot have
 Kyotaro time till next week.

The point is moot; rather than try and explain it further I did the test
myself, and the performance penalty of disabling mark/restore is
substantial, on the order of 30%, so that's a non-starter. (I was a bit
surprised that it was so bad...)

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-27 Thread Kyotaro HORIGUCHI

Hello.

Mmm... We might be focusing on a bit different things..

2015/02/27 17:27 Andrew Gierth   Kyotaro I understand that you'd like
to see the net drag of
  Kyotaro performance by the memcpy(), right?

 No.

 What I am suggesting is this: if mark/restore is a performance issue,
 then it would be useful to know how much gain we're getting (if any)
 from supporting it _at all_.

The mark/restore works as before with this patch, except the stuff for
early dropping of buffer pins. As far as my understanding so far all the
stuff in the patch is for the purpose but doesn't intend to boost btree
itself. That is, it reduces the chance to block vacuum for  possible
burden on general performance. And I think  the current issue in focus is
that the burden might a bit too heavy on specific situation. Therefore I
tried to measure how heavy/light the burden is.

Am I correct so far?

 Let me try and explain it another way. If you change
 ExecSupportMarkRestore to return false for index scans, then
 btmarkpos/btrestorepos will no longer be called. What is the performance
 of this case compared to the original and patched versions?

As you mentioned before, such change will make the planner turn to, perhaps
materialize plan or rescan, or any other scans which are no use comparing
to indexscan concerning the issue in focus, the performance drag when btree
scan does  extremely frequent mark/restore in comparison to  unpatched,
less copy-amount version. Your suggestion looks intending somewhat
different from this.

Anyway I'm sorry but I have left my dev env and cannot have time till next
week.

regards,

-- 
Kyotaro Horiguti

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-26 Thread Kyotaro HORIGUCHI

Hello, I measured the performance of this patch considering
markpos/restorepos. The loss seems to be up to about 10%
unfortunately.

At Thu, 26 Feb 2015 17:49:23 + (UTC), Kevin Grittner kgri...@ymail.com 
wrote in 440831854.629116.1424972963735.javamail.ya...@mail.yahoo.com
 Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/15/2015 02:19 AM, 
 Kevin Grittner wrote:
  Hmm. Did your test case actually exercise mark/restore? The memcpy()s 
  are not that small. Especially if it's an index-only scan, you're 
  copying a large portion of the page. Some scans call markpos on every 
  tuple, so that could add up.
 
 
 I have results from the `make world` regression tests and a 48-hour
 customer test.  Unfortunately I don't know how heavily either of those
 exercise this code.  Do you have a suggestion for a way to test whether
 there is a serious regression for something approaching a worst case?

ammarkpos/amrestrpos are called in merge joins. By the steps
shown below, I had 1M times of markpos and no restorepos for 1M
result rows, and had 500k times of markpos and the same number of
times of restorepos for 2M rows result by a bit different
configuration. I suppose we can say that they are the worst case
considering maskpos/restrpos. The call counts can be taken using
the attached patch.

Both binaries ware compiled with -O2. shared_buffers=1GB and all
shared pages used in the query were hit when measuring.

The numbers were taken 12 times for each cofiguration and took
averages and stddevs of 10 numbers excluding best and worst.

Case 1. 500k markpos/restorepos for 2M result rows.

Index scan: The patched loses about 2%. (1.98%)
  master:  6166 ms (std dev = 3.2 ms)
  Patched: 6288 ms (std dev = 3.7 ms)

IndesOnly scan: The patches loses about 2%. (2.14%)
  master:  5946 ms (std dev = 5.0 ms)
  Patched: 6073 ms (std dev = 3.3 ms)

The patched version is slower by about 2%. Of course all of it is
not the effect of memcpy but I don't know the distribution.


Case 2. 1M markpos, no restorepos for 1M result rows.

IndesOnly scan: The patches loses about 10%.
  master:  3655 ms (std dev = 2.5 ms)
  Patched: 4038 ms (std dev = 2.6 ms)

The loss is about 10%. The case looks a bit impractical but
unfortunately the number might be unignorable. The distribution
of the loss is unknown, too.


regards,

===
CREATE TABLE t1 (a int);
CREATE TABLE t2 (a int);
delete from t1;
delete from t2;
-- This causes 1M times of markpos and no restorepos
INSERT INTO t1 (SELECT a FROM generate_series(0, 99) a);
INSERT INTO t2 (SELECT a FROM generate_series(0, 99) a);
-- This causes 500k times of markpos and the same number of restorepos
-- INSERT INTO t1 (SELECT a/2 FROM generate_series(0, 99) a);
-- INSERT INTO t2 (SELECT a/2 FROM generate_series(0, 99) a);
CREATE INDEX it1 ON t1 (a);
CREATE INDEX it2 ON t2 (a);
ANALYZE t1;
ANALYZE t1;
SET enable_seqscan to false;
SET enable_material to false;
SET enable_hashjoin to false;
SET enable_nestloop to false;
SET enable_indexonlyscan to false; -- omit this to do indexonly scan

EXPLAIN (ANALYZE) SELECT t1.a, t2.a FROM t1 JOIN t2 on (t1.a = t2.a);


  QUERY PLAN 
--
 Merge Join  (cost=2.83..322381.82 rows=3031231 width=8) (actual 
time=0.013..5193.566 rows=200 loops=1)
   Merge Cond: (t1.a = t2.a)
   -  Index Scan using it1 on t1  (cost=0.43..65681.87 rows=167 width=4) 
(actual time=0.004..727.557 rows=100
oops=1)
   -  Index Scan using it2 on t2  (cost=0.43..214642.89 rows=3031231 width=4) 
(actual time=0.004..1478.361 rows=1
 loops=1)
 Planning time: 25.585 ms
 Execution time: 6299.521 ms
(6 rows)

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3af462b..04d6cec 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -543,15 +543,19 @@ btendscan(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+
 /*
  *	btmarkpos() -- save current scan position
  */
+int hoge_markpos_count = 0;
+int hoge_restrpos_count = 0;
 Datum
 btmarkpos(PG_FUNCTION_ARGS)
 {
 	IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
 	BTScanOpaque so = (BTScanOpaque) scan-opaque;
 
+	hoge_markpos_count++;
 	/* we aren't holding any read locks, but gotta drop the pin */
 	if (BTScanPosIsValid(so-markPos))
 	{
@@ -585,7 +589,7 @@ btrestrpos(PG_FUNCTION_ARGS)
 {
 	IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
 	BTScanOpaque so = (BTScanOpaque) scan-opaque;
-
+	hoge_restrpos_count++;
 	/* Restore the marked positions of any array keys */
 	if (so-numArrayKeys)
 		_bt_restore_array_keys(scan);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 33b172b..e7c1b99 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -258,14 +258,19 @@

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-26 Thread Andrew Gierth

 Kyotaro == Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes:

 Kyotaro ammarkpos/amrestrpos are called in merge joins. By the steps
 Kyotaro shown below, I had 1M times of markpos and no restorepos for
 Kyotaro 1M result rows, and had 500k times of markpos and the same
 Kyotaro number of times of restorepos for 2M rows result by a bit
 Kyotaro different configuration. I suppose we can say that they are
 Kyotaro the worst case considering maskpos/restrpos. The call counts
 Kyotaro can be taken using the attached patch.

You might want to try running the same test, but after patching
ExecSupportsMarkRestore to return false for index scans. This will cause
the planner to insert a Materialize node to handle the mark/restore.

This way, you can get an idea of how much gain (if any) the direct
support of mark/restore in the scan is actually providing.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-26 Thread Kyotaro HORIGUCHI

Hi,

At Fri, 27 Feb 2015 05:56:18 +, Andrew Gierth and...@tao11.riddles.org.uk 
wrote in 874mq77vuu@news-spur.riddles.org.uk
  Kyotaro == Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes:
 
  Kyotaro ammarkpos/amrestrpos are called in merge joins. By the steps
  Kyotaro shown below, I had 1M times of markpos and no restorepos for
  Kyotaro 1M result rows, and had 500k times of markpos and the same
  Kyotaro number of times of restorepos for 2M rows result by a bit
  Kyotaro different configuration. I suppose we can say that they are
  Kyotaro the worst case considering maskpos/restrpos. The call counts
  Kyotaro can be taken using the attached patch.
 
 You might want to try running the same test, but after patching
 ExecSupportsMarkRestore to return false for index scans. This will cause
 the planner to insert a Materialize node to handle the mark/restore.

Mmm? The patch bt-nopin-v1.patch seems not contain the change for
ExecSupportMarkRestore and the very simple function remain
looking to return true for T_Index(Only)Scan after the patch
applied.

Explain shows that the plan does not involve materializing step
and addition to it, the counters I put into both btmarkpos() and
btrestrpos() showed that they are actually called during the
execution, like this, for both unpatched and patched.

| LOG:  markpos=100, restrpos=0
| STATEMENT:  EXPLAIN (ANALYZE,BUFFERS) SELECT t1.a, t2.a FROM t1 JOIN t2 on 
(t1.a = t2.a);

To make sure, I looked into btmarkpos and btrestrpos to confirm
that they really does the memcpy() we're talking about, then
recompiled whole the source tree.

 This way, you can get an idea of how much gain (if any) the direct
 support of mark/restore in the scan is actually providing.

Does the scan mean T_Material node? But no material in plan and
counters in bt*pos were incremented in fact as mentioned above..

Could you point out any other possible mistake in my thought?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-26 Thread Andrew Gierth

 Kyotaro == Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes:

  You might want to try running the same test, but after patching
  ExecSupportsMarkRestore to return false for index scans. This will
  cause the planner to insert a Materialize node to handle the
  mark/restore.

 Kyotaro Mmm? The patch bt-nopin-v1.patch seems not contain the change
 Kyotaro for ExecSupportMarkRestore and the very simple function remain
 Kyotaro looking to return true for T_Index(Only)Scan after the patch
 Kyotaro applied.

Right. I'm suggesting you change that, in order to determine what
performance cost, if any, would result from abandoning the idea of doing
mark/restore entirely.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-26 Thread Kevin Grittner

Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/15/2015 02:19 AM, 
Kevin Grittner wrote:
 Interestingly, the btree README points out that using the old TID
 with a new tuple poses no hazard for a scan using an MVCC snapshot,
 because the new tuple would not be visible to a snapshot created
 that long ago.

 The first question is: Do we really need that interlock for the non-MVCC 

 snapshots either?

We don't need the interlock for non-MVCC snapshots if the conditions we
limit or filter on at the index level are rechecked when we get to the heap
tuple; otherwise we do.

 If we do: For non-MVCC snapshots, we need to ensure that all index scans 

 that started before the VACUUM did complete before the VACUUM does.

Isn't it enough to be sure that if we're not going to recheck any index
tests against the heap tuple that any process-local copies of index entries
which exist at the start of the index scan phase of the vacuum are not used
after the vacuum enters the phase of freeing line pointers -- that is, any
scan which has read a page when the vacuum starts to scan the index moves
on to another page before vacuum finishes scanning that index?


 I wonder if we could find a different mechanism to enforce that. Using the 

 pin-interlock for that has always seemed a bit silly to me.

It certainly is abusing the semantics of the pin to treat it as a type of
lock on the contents.


 Perhaps grab a new heavy-weight lock on the table whenever a non-MVCC index
 scan on  the table begins, and have VACUUM wait on it.


-1

The problem I'm most interested in fixing based on user feedback is a vacuum
blocking when a cursor which is using an index scan is sitting idle.  Not
only does the vacuum of that table stall, but if it is an autovacuum worker,
that worker is prevented from cleaning up any other tables.  I have seen all
autovacuum workers blocked in exactly this way, leading to massive bloat.
What you suggest is just using a less awkward way to keep the problem.

 I found that the LP_DEAD hinting
 would be a problem with an old TID, but I figured we could work
 around that by storing the page LSN into the scan position structure
 when the page contents were read, and only doing hinting if that
 matched when we were ready to do the hinting.  That wouldn't work
 for an index which was not WAL-logged, so the patch still holds
 pins for those.


 Or you could use GetFakeLSNForUnloggedRel().

I'm unfamiliar with that, but will take a look.  (I will be back at
my usual development machine late tomorrow.)

 Robert pointed out that the visibility information
 for an index-only scan wasn't checked while the index page READ
 lock was held, so those scans also still hold the pins.


 Why does an index-only scan need to hold the pin?

Robert already answered this one, but there is a possible solution
-- provide some way to check the heap's visibility map for the TIDs
read from an index page before releasing the read lock on that page.

 Finally, there was an optimization for marking buffer position
 for possible restore that was incompatible with releasing the pin.
 I use quotes because the optimization adds overhead to every move
 to the next page in order set some variables in a structure when a
 mark is requested instead of running two fairly small memcpy()
 statements.  The two-day benchmark of the customer showed no
 performance hit, and looking at the code I would be amazed if the
 optimization yielded a measurable benefit.  In general, optimization
 by adding overhead to moving through a scan to save time in a mark
 operation seems dubious.

 Hmm. Did your test case actually exercise mark/restore? The memcpy()s 
 are not that small. Especially if it's an index-only scan, you're 
 copying a large portion of the page. Some scans call markpos on every 
 tuple, so that could add up.


I have results from the `make world` regression tests and a 48-hour
customer test.  Unfortunately I don't know how heavily either of those
exercise this code.  Do you have a suggestion for a way to test whether
there is a serious regression for something approaching a worst case?

 At some point we could consider building on this patch to recheck
 index conditions for heap access when a non-MVCC snapshot is used,
 check the visibility map for referenced heap pages when the TIDs
 are read for an index-only scan, and/or skip LP_DEAD hinting for
 non-WAL-logged indexes.  But all those are speculative future work;
 this is a conservative implementation that just didn't modify
 pinning where there were any confounding factors.

 Understood. Still, I'd like to see if we can easily get rid of the 
 pinning altogether.


That would be great, but I felt that taking care of the easy cases in
on patch and following with patches to handle the trickier cases as
separate follow-on patches made more sense than trying to do everything
at once.  Assuming we sort out the mark/restore issues for the initial
patch, it provides infrastructure to keep the

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-23 Thread Heikki Linnakangas


On 02/15/2015 02:19 AM, Kevin Grittner wrote:

Interestingly, the btree README points out that using the old TID
with a new tuple poses no hazard for a scan using an MVCC snapshot,
because the new tuple would not be visible to a snapshot created
that long ago.


The first question is: Do we really need that interlock for the non-MVCC 
snapshots either?


If we do: For non-MVCC snapshots, we need to ensure that all index scans 
that started before the VACUUM did complete before the VACUUM does. I 
wonder if we could find a different mechanism to enforce that. Using the 
pin-interlock for that has always seemed a bit silly to me. Perhaps grab 
a new heavy-weight lock on the table whenever a non-MVCC index scan on 
the table begins, and have VACUUM wait on it.



I found that the LP_DEAD hinting
would be a problem with an old TID, but I figured we could work
around that by storing the page LSN into the scan position structure
when the page contents were read, and only doing hinting if that
matched when we were ready to do the hinting.  That wouldn't work
for an index which was not WAL-logged, so the patch still holds
pins for those.


Or you could use GetFakeLSNForUnloggedRel().


Robert pointed out that the visibility information
for an index-only scan wasn't checked while the index page READ
lock was held, so those scans also still hold the pins.


Why does an index-only scan need to hold the pin?


Finally, there was an optimization for marking buffer position
for possible restore that was incompatible with releasing the pin.
I use quotes because the optimization adds overhead to every move
to the next page in order set some variables in a structure when a
mark is requested instead of running two fairly small memcpy()
statements.  The two-day benchmark of the customer showed no
performance hit, and looking at the code I would be amazed if the
optimization yielded a measurable benefit.  In general, optimization
by adding overhead to moving through a scan to save time in a mark
operation seems dubious.


Hmm. Did your test case actually exercise mark/restore? The memcpy()s 
are not that small. Especially if it's an index-only scan, you're 
copying a large portion of the page. Some scans call markpos on every 
tuple, so that could add up.



At some point we could consider building on this patch to recheck
index conditions for heap access when a non-MVCC snapshot is used,
check the visibility map for referenced heap pages when the TIDs
are read for an index-only scan, and/or skip LP_DEAD hinting for
non-WAL-logged indexes.  But all those are speculative future work;
this is a conservative implementation that just didn't modify
pinning where there were any confounding factors.


Understood. Still, I'd like to see if we can easily get rid of the 
pinning altogether.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-23 Thread Robert Haas

On Mon, Feb 23, 2015 at 2:48 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Robert pointed out that the visibility information
 for an index-only scan wasn't checked while the index page READ
 lock was held, so those scans also still hold the pins.

 Why does an index-only scan need to hold the pin?

Suppose there's a dead tuple in the heap someplace, and an index
pointer pointing to that dead tuple.  An index scan reaches the
relevant page and copies all of the index pointers.  VACUUM removes
the index pointer and marks the heap page all-visible.  The scan now
uses the index pointers copied to backend-local memory and notes that
the heap-page is all-visible, so the scan sees the tuple even though
that tuple is totally gone from the heap by that point.  Holding the
pin prevents this, because we can't reach the second heap pass until
the scan is done with the page, and therefore the dead line pointer in
the heap page can't be marked unused, and therefore the page can't be
marked all-visible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Reduce pinning in btree indexes

2015-02-14 Thread Kevin Grittner

We have a customer who was unable to migrate from a well-known
commercial database product to pg because they have a very large
software base that holds cursors open for very long periods of
time, preventing GlobalXmin values from advancing, leading to
bloat.  They have a standard test that exercises hundreds of client
connections for days at a time, and saw disk space and CPU usage
climbing to unacceptable levels[1].  The other database product had
no such problems.  We suggested the obvious solutions that we
always suggest on the community lists, and they had perfectly good
reasons why those were not viable for them -- we either needed to
prevent bloat under their current software or they could not
migrate.

What they wanted was what happened in the other database product --
if a snapshot got sufficiently old that cleaning up the MVCC data
was a problem *and* the snapshot was used again *and* it read a
page which had been modified far enough back that it was not
possible to return correct data, then they wanted to receive a
snapshot too old error.  I wrote a patch to do that, but they
didn't seem much improvement, because all (auto)vacuum processes
blocked indefinitely on the btree index pages where a cursor was
parked.  So they still got massive bloat and consequent problems.

It seemed odd to me that a btree implementation based on the
Lehman  Yao techniques would block anything, because the concept
is that it reads everything it needs off the index page and pauses
between pages.  We sorta do that, but the interlock between
vacuum processes and other index usages basically involves holding
a pin on the page we just read until we are done using the values
we read off that page, and treating the pin as a lighter lock than
a shared (or READ) lock -- which only conflicts with a super
exclusive lock, which consists of getting an exclusive lock only
once there are no other processes holding a pin on the page.  This
ensures that at the end of a vacuum pass over a btree index there
are no TIDs in process-local memory from before the start of the
pass, and consequently no scan can read a new tuple assigned to the
same TID value after the rest of the vacuum phases run.  So a pin
on a btree page blocks a vacuum process indefinitely.

Interestingly, the btree README points out that using the old TID
with a new tuple poses no hazard for a scan using an MVCC snapshot,
because the new tuple would not be visible to a snapshot created
that long ago.  The customer's cursors which were causing the
problems all use MVCC snapshots, so I went looking to see whether
we couldn't take advantage of this fact.  For the most part it
seemed we were OK if we dropped pins with the READ lock for a btree
scan which used an MVCC snapshot.  I found that the LP_DEAD hinting
would be a problem with an old TID, but I figured we could work
around that by storing the page LSN into the scan position structure
when the page contents were read, and only doing hinting if that
matched when we were ready to do the hinting.  That wouldn't work
for an index which was not WAL-logged, so the patch still holds
pins for those.  Robert pointed out that the visibility information
for an index-only scan wasn't checked while the index page READ
lock was held, so those scans also still hold the pins.

There was also a problem with the fact that the code conflated the
notion of having a valid scan position with holding a pin on a
block in the index.  Those two concepts needed to be teased apart,
which I did using some new macros and figuring out which one to use
where.  Without a pin on the block, we also needed to remember what
block we had been processing to be able to do the LP_DEAD hinting;
for that the block number was added to the structure which tracks
scan position.

Finally, there was an optimization for marking buffer position
for possible restore that was incompatible with releasing the pin.
I use quotes because the optimization adds overhead to every move
to the next page in order set some variables in a structure when a
mark is requested instead of running two fairly small memcpy()
statements.  The two-day benchmark of the customer showed no
performance hit, and looking at the code I would be amazed if the
optimization yielded a measurable benefit.  In general, optimization
by adding overhead to moving through a scan to save time in a mark
operation seems dubious.

At some point we could consider building on this patch to recheck
index conditions for heap access when a non-MVCC snapshot is used,
check the visibility map for referenced heap pages when the TIDs
are read for an index-only scan, and/or skip LP_DEAD hinting for
non-WAL-logged indexes.  But all those are speculative future work;
this is a conservative implementation that just didn't modify
pinning where there were any confounding factors.

In combination with the snapshot-too-old patch the 2-day customer
benchmark ran without problem levels of bloat, controlling disk
space growth and

40 matches

Mail list logo